DMA FIFOs

Jeff.St · ‎04-03-2013

Hello Everybody!

For the 3^rd Blog Post, we are going to discuss DMA FIFOs.

When you need to have a lossless stream of data going to or from your FPGA Target, the best method to use is a DMA FIFO. An example of this would be acquiring or writing an analog waveform.

The DMA FIFO actually consists of two buffers. One resides on the FPGA and the other on the RT Controller. The data transfer between these two buffers automatically occurs ‘behind the scenes.’ This help page, How DMA Transfers Work (FPGA Module), does a good job of explaining the basic mechanics of the FIFO.

The FPGA buffer is configured in the project explorer and the RT buffer is configured in the RT VI, as described in the first link. To ensure a smooth flow of data, it is important to appropriately size your buffers. If your buffers are too small, it is possible to ‘overflow’ the buffers and cause a loss of data. If your buffers are too large, it is a waste of resources.

With many FPGA Targets you are limited to 3 DMA FIFO Channels. If you need to stream more than 3 channels of data, this can become a problem. To get around this, you can use a technique called interleaving. This is where you queue and de-queue the elements in a specific order so that you can have multiple channels of data in the same FIFO. The help page, Transferring Multi-Channel Data in DMA Applications (FPGA Module), does a great job of showing how to implement this technique.

If you have any questions about implementing DMA FIFO’s please feel free to post them here!

Best,

Jeff S

Jeff S.
National Instruments

adrian.spanu · ‎08-06-2013

Thanks for the post Jeff.

I had a quick question about overflow problems with Target - Host DMA FIFOs. Other than increasing the size of the buffers as well as the number of elements read at one time, is there any other way to reduce overflow errors on the Host side of the FIFO? I've seen some articles about increasing the speed of the loop on the host side, but for a given collection of logic/operations isn't the loop already maximum speed? Or is the intent to reduce the number of operations subsequently occurring in the loop?

Thanks for any help!

Best,
Adrian

Deborah_B · ‎08-07-2013

Hi Adrian,

While increasing the buffers and the number of elements you read at one time are the first things to change, there may be some other optimizations you can do. Have you checked the FIFO Depth Property? What else is your host side loop doing, can you split off any currently serialized tasks into another loop? Better yet, you could dedicate a loop to ONLY read the FIFO and then put the data in a queue to process in another loop or Windows Host (if running real-time).

Is your host Windows or Real-Time? Specifically on real-time you may be able to optimize and vary the maximum loop speed.

Regards,

Deborah

Deborah Burke
NI Hardware and Drivers Product Manager
Certified LabVIEW Architect

adrian.spanu · ‎08-08-2013

Hi Deborah,

Thanks for your follow up. Especially given how long ago this blog post was written.

I have taken a look at the FIFO Depth, and have scaled it although I have noticed that its effect diminishes the larger it becomes. Is that conclusion correct?

My host is Windows, and so my experience has been that especially when using 3 DMAs to transfer data back and forth (2 DMAs target - host and 1 DMA host - target), the data is not able to be consistently read by Windows fast enough, although I am continuing to try different solutions.

When reading from the DMA FIFO, is it best to read all elements at once into a 1D array or a smaller subset inside a for loop? For example if I want to read 120 values from this FIFO each loop iteration, is it better to tell it to read 120 elements at once into a 1D array, or better to read 12 elements at once, 10 times and store that in a 2D array (that is auto indexed by the outer for loop that runs 10 times), or is there an optimal trade off between the two. Does this change if instead of 120 elements, I'm trying to read 1,000,000 elements?

If we off-load the work from the DMA FIFO to a Queue, won't that simply move the problem further down the line, meaning that the queue will overflow until it fills up all memory? If not, does that mean that there are some distinctions between the way in which a queue is implemented and the way in which the DMA FIFO is implemented?

In essence, I'm trying to really understand the inner workings of the FIFO DMA, and although I've seen some documentation on recommended approaches for it, I haven't yet stumbled upon anything that allows me to really get to what it is doing at a fundamental level, to be able to answer my questions from above myself. Do you know of something like that?

In the mean time, I've reduced the duty cycle of the FPGA (essentially throwing away some of the DAQ values that I receive), and I believe I am now squeaking by with no timeouts, but only just, I believe.

Thanks so much for all of your help Deborah!

Adrian

Deborah_B · ‎08-20-2013

Hi Adrian,

Sorry for the delayed response. I chatted with a couple colleagues, also referencing the LabVIEW FPGA Help best practices for DMA FIFOs, and overall, due to the overhead and to maximize the FIFO efficiency, it's best to pull larger chunks, but we have a couple other thoughts.

In regards to your data throughput to real-time, you could consider a form of triggering to send your data, which is a technique sometimes used in NI FlexRIO applications which more typically deal with pushing high throughput. Note that you will receive a subset of data each time instead of the full data. To start, assert a trigger that will allow you to start acquiring data. Then when the FIFO gets close to being full (via monitoring), you can deassert that trigger, which will allow the host some breathing room to start pulling data off. Then when the host is close to being at the end of the FIFO, go ahead and start the trigger again to acquire data on the FPGA.

Next, you could bit pack your data. How this works is if you are getting multiple samples you can hold on to a few and pack them together and send them together in a U64. The only downside is the amount of space that FIFO would utilize on the FPGA, depending on how many other tasks you are performing and available remaining resources. And then on the host it may be harder to keep the desired loop rate. It's all finding the right compromise between these constraints, and which is the more resource/performance strapped target.

For more dedicated resources and a second eye on your code (also a phone call might be easier to discuss these topics), I'd recommend creating a service request at ni.com/support.

Is your current configuration still working without timeouts?

Deborah

Deborah Burke
NI Hardware and Drivers Product Manager
Certified LabVIEW Architect

adrian.spanu · ‎08-20-2013

Hi Deborah,

Thanks so much for your response, this is very helpful.

Before I begin on the rest of the details - I have been attempting to either call or start a ticket on the support line, however, I'm encountering an issue of globalization. I am currently working for a company in Korea and all of the local numbers (as well as the default ticketing systems) that I've been able to find have been in Korean or have lead me to only Korean speakers. Do you know if there is there some way to schedule a phone call with someone in the US or Europe? I'm comfortable working through the time difference issues, I would just prefer to know that there will be someone else on the other end if I were to stay up.

To make sure I'm on the same page with the trigger - I should have the host write to a control on the FPGA a boolean value that controls the first DMA that goes from FPGA - Host to control whether values are written. I think that's a very good idea. It is similar in effect to the idea that we are using currently, where our specific application allows us to discard some amount of data on each data acquisition scan, which allows the Host to catch up to the FPGA during these intervals when no new values are being recorded. At the same time, however, if we delay writing to the DMA, won't this mean that we would miss some values that we did not collect? Our application uses a scan to collect data and displays results in a 2D matrix, thus I believe that each value missed would be essentially a dead pixel, which, since our application assumes set-width data, would mean that the entire rest of the matrix is shifted by 1 pixel each time this occurs. Please let me know if there is anything there I am misunderstanding.

I think that the idea of bit packing is very interesting. We are already bit-packing 2 channels into a U32, but we can theoretically also time demux 2 time values into 1 U64 written to the DMA. This would decrease the speed of data arriving at the host by a factor of 2. The first time the data arrives at the host (we are sending it back and forth, unfortunately), we simply package it together and apply a transpose. I am still thinking through whether time demuxing and then transposing would work, but I think that it does. This looks like a great next step to test out.

I have managed to tweak my system into now working without timeouts, however, I'm still getting some non-real time behaviour. I have optimized most of my paths on the host (I have 7 while loops running in parrallel: 2 UI control, 3 DMA communication (FPGA - Host, Host - FPGA, FPGA - Host) and 2 final processing and matrix display. I have now eliminated the timeouts on the 3 DMA loops by optimizing the way in which they read and send data (and by moving each role to its own unique loop), however, I am now getting pixels dropping somewhere else. The behaviour that I'm getting is of the matrix 'losing' rows and thus 'shifting' the picture down towards the bottom of the screen. Is there elsewhere that I can expect to drop pixels? Perhaps in queues?

Thanks again for all of your help!

Best,

Adrian

adrian.spanu · ‎08-27-2013

Here is what I was able to do to fix my problem:

I think that I was also able to get it working, mostly from the ideas that you gave me. Specifically in trying to bundle multiple data elements together into a U64, I think that reduced the amount of processing necessary to the point that it allowed the host to no longer drop data. In implementing, I had to be careful about the use of interleaving and decimating to reconstruct the data correctly on the other side, however, I think that it ended up working well. Upon a conversation with a support technician that came by yesterday, it appears that I was having my issues for a few reasons that should be investigated should this problem resurface.

1. The DMA FIFOs sent to the host could not be kept up with by the host. Even without causing timeouts, at some point, data was lost in the transition between the DMA and the host processing.

2. My usage of the FPGA board was close to 98%. I was advised that this amount of usage may be causing some of these data loss issues as the FPGA, although can be used up to 100%, begins exhibiting strange behaviour past 90-92% usage.

3. Other programs running concurrently on the host may have also played a part. This includes virus scanners as well as other windows open in Labview as well as other programs altogether.

Next Steps - LabVIEW RIO Evaluation Kit

DMA FIFOs

DMA FIFOs

Re: DMA FIFOs

Re: DMA FIFOs

Re: DMA FIFOs

Re: DMA FIFOs

Re: DMA FIFOs

Re: DMA FIFOs