FPGA target to host DMA transfer speed

michaeljoseph · ‎09-05-2013

Hello,

------------Task summary:

I'm currently working on a data acquisition-heavy project using a PXIe chassis system (plus a host computer), see below for the components.

-------
PXIe-PCIe8388 x16 Gen 2 MXI-express (controller)*
PXIe-1082 (chassis)
PXIe-7966R (FPGA)
NI 5772 (AC version, IO Module)
-------
*note: the controller is connected to a PCIe port on the host computer with the full x16 bandwidth.

For my application, I need to acquire a fixed number of samples (16000) from each channel of the IO module at a fixed sampling rate (800MS/s). Each acquisition will be externally triggered at a fixed frequency, 50kHz. The number of acquisitions will also be fixed. Right now I'm aiming for about 90000 acquisitions per session.

So in summary, for each acquisition session, I will need (16000 samples per acquisition) * (90000 acquisitions) * (2 AI channels) = 2.88e9 samples per acquisition session.

Since each sample is transferred as a 16-bit number, this equates to 5.76GB per acquisition session.

The total time per acquisition session is (90000 acquisitions) / (50kHz per acquisition) = 1.8 seconds.

--------------Problems:

I'm having problems transferring the acquired data from the FPGA to host. I think I'm seeing an overflow on the FPGA before the data is transferred to the host. I can go into more detail pending an answer to my questions below.

--------------Questions:

I want to ask a few general questions before posting any code screenshots. Assuming my math is correct and the host computer 'good' enough, is it theoretically possible to transfer data at my required throughput, 5.76GB/1.8seconds = 3.2GB/s using the hardware that I have?

If it is possible, I can post the FPGA and host VIs that I'm using. If not, I will have another set of problems!

Thanks,
Michael

nathand · ‎09-06-2013

@michaeljoseph wrote:

I want to ask a few general questions before posting any code screenshots. Assuming my math is correct and the host computer 'good' enough, is it theoretically possible to transfer data at my required throughput, 5.76GB/1.8seconds = 3.2GB/s using the hardware that I have?

I didn't check your math, but 3.2GB/s would be difficult to achieve. The product page for the 7966R says "16 DMA channels for high-speed data streaming at more than 800 MB/s." I'm not clear as to whether that's 800MB/s combined across all DMA channels, or per channel. PXIe is based on PCIe, and the specs for your board says it supports PXIe v1.0. According to Wikipedia, the maximum transfer rate for PCIe v1.0 is 4GB/s, and it's always difficult to achieve maximum theoretical transfer rates. Also, your board has onboard memory and according to the specs, the maximum theoretical transfer rate to the onboard memory is 1.6GB/s per bank, with 2 banks, so an absolute maximum of 3.2GB/s. I'd be surprised if you can transfer to off-board memory faster than on-board memory although that's mostly a guess as I'm a long way from being a digital design engineer. So I'd say it's somewhere between optimistic and impossible to get the transfer rate you want, although you might want to conduct a few simple experiments where you put constant data into several DMA channels simultaneously and benchmark the speed. Also see this thread containing an older table of DMA transfer rates.

thibber · ‎09-06-2013

Hi Michael,

I have a few questions / observations for you based on your post:

First, you mention that you are using the PXIe-PCIe8388 x16 Gen 2 MXI-express. This is only compatible with the NI RMC-8354, so when you mention the streaming speeds you are looking to acheive, is this streaming back to the RMC, or to something else? Is the NI RMC-8354 the host computer you are mentioning?

When it comes to streaming data with the NI 5772 and PXI 7966R, there are a few different important data rates. First, the NI-5772 can acquire at a maximum rate of 1.6 GS/s with 12 bit resolution = 2.4 GB/s. This is only if you are using 1 channel, for 2 channels the rate is halved. Are you planning on using 2 separate 5772 and 7966Rs?

The 7966R can stream data at a maximum rate of 800 MB/s, so we have a data rate coming into the FlexRIO's FPGA (2.4GB/s) and going out of the FlexRIO's FPGA (.8 GB/s). The data that isn't being sent back to the host accumulates in the FPGA's DRAM. Lets say we have all of the FPGA's DRAM available to store this data (512 MB). Our effective accumulation rate is 2.4 - 0.8 = 1.6 GB/s, so our FPGA is going to fill up in about 1/3 s, streaming a total of 0.8+0.512 = ~1.3 GB back to the host before saturating and losing data.

There are a few options, therefore, to reach your requirement. One might be duplicating your setup to have more cards. 1.3 GB x 3 = 4GB, which meets your need. Also, the 7975R can stream data back to the host twice as fast and has 2GB of DRAM onboard, so you could store more data and stream faster, therefore meeting your requirement.

I hope that this information helps clarify what concerns come into play for this type of application. Please let me know if anything above is unclear or if you have further questions.

Andrew T.
National Instruments

michaeljoseph · ‎09-09-2013

@nathand wrote:

@michaeljoseph wrote:

I want to ask a few general questions before posting any code screenshots. Assuming my math is correct and the host computer 'good' enough, is it theoretically possible to transfer data at my required throughput, 5.76GB/1.8seconds = 3.2GB/s using the hardware that I have?

I didn't check your math, but 3.2GB/s would be difficult to achieve. The product page for the 7966R says "16 DMA channels for high-speed data streaming at more than 800 MB/s." I'm not clear as to whether that's 800MB/s combined across all DMA channels, or per channel. PXIe is based on PCIe, and the specs for your board says it supports PXIe v1.0. According to Wikipedia, the maximum transfer rate for PCIe v1.0 is 4GB/s, and it's always difficult to achieve maximum theoretical transfer rates. Also, your board has onboard memory and according to the specs, the maximum theoretical transfer rate to the onboard memory is 1.6GB/s per bank, with 2 banks, so an absolute maximum of 3.2GB/s. I'd be surprised if you can transfer to off-board memory faster than on-board memory although that's mostly a guess as I'm a long way from being a digital design engineer. So I'd say it's somewhere between optimistic and impossible to get the transfer rate you want, although you might want to conduct a few simple experiments where you put constant data into several DMA channels simultaneously and benchmark the speed. Also see this thread containing an older table of DMA transfer rates.

Thanks for your response, those are good points about the specifications. I will definately not be able to stream at the full sampling rate, 800MS/s, if I am using both AI channels. So at best I will probably have to make due with 400MS/s per AI channel.

I have been running rough tests where I put the data into multiple target to host FIFOs per AI channel. It seems like it improves performance versus putting the data into one large target to host FIFO per AI channel. The only issue is that it increases timing delays and uses more resources to the point where I'm using almost half of the FPGA.

I will need to run more thorough tests and try to quantify the transfer performance. Once I do, I will post my VIs.

thibber · ‎09-09-2013

Hi Michael,

A few updates to my previous post:

First, I think I could have explained the sampling rate a bit more clearly. Using 2 channels instead of 1 means that each channel will have half the sampling rate (800 MS/s), but the total acquisition rate will still be the same (1.6 S/s).

There are some other options you might want to look into as well regarding your acquisition. For instance, is it acceptable to use only the 8 most significant or least significant bits of your measurement? Or to discard a section of your acquisition that is irrelevant to the measurement?

Also, if you do end up wanting to look in the direction of a 7975R, you would also want to likely switch to a 1085 chassis to fully utilize the improved streaming speeds. The 1082 has a limitation of 1 GB/s per slot, while the 1085 can achieve up to 4 GB/s per slot.

I look forward to hearing what other observations or concerns arise in your testing.

Andrew T.
National Instruments

michaeljoseph · ‎09-09-2013

@thibber wrote:

Hi Michael,

I have a few questions / observations for you based on your post:

First, you mention that you are using the PXIe-PCIe8388 x16 Gen 2 MXI-express. This is only compatible with the NI RMC-8354, so when you mention the streaming speeds you are looking to acheive, is this streaming back to the RMC, or to something else? Is the NI RMC-8354 the host computer you are mentioning?

When it comes to streaming data with the NI 5772 and PXI 7966R, there are a few different important data rates. First, the NI-5772 can acquire at a maximum rate of 1.6 GS/s with 12 bit resolution = 2.4 GB/s. This is only if you are using 1 channel, for 2 channels the rate is halved. Are you planning on using 2 separate 5772 and 7966Rs?

The 7966R can stream data at a maximum rate of 800 MB/s, so we have a data rate coming into the FlexRIO's FPGA (2.4GB/s) and going out of the FlexRIO's FPGA (.8 GB/s). The data that isn't being sent back to the host accumulates in the FPGA's DRAM. Lets say we have all of the FPGA's DRAM available to store this data (512 MB). Our effective accumulation rate is 2.4 - 0.8 = 1.6 GB/s, so our FPGA is going to fill up in about 1/3 s, streaming a total of 0.8+0.512 = ~1.3 GB back to the host before saturating and losing data.

There are a few options, therefore, to reach your requirement. One might be duplicating your setup to have more cards. 1.3 GB x 3 = 4GB, which meets your need. Also, the 7975R can stream data back to the host twice as fast and has 2GB of DRAM onboard, so you could store more data and stream faster, therefore meeting your requirement.

I hope that this information helps clarify what concerns come into play for this type of application. Please let me know if anything above is unclear or if you have further questions.

Thanks for replying. To answer your first question: I'm transferring to a desktop computer. The controller is able to connect with a PCI express x16 slot in the desktop computer. I'm not sure how to technically describe it, but the controller plugs into the PXIe chassis, then there is another card that plugs into the host computer's PCI express x16 slot, and finally there is a large cable that connects the card in the host computer and the controller.

For your second paragraph: the reason I used 16-bit numbers in my calculations is because that's how the data is handled in the FPGA after it has been acquired (assuming I keep it as an integer), is that correct? Then it's packed in chunks of 4 (one U64) before being inputted to the target to host FIFO (that's how the NI 5772 examples do it). Right now I'm only using one FPGA and I/O module, and I'm using both AI channels (I need to simultaneously sample two different inputs).

I might be able to live with half of the sampling rate, 400MS/s for both channels, if that means I will be able to acquire a larger amount of data. Getting another FPGA and IO module is also an appealing option. It depends on what my advisors think (I'm a graduate student), and if they want to buy another FPGA and IO module.

Questions:

I have a question about the 7966R vs the 7975R that you mentioned. I could probably find the information in the specifications, but I figured I would just ask you here. Is there any advantage to using the 7966R over the 7975R in terms of programmable logic elements? From what I could quickly read, the 7975R has more DSP slices and RAM, but does it have less general purpose logic blocks than the 7966R? The reason I'm asking is because the project that I'm working on will eventually involve implementing as much signal processing on the FPGA as possible. But obviously figuring out the acquisition part of the project is more important right now.

The other question I have is related to something nathand said in response to my first post. Is using multiple target to host FIFOs faster than using 1 target to host FIFO (assuming the combined sizes are equivalent)? I noticed that the FPGA has a max of 16 target to host FIFOs. Does each target to host FIFO reserve some amount of bandwidth? Or is the total bandwidth just divided by the amount of target to host FIFOs that I use in a given FPGA VI? Ex: If I only define 2 target to host FIFOs, each would have half of the total bandwidth, if I define 3 target to host FIFOs each would have 1/3, etc.

Hi Michael,

A few updates to my previous post:

First, I think I could have explained the sampling rate a bit more clearly. Using 2 channels instead of 1 means that each channel will have half the sampling rate (800 MS/s), but the total acquisition rate will still be the same (1.6 S/s).

There are some other options you might want to look into as well regarding your acquisition. For instance, is it acceptable to use only the 8 most significant or least significant bits of your measurement? Or to discard a section of your acquisition that is irrelevant to the measurement?

Also, if you do end up wanting to look in the direction of a 7975R, you would also want to likely switch to a 1085 chassis to fully utilize the improved streaming speeds. The 1082 has a limitation of 1 GB/s per slot, while the 1085 can achieve up to 4 GB/s per slot.

I look forward to hearing what other observations or concerns arise in your testing.

Andrew T.
National Instruments

I'll go ahead and respond to your latest response too. Thanks again for your help.

I think I understand the streaming rate concept. I'm not using time interleaved sampling. My application requires using the simultaneous sampling mode. I need two channels of input data.

Unfortunately I don't think I can sacrifice on bit depth. But for right now I can probably sacrifice half of the sampling rate, and reduce my acquisition duty cycle from 100% (constantly streaming) to 50% (acquiring only half of the time). My acquisition rate will still need to be 50kHz though. I'm planning to compromise on sampling rate by summing pairs of data points instead of simply decimating, and then transferring the data to the host.

Questions:

We (my advisors and I) think that the summing pairs approach would preserve more information than simply throwing away every other point. Also, we can avoid overflow because each 16-bit number only contains 12-bits of actual information. The 16-bit number will just need to be divided by 16 before summing because the 12-bits of information are placed in the 12 MSBs of the 16-bit number. Does that sound right?

As for upgrading the hardware, that would be something I would need to discuss with my advisors (like I said in my above response to your previous post). It would also depend on any exchange programs that NI may have. Is it possible to exchange current hardware for some discount on new hardware?

nathand · ‎09-10-2013

michaeljoseph wrote:

The other question I have is related to something nathand said in response to my first post. Is using multiple target to host FIFOs faster than using 1 target to host FIFO (assuming the combined sizes are equivalent)? I noticed that the FPGA has a max of 16 target to host FIFOs. Does each target to host FIFO reserve some amount of bandwidth? Or is the total bandwidth just divided by the amount of target to host FIFOs that I use in a given FPGA VI? Ex: If I only define 2 target to host FIFOs, each would have half of the total bandwidth, if I define 3 target to host FIFOs each would have 1/3, etc.

I had a reason to benchmark this, on a less-powerful setup, just this afternoon, although I was testing communication from the FPGA target to the host. At least in my setup - a PCI-7813R installed in a standard desktop PC running LabVIEW RT - the total bandwidth was the same for both one and two DMA channels. I calculated 20 MB/s for a single DMA FIFO, and 10 MB/s per FIFO when using two DMA FIFOs.

nathand · ‎09-10-2013

@nathand wrote:

I had a reason to benchmark this, on a less-powerful setup, just this afternoon, although I was testing communication from the FPGA target to the host.

Sorry, correction! I wrote it backward. I tested communication from the host, TO the FPGA target - the slower direction and the opposite of your situation.

Dave.T · ‎09-10-2013

12-bit numbers stored 16-bits.

There’s nothing stopping you from getting creative with the bit packing and storing the samples more efficiently, it just takes more processing on the host to unpack everything.
For example, you could take 16 12-bit samples and store them in 3 U64’s.

High Throughput

There is a High throughput example in the example finder under FlexRIO. Using this example you should be able to test the throughput of your system. This will test the through memory, to disk, and from disk. What do you plan on doing with the data once it gets back to the host? Do you have a way to store the data at that rate?
Can you move the processing down to the FPGA , and send back less data to the host?
Can you share any more details of your application beyond the streaming rates?

7966R vs the 7975R

The 7975 will have 4x the onboard ram at 2 GB, with 2x the DSP blocks, and 1.5x total resources, along with 2x the PXIe bandwidth. All of these specifications sound like they would be beneficial in your application. The device is orderable, but not shipping until later this year (Nov 1^st). It sounds like you’re in the middle of development right now, which would present difficulties. It may be worth a call to your sales person if you are interested in this to see if there’s anything we can do for you in the mean time.

Is using multiple target to host FIFOs faster than using 1 target to host FIFO

Using multiple FIFO does not increase efficiency. The ability to use multiple FIFOs is mostly to make data transfer of multiple channels easier to deal with on the host. Otherwise you’d have to use more processing to split up the samples back to their original channels.

If I only define 2 target to host FIFOs, each would have half of the total bandwidth, if I define 3 target to host FIFOs each would have 1/3, etc

Yep, the bandwidth is limited by the PXIe bus, therefor, you only get ~800 MB/s per slot regardless of how many DMA FIFO’s you instantiate.

National Instruments
FlexRIO & R-Series Product Support Engineer

nathand · ‎09-10-2013

@michaeljoseph wrote:

We (my advisors and I) think that the summing pairs approach would preserve more information than simply throwing away every other point. Also, we can avoid overflow because each 16-bit number only contains 12-bits of actual information. The 16-bit number will just need to be divided by 16 before summing because the 12-bits of information are placed in the 12 MSBs of the 16-bit number. Does that sound right?

Depends on what the code is actually doing, but generally if you convert a 12-bit value to 16 bits, the useful bits are stored in the low 12 bits so that the value remains constant. No division by 16 should be necessary.

LabVIEW

FPGA target to host DMA transfer speed

FPGA target to host DMA transfer speed

Re: FPGA target to host DMA transfer speed

Re: FPGA target to host DMA transfer speed

Re: FPGA target to host DMA transfer speed

Re: FPGA target to host DMA transfer speed

Re: FPGA target to host DMA transfer speed

Re: FPGA target to host DMA transfer speed

Re: FPGA target to host DMA transfer speed

Re: FPGA target to host DMA transfer speed

Re: FPGA target to host DMA transfer speed