implicit buffered counter output with 100MHz internal clock locked to external 10MHz clock (PXIe-6738, ANSI-C)

andiT0815 · ‎03-06-2024

Hi all!

We use a PXIe-6738 analog output card in a PXIe chassis. We use one (or two) of the counters of the PXIe-6738 card to generate pulses at programmable time (implicit mode, buffered) on one of the PFI outputs. This PFI is used as the timebase of the analog outputs of the same PXIe-6738 card and/or of another analog or digital output card (PXIe-6535) in the same chassis. We can use also an external trigger on some PFI input to start generation of the pulses and we can also use an external 10MHz signal on another PFI input as clock source. With the internal 100MHz clock source the minimum programmable low and high times of the pulses are 60ns which is sufficient for our application, but with the external 10MHz clock the minimum low and high times are 200ns which is not good for our application. So our problem is: we would like to have 60ns resolution independent of the clock source.

A possible solution would be to use the PLL of the PXIe-6738 card to lock the internal 100MHz clock to the external 10MHz clock. With this configuration we should have always 60ns resolution regardless of using an external 10MHz clock or not. However, so far we did not managed to get this working.

We usually use Python on Windows 10 but for testing we use also a simpler C-code with MS Visual Studio. This is a modified version of the DigPulseTrain-Cont-Buff-Implicit.c example code which I attach here. We use the function DAQmxCreateCOPulseChanTicks which allows to set the external clock input terminal and DAQmxWriteCtrTicks to program the pulse times. To set the frequency of the external clock we tried using the function DAQmxSetCOCtrTimebaseRate but which gives error -200486: "Specified channel is not in the task", although the PFI channel was programmed before with DAQmxCreateCOPulseChanTicks. We tried to use DAQmxCreateCOPulseChanTime and DAQmxWriteCtrTime to program the time in seconds and to set an external clock source with DAQmxSetSampClkTimebaseSrc and DAQmxSetCOCtrTimebaseRate to set the external clock frequency. DAQmxSetSampClkTimebaseSrc does not give an error but DAQmxSetCOCtrTimebaseRate gives again the same error -200486. Using DAQmxSetSampClkTimebaseSrc without DAQmxSetCOCtrTimebaseRate does not show any fixed phase relation of the counter output with the external clock, so seems also not to work.

So how can I configure the 100MHz internal clock of the PXIe-6738 card to use the 10MHz external clock and still have 60ns minimum low and high time of the counter?

Many thanks in advance!

Andreas

Kevin_Price · ‎03-06-2024

I've got a few offhand remarks that I don't have a ton of confidence in. Consider them for whatever they may be worth.

1. I would expect the standard behavior is for your 6738 device to PLL its internal master timebase (100 MHz) to the PXI 10 MHz chassis backplane clock.

2. If you tried and failed to PLL to a distinct external 10 MHz source, perhaps you first need to "detach" the default timing sync between your 6738 and the chassis.

3. Alternately, perhaps you need to configure your chassis to use the external 10 MHz source as its backplane clock. Then your 6738 can sync indirectly to the external 10 MHz by way of the chassis.

4. Either way, since counter output needs to uses integer timebase periods, the task will need to be defined in terms of the internal 100 MHz timebase, providing 10 nanosec quantized increments for high and low times.

5. Internally, you typically need a minimum of 2 intervals each for high and low times, which is 20 nanosec in this case. I'm not sure what led you to conclude that you need those intervals to be 60 nanosec.

6. However, I do know some devices have input and/or output circuitry that puts further limitations on minimum high and low times when the signal routes from or to the outside world. Perhaps that's playing a role?

7. Your error # and text (-200486: "Specified channel is not in the task") seem fishy because sometimes you're getting that error for things that aren't directly referencing a channel. While searching this site for that error, one possibly useful tidbit is to use MAX to perform a self-test on your device and chassis and perhaps reset both too.

8. I haven't dabbled with trying to sync my higher-freq timebase to a slower external clock. You may need to get involved with configuring "Reference Clocks" and "Sync Pulses" as described in this very thorough doc.

-Kevin P

CAUTION! New LabVIEW adopters -- it's too late for me, but you *can* save yourself. The new subscription policy for LabVIEW puts NI's hand in your wallet for the rest of your working life. Are you sure you're *that* dedicated to LabVIEW? (Summary of my reasons in this post, part of a voluminous thread of mostly complaints starting here).

andiT0815 · ‎03-08-2024

Hi Kevin, thank you for your suggestions!

We have the PXIe-1073 chassis which has no onboard clock and no external clock input.

Looking for "Reference Clocks" we have successfully solved our issue (after further problems, see below) with the following two functions in the counter task:

DAQmxSetRefClkSrc(clock_terminal)
DAQmxSetRefClkRate(10e6)

Before we were using the following two functions which did not worked or gave an error:

DAQmxSetSampClkTimebaseSrc(clock_terminal)
DAQmxSetSampClkTimebaseRate(10e6)

With these new functions we do not get an error when programming the counter task and when the external clock is not attached to the PFI input we get the expected error that the PLL cannot be locked. So we can assume this works with the counter.

However, we have got a new error when writing the data for the analogue output task (I show you the error from Python, in C I guess it is similar):

"PyDAQmx.DAQmxFunctions.ResourcesInUseForRoute_RoutingError: Specified route cannot be satisfied, because it requires resources that are currently in use by another route."

This error is a bit puzzling since the RefClock is not used for the analogue outputs but the PFI output of the counter (programmed as clock source with the function DAQmxCfgSampClkTiming). The error is not occurring when we use a different PXIe-6738 card for the analogue outputs as the counter.

Explicitly calling DAQmxDisconnectTerms to disconnect the onboard 100MHz clock from the analogue output clock source (we tested several possible names as given by NI MAX) did not give an error but the previous error remained.

Finally, we managed to get the counter and analogue outputs working on the same PXIe-6738 card by calling the two DAQmxSetRefClkSrs/Rate functions also for the analogue output task. I cannot say why this solution works or if this solution has side effects?

But I would conclude that the problem with the PLL is solved. Sorry for the lengthy explanation above but I hope the details might be helpful someone.

Thank you very much!

Andreas

Kevin_Price · ‎03-11-2024

I can't comment with any authority about the issues you're seeing when configuring a Reference Clock for your 6738. I would only express *surprise* that you seem to need to do it for both tasks on a given device. I had thought that doing it just once would be enough -- my (admittedly limited) understanding is that configuring a Ref Clock would cause the board's internal timebase to PLL to the Ref Clock. Thereafter, *any* device that derives timing from that internal clock should be sync'ed to the Ref Clock.

-Kevin P

CAUTION! New LabVIEW adopters -- it's too late for me, but you *can* save yourself. The new subscription policy for LabVIEW puts NI's hand in your wallet for the rest of your working life. Are you sure you're *that* dedicated to LabVIEW? (Summary of my reasons in this post, part of a voluminous thread of mostly complaints starting here).

andiT0815 · ‎03-13-2024

Hi Kevin and other members of the NI community!

Indeed I agree it is surprising that we have to program the reference clock for the analogue task as well which runs on the same board as the counter. But this works.

However, we face now a new surprising issue when going to higher output rates: even without using any analogue or digital channels, i.e. just running the counter, our maximum contiguous output rate is 2.5MHz. Above we get the error "PyDAQmx.DAQmxFunctions.DAQError: Onboard device memory underflow.". Sometimes it runs a bit longer sometimes a bit shorter before the error occurs. We have seen the error also one time at 2.5MHz. This means that the output data rate is faster than the rate the onboard FIFO of the PXIe-6738 card can be loaded with new samples from the computer memory. We use the PXIe-1073 chassis which is supposed to have a maximum system bandwidth of 250MB/s which according to my calculation should be easily sufficient: we use ticks to program the counter, i.e. the low and high times are given as 2x 32bit numbers, i.e. 2x4Bytes/sample. Running with 2.5MHz means 2.5Msps*2*4Bytes/sample = 20MB/s which is about 10x smaller than the specified data rate given by National Instruments. So I do not understand why we cannot run the counter with higher rates?

My suspect is the very small internal (FIFO) memory size of 127 samples per counter of the PXIe-6738 card. Since this is so small it needs to be updated contiguously and fast and if there is some latency it might happen that it gets empty before it can be filled again with new samples. But this is only my guess.

Other possibility is that the computer is busy doing other stuff on the bus (e.g. high Ethernet traffic) but we have not found anything suspicious. We have tested running in airplane mode but with the same result. Also we the CPU load increases only a little while running the tests. So far I have not found how to display IRQ counts (like "cat /proc/interrupts" on linux) but the low CPU load already gives a hint that there are no excessive IRQ requests occurring.

We have already tested the functions DAQmxSetCODataXferMech and DAQmxSetCODataXferReqCond but without success.

The computer is relative new with Windows 11 and it has 32GB of memory available.

Either there is something which we do not understand properly. Maybe my rate calculation is wrong, although I think it is reasonable, unless 10x more data has to be sent to the card.

It could be also some misunderstanding of mine regarding the given data rates of the chassis. Unfortunately, the datasheet DOES NOT specify any data rate, the 250MB/s is given only on the overview page of the chassis saying "up to 250MB/s PXI". So its hard to say under which conditions this rate can be achieved?

Or there is some problem with our hardware, either the computer or the chassis or the cards?

Maybe we need to further customize the kernel driver/hardware to improve the DMA performance?

Any thoughts are appreciated!

Thanks in advance,

Andreas

Kevin_Price · ‎03-18-2024

Anything I say here will be pretty speculative, somewhere in the "semi-educated guess" realm.

Looking back at the code you posted in msg #1, it appears that you were defining a 5-pulse task buffer and leaving the task in its default mod of regeneration based on task buffer contents. This mode would require regular delivery of the task buffer contents across the PXIe bus to the DAQ board. With a rather small buffer, this is apt to need to happen much more frequently.

What to do? Well, here are several thoughts of things to *try*:

1. If you don't need to change the pulse specs mid-run, you could configure the task to do "onboard regeneration" (or some similar term). As long as the entire set of unique pulse specs can fit in the onboard FIFO, this option should probably be preferred.

2. Create a much bigger task buffer, even if that means replicating the 5-sample contents back-to-back-to-back N times. This at least gives DAQmx some options for delivering larger chunks of samples less often. I don't know if it *will*, but at least it'll become more possible.

3. DAQmx has a property that controls when and how often data is transferred from the task buffer to the device. It's called the "data transfer request condition" and the options are:

a. onboard memory less than full (the default setting. It tries to keep the onboard FIFO full, leading to more latency for changes but less likely to get underflow errors)

b. onboard memory half full (this reduces the *frequency* of bus traffic, delivering bigger chunks of data during each access time period)

c. onboard memory empty (this is more of a "just-in-time" condition used to minimize latency for changes but with higher risk of underflow errors)

-Kevin P

CAUTION! New LabVIEW adopters -- it's too late for me, but you *can* save yourself. The new subscription policy for LabVIEW puts NI's hand in your wallet for the rest of your working life. Are you sure you're *that* dedicated to LabVIEW? (Summary of my reasons in this post, part of a voluminous thread of mostly complaints starting here).

andiT0815 · ‎03-28-2024

Hi Kevin and NI community,

thanks for sharing your thoughts!

We have adapted our C-code (which I attach) to include digital output (DO) and analogue output (AO) channels using the counters as clock source (which we call "pseudo-clock"). We have done more testing and now we understand the problem with the memory underflow and the supposed low data transmission rate between the computer and the chassis.

It turned out that the memory underflow error occurs regardless of the number of channels used during the test. So it does not matter if one uses a single counter without any DO or AO channels or if one runs the test with all counters and all DO and AO channels which we have available. This means, that the driver sends the data for all possible channels to the chassis regardless how many channels are actually used. Therefore, my estimation of the transmission rate was completely off, assuming that only data for the used channels are transmitted. Taking all channels with the minimum possible data size I would estimate 140MB/s at 2.0MHz output rate. This is still smaller than the 250MB/s of our chassis but the difference is less than a factor of 2 and it just means that more data is sent than I have estimated. This is perfectly reasonable and I would consider the memory underflow problem as understood.

We find that with our chassis PXIe-1073 we can output data only at 2.0MHz when doing it contiguously. In the attached C-code we demonstrate that one can also output data at higher rates, i.e. at 10MHz which is the maximum specified for our digital output channels. But only for short time and one has to add some waiting time (depending on the output time at full rate) to allow the internal FIFO of the counters to be refilled. For example one can generate 110 pulses at 10MHz but one has to wait minimum 300us after these pulses. This can be done indefinitely without getting the memory underflow error. For a smaller output rate or fewer pulses the waiting time is smaller. Unfortunately, due to the small FIFO size of the counters we cannot go above 110 pulses at 10MHz without getting the memory underflow error.

The memory transfer condition MEM_XFER_COND cannot be changed for the counters to other value than DAQmx_Val_OnBrdMemNotFull and the memory transfer mode MEM_XFER_MODE is already set to DMA.

This situation is not ideal for us but having understood this we can design our pulse sequences accordingly. In the worst case we have to upgrade to a faster chassis. Using the sequence only in memory or repeating the same sequence (memory regeneration) is no option for us since we have from experiment to experiment varying arbitrary pulse sequences which can last for several seconds and containing many samples.

I would now close the issue. Thanks!

Andreas

Kevin_Price · ‎03-28-2024

Just looking to help. It's been a looooong time since I did much C and never with any DAQ devices. I tried looking over your code nonetheless, b/c a couple of the things you said didn't sound quite right to me. Specifically:

It turned out that the memory underflow error occurs regardless of the number of channels used during the test. So it does not matter if one uses a single counter without any DO or AO channels or if one runs the test with all counters and all DO and AO channels which we have available. This means, that the driver sends the data for all possible channels to the chassis regardless how many channels are actually used.

I don't think that's the correct interpretation. The driver doesn't send data for channels that are not being used. I *think* the correct interpretation is that the counter output task is your weak link, such that even 1 counter task is enough to produce a buffer underflow. The # of DO or AO channels becomes irrelevant because the counter task alone is enough to cause underflow.

I *know* I don't have a full handle on your code, but there were a few things that seemed pretty unexpected.

I first noticed that you call the same create_samples() function for each task type - AO, DO, and CO. Given the likely distinct needs of each kind of output, that was pretty unexpected, though admittedly I didn't try to decipher all the details inside that function.

I also noticed discussion in the header comments about variable rate counter output (and thus variable rate sample clocks for AO, DO), but it seems that you *also* calculate your "num_samples" by multiplying a rate by a duration in seconds. That calculation *seems* to imply a constant output rate, in which case you wouldn't need a buffer of values for your CO task.

It also appears that you define your counter_rate as 10 MHz and duration as 5 seconds, such that you'd calculate num_samples=50e6 for your counter task.

That's when I figured I'd respond to the thread. There's a lot going on and I'm not entirely seeing how it all fits together. It seems that you want to use CO to generate variable-timing pulses with 10 nanosec resolution (you were initially concerned with "mere" 60 nanosec resolution), and use these pulses as a sample clock to update DO and AO signals. Despite this timing resolution, the DO and AO devices will require sample intervals of at least 100 and 1000 nanosec respectively. So that makes me wonder what 10 nanosec (or even 60 nanosec) resolution is going to accomplish.

The header comments also make it sound like you'd want to define very specific sample times within your 5 second duration when counter pulses should happen, such that the # of samples for CO should be very much less than 50e6.

Can you help me understand some details about the experiment you ran that led you to the following comment:

For example one can generate 110 pulses at 10MHz but one has to wait minimum 300us after these pulses. This can be done indefinitely without getting the memory underflow error.

Exactly how did you set up "110 pulses at 10 MHz"? Did you define a buffer of output values for CO that produced a constant pulse rate of 10 MHz? How big a buffer did you create? How did you determine and implement the 300 usec wait time after those pulses and before starting another set of 110? When you tried to generate more than 110 pulses at 10 MHz, what exactly happened? Did you get an error from the CO task? From the DO task? Both?

I'm asking a bunch of stuff because maybe, just maybe some of the constraints you're running into can be overcome by approach certain key things differently. I'm not at all *sure* of that, but I'm still a little suspicious that it *might* be true.

-Kevin P

CAUTION! New LabVIEW adopters -- it's too late for me, but you *can* save yourself. The new subscription policy for LabVIEW puts NI's hand in your wallet for the rest of your working life. Are you sure you're *that* dedicated to LabVIEW? (Summary of my reasons in this post, part of a voluminous thread of mostly complaints starting here).

andiT0815 · ‎03-29-2024

Hi Kevin, thanks again for your time!

I agree that the weak link in our case must be the counter because it has to run at twice the frequency of the fastest output channel since it acts as the "sample clock" for the output channels, it requires 2x4 bytes per counter to program the low/high time in ticks and it has the smallest onboard FIFO memory size of all the channels on the PXIe-6738 card. This card has 4 counters where always 2 are linked together such that only 2 independent counters are available for our purpose. We have two 6738 cards and can produce contiguous digital output pulses with sample rate of 2MHz of all digital channels without getting the buffer underflow error. We can use all DO or AO channels simultaneously at this rate or at different rates without producing the error. However, as soon as we try to increase the sample rate >2MHz, say 2.5MHz we observe the memory underflow error. Even using only a single counter (with our without DO and AO channels) does not help to avoid the error. If we can run simultaneously 4 counters with 2MHz DO sample rate, one would naively expect that two counters could run with 4MHz and a single counter could run at 8MHz if only the bandwidth is limiting. But we observe that a single counter cannot run with 2.5MHz contiguously, although for short time (discussed further below) we show it can run at 8.333MHz. This brings me to the conclusion that there must be a fixed bandwidth allocated per counter. I agree with you that my earlier interpretation of a fixed amount of data sent to the device, was probably too much interpretation and a fixed bandwidth per channel would have the same effect.

Regarding the example code: first, this is not our experiment but it can be considered as a benchmark test of what are the limitations of the hardware. As I wrote already, we program (in Python) our pulse sequences and analogue waveforms from shot to shot (where only small parts change) and they are neither contiguous nor repetitive. We often do linear and exponential ramps (duration order of ms to s) of few analogue channels at modest output rates (1-10kHz) where nearly no digital pulses occur, then there might be waiting times order of seconds without any or only few outputs and then we have tight bunches of order of tens of digital pulses of varying length from few 100ns to few 1us where the exact timing can be important on the 100ns time scale. The number of samples strongly depends on the number of ramps we are doing and is varying from few 10k up to 10M.

The create_samples function is just producing contiguous output data or bunches of output data with waiting time in-between for a given DURATION in seconds. It simply generates some triangular ramps for the analogue channels and is producing a simple pattern for all digital channels. It was intended to generate samples such that something is happening on the outputs which can be checked with the oscilloscope. Using one function is convenient since the logic is nearly identical for AO, DO and for the times used by the counters. Also the write_task function uses similar logic for the different tasks and I purposely wanted to use one function to be sure every step done is the same and it allows to easier spot task-specific code.

For each digital and analogue device the output sample rate DO_RATE and AO_RATE can be specified and a counter can be assigned. For simplicity the sample rate is kept constant for each device in the example code. In the true experiment this is highly variable. Note that all rates which I am specifying here are sample rates for the DO and AO channels while for the counter it is the output frequency. I.e. the counter runs at twice the frequency as the DO and AO channels. This is because it acts as the "sample clock" for the AO and DO channels.

To test how to overcome the 2MHz limit in the code the two times T_FULL and T_WAIT can be specified. If they are set to 0 create_samples produces contiguous samples for the given DURATION. If they are non-zero, T_FULL gives the time in seconds the samples at the given output rate are produced, and T_WAIT gives the time in seconds there will be no output produced. create_samples is repeating T_FULL and T_WAIT sequences for the given DURATION of the test. This allows to generate digital output samples at sample rate of 8.333MHz for a given short time, which will empty the FIFO, and to wait afterwards to allow the FIFO to be refilled. Varying T_FULL and T_WAIT we can understand the limits to avoid the memory underflow error. We find that for few high-frequency pulses we only need short waiting time while for more pulses we have to wait longer. This test shows also that the times do not change regardless if we use one or two counters at the same sample rate (we have not tested 4 counters). This shows that the counters can indeed run at 8.333MHz but are limited by a fixed data transmission bandwidth allocated for each of them.

About the timing and resolution. The counters can be programmed either in units of time in seconds or in units of ticks which is 1/100MHz = 10ns. We experimentally find that when they are programmed in seconds we get an error when we try to set times shorter than 60ns. This is the number I gave in my first post. Later we tested to program the times in ticks and magically the smallest tick one can set without error is 3, i.e. 30ns. So this is the smallest time we can use for the counter low and high times. Thus the smallest counter period is 60ns which is 16.667MHz. Since we can program in ticks our resolution is 10ns on the counter. We have tested single pulses with 30ns pulse width but we have not tested trains of pulses with this pulse width.

The digital channels can change state only on one edge of the counter output, i.e. rising or falling but not on both, so the highest sampling rate would be 1/60ns = 16.MHz. If I understand it correctly this would be out of specs for our DO channels and we have not tested such high rates, but 8.333MHz definitely works for short time. Still the timing resolution is given by the "sample clock", i.e. the counter, and it should remain 10ns. This we should investigate a bit more, since there might be some jitter of order of 10ns when different internal clocks for counter and DO are used. But 10ns jitter most likely is no problem for our application.

The analogue channels cannot change faster than every 1us, but we do not need to wait integer multiple of 1us to change the state but could still program them with 10ns resolution, although this is not really needed for our application.

About your question regarding how we tested the 110 pulses + 300us waiting time. It was done with the example code setting DO_RATE = 10e6, T_FULL = 110*0.1e-6 and T_WAIT = 3000*0.1e-6 and DURATION = 5s. Note that the true sample rate is 8.333MHz due to roundoff errors. It is important to understand that the sequence of 110 pulses and 300us waiting time is repeated many times in this test. The test creates 1,768,581 samples with about 90MB of memory usage. One could also set DURATION = 311e-6 to have just a single sequence. However, we wanted to observe the behavior for many repetitions such that loading/reloading of the FIFO is in a steady-state. Close to the threshold the error comes randomly and only for many repetitions or long DURATION one can be sure it does not happen once in a while. Setting T_FULL = 120*0.1e-6 we could not find any T_WAIT where the memory underflow error does not occur. In this case it happens quasi immediately even for very short DURATION and it occurs for all counters running with the nominally 10e6 rate. We do not get an error on the digital channels but obviously the number of output samples is less than what it should be, since the counter stopped working. This is the case where we are clearly limited by the small FIFO size of the counter. I cannot say why we cannot output 127 samples = FIFO size, but it might be related to some details in the hardware. Anyway we are close to this limit.

I hope I could now remove your doubts.

Andreas

Counter/Timer

implicit buffered counter output with 100MHz internal clock locked to external 10MHz clock (PXIe-6738, ANSI-C)

implicit buffered counter output with 100MHz internal clock locked to external 10MHz clock (PXIe-6738, ANSI-C)

Re: implicit buffered counter output with 100MHz internal clock locked to external 10MHz clock (PXIe-6738, ANSI-C)

Re: implicit buffered counter output with 100MHz internal clock locked to external 10MHz clock (PXIe-6738, ANSI-C)

Re: implicit buffered counter output with 100MHz internal clock locked to external 10MHz clock (PXIe-6738, ANSI-C)

Re: implicit buffered counter output with 100MHz internal clock locked to external 10MHz clock (PXIe-6738, ANSI-C)

Re: implicit buffered counter output with 100MHz internal clock locked to external 10MHz clock (PXIe-6738, ANSI-C)

Re: implicit buffered counter output with 100MHz internal clock locked to external 10MHz clock (PXIe-6738, ANSI-C)

Re: implicit buffered counter output with 100MHz internal clock locked to external 10MHz clock (PXIe-6738, ANSI-C)

Re: implicit buffered counter output with 100MHz internal clock locked to external 10MHz clock (PXIe-6738, ANSI-C)