Beginning advice on gpu computing

Jeff2 · ‎10-19-2009

Hi,

I was hoping I could get a little advice on computing with the GPU in the Labview environment. We are computing a large number of FFT's, anywhere from 10^5 to 10^6, each being about 100-200 elements long. Right now, with our current machine, this is done in a post-processing phase where time is not an issue. I was wondering if the gpu computing would allow us to do this quicker and if so, what kind of performance gains could we expect to see? Could this be done in perhaps 1s? I understand there are a lot of factors involved so right now, I guess I am wondering if this would be worth pursuing. Also, how much experience is involved? I use Labview on a daily basis, but what skills are necessary outside of Labview?

Thanks

Vectrix · ‎10-20-2009

Hi Jeff2,

I have been trying to implement the same thing in my lab. The only difference are the number of FFT's (~1000) and their lengths (1024-2048). I was successful at building and testing a 64 bit DLL that does FFT using CUDA in Labview 2009 (64bit) and compare the performance with that of the built-in Labview FFT function. I found that at the target number of FFT's (1000), the CUDA DLL performed 10x faster than the built-in function. There might be room for improvement by optimizing the code.

I would expect some sort of speed improvement for your combination of # of FFT's and FFT lengths. But I really can't say right now how much faster it would be. I could try running my code with that combination and get back to you so you can have some rough idea as to whether this is worth persuing.

Regarding to experience and skills that are "neccessary" outside of Labview, it depends on how you would like to implement CUDA into Labview. From what I know, you can download and work with the "LVCUDA" package posted in the document section of this group or you can write your own DLL and calling it in Labview like what I have done.

I downloaded the LVCUDA package but couldn't get it to work because I am on a 64 bit platform, so I really can't provide you with any more information on this.

For writing a DLL, I used C and CUDA extention (cu) and compiled it in Visual Studio 2008. In this case, you would only need to know a bit of C and pick up the rest as you write your code. (That's what I did at least.) CUDA programming guide that comes with the toolkit provide a really good starting point if you are completely new to it.

I am by no means an expert, but hope that helped.

Cheers,

MathGuy · ‎10-21-2009

I plan to comment on this thread in more detail but feel like I should probably point out a few issues to the DLL-only approach presented here.

Let's first look at the algorithm basis. The FFT in LabVIEW uses precomputed tables for each specific signal size being processed. That means that the first call to FFT of size N will have a performance hit to build the table. After that, the precomputed values are used to compute the FFT result. From what I can tell the same is true of the cuFFT library for CUDA although I believe any precomputed resources are allocated manually thru the library API. So, a reasonable performance comparison can be made w/out comparing two radically algorithm styles (beyond the dramatic differences in instruction execution on the processors).

Now for the more complicated part - the resource management. Since each FFT size relies on a precomputed batch of values, there is a shared resource that exists from call to call on each compute device. On the CPU, it's not an issue since LabVIEW executes code in the same memory space. However, this is not true to CUDA FFT. Herein lies the value of the LVCUDA because there is not flexible alternative in LabVIEW to manage the CUDA context.

Let's consider the two mechanisms that are available and cover their limitations:

Configuring the Call Library Function node that executes the CUDA code in the DLL to run in the UI thread
This maintains the host thread to CUDA runtime behavior that allows for persistent CUDA resources. The downside is that the I/O can be dramatically affected by coupling all front panel updates on LabVIEW diagrams with possibly expensive CUDA calls. You can imagine that this approach could result in some execution jitter as it pits the GPU device against itself (for display requests and CUDA execution) on the same host thread.

What's perhaps worse in this scenario is that the jitter is not easy to detect unless you follow a manufacturing-line approach to testing/benchmarking and varies differently based on download and upload sizes. Note that on a GPU download is considered far more important in practice than upload so there is no guarantee they will behave equally well (or bad). In practice, hardware is optimized for download.
Configuring the Call Library Function node to run in any thread
Without external management of resources on the CUDA device, this configuration requires that all resources, including precomputed tables on the device, be allocated, computed and deallocated for every FFT call. This would also be true of the buffers to store the intermediate FFT inputs and outputs. If this is not done, then LabVIEW may use more than one thread to call the DLL over the course of execution. When the FFT is called with cached resource information from a different thread, the CUDA call will fail returning an 'unknown/invalid reference' error.

On the host side, every thread that executes the CUDA call creates an independent CUDA context. This may or may not include loading another copy of the CUDA runtime - I have not researched the exact behavior. In LabVIEW, at least one thread is allocated for each core (including virtual cores thru Hyperthreaded CPUs) and this batch of threads is duplicated for each possible VI priority (of which there are 6) AND this bundle is create for EACH execution system (again there are 6 to choose from).

If you thought it wasn't likely that a single DLL call in an application running on a dual core system would run on more than one thread, you might want to reconsider those odds.

I still plan to address the original question - essentially "What's a LabVIEW customer got to do to get some GPU computing around here, anyway" - but thought I should point out a few key issues. This is by no means a complete coverage of the topic and it should not be construed as knock against the integration approach you are using. It is certainly viable for some applications and I implemented some of the original prototypes using these methods.

NOTE: The Installation thread has touched on another solution relevant to CVI that describes model #1 above.

MathGuy · ‎11-03-2009

In my opinion, your question required more than just a quick email response. So, I've written a short document that covers the basic components of GPU computing from a LabVIEW user perspective. The existing documentation treats the subject as if the user already has code running on a GPU.

While this is by no means a complete coverage of the topic, it covers the most important details that I encountered when I started using GPUs from LabVIEW.

Please feel free to ask further questions and make comments:

What's a LabVIEW Developer Got To Do To Get Some GPU Computing Around Here, Anyway?

Message was edited by: MathGuy - Attachment moved to general documents location.

Jeff2 · ‎11-05-2009

Thanks for the documentation and advice. It looks to be very informative for me and others as well. I believe we have decided to explore the gpu computing route so I'm sure it will be very helpful, especially since keeping things in the LabVIEW environment will be extremely simpler in the end. Hopefully I will be able to post some positive results and or feedback in the future. Thanks again.

Jeff2 · ‎02-17-2010

I have been working on developing code for gpu computing for processing FFTs. Its been slow going mostly because I'm learning C in the process. Right now I'm looking at loading the data and using fftw since CUDA mirrors that. I have been investigating other alternatives as well, partially because there may be a need to interface the processing with other environments for user interfaces and acquisition, and partially because it may speed the process up for me.

However, I had a question about the prospect of getting gpu computing in Labview. Matlab seems to be currently working to incorporate CUDA within their environment, apparently with little knowledge of CUDA. There exists products that currently do this as an add-on:

http://gp-you.org/

http://www.accelereyes.com/

My question is, since Labview can run Matlab code, would it be possible to have gpu processing by running the Matlab code that handles the gpu computing with these products installed or is there a fundamental difference in running the code in the Matlab environment versus the Labview environment? Would Labview even be able to see these add-on packages?

nesaboz79 · ‎02-18-2010

Hi,

Labview can call Matlab functions in two ways: using mathscript (very basic matlab) and using matlab script (I think that's the name) that needs to have matlab command window runing in the background. The later is more extensive and could see other packages (I used dipImage library). So try it out. This is not the best option, but in your case may be the easiest.

You could learn C and CUDA but it is just a matter of time when NI and Mathworks will make it GPU support official, maybe its worth waiting.

MathGuy · ‎02-18-2010

Let's walk thru the requirements for a MATLAB-based tool to run from MathScript.

The tool must manage the host thread used to execute the GPU calls. This is a requirement of CUDA when resources on the GPU are cached between function calls.
If the tool does not manage the host thread, then MATLAB environment must call functions in external libraries using the same execution thread.

I do not know if #1 is true. If it is then it is possible this tool would work from MathScript. If the tool vendor doesn't specify this thread support, it can be difficult to test. Even though threads can be swapped out by the OS at any time, it happens less often than you might think. I've had to run tests hundreds of iterations to expose the issue.

If #2 is the case, then the tool will not work out-of-the-box. This is due to LabVIEW's multi-threaded nature - it calls functions in external libraries from one of many threads. The same would be true for functions in the toolkit called from MathScript. However, creating a MathScirpt plug-in that wraps these functions in VIs configured a specific way might work.

There are two ways I've used to get a VI to run in a given thread. The easiest is setting the execution system to 'user interface' but this pits GPU computing against updates to objects on the VI front panels. If the front panels have complicated graphics (e.g. 3D surface plots), the result is larger execution jitter based on how much data is being transferred to and from the GPU device.

The other option is more complicated. Both of these alternatives are covered on the last two pages of the online document, http://decibel.ni.com/content/docs/DOC-7707 - What's a LabVIEW Developer Go To Do To Get Some GPU Computing Around Here, Anyway?.

egda · ‎04-02-2010

Hi,

I'm interesting in using LabVIEW GPU Computing.

I've understand that LVCUDA enables LabVIEW users for interfacing with the CUDA runtime and library functions, but that all functions to be executed on the GPU have to be made as C DLL.

I can imagine that it could be possible for developers to use only LabVIEW, if NI provide a toolkit to compile the code for GPU, as it is the case for FPGA for instance (as I'm not a software specialist, it may be stupid).

Is such a toolkit in development?

MathGuy · ‎04-02-2010

Your summary of LVCUDA is accurate - it is only an interface to external code compiled by other tools (such as NVIDIA's NVCC compiler). The option to generate this code from a LabVIEW diagram could benefit those familiar with programming in G. This is especially the case for those not versed in the textual languages that currently recognize and produce CUDA-compliant executables.

Currently LabVIEW's compiler understands (in a code-generation sense) x86-based CPUs and Xilinx FPGAs. Support for FPGAs was a significant development effort over several years. Since GPUs represent a processor architecture different from these, generating code for execution on that target class is non-trivial.

We have hopes that our interface to CUDA-based code will give us insight into future ways of incorporating GPU computing into LabVIEW applications. The option you're asking about here - automatic deployment of a (portion of a) G diagram directly to a GPU target - is not a committed LabVIEW feature at this time.

This doesn't mean that the current NILabs module for GPU computing will not (or can not) evolve.

GPU Computing

Beginning advice on gpu computing

Beginning advice on gpu computing

Re: Beginning advice on gpu computing

Re: Beginning advice on gpu computing

Re: Beginning advice on gpu computing

Re: Beginning advice on gpu computing

Re: Beginning advice on gpu computing

Re: Beginning advice on gpu computing

Re: Beginning advice on gpu computing

Re: Beginning advice on gpu computing

Re: Beginning advice on gpu computing