Can DIAdem work with very large, multiple, sparse CSV data sets?

Thoric · ‎04-25-2012

Having read the manuals for DIAdem, and searched this forum, I'm still not sure DIAdem can do what we need, so here is my question.

We will be acquiring from about 10 channels in our test rig and saving the data to CSV files.

The sample rate is configurable for each channel, therefore, in order to keep file sizes down, we will either be saving the data:

a) to multiple files, each files containing channel data of a common sampling rate (partitione data files), or

b) to a single file, leaving cells empty where a channel has not been acquired at that particular timestamp (sparse data file)

The tests will run for weeks, moving to a new file/set of files each hour. Therefore, there will be many files, totally up to about 4.8 Gb.

We will need to mine this data, looking for trends etc. and pulling out sections for plotting. Therefore, DIAdem will need to be able to consider the library of datafiles as 'one' dataset, understanding that each channel could be in a separate file (representing a different sampling rate), and that each file only contains up to one hour of a potentially many-hour test.

Is DIAdem able to handle very large datasets, spread across multiple files, without having to load them all into memory at once?

And can it cope with sparse CSV content?

As an example of "sparse", to be sure I'm explaining myself properly, the following represents three channels of data, channel 1 acquired at 50Hz, channel 2 at 25Hz, channel 3 at 100Hz.

Timestamp,Channel1,Channel2,Channel3

0.00,1.0000,0.1230,0.4500

0.01,,,0.4400

0.02,1.0500,,0.4300

0.03,,,0.4150

0.04,1.0300,0.1240,0.4170

0.05,,,0.4190

Which needs to be interpreted by DIAdem as:

Timestamp,Channel1,Channel2,Channel3

0.00,1.0000,0.1230,0.4500

0.01,1.0000,0.1230,0.4400

0.02,1.0500,0.1230,0.4300

0.03,1.0500,0.1230,0.4150

0.04,1.0300,0.1240,0.4170

0.05,1.0300,0.1240,0.4190

It is important that DIAdem realises that the 'empty' cells simple mean no data acquired, not NaN, zero or error. Can it be configured to interpret the data fie this way?

Thoric (CLA, CLED, CTD and LabVIEW Champion)

DRock · ‎04-26-2012

Hey Thoric -

First off, congrats again on becoming the UK's first LabVIEW Champion! Don't think we didn't hear about it over in this corner of the forum... Your reputation precedes you. I apologize for the delayed response - I'm out of the office on international travel but wanted to get you an update, even if it isn't entirely thorough.

Suffice it to say that on both accounts, the answer is "yes, DIAdem can do that" - and to be quite frank, I haven't run across another data post-processing / management platform that can say the same. The key to success here is twofold: the DataPlugin that will need to be written for your custom file format (*.CSV is generic, after all), and the DataFinder utility. Allow me to explain.

This is the point where I would normally make a shameless plea for you to use TDMS instead of CSV. I'm going to spare you this plea because I assume you know the benefits already, particularly in a use case such as this (with large data sets, binary is both smaller and faster; with disparate sampling rates, you could write waveform channels directly, which would represent time implicitly as merely t0, dt » and so on). Plus, it sounded like you were already committed to using *.CSV.

Since *.CSV is a generic file format, you'll be choosing a custom implementation. Once you narrow down your standardized way of saving your file, you'll need a DataPlugin specific to that standardized format. DataPlugins teach the DataFinder how a given file format is structured - where to find the metadata, where to find the channels (etc). They're written as small pieces of code, and as such can be intelligent in nature. I've seen DataPlugins that view multiple files as one data source (the *.MME DataPlugin does this, for example), perform custom metadata calculations, interpret data in a specified way that differs from the way in which the data is literally stored to file, and so on. You've got a ton of flexibility here, which speaks to the power of the technology. Both of your options - (A) and (B) - can be accommodated. In fact, there's a couple of options for handling the "sparse" nature of the data - at the moment I'm not prepared to do a performance evaluation of the ideas I have in my head, I simply know it can be done.

Once the DataPlugin exposes the structure of the file, DataFinder will index the metadata exposed in the file and build its DataFinder Index, the queryable database that gives you the searching and mining functionality you're looking for when you say "...we will need to mine this data, looking for trends etc., and pulling out sections for plotting." You'll be able to do things like load isolated channel(s) from the larger data set (whether it is one or many files) with literally zero extra work; however, your mining flexibility will depend on the available metadata in the file. Therefore, my general recommendation is:

If you know you're going to want to perform a search based upon a given piece of metadata, make sure you calculate it and put it in the file(s)
Calculate and store the aforementioned metadata with the same scope as the lowest level of time granularity you'll want in your trending and querying.

For example, if you know you'll want to query/trend things on a per-hour basis, then calculate and store the metadata for each hour segment.
If you think you'll want to query/trend things on an every-30-minute basis, then calculate and store the metadata for each 30-minute segment.

The last considerations I have for you, for now:

You won't be able to load the entire 4.8 GB data set into DIAdem memory at one time. However, there are several tools built into DIAdem that make it possible to operate on large data sets nonetheless when trending across many files or manipulating large data sets.
Yes, you can easily concatenate segmented data files (for example, if you save one file each hour and you want to concatenate Hours 3-10) into one data set for processing - everybody always asks .

I expect you'll have followup questions - feel free to ask them, we're happy to help anytime.

Derrick S.
Product Manager
NI DIAdem
National Instruments

Thoric · ‎04-27-2012

Derrick,

Thank you very much for your comprehensive reply, this is good news that DIAdem can perform so well, it certainly sounds like a powerful tool. And thank you for the congratulations - I wasn't aware I had such a widespread reputation, I hope it's a positive one?

You say you're not aware of another data-processing platform that can do the same. Our customer's are considering another product which they already happen to have called FAMOS. They aren't sure if it can do what we need, but given they have it already I suspect they will try it before considering buying another product, plus it has a free Viewer release which allows anyone to view the compiled output files, which is highly attractive to them.

Are you aware of FAMOS? If so, do you know if it can/cannot do what we need? I ask because, if you can clarify that FAMOS cannot, then we can justify promoting DIAdem to them. I'm not aware of its capabilities.

The project has advanced a little and we now know we will be using separate files for groups of channels acquired at the same sample rate, using a 'key' index as a timestamp to allow for collation. Each file will be of length 'n', configurable for each test, but probably in the order of an hour of data each, with test lasting potentially many weeks. Therefore some files will be many megabytes, whereas others might be less than 10k. Alongside all these files will be a single file containing the parameters of the test, such as pperator name, test start and end times etc., which probably needs to be imported too.

I appreciate you're out of the office on business, so I patiently look forward to your reply.

Again, many thanks,

Thoric

Thoric (CLA, CLED, CTD and LabVIEW Champion)

DRock · ‎04-30-2012

Hi Thoric -

Thanks for your patience. It's definitely a positive reputation - your face was posted on our wall here for a couple of weeks in a row. :

I'm definitely aware of FAMOS, but I'm not comfortable talking about the details of my experiences here on the forum (feel free to email me directly if you'd like), especially because I'm in no way a technical expert on their product and its capabilities.

Suffice it to say that I'm confident that DIAdem will be a more powerful approach and I'd be happy to help you put together whatever materials/demos are necessary to justify its consideration by your customer. For starters, *if* FAMOS can load the data in the format you've specified (it can load CSV files, but I'm not sure how they handle the case when the data is spread across multiple files), then I still don't think you'll have any option for data mining and querying, and it sounded like that was a fundamental part of the system you're considering. FAMOS will inevitably include use of the term "data management" on their website, but our definition of the term tends to be far more inclusive and robust...

As soon as you can get ahold of a few example files, I'd recommend we sync up for a web-based demo using your files - it might help to better show off exactly what we can do. Let me know if you're interested?

Derrick S.
Product Manager
NI DIAdem
National Instruments

DIAdem

Can DIAdem work with very large, multiple, sparse CSV data sets?

Can DIAdem work with very large, multiple, sparse CSV data sets?

Re: Can DIAdem work with very large, multiple, sparse CSV data sets?

Re: Can DIAdem work with very large, multiple, sparse CSV data sets?

Re: Can DIAdem work with very large, multiple, sparse CSV data sets?