Actor Framework Discussions

cancel
Showing results for 
Search instead for 
Did you mean: 

AF not suitable for working with big data sets

So far I am familiar with the 'new' Channeled Message Handler and already setup some modules using it.

I am still looking for a nice way to share my acquired data between my modules (post-processing, export, acquisition).

So far I used FGV, from which reusability suffers. Despite the modules run async, the actual steps are done sync.

import->configure/manipulate all devices->acqu->post-processing->export

 

Now I want to learn AF and setup a system where I can compose different devices/configs to a set of studies more dynamically.

 

I read that every Actor should hold it's own copy of data because of it's async philosophy. Is this a show stopper for AF, when working on big datasets? How should communicate datasets of several GB between different actors without copying it? I thought about GOOP4

 

0 Kudos
Message 1 of 7
(189 Views)

AF is fine for this sort of thing. If you truly need single copies of the data to hang around, send DVR's or something. "Every Actor has its own independent copy of the data" isn't a hard requirement, more like a (very) strong suggestion. If you need multiple copies of the data, then you need multiple copies of the data. Think about something like a database connected to an Actor-based program- you don't duplicate the entire thing for each Actor.

 

Your problem will be in making sure one Actor isn't modifying the data while another one reads it in. One potential solution here is just regular ol' Objects. You can use GOOP to make singleton classes or you can roll your own, and send by-ref Objects to each Actor that needs it.

 

One thing to consider though- does each Actor need to actually store the data? What exactly are you doing with the data?

 

For example, I had a project taking data from two multifunction DAQ cards continuously for about a month, monitoring very small but very sudden changes in a voltage measurement. I was reading two cards at 1 MS/s, both doubles, so that's 16 MB/sec of data coming in. I didn't need to log all of this data, but I did need to do some processing on it. I just used regular messages to pass data acquired from my DAQ cards up the stream. Each Actor that received a chunk of data operated on it, then sent out messages with the new information that Actor created. IIRC, that one needed to measure peak to peak voltage, an average value, and some time information. It all worked fine and only used like 500 MB of RAM to do so.

 

Now if you have several GB that you need to process that's different than a stream of incoming data, but my point is that regular messages do just fine with a bunch of data coming through them, and that it's not against the rules to share references to big hunks of data that can be shared.

 

By combining a Singleton object with multiple asynchronous Actors, you can let the Singleton handle the access (by making your function calls non-reentrant, for one) and each Actor can therefore get access to it immediately when the resource is available. Thus, each Actor only blocks while waiting for a resource that it can't continue without. I hope that makes sense.

Message 2 of 7
(153 Views)
I am not sure if I have to go with the singleton. If I put a GOOP4 into a message it is shared by reference out of the box right? A singleton feels like I limit flexibility and scalability.
0 Kudos
Message 3 of 7
(139 Views)

Why is each actor keeping a copy of everything?

 

If you're talking about a lot of processing pipeline type stuff I don't see why each one needs to keep data around. Class private data would have processing config/params/state and act on data that comes in on messages and passes it along to its next destination or stores some result. Bus stops along the route don't necessarily mean copies. Storing extra copies of things sounds like a design issue that could be overcome.

~ The wizard formerly known as DerrickB ~
Gradatim Ferociter
0 Kudos
Message 4 of 7
(128 Views)

@Quiztus2 wrote:
I am not sure if I have to go with the singleton. If I put a GOOP4 into a message it is shared by reference out of the box right? A singleton feels like I limit flexibility and scalability.

You're correct, I flubbed my terms a bit on Friday. I suppose I was thinking "singleton" in terms of "one place to access the data" but of course you could do that to multiple data sets, which I wasn't thinking about.

 

I don't use GOOP4 but as long as it handles by-ref stuff for you then yeah, you're fine.

0 Kudos
Message 5 of 7
(104 Views)

While talking about big data sets there is no other way than keeping only one copy of it.

There are only few more or less dirty strategies differing by the point of view to the data:

 

1) PUBLIC DATA

You can share the data to all actors who need them by sending a reference
=> sharing internal data is directly against actors concept (who is responsible for the data consistency?)
BUT for me this is acceptable in case only the owner can write (which is also technically enforceable)

 

2) MSG DATA

You can send the only copy from actor to actor if the processing is a simple chain (as you mentioned it)

=> this is great if you need to "add calculated columns" because it allows pipelining

BUT you have to be very careful with references created in a different actor because the memory allocation is lost immediately after the actor who created them, stops.

 

3) INTERNAL DATA

Only the data-owning actor can access and manipulate the data (DB actor concept already mentioned)

=> this could be seen as a clean OOP concept

BUT the more complicated requests you have to implement the more complicated (and slower) is the DB actor OR the more data it copies out OR the more it implements functionality belonging to the requester another disadvantage is that this actor can become a bottleneck   

 

3b) If you pack the data into a special Class you can separate the requestors functionalities implementation by interfaces. 

 

Message 6 of 7
(87 Views)

Note: if you are working with data in the "several GB" size, you should consider combining LabVIEW with a  technology designed for storing and querying large data, such as SQLite, mysql, HDF5, etc.

0 Kudos
Message 7 of 7
(60 Views)