Total shot in the dark. I want to build a platform capable of attempting to parse a malformed CSV file from persistent storage, while returning all of the best guess fixes for the malformed file. This file would be too large to fit into the main memory of any peer.
In this case, order of bytes matter. Line delimiting is not reliable in case a cell has CR or LF inside and the cell is improperly quoted. Hell, even converting bytestream into a stream of codepoints means that the file can't be randomly split into segments.
How does this ordering get preserved on deserialization? And what's the best practice to get neighboring bytes (with offset numbers) onto the same virtual peer? Any experience with what the presumably unbounded combine phases would be?
The idea, of course, is that a properly written CSV would yield a seq of seqs.
And thoughts or experiences with this?
One thing I noticed is that the stream input bias of Onyx is showing when the AWS S3 plugin doesn't even address deserialization, at least to my reading of the code. I may be totally wrong here. I hope I am.
Oynx is a distributed system solution for processing data over time. As such, it wouldn't seem to address the issues your raising. Though your much closer to the problem then I am. There are several csv processing libs in clojure that make lazy loading easy, which might solve in memory issues. Also, it's usually worth scaling memory up, unless your have up time guarantees to your customers. I'm a bit confused about the rest of what your describing, but that's likely because I haven't thought about csvs for a while. Hope that helps. @me1238
Thanks, @drewverlee! To clarify, "data over time" means small, internally complete values that arrive at different as-of times?
no relationship to small (or any size). Not sure what internally complete would me. "arrive at different as-of times" rings closer to what i was describing in that typically the data people process contains timestamps and their interested in "what" in relation to when. This is true for all computation of course, but onyx gives more control over describing it. In some contexts that control isn't worth the extra steps. E.g if my boss asks me for a weekly report and wants me to read it to him, setting up onyx is a total waste because i have to trigger the computation at his convince. He isn't interested in watching a real time dashboard. Put another away, onyx can start a computer(s)/processor that stays up and running over time (days, weeks, forever) such that as more data arrives it automatically (as opposed to a human in the loop triggers an execution). No one part of the onyx/flink/spark solutions are novel in that regards, but they combine a number of trade offs that are desirable at a certain scale. Does that make sense? i'm i moving closer or further to being helpful 🙂 @me1238
I apologize. Internally complete is bad terminology. I mean fully bound, as in nothing in the value is free.
I could model seq of bytes from a file as fully bound by including byte offset of the byte, if I were willing to pay the cost of allocating a segment per byte. I wouldn't want to do so for performance reasons, but if I did, as long as there were a shuffle strategy that would bring neighboring bytes to the same virtual peer, those bytes could be combined into bigger segments. Does that make sense?
I guess my question at this point is if the responsibility for maintaining timestamps lies on the data recording process? It seems like Onyx itself tries to avoid this responsibility?
And I say small because I assume Onyx couldnt handle a terabyte file as a single segment?
no need to apologize, i'm likely just confused in a different way. I'm mostly just fishing to make sure onyx makes sense here. Which is presumptuous. My concern is that CSVS tend to be sub 100gs and onyx tends to be useful at much larger scales. > terabyte file Ok, that's much larger then i assumed. Is this computation a one time deal or csv growing? is the information set being added to all the time? I'm also confused because your talking about bytes and from what i recall, as a user of onyx, thats well below the abstraction layer. As in, I don't believe you need to worry about specifying how the data is broken apart beyond picking the windows, triggers, etc... > maintaining timestamps lies on the data recording process? I believe onyx handles the physical organization issues with what computer gets what data.
What wouldn't work about using https://github.com/clojure/data.csv#laziness to lazily read the csv and write out the data fixes to file? I'm guessing the individual corrections require only a super small subset of the data to be held in memory.
Thanks so much. It depends how degenerate. Sometimes files cause the contents of the entire file to be loaded as one row.
It doesn't seem like Onyx is a good fit for us. Thanks!
> Sometimes files cause the contents of the entire file to be loaded as one row. Interesting! Yea, maybe you can build something to start processing as a csv and then if it runs into issues have it output the data that causes the problem. It sounds like you might have to do a lot of manual tinkering. My advice is to try to use 1 computer and lazily loading before going distributed. going distributed tends to mean, among other things, you need fault tolerance. Which is why i was asking the questions i was, i was worried onyx was going to make things more challenging.