@michaeldrogalis Thank you for the pointers. Yes, they run consecutively. Sounds like we should consider external stores (kafka, redis, db, etc) rather than serializing large values
@aaron51 Yup! You have plenty of choices there.
will the onyx seq plugin ensure that peers are coordinating the input segments and not processing a unique segment n-times (n being the number of peers associated with that input task)? In other words, if I'm passing a list of segments to onyx-seq 3 times during task definition on 3 peers, will the peers coordinate to see which one has seen a given segment?
It won’t, though assuming you’re just passing it a static list, one input peer would likely be enough to distribute the work downstream. It’d be pretty easy to update onyx-seq to partition the work based on the number of peers but I haven’t seen a reason to yet
Alternatively if you inject the data via a lifecycle you can partition it based on the slot id and number of peers
What if there's a lot of data to be put into the seq? Would it still work to put all of it in the lifecycle? I had the understanding that putting data in the lifecycle might have the unintended consequence of making your barriers really big
Yes, so what I meant by that is that you have a before-task-start fn that injects say (map (fn[i] {:n i}) (range 10000))
but instead you check :onyx.core/slot-id
inside that fn, and based on the slot, you would return differently partitioned ranges for each peer
that way each peer gets a different part of your data set
it also works if you know that want to read, say, a bunch of S3 objects. Based on the particular slot the peer is on, you would have each peer’s injection return a different partition of the objects that you passed in
(e.g. first peer gets the first 1/3, second gets the second 1/3, third gets the last 1/3)
the idea works whether you’re generating the messages or passing in via the lifecycle
Gotcha, ty