onyx

FYI: alternative Onyx :onyx: chat is at <https://gitter.im/onyx-platform/onyx> ; log can be found at <https://clojurians-log.clojureverse.org/onyx/index.html>
aaron51 2018-06-15T03:21:05.000011Z

@michaeldrogalis Thank you for the pointers. Yes, they run consecutively. Sounds like we should consider external stores (kafka, redis, db, etc) rather than serializing large values

michaeldrogalis 2018-06-15T13:43:14.000537Z

@aaron51 Yup! You have plenty of choices there.

dbernal 2018-06-15T17:08:32.000450Z

will the onyx seq plugin ensure that peers are coordinating the input segments and not processing a unique segment n-times (n being the number of peers associated with that input task)? In other words, if I'm passing a list of segments to onyx-seq 3 times during task definition on 3 peers, will the peers coordinate to see which one has seen a given segment?

lucasbradstreet 2018-06-15T19:08:00.000215Z

It won’t, though assuming you’re just passing it a static list, one input peer would likely be enough to distribute the work downstream. It’d be pretty easy to update onyx-seq to partition the work based on the number of peers but I haven’t seen a reason to yet

lucasbradstreet 2018-06-15T19:16:45.000287Z

Alternatively if you inject the data via a lifecycle you can partition it based on the slot id and number of peers

dbernal 2018-06-15T22:48:25.000197Z

What if there's a lot of data to be put into the seq? Would it still work to put all of it in the lifecycle? I had the understanding that putting data in the lifecycle might have the unintended consequence of making your barriers really big

lucasbradstreet 2018-06-15T22:52:55.000071Z

Yes, so what I meant by that is that you have a before-task-start fn that injects say (map (fn[i] {:n i}) (range 10000))

lucasbradstreet 2018-06-15T22:53:35.000164Z

but instead you check :onyx.core/slot-id inside that fn, and based on the slot, you would return differently partitioned ranges for each peer

lucasbradstreet 2018-06-15T22:53:41.000229Z

that way each peer gets a different part of your data set

lucasbradstreet 2018-06-15T22:54:33.000278Z

it also works if you know that want to read, say, a bunch of S3 objects. Based on the particular slot the peer is on, you would have each peer’s injection return a different partition of the objects that you passed in

lucasbradstreet 2018-06-15T22:55:05.000148Z

(e.g. first peer gets the first 1/3, second gets the second 1/3, third gets the last 1/3)

lucasbradstreet 2018-06-15T22:55:24.000133Z

the idea works whether you’re generating the messages or passing in via the lifecycle

dbernal 2018-06-18T13:28:09.000023Z

Gotcha, ty