If you need lots of streaming functions, but not much raw processing power (small data) what options are out there? I like the onyx model, but it seems like onyx-local-rt isn’t built to be used for more than testing. Is there a middle ground where its possible to get the dataflow model but scale it up and down from dataflow models running in threads to dataflow models running across distributed nodes.
I’m also curious what, if anything, needs to be done to move onyx-local-rt into a position where it could be used in a production system as a lightweight tool for realtime dataflow processing. I would be excited to do the work if necessary.
running onyx in single node mode should be perfectly fine, though you will start to top out with the current defaults if you run too many peers on one node.
It’s smart enough not to serialize messages that it’s sending to peers on the same node, so there’s not that much overhead, it’s just a bit aggressive when idling to reduce latency.
The main thing that I would like to see for more efficient scale up on a single node under the data flow model is for onyx peers to share worker threads.
the peer state machine is written in a non-blocking way, so when instead of idling it’d be easy enough to just switch to another task.
any ideas?
Hmm. It might be restricting all messages that flow to all tasks to ::error? segments
might be a weird handling case with :all
and short circuit / error handling flow conditions
I am not sure you would ever want :all
for those
I was under the impression that :all
in this context meant "all possible tasks that follow :transact
", hence it would equal [:out]
in my case
Yeah, it might be due to how the short circuited / error handling conditions are special cased
You’re right that that’s probably how it should work.
Create an issue on github for me and I’ll give it a look
I might have found something! 🙂 it's not big deal though, I just used [:out]
, but it was confusing (going to make that issue now)
I might just block :all
from being used on the short circuit exceptions
as it doesn’t make a lot of sense since the point is generally to restrict flow to only certain tasks
I’ll have to think about it a bit more though.
in my case what I wanted to achieve is to auto-handle exceptions across all tasks
the only way to do it (that I know of) is to create one flow condition for each, :all
doesn't seem to work
@joelsanchez uploaded a file: https://clojurians.slack.com/files/U5LPUJ7AP/FAK3PQD4M/-.clj
Oh, don’t you want :flow/from :all
, :flow/to :out
?
with a short circuited flow condition?
That should send all error exceptions down to out while leaving everything else working as normal
that would be it, yes, trying now
so...it kind of works but at the same time it doesn't
it works if the task that errored is the one that's connected to out (transact)
but I get nothing if it's another one that's not connected to out
Ah. Right, all tasks will need to be connected to out
If you’re trying to achieve that it should be pretty easy with code
ah but I can prevent them from going to out
with a flow condition, and connect all of them to out
yes
ok I'll try 🙂 thanks
got it working!
@joelsanchez uploaded a file: https://clojurians.slack.com/files/U5LPUJ7AP/FAJVAL43C/-.clj
just realized the shortr
typo...anyway, this works 100% as expected
I wire all tasks to out and catch exceptions with this flow condition, for all tasks
Great
@lucasbradstreet The problem i'm trying to solve is that often times we would like to leverage the dataflow model (windows, triggers, etc...) but our processing needs are best served in a none distributed (multiple servers) model. I feel i could craft together the time widowing aspect with coreasync or go channels (in golang) but i'm surprised i dont see this being done already.
Right. I guess I’m saying that if you run embedded aeron and ZooKeeper you’re pretty much getting that (though you still need s3 checkpointing). The main problem is that onyx peers are currently thread heavy.
Can the onyx-kafka plugin be used without direct access to ZK, such as with a managed kafka service like https://github.com/CloudKarafka/java-kafka-example?
It can take bootstrap servers rather than looking up via ZK. you’ll still need ZK for onyx but it doesn’t have to touch the kafka ZK servers
That worked great, thanks!
:)