onyx

FYI: alternative Onyx :onyx: chat is at <https://gitter.im/onyx-platform/onyx> ; log can be found at <https://clojurians-log.clojureverse.org/onyx/index.html>
sparkofreason 2018-07-14T17:37:27.000043Z

Seeing a lot of these messages in my local dev env after changing a job, Does that just indicate I'm overloading aeron?

lucasbradstreet 2018-07-14T17:43:59.000009Z

Are you seeing any peer timeout messages in your logs?

sparkofreason 2018-07-14T17:51:29.000124Z

Yes.

sparkofreason 2018-07-14T17:51:52.000053Z

Bumped up the number of peers, and it seems happier.

lucasbradstreet 2018-07-14T18:06:19.000096Z

Right. Those unavailable images are more of a symptom of the peers timing out. I may remove that message since it’s not particularly helpful. What was probably happening was a single peer was doing too much work and didn’t have a chance to heartbeat in time to not be timed out

lucasbradstreet 2018-07-14T18:06:34.000002Z

If they’re doing lots of work you may want to reduce the batch size and/or increase the timeouts

sparkofreason 2018-07-14T18:23:03.000019Z

Eventually got this, which again I suspect is probably more an artifact of running locally.

lucasbradstreet 2018-07-14T18:24:56.000049Z

I’ll improve that message too. What’s happening there is you are generating too many segments in one pass. So maybe you have a batch size of 200, and each segment is generating over 100 segments each. You will end up with over 20000 segments which is overflowing the preallocated buffer.

sparkofreason 2018-07-14T18:39:14.000009Z

Could it occur just because aeron fell behind? I have a custom input generating segments, and there's a single aggregation task downstream that emits on a timer trigger. The custom input will definitely output more than 20K total segments, though.

lucasbradstreet 2018-07-14T18:49:32.000021Z

Total segments is fine. I think this is emitting that many segments in a single pass. Is it possible that you’re trigger emitting 20000 messages from via a single trigger/emit?

sparkofreason 2018-07-14T18:53:38.000049Z

Don't think so. The emit function returns a single map.

lucasbradstreet 2018-07-14T18:53:55.000011Z

Hum

sparkofreason 2018-07-14T18:54:14.000050Z

There could be a lot of windows active.

lucasbradstreet 2018-07-14T18:54:34.000065Z

Right, I was about to say if there are more than 20000 windows that are emitting at the same time that could be a problem too.

lucasbradstreet 2018-07-14T18:55:12.000028Z

This is especially a problem for the timer trigger since it can end up firing for all windows at the same time.

sparkofreason 2018-07-14T18:57:07.000021Z

trigger/fire-all-extents? is false. Should that make a difference?

lucasbradstreet 2018-07-14T18:58:31.000083Z

For timer triggers it’s global so that will apply anyway.

lucasbradstreet 2018-07-14T18:58:49.000079Z

I mean it’ll still fire all extents since the timer trigger is global

lucasbradstreet 2018-07-14T18:59:18.000023Z

I’m trying to think of a better strategy for this situation

sparkofreason 2018-07-14T18:59:44.000004Z

Hmmm. So I'm actually running a custom trigger. Is there something I can do in that implementation?

lucasbradstreet 2018-07-14T19:02:03.000045Z

Hmm. You’ve returned true for whether it should fire, which means that all windows will be flushed. I think we could either make it so that the messages are written out in multiple phases, or we could increase the buffer size, or possibly we could give you some way of ensuring the number of windows doesn’t grow too big before flushing.

lucasbradstreet 2018-07-14T19:02:37.000029Z

I’m leaning towards the last option, as generally the timer is supposed to put an upper bound on how much is buffered up before you flush, but if you have built a lot of segments up to emit you may want to flush early.

lucasbradstreet 2018-07-14T19:04:51.000001Z

Would that option work for your use case?

sparkofreason 2018-07-14T19:11:12.000040Z

The trigger logic is supposed to work as a combination of segment and timer, It is supposed to trigger in a time period only if a new segment was received. So, new segment arrives, starts the clock, after which any further segments have no effect. Once the timer fires the state is reset, so unless a new segment is received the clock will not start again. My reasoning was to avoid the situation where a lot of windows would fire with no changes.

lucasbradstreet 2018-07-14T19:13:32.000050Z

OK, right, that’s not working correctly then. I think what’s happening is we’re defaulting to fire-all-extents? on all non segment triggers and that’s causing you issues. Are you on 0.13.0?

lucasbradstreet 2018-07-14T19:14:06.000036Z

I can send you a snapshot to see if we can fix it by respecting fire-all-extents? and then add validation on all of the trigger types where it fire-all-extents? must be true

lucasbradstreet 2018-07-14T19:16:29.000078Z

I have to run out for a sec. I’ve pushed a snapshot which respects fire-all-extents? for all trigger types 0.13.1-20180714.191549-15

sparkofreason 2018-07-14T19:16:37.000070Z

Yes, I am on 0.13.0

lucasbradstreet 2018-07-14T19:16:41.000048Z

if you want to test it out and let me know how you go, I can figure out the right way to make the change

lucasbradstreet 2018-07-14T19:17:16.000088Z

We haven’t had anyone create a composite timer/segment type trigger so this hadn’t come up yet.

sparkofreason 2018-07-14T19:17:34.000007Z

Thanks, I'll give it a whirl after lunch.

lucasbradstreet 2018-07-14T19:18:06.000011Z

sure thing. Lunch for me too

lucasbradstreet 2018-07-14T19:20:36.000075Z

If this turns out to be the problem I’ll be pretty happy as the 20000 segment per pass issue was a bit of a smell for a streaming job

sparkofreason 2018-07-14T20:23:00.000029Z

Looks like that was the answer. Running much faster, with far fewer restarts, at least so far.

lucasbradstreet 2018-07-14T20:23:28.000048Z

Great. Yeah, I could see a lot of bad behaviour coming from that. I’ll have a think about how to make that change right.

lucasbradstreet 2018-07-14T20:24:44.000034Z

I think with more validation or settings on the trigger implementation side it should work out well.

sparkofreason 2018-07-14T20:29:39.000024Z

These windows are all time-based, so once time has passed the extent should I evict them? Just clicked that perhaps that's the point of watermark triggers.

lucasbradstreet 2018-07-14T20:35:48.000062Z

Yeah, that’s the point of watermark triggers. You could add something like that to your trigger + input plugin

lucasbradstreet 2018-07-14T20:36:20.000062Z

Pretty much have to evict at some point if you have long running streaming jobs. Otherwise you’ll just keep adding state.

sparkofreason 2018-07-14T21:33:06.000009Z

I must be missing something, as the final-emit function specified in the watermark trigger isn't getting called. Can you both emit and evict with the same trigger?

lucasbradstreet 2018-07-14T21:48:46.000093Z

Oh, you probably haven’t implemented the watermark protocol on your input plugin.

lucasbradstreet 2018-07-14T21:49:04.000057Z

The input plugin is responsible for feeding timestamps down through the pipeline

lucasbradstreet 2018-07-14T21:50:02.000069Z

The way it works is that all of the segments will be between two barriers, each with their own timestamp. This is so that mismatched watermarks from input sources can take the minimum of each

sparkofreason 2018-07-14T21:51:26.000023Z

I was going off this in the docs: "Trigger only fires if the value of :window/window-key in the segment exceeds the upper-bound in the extent of an active window. " Is that no longer valid?

lucasbradstreet 2018-07-14T21:52:15.000048Z

That’s no longer valid now that we have a better way of doing watermarks. I’ll fix the doc. Thanks

sparkofreason 2018-07-14T21:53:14.000010Z

Makes sense, and though I only have a single input for my simulation case, that may not hold in production.

lucasbradstreet 2018-07-14T21:55:16.000013Z

assign-watermark-fn also works if your data may change for a given input plugin

sparkofreason 2018-07-14T22:48:54.000009Z

So it looks like the :fire-all-extents patch broke watermarks. But for my immediate purposes, it doesn't matter. The reason I wound up with the segment/timer trigger and large number of active windows was because I didn't grok the watermark/eviction connection. Looks like I can use a combination of OOB timer and watermark triggers to have the desired outcome. Thanks for all of your help.

lucasbradstreet 2018-07-14T22:49:51.000024Z

That makes sense too. Cool. I’ll think about what we should do with the fire-all-extents change in the future, but for now I won’t make any changes there.