onyx

FYI: alternative Onyx :onyx: chat is at <https://gitter.im/onyx-platform/onyx> ; log can be found at <https://clojurians-log.clojureverse.org/onyx/index.html>
lucasbradstreet 2018-07-15T00:40:27.000023Z

eviction should happen immediately after emit. Do the messages arrive somewhat in order? If they are at completely random times then watermarks plus a delay won’t help

lucasbradstreet 2018-07-15T00:43:11.000012Z

if there won’t be any segments arriving in that window after you start receiving segments that occur 30 minutes after, then it should be working fine and there’s definitely an issue somewhere else though

lucasbradstreet 2018-07-15T00:43:41.000014Z

could it be the presence of the timer trigger in your composite trigger?

lucasbradstreet 2018-07-15T00:43:53.000008Z

Now that you’re evicting, those windows may just be getting flushed by the timer trigger

sparkofreason 2018-07-15T00:56:35.000014Z

Is the delay wall time or watermark time? I'm simulating a month in terms of internal timestamps, but the whole thing runs in about ten minutes.

lucasbradstreet 2018-07-15T01:00:22.000122Z

Watermark time

lucasbradstreet 2018-07-15T01:00:30.000024Z

Vs time of the given segment

sparkofreason 2018-07-15T02:00:43.000029Z

It seems to be working if I set the delay equal to the window range. I think I must be misunderstanding the purpose of the delay. And as such I'm not clear if the delay is really required here, or if I just managed to hack around something else I did wrong. Anyway, it seems to be solid now.

sparkofreason 2018-07-15T02:05:51.000044Z

Except when I get this exception: "Messenger subscription image was lost, and rejoined. This is generally due to timeout conditions e.g. garbage collection, or other causes of connection flapping. Rebooting peer." It that case I seem to lose segments, though maybe that's because I managed to boff the checkpointing of my custom input.

lucasbradstreet 2018-07-15T04:32:00.000013Z

Most likely bad checkpointing

sparkofreason 2018-07-15T14:46:12.000058Z

Is it expected that a delay would be required for eviction to work properly, in the sense of ensuring that the window has captured all segments in the time interval? The input is just generating segments equally spaced in time and in order, and the aggregation is counting segments and doing some simple reduction of other values in the segment.

sparkofreason 2018-07-15T17:08:53.000014Z

A little more info: I had accidentally removed the delay in the case where segments appeared to be lost. I added some debuggery to the function called to emit the final state, triggered by the watermark with no delay. Under normal conditions. the final-emit function is called once, and the state indicates all expected segments were aggregated in the window. But after receiving a "Messenger subscription image was lost", final-emit gets called twice, with each call having a different part of the window. So if I expect 1000 segments in a window, the first call might show 800, and the second 200. Putting in a trigger delay equal to the window range appears to alleviate this.

lucasbradstreet 2018-07-15T19:00:38.000005Z

So, generally the way it should work is that the watermark + delay should put an upper bound on the time before a window will be emitted, with the delay being used to handle any stragglers that may arrive later than some later timestamps. But when you are completing the job you need to flush the remaining messages (which I believe you have handled with punctuation messages). That’s strange that the final emit will be called with different sets of data. I’d be worried that they delay is just papering over it unless it’s because the data is arriving out of order. I think you should print out the messages as they are added to your window as it’d be helpful to know whether your watermark + delay is acting correctly to wait for stragglers to arrive

sparkofreason 2018-07-15T20:01:08.000072Z

What would cause "stragglers"? The input is generating the segments in strict time order, and the watermark is just the ms of the generated time stamp. The only time I see this issue is when recovering from an exception, the "Rebooting peer" scenario. I'm pretty sure the checkpointing/recovery is working right, and the input is pretty simple, essentially the collection plugin with a little extra state.

lucasbradstreet 2018-07-15T21:38:28.000017Z

ok, if everything is in order, that can’t be it.

lucasbradstreet 2018-07-15T21:39:56.000014Z

if everything is in order, then I believe it should work with delay 0. I assume the peer is rebooting after the subscription image is lost?

sparkofreason 2018-07-15T21:45:22.000061Z

Correct.

lucasbradstreet 2018-07-15T21:48:29.000014Z

Hmm. That’s pretty odd. What does the parallelism look like on the tasks prior to the window task?

sparkofreason 2018-07-15T21:57:02.000031Z

This is the workflow. Input goes to several results tasks which are parallel. each with different window range, and those emit to the output. All tasks have :max-peers 1, so each path through the DAG should be input -> results -> output.

lucasbradstreet 2018-07-15T21:57:45.000025Z

Max peers was what I was looking for. Since it’s 1 it can’t be related to what I was thinking.

lucasbradstreet 2018-07-15T21:58:59.000060Z

Are the timestamps unique? Or can some segments have the same timestamp?

sparkofreason 2018-07-15T22:04:38.000064Z

Should be unique, generated via range as in the code above.

lucasbradstreet 2018-07-15T22:13:00.000047Z

k, can’t be that then.

sparkofreason 2018-07-15T22:36:02.000091Z

I've captured the segments seen by one of the results aggregator tasks, and looked for places where they are not in time order (shown above). These are as expected, three exceptions and corresponding restarts. What is interesting is that the double watermark issue and "lost" segments only occurred when the recovery crossed windows, in the last two cases (window range is one day). I'll run a few more and see if that pattern persists.

sparkofreason 2018-07-15T22:46:57.000041Z

Never mind about that last part, saw a case where the recovery spanned windows with no issue.

lucasbradstreet 2018-07-15T22:56:09.000021Z

Is it possible you’re setting a watermark somewhere in your input plugin but not resetting it when recover is called?

lucasbradstreet 2018-07-15T22:56:24.000050Z

thus passing down a later watermark than indicated from the point you recovered?

lucasbradstreet 2018-07-15T22:57:19.000045Z

If you’re using assign-watermark-fn rather than implementing the protocol, I will look at onyx’s code instead

sparkofreason 2018-07-15T23:14:53.000021Z

Using assign-watermark-fn.

lucasbradstreet 2018-07-15T23:15:14.000027Z

Ok. We may have a recovery bug there. I’ll let you know when I’ve had a chance to look at it.

sparkofreason 2018-07-15T23:16:52.000064Z

Some more data points: the issue does not always occur on recovery, only sometimes. I also forced some random exceptions in the job, and they show the same behavior, so it isn't isolated to the messenger exception.

lucasbradstreet 2018-07-15T23:17:23.000017Z

K. Seems reasonably likely to be a recovery bug that is timing dependent then

lucasbradstreet 2018-07-15T23:18:45.000066Z

Seems like we may be handling recovery badly for watermarks. Let me run the tests and cut you a snapshot to try

sparkofreason 2018-07-15T23:20:42.000068Z

Cool, thanks.

lucasbradstreet 2018-07-15T23:27:21.000058Z

Give this guy a try: 0.13.1-20180715.232706-16

sparkofreason 2018-07-15T23:44:31.000002Z

No issues on the first run. I'll give it some more punishment and keep you posted.

lucasbradstreet 2018-07-15T23:44:39.000082Z

:thumbsup: