onyx

FYI: alternative Onyx :onyx: chat is at <https://gitter.im/onyx-platform/onyx> ; log can be found at <https://clojurians-log.clojureverse.org/onyx/index.html>
twashing 2017-11-20T19:03:37.000419Z

Hey, I just placed an issue on onyx-kafka, for troubleshooting simple job not writing to a topic. https://github.com/onyx-platform/onyx-kafka/issues/47

twashing 2017-11-20T19:03:59.000232Z

Any ideas here would be great. I’m probably just missing something very simple.

eriktjacobsen 2017-11-20T19:50:53.000737Z

Getting an error that looks like it's stemming from onyx itself: integer overflow

eriktjacobsen 2017-11-20T19:51:18.000340Z

Anyone seen similar?

jasonbell 2017-11-20T20:16:56.000013Z

So :conform-health-check-msg threw the exception. Do you have any lifecycles setup? (http://www.onyxplatform.org/docs/user-guide/0.12.x/#_example) if you don’t the task will die and not restart, that will render the job broken and everything will need restarting.

jasonbell 2017-11-20T20:17:15.000100Z

If you handle the exception you can :reset the task for it to startup again.

jasonbell 2017-11-20T20:17:43.000078Z

Though you are still prudent to see what’s happening in the task itself to catch the exception in the first place obviously.

lucasbradstreet 2017-11-20T21:15:19.000447Z

I’ll pop in with some additional info later

eriktjacobsen 2017-11-20T21:28:01.000123Z

@jasonbell So I see the task name, but since there was nothing in the stack trace and the code is this:

(defn conform-health-check-message
  [segment]
  (let [result (:result segment)
        ts (if-let [ts (:timestamp segment)]
             ts ; expect this to be in msecs
             (.getMillis (time/now)))
        output (-&gt; segment
                   (select-keys [:hash :config :lambda :commit])
                   (assoc :timestamp ts :result (keyword result)))]
    (debug "Conformed: " output)
    output))

We don't have any lifecycles set up for this task, and it didn't look like we were modifying any integers, which is why I thought it might be in the internals of onyx.

lucasbradstreet 2017-11-20T21:31:32.000587Z

It definitely is onyx internals. More when I’m done with a call.

👍 2
eriktjacobsen 2017-11-20T21:32:18.000122Z

Sure, no rush. job successfully restarted from resume-point it seems. Thanks

jasonbell 2017-11-20T21:38:58.000554Z

@eriktjacobsen As long as you’re sorted. Previously I’ve wrapped each task with lifecycle events just in case and then handle all the exception so the job doesn’t have a chance to fail.

lucasbradstreet 2017-11-20T21:56:55.000457Z

Alright, so the problem there is that it took a really long time to write out a batch to the task downstream of your task, and we overflowed a long in terms of how many nanoseconds it took.

lucasbradstreet 2017-11-20T21:57:29.000543Z

The second problem, as @jasonbell accurately described, is that you don’t have a handle-exception lifecycle on your tasks as a failsafe for whether to continue running the job.

lucasbradstreet 2017-11-20T21:58:19.000514Z

So, I would think the actions for us are to fix the overflow. Your actions are to figure out why it might have taken so long for that task to write the batch, as well as add the exception lifecycle.

lucasbradstreet 2017-11-20T21:59:05.000650Z

My bad for assuming we would never overflow that long 😄

eriktjacobsen 2017-11-20T21:59:19.000341Z

ah. Looking through logs, it seems there were some ZK timeouts happening around that time.

lucasbradstreet 2017-11-20T22:01:59.000062Z

Yeah, I’m guessing you got blocked downstream, and so upstream was trying to offer the segments to it and got stuck.

lucasbradstreet 2017-11-20T22:02:30.000595Z

@jasonbell hah, I have a helper just like that. Actually in this case it’s already a long, but nanoseconds are kinda big to start with 😮, so we overflowed the long anyway.

jasonbell 2017-11-20T22:02:44.000180Z

🙂

lucasbradstreet 2017-11-20T22:29:55.000568Z

I’m actually not sure how that overflowed, as it would have had to be a lot of hours (many many thousands)

lucasbradstreet 2017-11-20T22:30:13.000309Z

I’ll have to figure it out anyway.

lucasbradstreet 2017-11-20T22:34:58.000175Z

Ahhh, it’s not resetting the accumulated time when you’re processing batches of 0 size, so if you have a long running job that isn’t receiving any segments it’ll continue to accumulate. How long was that job running for approximately?

eriktjacobsen 2017-11-20T22:37:37.000046Z

The weekend, since Friday. looks like timeouts were happening here and there, but started majorly ramping up about an hour before the exception which is ultimately what stopped the job.

lucasbradstreet 2017-11-20T22:38:42.000027Z

OK, the overflow still doesn’t completely make sense to me then.

lucasbradstreet 2017-11-20T22:40:43.000299Z

Anyway, I’ll put in some code to prevent the overflow, and with the lifecycle addition the job would have recovered.