Back on the exception/restart topic: happened again, lifecycle exception handler was called several times, and onyx gave a warning a few times in there as well. The last thing was a the onyx warning "Caught exception inside task lifecycle :lifecycle/offer-heartbeats.", and then everything shut down.
And as before, after the peers are restarted the job does not restart on its own, and requires manual resubmission.
Thanks. There must be a bug in the supervision where handle-exception isn’t invoked under certain circumstances (probably in offer-heartbeats)
I assume handle-exception is set for :all
and always returns :restart
?
I believe so, code above, let me know if I missed something. It did actually restart successfully several times.
Looks right to me. Just double checking stuff before going digging.
Do you know whether the old job moved to the killed key in the cluster replica? 99% sure it is.
How would I check?
On my phone so I haven’t checked this. If you have onyx peer http query you can query /replica and see if the latest job-is is under killed-jobs
Either that or look under killed-jobs however you’ve played back the log for diagnostics