The job should be able to stay running regardless, assuming the lifecycles are setup right. You just may have a period where one node prevents progress. We have generally used 30 seconds in the past but it really dependso n your needs
:lifecycle/handle-exception
is set, and got called several times with 8io.aeron.exceptions.ConductorServiceTimeoutException`, followed by a couple of org.agrona.concurrent.AgentTerminationException
. I'm guessing what follows in the log snippet above is Onyx attempting to restart and not succeding, as the lifecycle handler does not get called again and subsequent exceptions are logged from Onyx code.
Yup, definitely looks like it’s not succeeding
Do you know if a kill-job log entry is written out though? If so, the job should come back up without a new job submission if the container is cycled
I don't know specifically if kill-job was written (getting monitoring wired up in production is next on the task list), but I don't think the job came back up on its own after recycling the container. I waited awhile after the peers started and nothing seemed to be happening, so I resubmitted the job.
Same tenancy?
Yes