onyx

FYI: alternative Onyx :onyx: chat is at <https://gitter.im/onyx-platform/onyx> ; log can be found at <https://clojurians-log.clojureverse.org/onyx/index.html>
dbernal 2018-05-15T14:43:43.000512Z

has anyone ran into any issues with C3P0 deadlocks on the onyx-sql plugin before?

lucasbradstreet 2018-05-15T17:48:23.000108Z

I haven’t seen them, but onyx-sql is mostly user maintained so I’ve never run it in prod

lucasbradstreet 2018-05-15T17:49:07.000543Z

I have seen initial connection timeouts lately on CI, but it was because our CI was using latest for docker images, and we ended up with an incompatible mysql version.

2018-05-15T18:26:18.000769Z

i know the code quite well and it’s definitely not normal to have deadlocks, especially since the whole thing is single threaded. how are you using the plugin, mysql ?

2018-05-15T18:26:50.000606Z

also, when you say deadlocks, do you mean a timeout when acquiring a connection from the pool ? those are two different things.

2018-05-15T18:28:38.000341Z

@lucasbradstreet congrats with the release :)

2018-05-15T18:28:59.000235Z

very happy to see curator updated

lucasbradstreet 2018-05-15T18:31:41.000829Z

Thanks. Just checking over everything before announcing 🙂

lucasbradstreet 2018-05-15T19:13:26.000204Z

Hi everyone! Onyx 0.13.0 is out with a very minor breaking change that probably won’t affect anyone. It bumps curator so that ZooKeeper SSL is now supported. https://github.com/onyx-platform/onyx/blob/0.13.x/changes.md#0130

eriktjacobsen 2018-05-15T19:16:55.000501Z

@eriktjacobsen uploaded a file: https://clojurians.slack.com/files/U1CTH1TUY/FARBGTKGF/stops_peers.clj

eriktjacobsen 2018-05-15T19:19:44.000672Z

Hey, we recently updated our code, everything passes unit tests and works fine for several days, but randomly our job goes from processing messages, then dumps something like this out, and then...... nothing. The job looks like it is running, the peers look like they are fine, but there is zero application or onyx level logging that appears again from the peers. (this is in an environment on a single box with 22 virtual peers). Just throwing it out there in case anyone has seen something similar, the only major addition is switching to an onyx output plugin that does Amazonica Lambda invocation.

lucasbradstreet 2018-05-15T19:20:33.000550Z

Looks like the uri is missing for your request? uri=-}

lucasbradstreet 2018-05-15T19:20:52.000319Z

not sure if something changed in how your urls got passed down to your lambda plugin but I’d check on the segments at that plugin.

eriktjacobsen 2018-05-15T19:22:52.000116Z

yes i'm trying to debug the output plugin itself and throwing more judicious try-catch, I'm more concerned that this seems to shut the entire system down with zero logging from onyx, no missed heartbeats, virtual peers closing down, aeron messages, etc... it's like once this triggers, everything just stops. My understanding is there is a threadpool for the virtual peers so just curious that this seems to just freeze the entire thing

lucasbradstreet 2018-05-15T19:24:36.000614Z

Ah right. Um, this is a plugin you wrote right?

lucasbradstreet 2018-05-15T19:25:05.000395Z

If you’re not checking whether your async requests failed in the plugin, from Onyx’s perspective everything may be working fine.

lucasbradstreet 2018-05-15T19:25:40.000255Z

Just a stab in the dark

eriktjacobsen 2018-05-15T19:27:03.000733Z

From the point that error dump happens, no further messages seem to be processed. Literally the log file just stops, though the peers remain reporting as up and things seem like they are running from zk / onyx perspective, just no messages are consumed. I get that the output plugin might be FUBAR and not actually saving anything and fine with that, more concerned that everything else fails silently. Will circle back around once the error is figured out.

lucasbradstreet 2018-05-15T19:39:14.000398Z

I assume it’s not possible that it just processed everything?

lucasbradstreet 2018-05-15T19:39:21.000554Z

Anyway, let me know how you go.

eriktjacobsen 2018-05-15T19:42:52.000707Z

Correct, input is a kafka stream that receives messages every minute, and we have another onyx cluster running with the former version of code which has no hiccups.

lucasbradstreet 2018-05-15T19:45:20.000683Z

In that case I could see a situation where the plugin is returning false from synced?, prepare-batch, or write-batch https://github.com/onyx-platform/onyx-plugin/blob/0.12.x/resources/leiningen/new/onyx_plugin/medium_output.clj#L42

lucasbradstreet 2018-05-15T19:46:46.000472Z

In that case it will continue to heartbeat, because Onyx is waiting for your work to finish, and the plugin is signalling to wait. In that case it’s still probably a problem dealing with the async requests

lucasbradstreet 2018-05-15T19:48:31.000063Z

There are metrics/health checks that you’d be able to use to detect when it’s processing a certain epoch for a long time.