onyx

FYI: alternative Onyx :onyx: chat is at <https://gitter.im/onyx-platform/onyx> ; log can be found at <https://clojurians-log.clojureverse.org/onyx/index.html>
sparkofreason 2018-07-07T15:48:16.000020Z

Any recommendations on how best to monitor as onyx cluster running in kubernetes? Not clear to me how/if you could use onyx-peer-http-query in this case.

lucasbradstreet 2018-07-07T16:28:04.000081Z

@j0ni ah yeah, it’s tough for us to know when your job is complete. You could do some different things like check whether you have processed all of the expected results.

lucasbradstreet 2018-07-07T16:28:39.000045Z

@dave.dixon I would only be using onyx peer http query with kubernetes for health checks and metrics

sparkofreason 2018-07-07T16:34:36.000051Z

@lucasbradstreet Yes. I'm just a little lost on how you wire it up. Each node runs it's own server, right? Somehow you presumably want to aggregate all those results into Prometheus or something, wondering how people are actually wiring up.

sparkofreason 2018-07-07T16:35:08.000061Z

There's probably some key k8s concept I'm missing here.

Travis 2018-07-07T16:39:59.000034Z

@dave.dixon I am a little rusty but I think one way to do this is to add an annotation to your peer containers so Prometheus can id them through the k8s mechanism of scraping.

sparkofreason 2018-07-07T16:42:29.000066Z

@camechis Thanks, that sounds similar to something I was reading about using prometheus with JMX in k8s, sounds like it will get me headed in the right direction.

lucasbradstreet 2018-07-07T16:44:35.000031Z

That’s right

sparkofreason 2018-07-07T16:47:35.000008Z

Another question as I puzzle through my deployment issues: at start-up I get several messages like "Job ID 5b3651ac-0ad0-4005-b503-79609899c3d1 has already been submitted, and will not be scheduled again.", but those jobs don't show up when I call rq/jobs. I assume those can be safely ignored, aren't using vpeers or other resources?

lucasbradstreet 2018-07-07T16:49:55.000072Z

Yep I’ll have to double check but I think it’s just a job being submitted with the same job id, and it’s the idempotency at work

sparkofreason 2018-07-07T16:52:34.000039Z

Occurs when the peer starts, before I submit any jobs.

Travis 2018-07-07T16:53:19.000071Z

Sounds like a previous run job in the same onyx tenant?

sparkofreason 2018-07-07T16:56:51.000022Z

Yes.

lucasbradstreet 2018-07-07T16:57:15.000018Z

Right it’s just playing back the log

sparkofreason 2018-07-07T21:56:00.000075Z

Trying to debug some really odd behavior, beginning to suspect it may be related to non-onyx parts of the application behaving badly with kafka and thus ZK. In an effort to better understand how things work, how might I wind up getting this warning on peer startup: "Log parameters have yet to be written to ZooKeeper by a peer. Backing off 500ms and trying again..."

lucasbradstreet 2018-07-07T22:07:02.000046Z

Hmm, where’s that warning coming from? When the onyx peers start they’ll write out log parameters for a given tenancy

lucasbradstreet 2018-07-07T22:07:28.000034Z

So it’s probably some other component starting up on a tenancy which hasn’t had any peers start yet.

sparkofreason 2018-07-07T22:29:59.000043Z

Here's the stack trace. I'll add some more logging to see whether start-peer-group or start-peers is responsible.

sparkofreason 2018-07-07T22:38:43.000037Z

No other component is starting, at least not intentionally. I also tried again with a fresh ZK (supposedly, it's a managed service) and still saw the same thing. Cleaning out all of the checkpoints now.

lucasbradstreet 2018-07-07T22:41:59.000025Z

Hmm

lucasbradstreet 2018-07-07T23:18:50.000021Z

Do you say any messages in there about aeron?

lucasbradstreet 2018-07-07T23:19:10.000010Z

It’s possible it’s not able to connect to aeron and it’s not starting any peers as a result