Any recommendations on how best to monitor as onyx cluster running in kubernetes? Not clear to me how/if you could use onyx-peer-http-query in this case.
@j0ni ah yeah, it’s tough for us to know when your job is complete. You could do some different things like check whether you have processed all of the expected results.
@dave.dixon I would only be using onyx peer http query with kubernetes for health checks and metrics
@lucasbradstreet Yes. I'm just a little lost on how you wire it up. Each node runs it's own server, right? Somehow you presumably want to aggregate all those results into Prometheus or something, wondering how people are actually wiring up.
There's probably some key k8s concept I'm missing here.
@dave.dixon I am a little rusty but I think one way to do this is to add an annotation to your peer containers so Prometheus can id them through the k8s mechanism of scraping.
@camechis Thanks, that sounds similar to something I was reading about using prometheus with JMX in k8s, sounds like it will get me headed in the right direction.
That’s right
Another question as I puzzle through my deployment issues: at start-up I get several messages like "Job ID 5b3651ac-0ad0-4005-b503-79609899c3d1 has already been submitted, and will not be scheduled again.", but those jobs don't show up when I call rq/jobs
. I assume those can be safely ignored, aren't using vpeers or other resources?
Yep I’ll have to double check but I think it’s just a job being submitted with the same job id, and it’s the idempotency at work
Occurs when the peer starts, before I submit any jobs.
Sounds like a previous run job in the same onyx tenant?
Yes.
Right it’s just playing back the log
Trying to debug some really odd behavior, beginning to suspect it may be related to non-onyx parts of the application behaving badly with kafka and thus ZK. In an effort to better understand how things work, how might I wind up getting this warning on peer startup: "Log parameters have yet to be written to ZooKeeper by a peer. Backing off 500ms and trying again..."
Hmm, where’s that warning coming from? When the onyx peers start they’ll write out log parameters for a given tenancy
So it’s probably some other component starting up on a tenancy which hasn’t had any peers start yet.
Here's the stack trace. I'll add some more logging to see whether start-peer-group
or start-peers
is responsible.
No other component is starting, at least not intentionally. I also tried again with a fresh ZK (supposedly, it's a managed service) and still saw the same thing. Cleaning out all of the checkpoints now.
Hmm
Do you say any messages in there about aeron?
It’s possible it’s not able to connect to aeron and it’s not starting any peers as a result