I'm trying to send data to an aws kinesis stream and send the output to S3: https://gist.github.com/danielstockton/1004aed11873daa4730199829f2bef19
Can't seem to get it working, whereas I do get output onto a core.async channel.
I'm thinking there is an exception somewhere, but feedback-exception!
causes the job to hang.
Anyone know what the problem might be?
i keep getting 17-12-14 11:35:16 WARN [onyx.messaging.aeron.publication-manager:79] [aeron-client-conductor] - Aeron messaging publication error: io.aeron.exceptions.ConductorServiceTimeoutException: Timeout between service calls over 5000000000ns
after peers have been running for a few days
are there any known issues which could cause this to happen ?
@mccraigmccraig Are the heartbeats timing out?
Ignore me
dunno @jasonbell
They wouldn’t at that point I don’t think.
i've got three containers running onyx peer processes, with aeron running in a separate process managed by the s6-overlay, which is the recommended configuration i think
the container logging the aeron timeouts seems to stop functioning as a peer, not unsurprisingly
@danielstockton I assume no exceptions in the logs? First thing would be to check is that your AWS keys are getting picked up by the S3 writer.
Assuming they are, the next thing you can do -- if you're local - is drop a debug function here: https://gist.github.com/danielstockton/1004aed11873daa4730199829f2bef19#file-etl-clj-L23
e.g. (fn [x] (prn x) x)
Remember to return the segment
If you're having issues in a real environment, throughput metrics will tell you where something's not flowing properly.
Thanks, that should get me started!
@mccraigmccraig What version are you on now?
Also have you changed your cluster's hardware or experienced changes in data intensity? That looks like starvation of Aeron again
@michaeldrogalis 0.9.15 on production - we'll be upgrading with our new cluster (with newer docker and kafka which both required upgrades), but i'm stuck on 0.9.15 for the moment
there has been a considerable increase in data intensity
Allocate more resources to the Aeron container.
i'm currently giving 2.5GB to the container, with 0.6=1.5GB of that going to peer heap and 0.2=0.5GB going to aeron... the onyx peers don't seem to need anything like that amount of heap though - they only seem to use a couple of hundred MB according to yourkit, so i could give 1.5GB to aeron easily enough - does that seem reasonable ?
Yeah, that would likely help. I'd have to look at the metrics to give you a good answer, but it's definitely in the right direction.
Running them in separate containers would help more if your set up supports it.
running onyx and aeron in separate containers. hmm. i think that would be difficult with my current setup (mesos+marathon) but might be feasible on the new cluster (dc/os) with pods
Ah, wasnt sure what you meant by your last message when you said "the container". Got it
@mccraigmccraig you’re probably hitting GCs which are causing timeouts in the Aeron conductor. You could give it a bit more RAM, and you can increase the conductor service timeout to make it not time out quite so easily. Taking a flight recorder log would help you diagnose it further
Unfortunately I can only help so much with 0.9 issues
My understanding is that the onyx input kafka task manages its own offsets in format partition->offset like {0 50, 1, 90}
that can be passed in as :kafka/start-offsets
. I'm curious, for a given snapshot / resume point, how would i get that offset map? For instance I have a resume point:
:in
{:input
{:tenancy-id "2",
:job-id #uuid "23709a1c-ae8b-0b3b-a1a5-5bce4cc935cb",
:replica-version 25,
:epoch 144223,
:created-at 1512603190357,
:mode :resume,
:task-id :in,
:slot-migration :direct}},
What path would I look in ZK to get the actual offsets? (I have s3 checkpointing on if its in the checkpoint)additionally, is there anything in onyx.api or an onyx lib that could pull those offsets out of zk into my repl env?
It’s in the S3 checkpoint.
I would love one to exist. Currently the best you can do is instantiate this: https://github.com/onyx-platform/onyx/blob/0.12.x/src/onyx/storage/s3.clj#L122
and have it read the checkpoint: https://github.com/onyx-platform/onyx/blob/0.12.x/src/onyx/storage/s3.clj#L218
via the coordinates stored in ZK at https://github.com/onyx-platform/onyx/blob/0.12.x/src/onyx/api.clj#L319
@lucasbradstreet Thanks a bunch! looking into it
it shouldn’t be so hard to wire up. If you have any questions let me know, but I would love if you shared it after 🙂
this is a better link to the coordinates https://github.com/onyx-platform/onyx/blob/0.12.x/src/onyx/api.clj#L306
We're trying. Going through the corporate open sourcing process in January. We also forked onyx-dashboard so you can view each task's window state, it's been super helpful for us.
Ooh
The extension from here will be that you will be able to repartition the offsets are a bit more easily
I’ve been considering allowing resume points to be passed as values for these sorts of circumstances, as an alternative to :kafka/start-offsets
e.g.
:in
{:input
{:tenancy-id "2",
:job-id #uuid "23709a1c-ae8b-0b3b-a1a5-5bce4cc935cb",
:replica-version 25,
:epoch 144223,
:created-at 1512603190357,
:mode :value,
:task-id :in,
:value {0 {0 35} 1 {1 99}}
:slot-migration :direct}},
That would definitely be helpful as well
Would remove the complexity from the plugin side
So, I’ve been using the new :reduce
type and noticed that it breaks the visualization graph in the onyx-visualization library and therefore the dashboard as well. If I were to open a pull request to fix that in onyx-visualization, should it share the same node colour as a :function
?
@niamu Sure, works for us.
Happy to merge it when it's ready.
PR is up now: https://github.com/onyx-platform/onyx-visualization/pull/9
Are there any long-term plans for the onyx-visualization library? I’ve been talking with a coworker about making a fork of the library that would visualize flow conditions and other components of the job as well.
onyx-dashboard and onyx-visualization have taken a back seat to all the plugins and helper utilities, but we do love PRs
Sorry about the reduce breakage :/
No worries. We’re still actively working on deploying our first Onyx job so it wasn’t a big issue for us, but something that was easily fixed we noticed.
Great :)
How are you finding the reduce type?
It’s great. Perfectly solved the hack of using a flow condition to stop the segments from flowing downstream in our pipeline.
Excellent.
We started working in earnest with Onyx during the 0.12 release so noticing that in the release notes allowed us to fix that portion of the job that we found a little ugly.
Nice. Yes, it was about time for a solution for that. It actually solved a number of issues quite cleanly (eg. Needing to use a null plugin for aggregating terminal tasks). Glad it worked out for you too.
Today we started our first steps migrating from docker-compose to a Kubernetes deployment and we’ve encountered a problem that we’re not sure how to debug involving the Aeron media driver not starting.
>17-12-14 21:21:14 onyx-peer-79d6d498d4-b8c6r WARN [onyx.peer.peer-group-manager:277] - Aeron media driver has not started up. Waiting for media driver before starting peers, and backing off for 500ms.
Any thoughts?
Sounds like you’re not starting a media driver when you start the peers. We run it in a side car container sharing memory between the two containers. If you want to get unstuck for now you can start the embedded driver (search the cheat sheet for the right peer config)
I don’t think we have an example for the side car up anywhere yet. @gardnervickers?
Not yet, no.
We’ve been following the manifests from the onyx-twitter-sample (https://github.com/onyx-platform/onyx-twitter-sample/tree/master/kubernetes) but it sounds like there are improvements to be made in that process.
@niamu are you using Helm?
No.
First I’ve heard of it actually.
We need some examples soon for compiling/running the peer sidecar container. What it consists of is running two containers in a single pod, sharing /dev/shm
as a type: Memory
volume.
Ok, sounds similar to what is defined already in the sample manifests here: https://github.com/onyx-platform/onyx-twitter-sample/blob/master/kubernetes/peer.deployment.yaml#L26
Apart from the two containers in a single pod bit.
Exactly
Essentially you don’t want to run multiple processes in a single container, so splitting out the template docker container into two containers is best practice.
oh I see. So we just need a container that’s sole job is to start the media driver and share it’s volume with the other container in the pod that runs the peer process.
Yes
Thanks a lot. I’ll try that out tomorrow and see how far I get.
@niamu just a heads up for when you go to prod with k8s and onyx. I highly recommend wiring up health checks to https://github.com/onyx-platform/onyx-peer-http-query#route
Yes, I believe I saw you recommend that a while back to someone else. I have that bookmarked to revisit. 🙂
Good good. Ah, forgot that I collected all of this stuff into: http://www.onyxplatform.org/docs/user-guide/0.12.x/#production-check-list. Less of a reason to tell everyone 🙂