onyx

FYI: alternative Onyx :onyx: chat is at <https://gitter.im/onyx-platform/onyx> ; log can be found at <https://clojurians-log.clojureverse.org/onyx/index.html>
2017-12-14T11:18:50.000369Z

I'm trying to send data to an aws kinesis stream and send the output to S3: https://gist.github.com/danielstockton/1004aed11873daa4730199829f2bef19 Can't seem to get it working, whereas I do get output onto a core.async channel. I'm thinking there is an exception somewhere, but feedback-exception! causes the job to hang.

2017-12-14T11:19:18.000293Z

Anyone know what the problem might be?

mccraigmccraig 2017-12-14T11:38:29.000271Z

i keep getting 17-12-14 11:35:16 WARN [onyx.messaging.aeron.publication-manager:79] [aeron-client-conductor] - Aeron messaging publication error: io.aeron.exceptions.ConductorServiceTimeoutException: Timeout between service calls over 5000000000ns after peers have been running for a few days

mccraigmccraig 2017-12-14T11:38:42.000382Z

are there any known issues which could cause this to happen ?

jasonbell 2017-12-14T11:38:59.000021Z

@mccraigmccraig Are the heartbeats timing out?

jasonbell 2017-12-14T11:39:19.000224Z

Ignore me

mccraigmccraig 2017-12-14T11:39:24.000260Z

dunno @jasonbell

jasonbell 2017-12-14T11:39:31.000272Z

They wouldn’t at that point I don’t think.

mccraigmccraig 2017-12-14T11:41:15.000217Z

i've got three containers running onyx peer processes, with aeron running in a separate process managed by the s6-overlay, which is the recommended configuration i think

mccraigmccraig 2017-12-14T11:42:05.000198Z

the container logging the aeron timeouts seems to stop functioning as a peer, not unsurprisingly

michaeldrogalis 2017-12-14T16:16:43.000015Z

@danielstockton I assume no exceptions in the logs? First thing would be to check is that your AWS keys are getting picked up by the S3 writer.

michaeldrogalis 2017-12-14T16:17:26.000005Z

Assuming they are, the next thing you can do -- if you're local - is drop a debug function here: https://gist.github.com/danielstockton/1004aed11873daa4730199829f2bef19#file-etl-clj-L23

michaeldrogalis 2017-12-14T16:17:39.000494Z

e.g. (fn [x] (prn x) x) Remember to return the segment

michaeldrogalis 2017-12-14T16:18:09.000610Z

If you're having issues in a real environment, throughput metrics will tell you where something's not flowing properly.

2017-12-15T08:40:39.000214Z

Thanks, that should get me started!

michaeldrogalis 2017-12-14T16:20:39.000049Z

@mccraigmccraig What version are you on now?

michaeldrogalis 2017-12-14T16:21:43.000008Z

Also have you changed your cluster's hardware or experienced changes in data intensity? That looks like starvation of Aeron again

mccraigmccraig 2017-12-14T16:23:00.000112Z

@michaeldrogalis 0.9.15 on production - we'll be upgrading with our new cluster (with newer docker and kafka which both required upgrades), but i'm stuck on 0.9.15 for the moment

mccraigmccraig 2017-12-14T16:23:09.000594Z

there has been a considerable increase in data intensity

michaeldrogalis 2017-12-14T16:25:05.000264Z

Allocate more resources to the Aeron container.

mccraigmccraig 2017-12-14T16:29:29.000427Z

i'm currently giving 2.5GB to the container, with 0.6=1.5GB of that going to peer heap and 0.2=0.5GB going to aeron... the onyx peers don't seem to need anything like that amount of heap though - they only seem to use a couple of hundred MB according to yourkit, so i could give 1.5GB to aeron easily enough - does that seem reasonable ?

michaeldrogalis 2017-12-14T16:30:52.000095Z

Yeah, that would likely help. I'd have to look at the metrics to give you a good answer, but it's definitely in the right direction.

michaeldrogalis 2017-12-14T16:31:23.000870Z

Running them in separate containers would help more if your set up supports it.

mccraigmccraig 2017-12-14T16:34:09.000387Z

running onyx and aeron in separate containers. hmm. i think that would be difficult with my current setup (mesos+marathon) but might be feasible on the new cluster (dc/os) with pods

michaeldrogalis 2017-12-14T16:35:52.000581Z

Ah, wasnt sure what you meant by your last message when you said "the container". Got it

lucasbradstreet 2017-12-14T17:46:07.000281Z

@mccraigmccraig you’re probably hitting GCs which are causing timeouts in the Aeron conductor. You could give it a bit more RAM, and you can increase the conductor service timeout to make it not time out quite so easily. Taking a flight recorder log would help you diagnose it further

lucasbradstreet 2017-12-14T17:47:06.000338Z

Unfortunately I can only help so much with 0.9 issues

eriktjacobsen 2017-12-14T19:49:55.000726Z

My understanding is that the onyx input kafka task manages its own offsets in format partition->offset like {0 50, 1, 90} that can be passed in as :kafka/start-offsets. I'm curious, for a given snapshot / resume point, how would i get that offset map? For instance I have a resume point:

:in
 {:input
  {:tenancy-id "2",
   :job-id #uuid "23709a1c-ae8b-0b3b-a1a5-5bce4cc935cb",
   :replica-version 25,
   :epoch 144223,
   :created-at 1512603190357,
   :mode :resume,
   :task-id :in,
   :slot-migration :direct}},

What path would I look in ZK to get the actual offsets? (I have s3 checkpointing on if its in the checkpoint)

eriktjacobsen 2017-12-14T19:51:58.000083Z

additionally, is there anything in onyx.api or an onyx lib that could pull those offsets out of zk into my repl env?

lucasbradstreet 2017-12-14T19:59:47.000412Z

It’s in the S3 checkpoint.

lucasbradstreet 2017-12-14T20:00:42.000522Z

I would love one to exist. Currently the best you can do is instantiate this: https://github.com/onyx-platform/onyx/blob/0.12.x/src/onyx/storage/s3.clj#L122

lucasbradstreet 2017-12-14T20:00:53.000805Z

and have it read the checkpoint: https://github.com/onyx-platform/onyx/blob/0.12.x/src/onyx/storage/s3.clj#L218

lucasbradstreet 2017-12-14T20:01:25.000355Z

via the coordinates stored in ZK at https://github.com/onyx-platform/onyx/blob/0.12.x/src/onyx/api.clj#L319

eriktjacobsen 2017-12-14T20:05:06.000739Z

@lucasbradstreet Thanks a bunch! looking into it

lucasbradstreet 2017-12-14T20:05:43.000237Z

it shouldn’t be so hard to wire up. If you have any questions let me know, but I would love if you shared it after 🙂

lucasbradstreet 2017-12-14T20:06:16.000315Z

this is a better link to the coordinates https://github.com/onyx-platform/onyx/blob/0.12.x/src/onyx/api.clj#L306

eriktjacobsen 2017-12-14T20:08:20.000340Z

We're trying. Going through the corporate open sourcing process in January. We also forked onyx-dashboard so you can view each task's window state, it's been super helpful for us.

lucasbradstreet 2017-12-14T20:08:30.000299Z

Ooh

lucasbradstreet 2017-12-14T20:11:35.000624Z

The extension from here will be that you will be able to repartition the offsets are a bit more easily

lucasbradstreet 2017-12-14T20:12:44.000053Z

I’ve been considering allowing resume points to be passed as values for these sorts of circumstances, as an alternative to :kafka/start-offsets

lucasbradstreet 2017-12-14T20:12:48.000511Z

e.g.

lucasbradstreet 2017-12-14T20:13:01.000245Z

:in
 {:input
  {:tenancy-id "2",
   :job-id #uuid "23709a1c-ae8b-0b3b-a1a5-5bce4cc935cb",
   :replica-version 25,
   :epoch 144223,
   :created-at 1512603190357,
   :mode :value,
   :task-id :in,
   :value {0 {0 35} 1 {1 99}}
   :slot-migration :direct}},

eriktjacobsen 2017-12-14T20:18:44.000381Z

That would definitely be helpful as well

lucasbradstreet 2017-12-14T20:27:05.000532Z

Would remove the complexity from the plugin side

niamu 2017-12-14T22:22:37.000180Z

So, I’ve been using the new :reduce type and noticed that it breaks the visualization graph in the onyx-visualization library and therefore the dashboard as well. If I were to open a pull request to fix that in onyx-visualization, should it share the same node colour as a :function?

michaeldrogalis 2017-12-14T22:41:30.000119Z

@niamu Sure, works for us.

michaeldrogalis 2017-12-14T22:41:35.000292Z

Happy to merge it when it's ready.

niamu 2017-12-14T22:58:53.000391Z

PR is up now: https://github.com/onyx-platform/onyx-visualization/pull/9

niamu 2017-12-14T23:00:34.000116Z

Are there any long-term plans for the onyx-visualization library? I’ve been talking with a coworker about making a fork of the library that would visualize flow conditions and other components of the job as well.

lucasbradstreet 2017-12-14T23:01:35.000290Z

onyx-dashboard and onyx-visualization have taken a back seat to all the plugins and helper utilities, but we do love PRs

lucasbradstreet 2017-12-14T23:01:51.000378Z

Sorry about the reduce breakage :/

niamu 2017-12-14T23:03:10.000279Z

No worries. We’re still actively working on deploying our first Onyx job so it wasn’t a big issue for us, but something that was easily fixed we noticed.

lucasbradstreet 2017-12-14T23:03:36.000378Z

Great :)

lucasbradstreet 2017-12-14T23:03:49.000290Z

How are you finding the reduce type?

niamu 2017-12-14T23:04:37.000080Z

It’s great. Perfectly solved the hack of using a flow condition to stop the segments from flowing downstream in our pipeline.

lucasbradstreet 2017-12-14T23:05:34.000134Z

Excellent.

niamu 2017-12-14T23:06:26.000214Z

We started working in earnest with Onyx during the 0.12 release so noticing that in the release notes allowed us to fix that portion of the job that we found a little ugly.

lucasbradstreet 2017-12-14T23:07:28.000239Z

Nice. Yes, it was about time for a solution for that. It actually solved a number of issues quite cleanly (eg. Needing to use a null plugin for aggregating terminal tasks). Glad it worked out for you too.

niamu 2017-12-14T23:09:54.000378Z

Today we started our first steps migrating from docker-compose to a Kubernetes deployment and we’ve encountered a problem that we’re not sure how to debug involving the Aeron media driver not starting.

niamu 2017-12-14T23:09:57.000244Z

>17-12-14 21:21:14 onyx-peer-79d6d498d4-b8c6r WARN [onyx.peer.peer-group-manager:277] - Aeron media driver has not started up. Waiting for media driver before starting peers, and backing off for 500ms.

niamu 2017-12-14T23:10:10.000254Z

Any thoughts?

lucasbradstreet 2017-12-14T23:10:59.000346Z

Sounds like you’re not starting a media driver when you start the peers. We run it in a side car container sharing memory between the two containers. If you want to get unstuck for now you can start the embedded driver (search the cheat sheet for the right peer config)

lucasbradstreet 2017-12-14T23:11:37.000136Z

I don’t think we have an example for the side car up anywhere yet. @gardnervickers?

gardnervickers 2017-12-14T23:12:20.000179Z

Not yet, no.

niamu 2017-12-14T23:12:24.000089Z

We’ve been following the manifests from the onyx-twitter-sample (https://github.com/onyx-platform/onyx-twitter-sample/tree/master/kubernetes) but it sounds like there are improvements to be made in that process.

gardnervickers 2017-12-14T23:13:23.000089Z

@niamu are you using Helm?

niamu 2017-12-14T23:13:40.000152Z

No.

niamu 2017-12-14T23:15:48.000416Z

First I’ve heard of it actually.

gardnervickers 2017-12-14T23:17:02.000075Z

We need some examples soon for compiling/running the peer sidecar container. What it consists of is running two containers in a single pod, sharing /dev/shm as a type: Memory volume.

niamu 2017-12-14T23:18:34.000318Z

Ok, sounds similar to what is defined already in the sample manifests here: https://github.com/onyx-platform/onyx-twitter-sample/blob/master/kubernetes/peer.deployment.yaml#L26

niamu 2017-12-14T23:20:16.000220Z

Apart from the two containers in a single pod bit.

gardnervickers 2017-12-14T23:21:43.000263Z

Exactly

gardnervickers 2017-12-14T23:22:15.000201Z

Essentially you don’t want to run multiple processes in a single container, so splitting out the template docker container into two containers is best practice.

niamu 2017-12-14T23:23:33.000385Z

oh I see. So we just need a container that’s sole job is to start the media driver and share it’s volume with the other container in the pod that runs the peer process.

gardnervickers 2017-12-14T23:29:30.000412Z

Yes

niamu 2017-12-14T23:30:34.000244Z

Thanks a lot. I’ll try that out tomorrow and see how far I get.

lucasbradstreet 2017-12-14T23:33:32.000067Z

@niamu just a heads up for when you go to prod with k8s and onyx. I highly recommend wiring up health checks to https://github.com/onyx-platform/onyx-peer-http-query#route

niamu 2017-12-14T23:34:45.000189Z

Yes, I believe I saw you recommend that a while back to someone else. I have that bookmarked to revisit. 🙂

lucasbradstreet 2017-12-14T23:36:51.000105Z

Good good. Ah, forgot that I collected all of this stuff into: http://www.onyxplatform.org/docs/user-guide/0.12.x/#production-check-list. Less of a reason to tell everyone 🙂