onyx

FYI: alternative Onyx :onyx: chat is at <https://gitter.im/onyx-platform/onyx> ; log can be found at <https://clojurians-log.clojureverse.org/onyx/index.html>
kenny 2017-11-22T04:48:52.000146Z

When testing out Onyx + Kafka my peers continually restart after throwing this exception:

java.lang.NullPointerException: 
    clojure.lang.ExceptionInfo: Caught exception inside task lifecycle :lifecycle/initializing. Rebooting the task. -&gt; Exception type: java.lang.NullPointerException. Exception message: null
       job-id: #uuid "30fbff49-4319-a757-6a0b-b7f28f795426"
     metadata: {:job-id #uuid "30fbff49-4319-a757-6a0b-b7f28f795426", :job-hash "db786ebce6742af642d9d60805225644110d2b6ded6bfd599e2f3823e62c4"}
      peer-id: #uuid "c01c9e46-3a41-efb7-88f7-7f2dd7b8cf7c"
    task-name: :in
That is the whole stacktrace. Seems like it is cut off but that's it. Any ideas on how to proceed?

lucasbradstreet 2017-11-22T04:50:22.000110Z

That’s odd. I think there should be more to the stack trace. I assume this is from onyx.log?

kenny 2017-11-22T04:51:14.000196Z

Yes. I do have these additional config settings:

{:level     :info
 :appenders {:println {:ns-whitelist ["compute.*" "dev.*"]}
             :spit    (assoc (appenders/spit-appender {:fname "server.log"})
                        :ns-blacklist ["org.apache.zookeeper.*"
                                       "onyx.messaging.*"])}}
I wouldn't imagine that would affect the printing of a stacktrace though.

lucasbradstreet 2017-11-22T04:52:10.000228Z

Ok. If your error is reproducible, turning that off and trying again is all I can really think to try

kenny 2017-11-22T04:52:45.000127Z

Very reproducible. Will try that now.

kenny 2017-11-22T04:56:53.000207Z

The full stacktrace is now visible. Though it wasn't hidden due to that log config, rather the JVM optimization that omits the stacktrace root.

lucasbradstreet 2017-11-22T04:59:35.000073Z

Ah. That one. I do usually overrule that in my dev JVM opts

lucasbradstreet 2017-11-22T05:02:53.000207Z

Anything we need to be worried about?

kenny 2017-11-22T05:05:06.000035Z

Yep just added that to my JVM opts. Nope.

lucasbradstreet 2017-11-22T05:05:20.000152Z

Ok, cheers.

2017-11-22T08:49:49.000035Z

now that we're talking about this, i recall there was some JVM options that allows you to emit a stderr log any time any exception is generated. that bypasses any logging framework. or am i confused ?

lucasbradstreet 2017-11-22T08:50:34.000003Z

Not sure about that one. I know there is a way to hook into unhandled exceptions, which is handy when you have a lot of threads doing things.

2017-11-22T09:23:53.000243Z

hmmm might be mistaking jvm for another language

2017-11-22T12:38:18.000149Z

question regarding job submission. we have four peers running and they all submit the same set of 3 jobs. 2 of the jobs, the job submission returns success on all 4 peers. the 3rd, success is only returned on one peer and the other 3 peers fail with :incorrect-job-hash. is this just the result of a race condition in job submission? or are we somehow generating different job hashs on our peers. I believe the job only needs to be submitted once to the cluster, but just want to make sure I understand what is happening, here. we are also running on 0.9, currently.

michaeldrogalis 2017-11-22T15:49:01.000491Z

@kenny That JVM opt kills me everytime

michaeldrogalis 2017-11-22T15:53:59.000340Z

@djjolicoeur I think you're confused about how job submission works. You only want to submit each job once - and probably not from the peers on start up. I'd start them from a different process

michaeldrogalis 2017-11-22T15:54:43.000177Z

If you're running into a hash error, you're trying to submit the same job ID with different job content

2017-11-22T15:58:18.000600Z

thanks @michaeldrogalis, we are actually looking to make that change WRT job submission in the near future. I essentially inherited this system and, to date, the current job submission process has not caused any issues other than that submission quirk I mentioned. that being said, the content of the jobs should be idempotent, so I need to track down what the diff is.

michaeldrogalis 2017-11-22T16:00:00.000428Z

@djjolicoeur Understood. Yeah, there's some underlying difference causing them to hash to different values. This should be the place to figure out what's up: https://github.com/onyx-platform/onyx/blob/0.9.x/src/onyx/api.clj#L209

michaeldrogalis 2017-11-22T16:00:20.000395Z

Or, more directly hash the job yourself offline: https://github.com/onyx-platform/onyx/blob/0.9.x/src/onyx/api.clj#L170

2017-11-22T16:01:08.000087Z

thanks @michaeldrogalis, I will take a look

michaeldrogalis 2017-11-22T16:01:38.000564Z

Np. 🙂

kenny 2017-11-22T18:44:36.000106Z

How do you guys develop an Onyx job that uses a Kafka queue at the REPL? Do you start an embedded version of Kafka? Or maybe replace Kafka queues with a Core async channel? What's the best approach?

lucasbradstreet 2017-11-22T18:46:12.000494Z

We used to use an embedded version and/or swap out core async, but lately we’ve moved closer to just using kafka directly via docker/docker-compose https://github.com/onyx-platform/onyx-kafka/blob/0.12.x/docker-compose.yml

lucasbradstreet 2017-11-22T18:46:48.000043Z

My preference is to minimise differences between dev, tests and prod, rather than get a bit nicer of a dev experience by swapping out core async.

lucasbradstreet 2017-11-22T18:47:19.000151Z

I’ve often initially developed it against core.async and then moved it to kafka via docker-compose later though.

lucasbradstreet 2017-11-22T18:47:35.000331Z

We use circleci which allows us to stand up ZK + Kafka for our tests via that compose yaml.

kenny 2017-11-22T18:48:27.000138Z

I like that approach. How do you handle the creation of topics that are needed for your job? Do you use the Kafka admin API?

kenny 2017-11-22T19:02:36.000411Z

Makes sense. I'll try that approach out. Thank you 🙂

kennethkalmer 2017-11-22T21:32:24.000048Z

I’m curious what the smallest practical EC2 instance types are for a Zookeeper ensemble to power onyx… more specifically, onyx mostly for monthly batch jobs and a few adhoc jobs throughout the month

kennethkalmer 2017-11-22T21:33:11.000300Z

For this project it really isn’t about scaling as much as it is about breaking up the batch processing code into neatly defined workflows that are easier to reason about

kennethkalmer 2017-11-22T21:37:26.000231Z

Hmm, that cluster might need kafka too… still lots to explore and figure out, just trying to think ahead to what happens if my poc is a success 🙂

lucasbradstreet 2017-11-22T21:41:35.000196Z

I’m not really sure how small you can go because we have always biased towards that piece being as solid as possible. We don’t do all that much with ZK when you’re using the s3 checkpointer though, so it just needs to be able to respond reliably.

kennethkalmer 2017-11-22T21:46:54.000385Z

I'll play when it comes down to it and see!

kenny 2017-11-22T21:50:31.000293Z

Any idea what this Schema error is talking about?

clojure.lang.ExceptionInfo: Value does not match schema: {(not (= (name (:kafka)) (namespace :kafka/offset-reset))) invalid-key}
     error: {(not (= (name (:kafka)) (namespace :kafka/offset-reset))) invalid-key}
I don't think I can run spec on the task def because I can't find any Kafka plugin specs.

kenny 2017-11-22T21:50:45.000141Z

I'm not sure why :kafka is in a list in that exception. Seems strange.

michaeldrogalis 2017-11-22T21:58:07.000065Z

@kenny Can I see your catalog?

michaeldrogalis 2017-11-22T21:58:17.000146Z

I feel like I say this on a weekly basis now. We gotta ditch Schema. 😕

lucasbradstreet 2017-11-22T22:15:08.000160Z

From a first look, kafka/offset-reset looks ok. I’m on my phone though so it’s hard to look deeper

lucasbradstreet 2017-11-22T22:15:17.000084Z

What version of onyx kafka is this?

lucasbradstreet 2017-11-22T22:17:11.000196Z

I assume this exception was thrown after you started the job?

kenny 2017-11-22T22:19:24.000024Z

[org.onyxplatform/onyx-kafka "0.11.1.0" :exclusions [org.slf4j/slf4j-log4j12]]

kenny 2017-11-22T22:19:59.000338Z

Yes, thrown during initialization lifecycle.

lucasbradstreet 2017-11-22T22:21:31.000363Z

Yknow I think that it’s being double namespaced

lucasbradstreet 2017-11-22T22:21:45.000254Z

Keyword validation is non existent in Clojure

lucasbradstreet 2017-11-22T22:22:09.000240Z

That was a pretty WTF one to figure out. It all looked perfect

kenny 2017-11-22T22:22:35.000048Z

Not sure what double namespaced means 🙂

lucasbradstreet 2017-11-22T22:23:06.000309Z

Oh nope. Not it

lucasbradstreet 2017-11-22T22:23:38.000262Z

Thought you were using the map name spacing form where you would supply the namespace before the map, and then I thought it had a second namespace inside the map, but no

kenny 2017-11-22T22:26:15.000179Z

I'm not actually typing those namespaced maps - it's output from a DSL we have. Personally, I don't like using the namespaced map syntax in my actual code.

lucasbradstreet 2017-11-22T22:26:18.000155Z

There’s definitely something odd going on, but I can’t diagnose it further from my phone. If you figure it out let me know. The schemas are in here: https://github.com/onyx-platform/onyx-kafka/blob/0.12.x/src/onyx/tasks/kafka.clj

kenny 2017-11-22T22:27:50.000033Z

Ok. Will keep staring at it to see if something catches my eye.

lucasbradstreet 2017-11-22T22:27:58.000051Z

Yeah. I hate that format too

michaeldrogalis 2017-11-22T22:34:03.000200Z

Man, I have no idea. That's bizarre.

kenny 2017-11-22T22:34:43.000035Z

Would the Onyx log shed any light here?

michaeldrogalis 2017-11-22T22:36:05.000316Z

Probably not. This is a Schema check before Onyx ever boots up

kenny 2017-11-22T22:37:07.000004Z

This exception occurs at runtime during the lifecycle. Is that what you mean when you say before Onyx boots up?

lucasbradstreet 2017-11-22T22:37:14.000209Z

This one is the onyx kafka schema check on task start @michaeldrogalis

michaeldrogalis 2017-11-22T22:37:39.000065Z

It's clearly in the enum https://github.com/onyx-platform/onyx-kafka/blob/0.11.x/src/onyx/tasks/kafka.clj#L23

lucasbradstreet 2017-11-22T22:37:40.000144Z

Can you give us the full exception stack trace just in case though?

michaeldrogalis 2017-11-22T22:38:15.000198Z

This is the last commit on the 0.11 branch https://github.com/onyx-platform/onyx-kafka/commit/bfe465e3488dfdbdc8641514f70ca655ecb60153

michaeldrogalis 2017-11-22T22:39:05.000195Z

@kenny I bet if you upgrade to "0.11.1.1" this'll be fixed

michaeldrogalis 2017-11-22T22:39:08.000083Z

For onyx-kafka

kenny 2017-11-22T22:39:46.000081Z

Good call. Will try that now.

kenny 2017-11-22T22:45:55.000127Z

Upgrading causes to 0.11.1.1 causes this exception:

clojure.lang.ExceptionInfo: No reader for tag cond
clojure.lang.Compiler$CompilerException: clojure.lang.ExceptionInfo: No reader for tag cond {:tag cond, :value {:default false, :test false, :ci false}}, compiling:(simple_job.clj:17:15)
I'm pretty sure I've run into this before with Onyx. Something with an Aero version mismatch.

lucasbradstreet 2017-11-22T22:47:12.000105Z

Yeah I think this is aero related

lucasbradstreet 2017-11-22T22:47:39.000038Z

Though I don’t know why it’s bringing in aero. Probably a dev dependency thing when it should really be a test dependency

kenny 2017-11-22T22:49:49.000161Z

The strange thing is that the changes don't seem to indicate something changed with Aero: https://github.com/onyx-platform/onyx-kafka/compare/0.11.1.0...0.11.1.1

lucasbradstreet 2017-11-22T22:50:27.000206Z

Hail Mary lein clean?

kenny 2017-11-22T22:50:34.000212Z

Using boot 🙂

lucasbradstreet 2017-11-22T23:00:52.000196Z

Jealous :D

1
lucasbradstreet 2017-11-22T23:01:20.000291Z

Love it. Just no time to switch over

kenny 2017-11-22T23:39:37.000072Z

Figured out the problem but not the solution. You guys may have some insight. We have an internal config library that stores our app's configuration in a config.edn file that is on the classpath. This library is brought in as a dependency in my Onyx project. We are able to read the config perfectly fine this way. This is with a project using [org.onyxplatform/onyx-kafka 0.11.1.0]. For some reason that is not shown in the GitHub compare, using [org.onyxplatform/onyx-kafka 0.11.1.1] places a new config file on the classpath using the same name config.edn. This config.edn overrides the one from our internal config library. The exception pasted above is due to the overridden config.edn using a tag literal #cond that has been removed in newer version of Aero. The real problem here is onyx-kafka placing a new config.edn on the classpath and those changes not being included in the GitHub compare.

kenny 2017-11-22T23:48:42.000228Z

It looks like a config.edn is always included in org.onyxplatform/onyx-kafka but for some reason upgrading from 0.11.1.0 to 0.11.1.1 causes my config to get overridden (yay Maven). Is there a reason that a config.edn is included with the onyx-kafka jar?

lucasbradstreet 2017-11-22T23:49:30.000075Z

Yuck. No, that should absolutely be under test-resources and should not be included outside of tests

lucasbradstreet 2017-11-22T23:49:38.000102Z

Sorry about that.

kenny 2017-11-22T23:53:42.000006Z

Yeah that was nasty. Anyway to get a 0.11.x release out with the config.edn removed?