When testing out Onyx + Kafka my peers continually restart after throwing this exception:
[1;31mjava.lang.NullPointerException[m: [3m[m
[1;31mclojure.lang.ExceptionInfo[m: [3mCaught exception inside task lifecycle :lifecycle/initializing. Rebooting the task. -> Exception type: java.lang.NullPointerException. Exception message: null[m
[1mjob-id[m: #uuid "30fbff49-4319-a757-6a0b-b7f28f795426"
[1mmetadata[m: {:job-id #uuid "30fbff49-4319-a757-6a0b-b7f28f795426", :job-hash "db786ebce6742af642d9d60805225644110d2b6ded6bfd599e2f3823e62c4"}
[1mpeer-id[m: #uuid "c01c9e46-3a41-efb7-88f7-7f2dd7b8cf7c"
[1mtask-name[m: :in
That is the whole stacktrace. Seems like it is cut off but that's it. Any ideas on how to proceed?That’s odd. I think there should be more to the stack trace. I assume this is from onyx.log?
Yes. I do have these additional config settings:
{:level :info
:appenders {:println {:ns-whitelist ["compute.*" "dev.*"]}
:spit (assoc (appenders/spit-appender {:fname "server.log"})
:ns-blacklist ["org.apache.zookeeper.*"
"onyx.messaging.*"])}}
I wouldn't imagine that would affect the printing of a stacktrace though.Ok. If your error is reproducible, turning that off and trying again is all I can really think to try
Very reproducible. Will try that now.
The full stacktrace is now visible. Though it wasn't hidden due to that log config, rather the JVM optimization that omits the stacktrace root.
Ah. That one. I do usually overrule that in my dev JVM opts
Anything we need to be worried about?
Yep just added that to my JVM opts. Nope.
Ok, cheers.
now that we're talking about this, i recall there was some JVM options that allows you to emit a stderr log any time any exception is generated. that bypasses any logging framework. or am i confused ?
Not sure about that one. I know there is a way to hook into unhandled exceptions, which is handy when you have a lot of threads doing things.
hmmm might be mistaking jvm for another language
question regarding job submission. we have four peers running and they all submit the same set of 3 jobs. 2 of the jobs, the job submission returns success on all 4 peers. the 3rd, success is only returned on one peer and the other 3 peers fail with :incorrect-job-hash
. is this just the result of a race condition in job submission? or are we somehow generating different job hashs on our peers. I believe the job only needs to be submitted once to the cluster, but just want to make sure I understand what is happening, here. we are also running on 0.9, currently.
@kenny That JVM opt kills me everytime
@djjolicoeur I think you're confused about how job submission works. You only want to submit each job once - and probably not from the peers on start up. I'd start them from a different process
If you're running into a hash error, you're trying to submit the same job ID with different job content
thanks @michaeldrogalis, we are actually looking to make that change WRT job submission in the near future. I essentially inherited this system and, to date, the current job submission process has not caused any issues other than that submission quirk I mentioned. that being said, the content of the jobs should be idempotent, so I need to track down what the diff is.
@djjolicoeur Understood. Yeah, there's some underlying difference causing them to hash to different values. This should be the place to figure out what's up: https://github.com/onyx-platform/onyx/blob/0.9.x/src/onyx/api.clj#L209
Or, more directly hash the job yourself offline: https://github.com/onyx-platform/onyx/blob/0.9.x/src/onyx/api.clj#L170
thanks @michaeldrogalis, I will take a look
Np. 🙂
How do you guys develop an Onyx job that uses a Kafka queue at the REPL? Do you start an embedded version of Kafka? Or maybe replace Kafka queues with a Core async channel? What's the best approach?
We used to use an embedded version and/or swap out core async, but lately we’ve moved closer to just using kafka directly via docker/docker-compose https://github.com/onyx-platform/onyx-kafka/blob/0.12.x/docker-compose.yml
My preference is to minimise differences between dev, tests and prod, rather than get a bit nicer of a dev experience by swapping out core async.
I’ve often initially developed it against core.async and then moved it to kafka via docker-compose later though.
We use circleci which allows us to stand up ZK + Kafka for our tests via that compose yaml.
I like that approach. How do you handle the creation of topics that are needed for your job? Do you use the Kafka admin API?
Yeah https://github.com/onyx-platform/onyx-kafka/blob/0.12.x/src/onyx/kafka/helpers.clj#L150
Makes sense. I'll try that approach out. Thank you 🙂
I’m curious what the smallest practical EC2 instance types are for a Zookeeper ensemble to power onyx… more specifically, onyx mostly for monthly batch jobs and a few adhoc jobs throughout the month
For this project it really isn’t about scaling as much as it is about breaking up the batch processing code into neatly defined workflows that are easier to reason about
Hmm, that cluster might need kafka too… still lots to explore and figure out, just trying to think ahead to what happens if my poc is a success 🙂
I’m not really sure how small you can go because we have always biased towards that piece being as solid as possible. We don’t do all that much with ZK when you’re using the s3 checkpointer though, so it just needs to be able to respond reliably.
I'll play when it comes down to it and see!
Any idea what this Schema error is talking about?
clojure.lang.ExceptionInfo: Value does not match schema: {(not (= (name (:kafka)) (namespace :kafka/offset-reset))) invalid-key}
error: {(not (= (name (:kafka)) (namespace :kafka/offset-reset))) invalid-key}
I don't think I can run spec on the task def because I can't find any Kafka plugin specs.I'm not sure why :kafka
is in a list in that exception. Seems strange.
@kenny Can I see your catalog?
I feel like I say this on a weekly basis now. We gotta ditch Schema. 😕
From a first look, kafka/offset-reset looks ok. I’m on my phone though so it’s hard to look deeper
What version of onyx kafka is this?
I assume this exception was thrown after you started the job?
[org.onyxplatform/onyx-kafka "0.11.1.0" :exclusions [org.slf4j/slf4j-log4j12]]
Yes, thrown during initialization lifecycle.
Yknow I think that it’s being double namespaced
Keyword validation is non existent in Clojure
That was a pretty WTF one to figure out. It all looked perfect
Not sure what double namespaced means 🙂
Oh nope. Not it
Thought you were using the map name spacing form where you would supply the namespace before the map, and then I thought it had a second namespace inside the map, but no
I'm not actually typing those namespaced maps - it's output from a DSL we have. Personally, I don't like using the namespaced map syntax in my actual code.
There’s definitely something odd going on, but I can’t diagnose it further from my phone. If you figure it out let me know. The schemas are in here: https://github.com/onyx-platform/onyx-kafka/blob/0.12.x/src/onyx/tasks/kafka.clj
Ok. Will keep staring at it to see if something catches my eye.
Yeah. I hate that format too
Man, I have no idea. That's bizarre.
Would the Onyx log shed any light here?
Probably not. This is a Schema check before Onyx ever boots up
This exception occurs at runtime during the lifecycle. Is that what you mean when you say before Onyx boots up?
This one is the onyx kafka schema check on task start @michaeldrogalis
It's clearly in the enum https://github.com/onyx-platform/onyx-kafka/blob/0.11.x/src/onyx/tasks/kafka.clj#L23
Can you give us the full exception stack trace just in case though?
This is the last commit on the 0.11 branch https://github.com/onyx-platform/onyx-kafka/commit/bfe465e3488dfdbdc8641514f70ca655ecb60153
@kenny I bet if you upgrade to "0.11.1.1"
this'll be fixed
For onyx-kafka
Good call. Will try that now.
Upgrading causes to 0.11.1.1 causes this exception:
clojure.lang.ExceptionInfo: No reader for tag cond
clojure.lang.Compiler$CompilerException: clojure.lang.ExceptionInfo: No reader for tag cond {:tag cond, :value {:default false, :test false, :ci false}}, compiling:(simple_job.clj:17:15)
I'm pretty sure I've run into this before with Onyx. Something with an Aero version mismatch.Yeah I think this is aero related
Though I don’t know why it’s bringing in aero. Probably a dev dependency thing when it should really be a test dependency
The strange thing is that the changes don't seem to indicate something changed with Aero: https://github.com/onyx-platform/onyx-kafka/compare/0.11.1.0...0.11.1.1
Hail Mary lein clean?
Using boot 🙂
Jealous :D
Love it. Just no time to switch over
Figured out the problem but not the solution. You guys may have some insight.
We have an internal config library that stores our app's configuration in a config.edn
file that is on the classpath. This library is brought in as a dependency in my Onyx project. We are able to read the config perfectly fine this way. This is with a project using [org.onyxplatform/onyx-kafka 0.11.1.0]
.
For some reason that is not shown in the GitHub compare, using [org.onyxplatform/onyx-kafka 0.11.1.1]
places a new config file on the classpath using the same name config.edn
. This config.edn
overrides the one from our internal config library. The exception pasted above is due to the overridden config.edn
using a tag literal #cond
that has been removed in newer version of Aero. The real problem here is onyx-kafka placing a new config.edn
on the classpath and those changes not being included in the GitHub compare.
It looks like a config.edn
is always included in org.onyxplatform/onyx-kafka
but for some reason upgrading from 0.11.1.0
to 0.11.1.1
causes my config to get overridden (yay Maven). Is there a reason that a config.edn
is included with the onyx-kafka jar?
Yuck. No, that should absolutely be under test-resources and should not be included outside of tests
Sorry about that.
Yeah that was nasty. Anyway to get a 0.11.x release out with the config.edn
removed?