onyx

FYI: alternative Onyx :onyx: chat is at <https://gitter.im/onyx-platform/onyx> ; log can be found at <https://clojurians-log.clojureverse.org/onyx/index.html>
pfeodrippe 2018-01-12T09:56:20.000311Z

@jasonbell Awesome, man!

ninja 2018-01-12T12:12:20.000116Z

Hi, i've got a problem with tags being used in my project. I've got 2 separate projects each holding implementations for different tasks. Tags are used to avoid a task being assigned to a peer which is not offering appropriate functionality. The strange thing is that after adding the tag information to the catalog entries and the peer-configuration in the 2 projects the submitted jobs fails to kick off stating that there are too few virtual peers available. When removing the tags the job kicks off but expectably fails due to assigning a task to a peer without the necessary functionality. Note that in the latter case there seems to be enough virtual peers but not in the first one (nothing but the tag information changed). Is this expected behavior? Is there something else that needs to be adjusted when using tags?

michaeldrogalis 2018-01-12T17:08:38.000733Z

@atwrdik It sounds like the scheduler can't find a valid set of peers to assign all the work to in the 1st case.

milk 2018-01-12T19:36:22.000356Z

I am trying to understand complete_latency a bit better, I read here https://github.com/onyx-platform/onyx-metrics/tree/0.9.x that this is the length of time for 1 segment to get through a the job. Does this include the length of time that it is in the pending messages queue? if that is the case, is there a metric for the time a segment is in the pending message queue for an input task?

milk 2018-01-12T19:36:47.000618Z

And if that is the case, then the key thing is that we definitely don’t want the complete_latency to ever go above the pending-timeout

milk 2018-01-12T19:36:53.000226Z

this is for 0.9

lucasbradstreet 2018-01-12T19:38:16.000010Z

@milk 0.10+ will get rid of that metric completely, so I wouldn’t spend so much time on it. There’s no metric for the current time pending. 0.10+ does away with the completion semantics completely, so you never have to worry about timing out messages.

milk 2018-01-12T19:39:56.000012Z

I see that but we have a system with 0.9.x so just trying to make sure I understand what those mean

milk 2018-01-12T19:40:11.000240Z

for the time being, we are planning an upgrade, but just need to manage this for the time being

souenzzo 2018-01-12T19:41:24.000104Z

I'm getting "Unfreezable type: class datomic.index.Index" but stacktrace says nothing about "where" in my job it occurs

lucasbradstreet 2018-01-12T19:42:32.000147Z

@milk understood

milk 2018-01-12T19:42:59.000360Z

yeah I don’t want to dig too deep

lucasbradstreet 2018-01-12T19:43:00.000165Z

@milk I think the best you have are the retry metrics.

lucasbradstreet 2018-01-12T19:43:22.000348Z

@souenzzo Could you stick the stack trace somewhere? Generally when we throw we include task metadata.

milk 2018-01-12T19:43:30.000582Z

just trying to manage this one a bit, and I see the retry metrics so I think I have that handled there, so I guess I will see that, I was just going to look to see if there were any ways I can see it creeping up

lucasbradstreet 2018-01-12T19:43:34.000047Z

It’s possible that it’s being dropped by clojure.test if you’re using clojure.test

milk 2018-01-12T19:44:06.000252Z

essentially, I think the main thing is that I don’t want complete_latency to go above pending-timeout

souenzzo 2018-01-12T19:44:07.000515Z

I'm on <http://localhost:8080/job/exception?job-id=3446ca4b-ad72-cb81-b63c-29b6c02e5047>

milk 2018-01-12T19:44:48.000310Z

I just saw that the complete_latency was a bigger than the batch latencies of my tasks in my job, so I thought it might be getting that extra time sitting on the queue, but was just verifying my understanding of it

lucasbradstreet 2018-01-12T19:44:48.000578Z

Yeah, I mean, it won’t really because it’ll be retried at that point anyway.

milk 2018-01-12T19:45:20.000211Z

right, I think that make sense, but we shouldn’t have that get near our timeout or we are going ot be i a bit of a retry loop

lucasbradstreet 2018-01-12T19:45:39.000408Z

Yeah, generally what will be happening is that you’re queuing up too many messages at once

lucasbradstreet 2018-01-12T19:45:49.000092Z

If the sum of batch latencies >>>> pending-timeout

lucasbradstreet 2018-01-12T19:46:06.000249Z

you can reduce :onyx/max-pending to reduce the queue size which will help a lot

lucasbradstreet 2018-01-12T19:46:09.000708Z

that’s your main backpressure knob.

milk 2018-01-12T19:47:13.000656Z

yep, makes sense, and one last question somewhat related, when it retries, it is just re-queueing those messages on that internal queue right, not putting them back on the Kafka topic?

lucasbradstreet 2018-01-12T19:47:53.000365Z

Yes, but if they had actually managed to write out the output plugin downstream and still triggered a retry, you will end up with duplicates

lucasbradstreet 2018-01-12T19:48:05.000401Z

It’s impossible for it to be exactly once writing to the output plugins in 0.9

milk 2018-01-12T19:48:38.000072Z

cool much appreciated @lucasbradstreet!

lucasbradstreet 2018-01-12T19:50:51.000277Z

Ah, hmm.

lucasbradstreet 2018-01-12T19:51:02.000763Z

Can you check the onyx.log? Guessing we’re not adding extra metadata there.