I'm preparing batches of segments to insert into a database which requires large batch inserts to be efficient. Is a reduce task with a window and a trigger the way to do this or am I missing something obvious? batch-fn doesn't look appropriate since it requires that the number of segments produced to be the same as the number of input segments.
you can configure a batch-size of the output connector
that will make onyx (try to) work in batches of that size
if you combine that with, say, the postgresql copy implementation (part of onyx-sql), you can actually have fairly efficient inserts
Thanks @lmergen -- I was hoping for something like this. Looking at onyx-sql though it seems that it always inserts each segment as a separate transaction. Maybe I'll need to implement an output plugin myself. The database -- clickhouse -- accepts writes by http request, so jdbc isn't helpful anyway. https://github.com/onyx-platform/onyx-sql/blob/0.13.x/src/onyx/plugin/sql.clj#L168
ah right
i implemented copy support for postgresql about a year or so ago
you can take a look at it and see whether you can do something similar for clickhouse
it would be the best way to do it, anyway
see also: https://github.com/onyx-platform/onyx-sql/blob/0.13.x/src/onyx/plugin/pgsql.clj#L50
In the case of the postgresql copy -- would the right approach be to have an upstream task that combined multiple rows into a single segment? Would this upstream step be using windows and triggers?
well, then you would need to make the sql plugin "understand" these batches -- since you would probably be putting vectors inside single segments in this case
i wouldn't go there
you will always want inserts to be as efficient as possible, so it makes the most sense to reuse the actual batches that onyx uses
then you can configure this using :onyx/batch-size
in your task map
I think the sql plugin would accept this segment right? -- {:rows [{:col1 "row1-val"} {:col1 "row2-val"}]}
But I see what you are saying about using the batching mechanism.
oh yeah that's correct
i think a window could work in this case, but.. you will probably want to have logic like "every 3 seconds or every 3000 rows, whichever is reached first"
i have found it difficult to define triggers like that
maybe @lucasbradstreet knows how to do that
I'm aiming more towards batches of 1 million rows and will have enough traffic that I don't need to worry about it taking too much time to accumulate enough rows
well then, luxury problems π
Thanks for discussing -- The options I'm seeing are 1. Prepare the batches of rows using a window aggregation and then submit the results using the onyx-http output plugin 2. Roll the batch preparation and database submission into my own output plugin and rely on onyx/batch-size
if you need help with it, i would be happy to assist
i've been meaning to want to take a look at clickhouse anyway, been hearing a lot about it the past year π
@lmergen Do you happen to know why each segment is inserted as a separate database call rather than batching all the segments into a single database call?
i think it's chosen initially out of simplicity
"make it work, then improve"
Got it -- so to write all the batches together, I would do something like:
(write-batch [this {:keys [onyx.core/write-batch]} replica messenger]
(jdbc/with-db-transaction [conn pool]
(insert-fn conn (mapcat :rows write-batch)))
true)
yep, pretty much
As a side question do you know of any examples using onyx.plugin.protocols/prepare-batch?
i saw this the other day: https://github.com/onyx-platform/onyx-amazon-s3/blob/0.13.x/src/onyx/plugin/s3_output.clj#L74
i think the idea is that prepare-batch should be pure
or rather, can be called multiple times without nasty side effects
but i hardly ever use prepare-batch myself
For clickhouse the rows need to be loaded into a large byte array using some classes provided by clickhouse, so I thought prepare-batch might be the right place to do this.
If you use prepare batch itβs mostly to setup a buffer that you drain with write batch, since write batch may be called multiple times. Usually you can get away without it but itβs useful sometimes.
^ sounds like prepare batch is the right place then for what you want
Right, that could be a good place to put it