sql

All things SQL and JDBC...
antonmos 2020-09-10T20:39:52.291600Z

@seancorfield hello! i am back with more questions about streaming. After you added support for folding the streaming result set, we put the change in production and eventually ran into “java.util.concurrent.RejectedExecutionException: Thread limit exceeded replacing blocked worker” This happened due to the fact that ForkJoin framework backing r/fold has magic that detects blocking operations and adds threads to compensate. (See Compensation Threads at http://moi.vonos.net/java/forkjoin/) Based on this quote:

One special case of blocking operations has been optimised for work-stealing threadpools: the "synchronization primitive" classes in package java.util.concurrent recognise when a thread belonging to a work-stealing threadpool is about to perform a blocking operation (eg wait on a mutex). These classes then tell the associated threadpool to create an additional "compensation thread" before the current thread blocks. The number of "active threads" in the threadpool thus remains the same.
i looked at the thread dump and found where we were running to synchronization and it turned out to be in https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/LazySeq.java#L50 which is not easily avoidable. So, we had to revert the changes to use r/fold. I am now back to wanting to stream the results, and again struggling to figure out how to do that. Note that now, my main motivation is to reduce the heap usage, i.e. i’d like the results of the query to be postprocessed, converted to transit and put on the wire continuously. It may be possible that streaming all the way will make parallel processing not so necessary, but that remains to be seen. The problem with plan is that i have to reduce over it. If i am not mistaken, even if i spin off futures to do the actual work, reduce will not return it to the calling code until the entire results set has been traversed, which means that i would not have avoided having the entirety of the result set in memory. Is there another jdbc.next function that i am missing? Thanks again for you help!

seancorfield 2020-09-10T20:58:50.293300Z

plan is the only way to stream large result sets and process them as they are streamed in. That processing must be eager (since it must all happen before the connection is closed) -- otherwise you have to realize the result set into memory (which you don't want).

seancorfield 2020-09-10T20:59:17.293900Z

You mention LazySeq but you don't say whether that was in your code or in next.jdbc?

antonmos 2020-09-10T20:59:53.294600Z

it was in our code. actually it’s in clojure.walk which is used by camel-snake-kebab which we are calling

seancorfield 2020-09-10T21:00:08.294900Z

Well... don't do that 🙂

seancorfield 2020-09-10T21:01:31.297200Z

The point of plan is to lift values out of each row quickly without trying to realize rows, and process those column values, and move on to the next row.

antonmos 2020-09-10T21:02:49.298900Z

i think i am misunderstanding what you mean by > That processing must be eager -- otherwise you have to realize the result set into memory (which you don’t want). arent those directly opposed? if understand correctly, you are requiring use of ‘reduce’ in order to prevent referecnes to resultset from escaping. However, that is also exactly what’s preventing me from starting to stream the results before streaming from the db is finished.

antonmos 2020-09-10T21:03:37.300100Z

i can certainly build around it - access the db in separate thread, put the rows into a blockingqueue and consume it from the request thread

antonmos 2020-09-10T21:04:05.301Z

but it seems like i’d be wrestling around the limitation imposed by jdbc.next

seancorfield 2020-09-10T21:04:51.301700Z

What does your reducing function do?

seancorfield 2020-09-10T21:05:27.302500Z

You aren't producing a data structure from the result of reduce I hope?

antonmos 2020-09-10T21:05:36.302800Z

does a bunch of cpu-bound calculations on the rows and adds the results to the rows

antonmos 2020-09-10T21:06:24.303800Z

the intent is to get the rows out of the db, apply business logic and return them to the caller. indeed, it is to produce a datastracture from ‘reduce’, not a scalar

seancorfield 2020-09-10T21:06:56.304500Z

That can't possibly work if the result set is large enough to warrant streaming it from the DB.

antonmos 2020-09-10T21:08:04.305600Z

can you clarify why you say that? note that the calculations are applied and added to each individual row.

seancorfield 2020-09-10T21:08:30.305900Z

I don't know how to say it any clearer, sorry.

seancorfield 2020-09-10T21:09:32.307300Z

If a result set won't fit in memory, you have to stream it from the DB, processing it eagerly (consuming it -- not producing a data structure because that would be too big to fit in memory).

seancorfield 2020-09-10T21:09:49.307700Z

If your result set would fit in memory, you might as well just use execute!

seancorfield 2020-09-10T21:10:24.309Z

Reducing over plan works (with streaming) if what you produce is small enough to fit in memory.

antonmos 2020-09-10T21:10:44.309700Z

i think we are on the same page there. however, the issue is that in my case “consuming” equals “writting to http response”; but reduce is requiring to wait until the whole result has been read from the db.

antonmos 2020-09-10T21:11:00.310500Z

> Reducing over `plan` works (with streaming) if what you produce is small enough to fit in memory. ah yes. that is what i am saying as well

seancorfield 2020-09-10T21:11:04.310700Z

If you need to stream the results to another process, then you can reduce over plan to put it on a queue that another process can consume.

seancorfield 2020-09-10T21:11:59.312300Z

If you can stream results into your HTTP response, then reduce over plan works there -- with the reducing function pushing each (processed) row out to the HTTP response.

seancorfield 2020-09-10T21:12:31.313200Z

(reduce (fn [_ row] (send (transformed row) to-output)) nil (jdbc/plan ...))

seancorfield 2020-09-10T21:13:32.314700Z

Otherwise put some sort of queue between the reducing function and the thing writing to the HTTP response.

antonmos 2020-09-10T21:13:52.315300Z

hmmmm. that would require that all layers of code above the db are able to consume a reducible. in my case, i am using ring+muuntaja to convert to transit. i will check if it can consume it directly.

seancorfield 2020-09-10T21:14:08.315900Z

You need back pressure somewhere 🙂

seancorfield 2020-09-10T21:15:15.317200Z

> that would require that all layers of code above the db are able to consume a reducible I don't understand why.

seancorfield 2020-09-10T21:15:37.317900Z

When you reduce over plan, you're getting rid of the reducible.

antonmos 2020-09-10T21:15:54.318300Z

that works only the result of reduce fits into memory

antonmos 2020-09-10T21:16:45.319200Z

if it doesnt, i have to return a reducible up to the point where it will be reduce by writing to outputstreame

antonmos 2020-09-10T21:17:38.320400Z

> that would require that all layers of code above the db are able to consume a reducible well, normally, i can return a seq to middleware and it knows how to serialize to the outputstream

seancorfield 2020-09-10T21:18:08.321300Z

And that's why you need a queue here. You need back pressure.

antonmos 2020-09-10T21:18:31.322200Z

> You need back pressure somewhere i imagined that a lazyseq could provide that. is there any reason why that’s not as good as a reducible?

seancorfield 2020-09-10T21:18:53.322400Z

Resource management.

seancorfield 2020-09-10T21:19:16.323Z

Lazy sequences are the wrong thing to use when you have side effects and/or I/O.

antonmos 2020-09-10T21:19:57.323800Z

hmmmm. i wonder why… in my mind, a reducible is closed over a connection just as much as a lazyseq would have to be.

seancorfield 2020-09-10T21:20:15.324100Z

The lazy sequence would not be.

antonmos 2020-09-10T21:21:32.325100Z

why not? (i believe you, i am just trying to learn here)

antonmos 2020-09-11T14:19:35.333900Z

mmmmm, nope still not getting it.

seancorfield 2020-09-11T16:43:24.334100Z

What are you actually going to implement finalize() on? We're talking about regular lazy seqs here. How would you "attach" a context specify finalization, that would be guaranteed to do the right thing? Since you have no idea how much of the result set will be consumed, you'd essentially need it attached to every part of the lazy seq, and then you run the risk of an early part being consumed and then GC'd and yet the rest of the sequence could be consumed later (requiring the connection to stay open). You can't attach it only to the end of the sequence because you don't know where it is: you would have to stream the entire result set from the DB so you could find the end of the sequence -- which is exactly what we're trying to avoid here.

antonmos 2020-09-12T02:28:11.334300Z

oh i was thinking that we’d have to extend or implement our own lazyseq for this.

dominicm 2020-09-10T21:23:07.326200Z

While on the topic, at the risk of distracting the main conversation, can the reducible interface be used to implement a dynamic limit? Is there chunking?

seancorfield 2020-09-10T21:28:58.326300Z

I'm not sure what you mean.

dominicm 2020-09-10T21:30:55.326500Z

I mean SQL limit, or take-while, that kind of thing.

seancorfield 2020-09-10T21:32:40.327Z

(into [] (take 100) (jdbc/plan ...)) like that you mean?

dominicm 2020-09-10T21:33:13.327200Z

Yeah. That would be an example.

dominicm 2020-09-10T21:33:31.327400Z

Will it blow through memory if theres a bajillion rows in the result?

seancorfield 2020-09-10T21:34:24.327600Z

I don't have the cycles right now to dig into that, sorry.

seancorfield 2020-09-10T21:36:07.327800Z

If the take transducer short-circuits the reduce then I believe it would stop streaming from the DB and close the connection (but if you were streaming 1,000 rows at a time, it would still have read 1,000 rows into the JDBC driver -- but it wouldn't realize them into Clojure data structures).

seancorfield 2020-09-10T21:37:15.328Z

Looks like that would work -- the transducing take returns (ensure-reduced result) when it has consumed enough items.

dominicm 2020-09-10T21:37:34.328200Z

Cool. No rush on this. Future planning. I don't suppose chunking can be controlled?

dominicm 2020-09-10T21:37:50.328400Z

Thanks for figuring it out despite!

seancorfield 2020-09-10T21:38:03.328600Z

(and of course this is kind of the whole point behind plan and reduce and streaming from the DB -- that's the assumed behavior, if you can get your DB to actually cooperate 🙂 )

seancorfield 2020-09-10T21:38:26.328800Z

Different DBs require different tickling in this area (and some just plain ol' don't support it).

dominicm 2020-09-10T21:40:25.329Z

I see. Where would I find the docs on this stuff? What's the terminology used by jdbc?

seancorfield 2020-09-10T21:40:50.329200Z

Which "stuff"? Streaming result sets?

dominicm 2020-09-10T21:44:19.329400Z

Yeah, although that might be it! :)

seancorfield 2020-09-10T21:44:43.329600Z

I think streaming jdbc results will probably get you what you need, with your specific DB added.

seancorfield 2020-09-10T21:45:06.329800Z

JDBC is a mighty finicky beast and horribly non-portable 😐

seancorfield 2020-09-10T21:45:27.330Z

Does that answer your question @antonmos?

dominicm 2020-09-10T21:51:33.330200Z

https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor looks like I need to do something with a statement. I don't suppose next.jdbc handles this for me somehow? 😀

seancorfield 2020-09-10T22:03:59.331100Z

You can just pass those arguments into plan and it will pass them into the PreparedStatement it creates.

seancorfield 2020-09-10T22:04:24.331300Z

(so you don't have to use jdbc/prepare yourself)

seancorfield 2020-09-10T22:04:50.331500Z

"All The Options" is a good section of the docs to read from the sort of things that are possible.

antonmos 2020-09-10T22:21:29.332700Z

ohhh i see. the issue is that lazy sequence hang outs until garbage collections gets too it, which is potentially way too long to hold on to a connection

antonmos 2020-09-10T22:22:40.333Z

thank you for clarifying that! i will try using plan or fold with a queue.

seancorfield 2020-09-10T22:34:43.333200Z

Even GC'ing a lazy sequence won't signal the "end" for resource management.

antonmos 2020-09-10T22:35:34.333400Z

in theory, couldnt you implement finalize() to make it release the connection?

seancorfield 2020-09-10T22:39:02.333700Z

I'll leave it up to you to think about that in some more depth... 🙂