hi, @seancorfield I am trying to stream process a largish dataset like this
(let [res (jdbc/plan (db/connection) params db-opts)]
(r/foldcat
(r/map #(do (println (.getId (Thread/currentThread)))
(do-stuff %))
res)))
But it seems to be getting processed in a single thread.
Any idea what’s going on? Thanks!That's nothing to do with next.jdbc
. It's just about clojure.core.reducers
.
it seems that r/fold requires the collection to be foldable
is there any reason why that idea is not compatible with jdbc.next?
i am trying to undestand if i am thinking about this wrong 🙂
next.jdbc/plan
returns an IReduceInit
like lots of things.
do you know of any other way to do streaming parallel processing?
of jdbc.next/plan result
Streaming data onto a channel and processing it in parallel as it is taken off?
i can certainly try to implement foldable on to top of IReduceInit, but not sure if that’s conceptually possible or not
No idea. I've never used clojure.core.reducers
stuff.
gotcha!
thank you for the channel idea
(of course, now you have me curious -- which is dangerous on a Friday afternoon!)
muahaha 🙂
@dpassen1 and I are trying to grok what it takes to implent CollFold from coljure.core.reducers
Looking at clojure.core.reducers
, the answer to your original question is: a plain old IReduceInit
is just reduce
'd directly by the reducers stuff, so that's a single thread.
yup
btw, i am curious, how come plan returns an IReduceInit and not a LazySeq?
Resource management: you can't let an open connection escape in a lazy sequence.
It was one of the big problems people naively fell into with clojure.java.jdbc
whenever they provided a :result-set-fn
that was lazy. And, earlier on, it was even easier to make that mistake with the old API.
Here's an example of that "bug" from 2017 https://stackoverflow.com/questions/46886158/result-set-fn-in-clojure-jdbc-return-error-the-result-set-is-closed-why
ah yes, that makes sense
I believe you could write a reducing function over an IReduceInit
that conj'd rows into vectors of size N and dispatched reduce over each of those via fork-join and then at the end of that process you'd have a set of reduces happening in threads and a final vector of < size N and you'd then combine all of those results (and reduce over the last piece).
Essentially implementing the foldvec
helper from clojure.core.reducers
to work inside an into-like reduction...
we just wrote something similar
+ (->>
+ (reduce (fn [{:keys [batch futures] :as acc} row]
+ (if (= (count batch) 100))
+ (assoc acc :batch []
+ :futures (conj futures (future (map db/ybr-loan-fix-base batch))))
+ (update acc :batch conj row))
+ {} res)
+ (mapcat deref))
where res is result of plan
yup, that was a todo 🙂
Let me know if you start using the built-in fold support and how that goes @antonmos
So that's "just" parallel map
with an unbounded thread pool I think?
yea
kinda like pmap + batching + over a reducable
That's certainly an easier thing to implement than a fork-join-based parallel fold, but it's good to see it can be done "easily".
ill let you know if it works 🙂
testing now
I think it would be very interesting to have plan
's result implement CollFold
...
Note that it would have to fully realize each row (because of the mutable result set under the hood).
that’s fine with me, i am not going after navigation
It's not an issue of navigation, it's about the performance boost you can get by not realizing rows when reducing.
Currently, reducing over plan
can completely avoid converting ResultSet
"rows" into Clojure data structures if you access columns by name (label) in your reducing function and don't do anything that requires a sequence.
It can't avoid that if there's going to be any parallel evaluation (because mutability).
Which also means that the reducing function you used in the fold would have to expect to operate on the fully built hash maps and therefore the final column names, rather than being able to use the short names (labels) as in a simple reduce
over plan
.
when we ran the code above, row objects came out as #object[next.jdbc.result_set$mapify_result_set$reify__40720 0x76d61149 {row} from
plan` -- missing map
or reduce
?]`
what’s the best way to realize it?
the big thing is fold needs some way to split the results, building a tree of parallel computations, which it is hard to see how you could do that generically on top of a sql query
should i be using rs/datafiable-row?
if you assume some common things like ids and getting the highest and lowest ids being cheap you can build something like that for scanning whole tables, or ranges of tables, but that breaks down for complex queries
(I built something like this a long time ago against s3 if I recall, not sql, where getting the highest id for a split was not cheap, but I did a binary search kind of probe for it)
something to bare in mind is it is kind of frowned on to do io in a forkjoin threadpool (which is what fold puts tasks on)
the forkjoin pools have this notion of managed blockers where you can tell the pool you are doing a blocking task, so it spins up another thread to avoid the threadpool doing nothing while waiting for io
@antonmos Yes, I missed that you need to realize the rows in your code.
could you help me understand why http://jdbc.next.re/mapify-result-set does not have an path for just calling
(row-builder @builder)
Because it wants to discourage you from doing it -- because you are losing the performance benefit of what plan
tries to do.
hmmm i feel like i am missing something.
what i am trying to do is to start processing the result set on all my cpus while the record set is being paged in from the db
which means that i need to realize each row as it gets downloaded (as governed by :fetch-size)
So you need to call rs/datafiable-row
on each row
before you decide whether to put it into a batch...
yup, that’s what i am trying now. it’s just odd that i have do that even those i have no intention of using the navigized version of it
as i am understand the value of plan
is that i can get the data as it comes in fetch-size batches
that’s the primary value
secondary value is ability to get it in even thinner fashion and delay relalization of rows until navigation occurs
No, the primary value is not realizing ResultSet data as Clojure data structures.
is there another to get the first thing only, without the second thing?
btw, i am just trying to understand if i am using the library correctly, not arguing about what’s more useful or not
You have to realize the row from the ResultSet data as a Clojure data structure if you are going to defer processing (beyond that row).
Realizing that data is expensive -- hence the very first bullet of the README motivation for next.jdbc
: "Performance: there's a surprising amount of overhead in how ResultSet objects are converted to sequences of hash maps in clojure.java.jdbc – which can be really noticeable for large result sets – so I wanted a better way to handle that."
The datafiable-row
call is deliberately ugly/complex (requiring the connectable etc) so that you are very clear in your code that you are indeed eating that cost.
You can use select-keys
on the row to pull out a subset of the data -- and that will produce just a bare hash map with the simple keys that you asked for. That's probably the fastest way to produce a simple hash map containing just the data you want.
(it relies on the get-column-by-label machinery to extract the column value without otherwise realizing any data)
oh, i can call select-keys on the row produced by plan
?
“extract column value without otherwise realizing any data” do you mean without pulling data for columns that i dont want?
It's about more than just that but, yes, it would not pull columns you did not specify.
(`select-keys` is mentioned in the Getting Started page, BTW)
"This means that select-keys
can be used to create regular Clojure hash map from (a subset of) columns in the row, without realizing the row, and it will not implement Datafiable
or Navigable
."
oh i missed that!
The important thing to bear in mind is that the builder fn is not used in that case.
gotcha. so, in my usecase, is it better to use select-keys or rs/datafiable-row?
If you know the keys you want to operate on in your db/ybr-loan-fix-base
function, use select-keys
-- although it ties the logic together a bit.
Also, in your code, you're going to want to use mapv
inside the future
I think, not map
.
i do know the columns, but ideally i wouldnt have to hardcode them. is there difference in performace between select-keys and rs/datafiable-row?
(btw, i really appreciate your time on this friday afternoon)
btw, looks like rs/datafiable-row approach is working with our parallel reduction
The difference in performance is that select-keys
will only cause a column read by label from the current result set row for each key, and then however select-keys
builds the hash map, whereas datafiable-row
will construct the result set builder and will then call row-builder
on it, which in turn does this
(->> (reduce (fn [r i] (with-column builder r i))
(->row builder)
(range 1 (inc (column-count builder))))
(row! builder)))
->row
just creates a transient hash map and row!
just makes it persistent.
If you are using a builder adapter with your own choice of label-fn (and maybe qualifier-fn) that could be quite a bit of code being executed.
(on the other hand, select-keys
doesn't use a transient hash map to build the result -- probably because it assumes you're building small hash maps?)
it might be as simple as "it was written before transients existed and nobody prioritized improving its performance"
Good point.
wow! i had no idea
good to know!
For what it's worth, the difference between calling (row-builder @builder)
and (datafiable-row row connectable opts)
is almost nothing.
A call to ResultSet.getRow()
(fast), two field accesses on @builder
(fast), a call to with-meta
to add a hash map with four keys that have function literals as values (pretty fast).
(note, in particular, that the call to navize-row
simple returns a function literal!)
totally, very minor overhead. It’s just a bit confusion to the reader why must the caller provider a connection when all the data has already been queried
i had to read the source to understand that it’s required for navigation only
Connectable, not connection. I would suggest passing in a datasource 🙂
(of course, if you want to avoid any attempt at navigation, you could pass nil
into it)
@seancorfield Hello Sean, I have a TSQL script that last command is SELECT, but the result is [{:next.jdbc/update-count 85}]
, and I´m using (jdbc/execute! (buscar-datasource) [query])
to execute it. How does it work?
If I execute (jdbc/execute! (buscar-datasource) ["BEGIN SELECT 1 AS VALOR END"])
returns resultset
our (db/connection) is a datasource, poorly named
@seancorfield thank you for pointing out that we have to use mapv in the future!!
The behavior of trying to run multiple SQL commands in a single call is database-specific @fabrao
You may also need to use the (work-in-progress) multiple result set support on the multi-rs branch and pass :multi true
in as an option.
Naming is hard! 🙂
but I think this is only one resultset
You said the script has select as the last command -- which implies it has other commands before that?
as the result in Management Studio
yes, updates, inserts ...
Are you using deps.edn
?
syes
yes
is it different using with it?
Can you change your next.jdbc
dependency to use git and this SHA d974ed23139331015fda3ac281336eccccfe3ff2
how come?
{:git/url "<https://github.com/seancorfield/next-jdbc>" :sha "d974ed23139331015fda3ac281336eccccfe3ff2"}
instead of {:mvn/version "..."}
and then in your execute!
call, pass {:multi true}
as options (a third argument).
let me try
Instead of just one result, you will get a vector of all available results.
the same result
(jdbc/execute! (buscar-datasource) [query] {:multi true})
[{:next.jdbc/update-count 85}]
I don't think you got the updated version.
You restarted your REPL or process or whatever? To make sure you pick up the different dependency.
:deps {org.clojure/clojure {:mvn/version "1.10.1"}
seancorfield/next.jdbc {:git/url "<https://github.com/seancorfield/next-jdbc>" :sha "d974ed23139331015fda3ac281336eccccfe3ff2"}
com.microsoft.sqlserver/mssql-jdbc {:mvn/version "7.0.0.jre8"}
last query is select DISTINCT * from #TAB_RESULTADO
I have one insert before
Oh, my bad! Should be {:multi-rs true}
Sorry, I was working from memory as I had a different branch checked out locally.
going try now
you are the SQLMAN
!!!
worked
You are the first user of the "coming soon" multiple result set functionality! 🙂
@seancorfield again, many thanks for your help!! it seems that we are seeing the improvement that we were looking for! needs a bit more testing but looking pretty good.
So, you can commit it !!!
@fabrao Haha... it's committed... it's just not merged to develop yet because I haven't written the documentation 🙂
But you've given me an idea for something I can use to test it further, without trying to wrangle stored procedures (which are all over the map across different DBs!).
@antonmos That's great to hear!
yes, I don´t want to use SP just script
and it works
@antonmos I realized I misspoke on this. There's a datafy
call in there on the result set metadata but that won't do anything unless you've also required the next.jdbc.datafy
namespace.
So datafiable-row
is still very cheap -- unless you require next.jdbc.datafy
and even then it's not horribly expensive.
many thanks
do I have to use this git for good?
Once I've written the docs for multiple result set support, I'll merge it to develop and cut a new release. That's the only thing holding it up.
@antonmos This seems to ignore the row
when the batch
has 100 elements in it -- and I assume there's some additional step needed to mapv
over the last batch
if there aren't exactly a multiple of 100 elements in the result set?