sql

All things SQL and JDBC...
antonmos 2020-06-26T19:57:49.424600Z

hi, @seancorfield I am trying to stream process a largish dataset like this

(let [res (jdbc/plan (db/connection) params db-opts)]
  (r/foldcat 
    (r/map #(do (println (.getId (Thread/currentThread)))
                (do-stuff %)) 
           res)))
But it seems to be getting processed in a single thread. Any idea what’s going on? Thanks!

seancorfield 2020-06-26T19:59:51.425500Z

That's nothing to do with next.jdbc. It's just about clojure.core.reducers.

antonmos 2020-06-26T20:00:15.425800Z

it seems that r/fold requires the collection to be foldable

antonmos 2020-06-26T20:00:45.426400Z

is there any reason why that idea is not compatible with jdbc.next?

antonmos 2020-06-26T20:01:05.427200Z

i am trying to undestand if i am thinking about this wrong 🙂

seancorfield 2020-06-26T20:01:12.427500Z

next.jdbc/plan returns an IReduceInit like lots of things.

antonmos 2020-06-26T20:01:52.428200Z

do you know of any other way to do streaming parallel processing?

antonmos 2020-06-26T20:02:06.428600Z

of jdbc.next/plan result

seancorfield 2020-06-26T20:02:43.429700Z

Streaming data onto a channel and processing it in parallel as it is taken off?

antonmos 2020-06-26T20:02:43.429800Z

i can certainly try to implement foldable on to top of IReduceInit, but not sure if that’s conceptually possible or not

seancorfield 2020-06-26T20:02:59.430200Z

No idea. I've never used clojure.core.reducers stuff.

antonmos 2020-06-26T20:03:17.430400Z

gotcha!

antonmos 2020-06-26T20:03:31.430700Z

thank you for the channel idea

seancorfield 2020-06-26T20:09:28.431100Z

(of course, now you have me curious -- which is dangerous on a Friday afternoon!)

antonmos 2020-06-26T20:10:09.431400Z

muahaha 🙂

antonmos 2020-06-26T20:10:44.432Z

@dpassen1 and I are trying to grok what it takes to implent CollFold from coljure.core.reducers

seancorfield 2020-06-26T20:13:49.433200Z

Looking at clojure.core.reducers, the answer to your original question is: a plain old IReduceInit is just reduce'd directly by the reducers stuff, so that's a single thread.

antonmos 2020-06-26T20:16:38.433400Z

yup

antonmos 2020-06-26T20:24:32.434800Z

btw, i am curious, how come plan returns an IReduceInit and not a LazySeq?

seancorfield 2020-06-26T20:25:04.435500Z

Resource management: you can't let an open connection escape in a lazy sequence.

seancorfield 2020-06-26T20:25:55.436600Z

It was one of the big problems people naively fell into with clojure.java.jdbc whenever they provided a :result-set-fn that was lazy. And, earlier on, it was even easier to make that mistake with the old API.

seancorfield 2020-06-26T20:29:02.436900Z

Here's an example of that "bug" from 2017 https://stackoverflow.com/questions/46886158/result-set-fn-in-clojure-jdbc-return-error-the-result-set-is-closed-why

antonmos 2020-06-26T20:30:59.437300Z

ah yes, that makes sense

seancorfield 2020-06-26T20:36:45.439900Z

I believe you could write a reducing function over an IReduceInit that conj'd rows into vectors of size N and dispatched reduce over each of those via fork-join and then at the end of that process you'd have a set of reduces happening in threads and a final vector of < size N and you'd then combine all of those results (and reduce over the last piece).

seancorfield 2020-06-26T20:38:03.440700Z

Essentially implementing the foldvec helper from clojure.core.reducers to work inside an into-like reduction...

antonmos 2020-06-26T20:40:00.441200Z

we just wrote something similar

+  (-&gt;&gt;
+    (reduce (fn [{:keys [batch futures] :as acc} row]
+             (if (= (count batch) 100))
+               (assoc acc :batch []
+                          :futures (conj futures (future (map db/ybr-loan-fix-base batch))))
+               (update acc :batch conj row))
+           {} res)
+    (mapcat deref))
where res is result of plan

antonmos 2020-06-27T14:16:55.023600Z

yup, that was a todo 🙂

seancorfield 2020-06-27T18:05:25.024Z

Let me know if you start using the built-in fold support and how that goes @antonmos

seancorfield 2020-06-26T20:43:22.442400Z

So that's "just" parallel map with an unbounded thread pool I think?

antonmos 2020-06-26T20:43:52.443400Z

yea

antonmos 2020-06-26T20:44:07.444100Z

kinda like pmap + batching + over a reducable

seancorfield 2020-06-26T20:44:07.444200Z

That's certainly an easier thing to implement than a fork-join-based parallel fold, but it's good to see it can be done "easily".

antonmos 2020-06-26T20:44:19.444500Z

ill let you know if it works 🙂

antonmos 2020-06-26T20:44:25.444700Z

testing now

seancorfield 2020-06-26T20:46:57.445600Z

I think it would be very interesting to have plan's result implement CollFold...

seancorfield 2020-06-26T20:50:13.445800Z

https://github.com/seancorfield/next-jdbc/issues/125

❤️ 1
seancorfield 2020-06-26T20:50:42.446600Z

Note that it would have to fully realize each row (because of the mutable result set under the hood).

antonmos 2020-06-26T20:51:11.447200Z

that’s fine with me, i am not going after navigation

seancorfield 2020-06-26T20:51:47.447900Z

It's not an issue of navigation, it's about the performance boost you can get by not realizing rows when reducing.

seancorfield 2020-06-26T20:53:15.449300Z

Currently, reducing over plan can completely avoid converting ResultSet "rows" into Clojure data structures if you access columns by name (label) in your reducing function and don't do anything that requires a sequence.

seancorfield 2020-06-26T20:54:40.450Z

It can't avoid that if there's going to be any parallel evaluation (because mutability).

seancorfield 2020-06-26T20:56:56.451500Z

Which also means that the reducing function you used in the fold would have to expect to operate on the fully built hash maps and therefore the final column names, rather than being able to use the short names (labels) as in a simple reduce over plan.

antonmos 2020-06-26T21:00:45.452600Z

when we ran the code above, row objects came out as #object[next.jdbc.result_set$mapify_result_set$reify__40720 0x76d61149 {row} from plan` -- missing map or reduce?]`

antonmos 2020-06-26T21:01:01.453400Z

what’s the best way to realize it?

2020-06-26T21:01:56.454400Z

the big thing is fold needs some way to split the results, building a tree of parallel computations, which it is hard to see how you could do that generically on top of a sql query

👍 1
antonmos 2020-06-26T21:02:36.455100Z

should i be using rs/datafiable-row?

2020-06-26T21:03:18.455900Z

if you assume some common things like ids and getting the highest and lowest ids being cheap you can build something like that for scanning whole tables, or ranges of tables, but that breaks down for complex queries

2020-06-26T21:05:27.456800Z

(I built something like this a long time ago against s3 if I recall, not sql, where getting the highest id for a split was not cheap, but I did a binary search kind of probe for it)

2020-06-26T21:08:49.458Z

something to bare in mind is it is kind of frowned on to do io in a forkjoin threadpool (which is what fold puts tasks on)

2020-06-26T21:10:34.459400Z

the forkjoin pools have this notion of managed blockers where you can tell the pool you are doing a blocking task, so it spins up another thread to avoid the threadpool doing nothing while waiting for io

seancorfield 2020-06-26T21:10:36.459600Z

@antonmos Yes, I missed that you need to realize the rows in your code.

antonmos 2020-06-26T21:11:26.460200Z

could you help me understand why http://jdbc.next.re/mapify-result-set does not have an path for just calling

(row-builder @builder)

seancorfield 2020-06-26T21:12:44.461100Z

Because it wants to discourage you from doing it -- because you are losing the performance benefit of what plan tries to do.

antonmos 2020-06-26T21:13:22.461300Z

hmmm i feel like i am missing something.

antonmos 2020-06-26T21:14:00.462100Z

what i am trying to do is to start processing the result set on all my cpus while the record set is being paged in from the db

antonmos 2020-06-26T21:14:42.463100Z

which means that i need to realize each row as it gets downloaded (as governed by :fetch-size)

seancorfield 2020-06-26T21:15:06.463700Z

So you need to call rs/datafiable-row on each row before you decide whether to put it into a batch...

antonmos 2020-06-26T21:15:41.464400Z

yup, that’s what i am trying now. it’s just odd that i have do that even those i have no intention of using the navigized version of it

antonmos 2020-06-26T21:17:07.465400Z

as i am understand the value of plan is that i can get the data as it comes in fetch-size batches

antonmos 2020-06-26T21:17:14.465800Z

that’s the primary value

antonmos 2020-06-26T21:17:38.466500Z

secondary value is ability to get it in even thinner fashion and delay relalization of rows until navigation occurs

seancorfield 2020-06-26T21:17:51.466800Z

No, the primary value is not realizing ResultSet data as Clojure data structures.

antonmos 2020-06-26T21:18:36.467400Z

is there another to get the first thing only, without the second thing?

antonmos 2020-06-26T21:19:12.468Z

btw, i am just trying to understand if i am using the library correctly, not arguing about what’s more useful or not

seancorfield 2020-06-26T21:20:09.468700Z

You have to realize the row from the ResultSet data as a Clojure data structure if you are going to defer processing (beyond that row).

seancorfield 2020-06-26T21:20:39.469300Z

Realizing that data is expensive -- hence the very first bullet of the README motivation for next.jdbc: "Performance: there's a surprising amount of overhead in how ResultSet objects are converted to sequences of hash maps in clojure.java.jdbc – which can be really noticeable for large result sets – so I wanted a better way to handle that."

seancorfield 2020-06-26T21:22:13.470300Z

The datafiable-row call is deliberately ugly/complex (requiring the connectable etc) so that you are very clear in your code that you are indeed eating that cost.

seancorfield 2020-06-26T21:23:39.471700Z

You can use select-keys on the row to pull out a subset of the data -- and that will produce just a bare hash map with the simple keys that you asked for. That's probably the fastest way to produce a simple hash map containing just the data you want.

seancorfield 2020-06-26T21:24:33.472500Z

(it relies on the get-column-by-label machinery to extract the column value without otherwise realizing any data)

antonmos 2020-06-26T21:25:29.473100Z

oh, i can call select-keys on the row produced by plan?

antonmos 2020-06-26T21:27:21.474400Z

“extract column value without otherwise realizing any data” do you mean without pulling data for columns that i dont want?

seancorfield 2020-06-26T21:29:17.475200Z

It's about more than just that but, yes, it would not pull columns you did not specify.

seancorfield 2020-06-26T21:30:16.475600Z

(`select-keys` is mentioned in the Getting Started page, BTW)

seancorfield 2020-06-26T21:30:27.475900Z

"This means that select-keys can be used to create regular Clojure hash map from (a subset of) columns in the row, without realizing the row, and it will not implement Datafiable or Navigable."

antonmos 2020-06-26T21:30:40.476200Z

oh i missed that!

seancorfield 2020-06-26T21:31:20.477400Z

The important thing to bear in mind is that the builder fn is not used in that case.

antonmos 2020-06-26T21:32:53.478600Z

gotcha. so, in my usecase, is it better to use select-keys or rs/datafiable-row?

seancorfield 2020-06-26T21:33:51.479400Z

If you know the keys you want to operate on in your db/ybr-loan-fix-base function, use select-keys -- although it ties the logic together a bit.

seancorfield 2020-06-26T21:34:17.480100Z

Also, in your code, you're going to want to use mapv inside the future I think, not map.

antonmos 2020-06-26T21:35:33.481400Z

i do know the columns, but ideally i wouldnt have to hardcode them. is there difference in performace between select-keys and rs/datafiable-row?

antonmos 2020-06-26T21:35:56.481700Z

(btw, i really appreciate your time on this friday afternoon)

antonmos 2020-06-26T21:38:26.482300Z

btw, looks like rs/datafiable-row approach is working with our parallel reduction

seancorfield 2020-06-26T21:42:24.484300Z

The difference in performance is that select-keys will only cause a column read by label from the current result set row for each key, and then however select-keys builds the hash map, whereas datafiable-row will construct the result set builder and will then call row-builder on it, which in turn does this

(-&gt;&gt; (reduce (fn [r i] (with-column builder r i))
               (-&gt;row builder)
               (range 1 (inc (column-count builder))))
       (row! builder)))

seancorfield 2020-06-26T21:43:03.485Z

-&gt;row just creates a transient hash map and row! just makes it persistent.

seancorfield 2020-06-26T21:44:57.486100Z

If you are using a builder adapter with your own choice of label-fn (and maybe qualifier-fn) that could be quite a bit of code being executed.

seancorfield 2020-06-26T21:48:01.487Z

(on the other hand, select-keys doesn't use a transient hash map to build the result -- probably because it assumes you're building small hash maps?)

2020-06-26T21:51:13.487600Z

it might be as simple as "it was written before transients existed and nobody prioritized improving its performance"

1
seancorfield 2020-06-26T21:51:46.488200Z

Good point.

antonmos 2020-06-26T21:53:56.488500Z

wow! i had no idea

antonmos 2020-06-26T21:54:00.488700Z

good to know!

seancorfield 2020-06-26T21:59:25.489600Z

For what it's worth, the difference between calling (row-builder @builder) and (datafiable-row row connectable opts) is almost nothing.

seancorfield 2020-06-26T22:00:50.491Z

A call to ResultSet.getRow() (fast), two field accesses on @builder (fast), a call to with-meta to add a hash map with four keys that have function literals as values (pretty fast).

seancorfield 2020-06-26T22:01:26.491800Z

(note, in particular, that the call to navize-row simple returns a function literal!)

antonmos 2020-06-26T22:02:22.492800Z

totally, very minor overhead. It’s just a bit confusion to the reader why must the caller provider a connection when all the data has already been queried

antonmos 2020-06-26T22:02:40.493800Z

i had to read the source to understand that it’s required for navigation only

seancorfield 2020-06-26T22:03:57.495500Z

Connectable, not connection. I would suggest passing in a datasource 🙂

seancorfield 2020-06-26T22:04:30.496400Z

(of course, if you want to avoid any attempt at navigation, you could pass nil into it)

fabrao 2020-06-26T22:04:48.496600Z

@seancorfield Hello Sean, I have a TSQL script that last command is SELECT, but the result is [{:next.jdbc/update-count 85}] , and I´m using (jdbc/execute! (buscar-datasource) [query]) to execute it. How does it work?

fabrao 2020-06-26T22:05:16.497100Z

If I execute (jdbc/execute! (buscar-datasource) ["BEGIN SELECT 1 AS VALOR END"]) returns resultset

antonmos 2020-06-26T22:05:26.497200Z

our (db/connection) is a datasource, poorly named

antonmos 2020-06-26T22:05:54.498300Z

@seancorfield thank you for pointing out that we have to use mapv in the future!!

1
seancorfield 2020-06-26T22:05:59.498500Z

The behavior of trying to run multiple SQL commands in a single call is database-specific @fabrao

seancorfield 2020-06-26T22:07:41.000200Z

You may also need to use the (work-in-progress) multiple result set support on the multi-rs branch and pass :multi true in as an option.

seancorfield 2020-06-26T22:08:14.000600Z

Naming is hard! 🙂

fabrao 2020-06-26T22:08:25.001Z

but I think this is only one resultset

seancorfield 2020-06-26T22:08:52.001800Z

You said the script has select as the last command -- which implies it has other commands before that?

fabrao 2020-06-26T22:08:54.001900Z

as the result in Management Studio

fabrao 2020-06-26T22:09:12.002200Z

yes, updates, inserts ...

seancorfield 2020-06-26T22:09:25.002400Z

Are you using deps.edn?

fabrao 2020-06-26T22:09:29.002600Z

syes

fabrao 2020-06-26T22:09:31.002800Z

yes

fabrao 2020-06-26T22:09:56.003400Z

is it different using with it?

seancorfield 2020-06-26T22:10:05.003600Z

Can you change your next.jdbc dependency to use git and this SHA d974ed23139331015fda3ac281336eccccfe3ff2

fabrao 2020-06-26T22:10:23.004Z

how come?

seancorfield 2020-06-26T22:10:47.004500Z

{:git/url "<https://github.com/seancorfield/next-jdbc>" :sha "d974ed23139331015fda3ac281336eccccfe3ff2"} instead of {:mvn/version "..."}

seancorfield 2020-06-26T22:11:12.005100Z

and then in your execute! call, pass {:multi true} as options (a third argument).

fabrao 2020-06-26T22:12:01.005400Z

let me try

seancorfield 2020-06-26T22:12:25.005800Z

Instead of just one result, you will get a vector of all available results.

fabrao 2020-06-26T22:15:13.006100Z

the same result

fabrao 2020-06-26T22:15:41.006300Z

(jdbc/execute! (buscar-datasource) [query] {:multi true})

fabrao 2020-06-26T22:15:49.006500Z

[{:next.jdbc/update-count 85}]

seancorfield 2020-06-26T22:16:24.006900Z

I don't think you got the updated version.

seancorfield 2020-06-26T22:16:49.007400Z

You restarted your REPL or process or whatever? To make sure you pick up the different dependency.

fabrao 2020-06-26T22:16:50.007500Z

fabrao 2020-06-26T22:17:49.008600Z

:deps {org.clojure/clojure {:mvn/version "1.10.1"}

        seancorfield/next.jdbc {:git/url "<https://github.com/seancorfield/next-jdbc>" :sha "d974ed23139331015fda3ac281336eccccfe3ff2"}
        com.microsoft.sqlserver/mssql-jdbc {:mvn/version "7.0.0.jre8"}

fabrao 2020-06-26T22:18:30.008900Z

last query is select DISTINCT * from #TAB_RESULTADO

fabrao 2020-06-26T22:18:43.009200Z

I have one insert before

seancorfield 2020-06-26T22:19:00.009500Z

Oh, my bad! Should be {:multi-rs true}

seancorfield 2020-06-26T22:19:17.009900Z

Sorry, I was working from memory as I had a different branch checked out locally.

fabrao 2020-06-26T22:20:06.010900Z

going try now

fabrao 2020-06-26T22:21:15.011600Z

you are the SQLMAN !!!

fabrao 2020-06-26T22:21:21.012Z

worked

seancorfield 2020-06-26T22:21:43.012800Z

You are the first user of the "coming soon" multiple result set functionality! 🙂

antonmos 2020-06-26T22:21:47.013Z

@seancorfield again, many thanks for your help!! it seems that we are seeing the improvement that we were looking for! needs a bit more testing but looking pretty good.

fabrao 2020-06-26T22:22:13.013300Z

So, you can commit it !!!

seancorfield 2020-06-26T22:22:56.014Z

@fabrao Haha... it's committed... it's just not merged to develop yet because I haven't written the documentation 🙂

seancorfield 2020-06-26T22:24:06.014800Z

But you've given me an idea for something I can use to test it further, without trying to wrangle stored procedures (which are all over the map across different DBs!).

seancorfield 2020-06-26T22:24:18.014900Z

@antonmos That's great to hear!

fabrao 2020-06-26T22:25:38.015900Z

yes, I don´t want to use SP just script

fabrao 2020-06-26T22:25:47.016100Z

and it works

seancorfield 2020-06-26T22:25:53.016200Z

@antonmos I realized I misspoke on this. There's a datafy call in there on the result set metadata but that won't do anything unless you've also required the next.jdbc.datafy namespace.

seancorfield 2020-06-26T22:26:27.016400Z

So datafiable-row is still very cheap -- unless you require next.jdbc.datafy and even then it's not horribly expensive.

fabrao 2020-06-26T22:27:17.016700Z

many thanks

fabrao 2020-06-26T22:27:38.017Z

do I have to use this git for good?

seancorfield 2020-06-26T22:31:24.017800Z

Once I've written the docs for multiple result set support, I'll merge it to develop and cut a new release. That's the only thing holding it up.

seancorfield 2020-06-26T22:59:42.018200Z

@antonmos This seems to ignore the row when the batch has 100 elements in it -- and I assume there's some additional step needed to mapv over the last batch if there aren't exactly a multiple of 100 elements in the result set?