sql

All things SQL and JDBC...
2021-01-19T16:59:15.022500Z

I'm using next.jdbc to do a simple select query, and it's awfully slow compared to the same query being run in C#. I'm finding Clojure to be around 10x slower. I'm using the exact same select query, what am I doing wrong?

(defn query-sentinel [computer-name sql-query]
  (let [db {:dbtype   "sqlserver"
            :dbname   dbname
            :host     host
            :user     user
            :password password
            }
        ds (jdbc/get-datasource db)]
    (jdbc/execute! ds [sql-query])))
sql-queryΒ is just a string that holds my select query. The query returns one field. In Clojure its taking around 600ms, in C# around 50ms
var query = GetRole(computerName, databaseName, instanceName);
using(var sqlConnection = new SqlConnection(ConnectionString))
using(var sqlCommand = new SqlCommand(query, sqlConnection))
{
  	sqlConnection.Open();
   	var result = sqlCommand.ExecuteReader();
   	return !result.Read() ? null : result["RoleSummary"].ToString();
}
Is their something I can do to speed up the clojure version? The difference over a five hundred to a thousand queries is really adding up. Sometimes the clojure version is as high as 750ms per query.

Nassin 2021-01-19T17:35:32.023500Z

you are creating a new db connection each time you call the function which is expensive

seancorfield 2021-01-19T17:42:38.024500Z

@qmstuart As it says in the docs, it's a good idea to use a connection pooling library. Support for c3p0 and HikariCP are built into next.jdbc -- see next.jdbc.connection/->pool

Nassin 2021-01-19T18:00:52.026100Z

for simple use cases one could use just get-connection(for reusing the connection object) too and close the object on shutdown?

seancorfield 2021-01-19T18:03:41.028300Z

Yup, calling get-connection on the db spec hash map and reusing that across multiple SQL operations would also work -- but it would not necessarily be robust enough for production code (if the connection is broken for whatever reason, your app is hosed, whereas with a connection pool, bad connections can usually be detected and thrown away, and then you get a new, valid connection from the pool on the next call).

πŸ‘ 1
seancorfield 2021-01-19T18:06:31.030600Z

How expensive it is to stand up a brand new connection depends very much on the DB and whatever network you have between the app and the actual DB. In production, we use a c3p0 connection pool and we use ProxySQL in front of MySQL (Percona), so our app->db hop is via loopback (to the proxy), and that makes things pretty robust in terms of bad connections / connection loss.

2021-01-19T18:22:07.033800Z

Thanks, I'll have a look into connection pooling, I didn't bother about that since it's jsut a console application that is needs to make some db queries. Its not long running, but does hit the db a lot. I was just surprised with teh time difference, as even the C# version is creating a new connection each time, yet it's still really fast.

Nassin 2021-01-19T18:29:49.035Z

What database? FWIW, with postgres the overhead of creating a new connection each time (after the first one) is ~25ms for me

2021-01-19T18:30:55.035400Z

its mssqlserver.

seancorfield 2021-01-19T18:30:58.035700Z

@kaxaw75836 According to the db spec, it's MS SQL Server

2021-01-19T18:31:14.036200Z

I mean, the C# creates teh connection, runs the sql and returns me the class in 50ms.

Nassin 2021-01-19T18:31:26.036600Z

hikary is a pretty light dependency anyway

2021-01-19T18:31:29.036800Z

Or maybe my clojure isn't equivalent to my C# ?

seancorfield 2021-01-19T18:33:09.037900Z

It's also hard to do timing/benchmarking "in the small" so you may not be comparing like with like.

seancorfield 2021-01-19T18:34:09.039100Z

The first time you open a connection in a Java app, it's doing a lot of work including loading and initializing the JDBC driver, but subsequent connections should be much faster.

2021-01-19T18:34:32.039600Z

I know if I run the C# version of the program, it runs in about 2 mins, that's also writing to an excel spreadsheet the result of each db call. In clojure i'm just making the db calls and no file IO and it's taking closer to 10-12 mins

2021-01-19T18:35:11.040100Z

OK, I'll look into creating the connection once and reusing it see how it goes, thanks guys

Aidan 2021-01-19T19:40:24.053200Z

πŸ‘‹ first time poster here! I'm using next.jdbc and trying to get my results as kebab-case maps when using plan for a streamable result set. I have two approaches in the works. Approach 1 = use :builder-fn

(let [db' (jdbc/with-options db {:auto-commit false})
      q (-> (honey/select etc...))]
  (jdbc/plan db' (hsql/format q) {:fetch-size 5000
                                  :concurrency :read-only
                                  :cursors :close
                                  :result-type :forward-only
                                  :builder-fn rs/as-kebab-maps})  ;;<===
However, when I try to reduce over the resulting ResultSet and access the kabab-case keys, I can't. I still have to use snake_case to pull keys from the rows. If I explicitly convert the rows to hashmaps using next.jdbc.result-set/datafiable-row that works, but I'm wondering if I can tweak something to be able access my kebab-case keys right away, since I don't love having to use next.jdbc functions outside of my database namespaces. I also noticed certain functions like println will print the kebab-case keys with no prior datafication. Is the best solution to use next.jdbc.result-set/datafiable-row ? Or is there a better way to do this that keeps the next.jdbc specific logic within the function? Approach 2 = wrap the ResultSet to create a new reducible
(let [db' (jdbc/with-options db {:auto-commit false})
      q (-> (honey/select etc...))]
  (eduction (map my-kebab-maps-fn)  ;;<===
            (jdbc/plan db' (hsql/format q) {:fetch-size 5000
                                            :concurrency :read-only
                                            :cursors :close
                                            :result-type :forward-only})
My issue with this is I've never used eduction before and I'm not sure I'm using it correctly here. I really want to make sure I'm not accidentally evaluating the result set too greedily or in a way that would cause a memory error on large result sets. Does using eduction here preserve the memory efficient, streamable nature of jdbc/plan ? Thanks!

2021-01-19T19:49:54.053900Z

with #1, when it doesn't work, how are you accessing the kebab case keys?

Aidan 2021-01-19T19:53:16.055Z

I've been doing something like this to test:

(run! #(-> % :ns/my-field println) result-set)

Aidan 2021-01-19T19:53:53.055700Z

So just "calling" the keyword fieldname on the row, as if it were a map

seancorfield 2021-01-19T19:55:29.057Z

Inside the reducing function over plan, the builder function is irrelevant -- you are working with column labels because there is no Clojure data structure in play. That's the whole point of using plan: to avoid that overhead.

seancorfield 2021-01-19T19:56:12.057700Z

If you just want a fully realized vector of hash maps back, use execute!, not plan.

Aidan 2021-01-19T20:01:45.060900Z

I see, that makes sense. I was under the impression that execute! would not work well for streaming large result sets though, because it fully realizes the entire thing (I'm using pedestal's http://pedestal.io/reference/streaming by passing a fn in my response body). So should I stick with plan if I want to handle the results as a stream, or am I misunderstanding how execute! can be used?

seancorfield 2021-01-19T20:03:13.061400Z

It's important to understand that plan lets you stream-from-the-DB but you must reduce it eagerly.

seancorfield 2021-01-19T20:04:11.062100Z

Resource management is why: when you reduce over plan, it opens a connection, reads and processes the data, and then closes the connection.

Aidan 2021-01-19T20:12:05.064600Z

Thanks for the clarification. So if I understand, does that mean I don't need to use plan here, since I plan to materialize every row? Or should I still use plan because I can materialize one row at a time, which will save memory?

Aidan 2021-01-19T20:40:35.066900Z

And from what I understand, the use of eduction around the result set should also be fine, as in Approach 2

seancorfield 2021-01-19T21:34:36.068700Z

@aidan Sorry, stepped away for lunch... If you use execute! the entire result set is realized and returned. If you can fit the result easily in memory, that's fine. If you can't fit the result in memory -- and so you need streaming-results-from-the-DB -- or you want to avoid the overhead associated with creating a full Clojure data structure from the result set object, then you need plan.

seancorfield 2021-01-19T21:35:04.069200Z

If you use execute!, the builder function is used -- to build the Clojure data structure.

πŸ‘ 1
seancorfield 2021-01-19T21:36:13.070800Z

If you use plan, the intent is that you can avoid the overhead of building the Clojure data structure by accessing columns by their label (a simple string or keyword). You are responsible for creating whatever data structure you need from the row-abstraction. There are many ways to approach that, depending on exactly what you want to do.

seancorfield 2021-01-19T21:36:59.071800Z

You could call datafiable-result-set on each row if you want streaming but also want a fully-realized hash map for each row that uses your builder function.

seancorfield 2021-01-19T21:37:32.072400Z

But if your data volumes are large enough to require streaming, you probably don't want to fully-realize a Clojure hash map for each row!

Aidan 2021-01-19T21:37:44.072500Z

I see, so while I may like my idiomatic kebab-case (I'm recently aware there is a spirited online debate about this in Clojure), it may be most efficient to use snake case with plan

seancorfield 2021-01-19T21:38:08.072700Z

It is most efficient inside the reducing function over plan to use the simple column names.

seancorfield 2021-01-19T21:38:31.073Z

It really depends what you want to do with the data from each row.

Aidan 2021-01-19T21:39:50.073200Z

I'm streaming a csv file to the client with the results, so just accessing certain fields of the row in a given order. Seems like the best bet is to leave it as snake_case

seancorfield 2021-01-19T21:44:05.077100Z

Streaming the result of a plan call is problematic because plan expects the result to be consumed eagerly.

seancorfield 2021-01-19T21:47:11.079900Z

eduction returns a reducible so you're in the same situation as plan: you are expected to eagerly consume the result, as I understand it.

Aidan 2021-01-19T21:50:09.080100Z

Ah, I see, I think even though I'm streaming to the client the result set may be consumed eagerly, since I just have a call to run! that puts all the results onto the IO stream.

Aidan 2021-01-19T21:51:53.080400Z

is it still better than execute! even if I realize/materialize the maps as I reduce over the plan result set? As I reduce using run!, only one row is realized at a time right? So while the time complexity would be worse, I was hoping the space complexity would not be too much worse.

seancorfield 2021-01-19T22:18:53.080900Z

(run! <something> (plan ...))
You control that <something> so you can determine what gets realized. If you can send snake_case to the IO stream, then you could use #(select-keys % [:the :set_of :cols :you_need]) and it will only create a simple hash map with just those keys, rather than realize the entire row into a hash map with transformed keys.

seancorfield 2021-01-19T22:22:10.081100Z

execute! will materialize the entire result set into memory, yes. With plan you can choose how much or how little you materialize. If you are realizing each row completely, you'll use about the same amount of memory overall but it can be GC'd as you walk the result set.

seancorfield 2021-01-19T22:22:42.081300Z

(unless you are reducing each row into a data structure, in which case you're just doing what execute! would do anyway! πŸ™‚ )

πŸ˜… 1
Aidan 2021-01-19T22:29:59.081600Z

First off, thanks so much for taking the time to respond to my million questions, it's a great help. When you say "If you are realizing each row completely, you'll use the same amount of memory overall", do you mean the same amount of memory as execute! or plan ? I just figured garbage collection was quick enough that the memory used for the materialized row would be freed up with each iteration...

seancorfield 2021-01-19T22:32:25.081800Z

Realizing a row to a (qualified/transformed) hash map takes X amount of memory and Y amount of time. If your result set has 1,000 rows, you'll use X,000 memory and Y,000 time -- in both execute! and plan (if you are realizing rows).

seancorfield 2021-01-19T22:33:48.082Z

It is possible to overrun GC even with plan but relatively unlikely (although it depends what else your system is doing).

seancorfield 2021-01-19T22:34:41.082200Z

I'm really just pointing out that plan offers the ability to avoid using X amount of memory per row if you can use just the column labels -- but it's going to depend on what you do with those column values.

seancorfield 2021-01-19T22:35:09.082400Z

There's no simple nor single answer here. "It. Depends."

Aidan 2021-01-19T22:35:16.082600Z

hmm ok thanks for spelling it out for me! So I guess most of the memory improvement I was seeing was from following https://github.com/seancorfield/next-jdbc/blob/develop/doc/tips-and-tricks.md#streaming-result-sets-1 in your docs, which mitigated the fact that I was fully realizing the rows

seancorfield 2021-01-19T22:35:56.082900Z

I'm not sure what "memory improvement" you're talking about there.

seancorfield 2021-01-19T22:37:32.083100Z

If you're saying "with execute! I got an Out Of Memory error and when I used plan I did not" well, yes, your throughput on processing the rows allowed the GC to avoid the OOM error.

Aidan 2021-01-19T22:38:24.083300Z

If I recall we were getting an OOM before we added those options to plan , but I'm confirming now with someone who made the PR

seancorfield 2021-01-19T22:40:59.083500Z

The "problem" here is that SQL and databases are fussy things and every one is different so there's no single, "true" answer for a variety of problems. Which means it's important to understand what's really going on in your code and what trade offs you are making. When I reach for plan, I'm generally also trying to avoid the overhead of realizing a full hash map from a row in the first place -- more so that trying to stream a very large result set that wouldn't fit in memory.

seancorfield 2021-01-19T22:42:09.083700Z

(which is why I may seem to be a) laboring the point b) being obtuse c) not answering the question definitively πŸ™‚ )

Aidan 2021-01-19T22:43:00.083900Z

Ok, that's very useful to know. I think given this discussion I'll move forward and not attempt to kebabify the maps as it would interfere with the gains from plan

Aidan 2021-01-19T22:44:15.084100Z

I realize most issues like this don't have a definitive answer, so I appreciate the thoughtful responses πŸ™‚