datascript

Immutable database and Datalog query engine for Clojure, ClojureScript and JS
superstructor 2021-03-16T05:30:13.002600Z

I've heard that there can be performance issues with Datascript in single page apps with "large" databases ? Is that still true or an old myth ? What are your experiences with performance ? Are there any types of queries / use cases to avoid ? Looking from the perspective of re-frame, possibly official support etc.

superstructor 2021-03-16T07:25:30.003300Z

Thanks for the reply @huxley 🙂 Roughly how big is your datascript db ?

2021-03-16T07:41:48.003500Z

several thousand entities

simongray 2021-03-16T07:44:28.003800Z

@huxley nice - if damning - review. It’s good to hear from people who have used it in anger (literally, it would seem).

2021-03-16T07:47:03.004Z

there was once a pretty interesting conversation on this topic here, but the archive doesn't work

2021-03-16T07:47:45.004200Z

lilactown created autonormal https://github.com/lilactown/autonormal

2021-03-16T07:48:53.004500Z

roman01la also mentioned problems with re-posh and datascript in general

2021-03-16T07:51:33.005Z

joinr commented on this same topic on Reddit

simongray 2021-03-16T08:24:18.005200Z

Hah, he even made that comment as a reply to me 😛

❤️ 1
2021-03-16T08:28:26.005400Z

I'm a compulsive github browser, and overall what I've noticed recently is that everyone seems to have run into the same problems and everyone is trying to solve them somehow.

2021-03-16T08:32:34.005600Z

one is state management, preferably using as flat a normalized db as possible or using a datalog. the other is to get rid of unnecessary multiple recalculations using graphs. third is closer integration with react. rumext, helix or uix are only first examples.

simongray 2021-03-16T08:44:43.005800Z

yup, noticed the same thing

simongray 2021-03-16T08:45:24.006Z

been trying to get an overview by compiling a list of the graph stuff: https://github.com/simongray/clojure-graph-resources

superstructor 2021-03-16T08:47:23.006300Z

Thanks! That is all very helpful. I'll look into all that. We want to improve the state management and subs (graph) storey in re-frame, but I doubt we'll ever diverge from reagent as the backwards compat is probably too much of an issue.

2021-03-16T08:48:29.006500Z

@simongray you can also add the mentioned autonormal

2021-03-16T08:49:13.006700Z

there is also no problem to use meander to search graphs, it is even in examples

simongray 2021-03-16T08:49:18.006900Z

@huxley yup, I have a few things I need to add, just been a bit busy (just got a kid)

👍 1
🎉 1
superstructor 2021-03-16T08:49:43.007400Z

@simongray congrats! How old ? I have a 3 month old boy.

👍 1
simongray 2021-03-16T08:49:55.007600Z

6 weeks on thursday 🙂

1
simongray 2021-03-16T08:50:12.007800Z

and thank you - you too!

superstructor 2021-03-16T08:50:37.008Z

Oh awesome, that is a good milestone. Those initial weeks can be full-on, after that its all a bit more reasonable.

2021-03-16T08:53:02.008200Z

@superstructor reagent is great. so great that one of the main inspirations for the hooks was reagent itself. However, I understand people who are just now facing the choice of react wrapper and I also understand that they may want to use the leanest one possible. Today reagent doesn't offer as much more than pure react as it did a few years ago.

2021-03-16T08:53:47.008400Z

@simongray congratulations. Even though I don't have children myself, I am glad that others do;)

superstructor 2021-03-16T08:54:31.008600Z

@huxley That meander datascript example is cool; basically a macro-based conversion of datalog to meander at compile time.

simongray 2021-03-16T08:55:33.008800Z

very cool, though not sure how practical it is

simongray 2021-03-16T08:55:45.009100Z

@huxley thank you!

2021-03-16T08:56:48.009300Z

I wrote for fun and to get to know meander better

2021-03-16T08:58:46.009500Z

is not practical at all and probably has a lot of bugs, but it's just an example that maybe it's stupid but possible ; )

pithyless 2021-03-16T09:19:15.009700Z

Another anecdotal experience report: I don't have much experience with re-posh, but last week I pitched in to investigate some performance issues with the athensresearch project. The codebase uses re-posh and re-frame and does a lot of recursive pulls, which seems to cause havoc on the posh pull-analyzer. Here's an example: https://github.com/athensresearch/athens/pull/665#issuecomment-790088361

👍 2
simongray 2021-03-16T09:25:59.010200Z

Are you involved with Athens, @pithyless? I’m amazed at how that project came out of a single tweet and just started snowballing.

pithyless 2021-03-16T09:33:04.010500Z

I came across it by accident when I reading about all these new org-like tools that are popping up and ended up submitting a couple of PRs. They definitely seem to have a lot of momentum right now, their discord channel has a lot of activitiy, (and IIUC, they have some funding sources); but the competition is fierce. Also, they're definitely going to have to fight through some scaling pains - I mentioned the re-posh stuff and also the way it's now handling durable storage.

👍 1
pithyless 2021-03-16T09:35:13.010700Z

I found it interesting when comparing Athens to what the LogSeq project (also CLJS project) is doing; where LogSeq is e.g. using git for their sync layer and OCaml for their Markdown parser (and modeling data at the page, not block level).

simongray 2021-03-16T11:01:00.011200Z

@pithyless If you had to pick some stack for modelling data in the frontend, what would you go with?

pithyless 2021-03-16T12:04:13.011500Z

@simongray not sure what you mean; if you're talking about frameworks/libraries, my goto stack is fulcro+shadow-cljs (vs say re-frame+figwheel); but you know... it depends. 🙂 If you're talking about more specifically datastores, Fulcro's generic 3-layer DB approach is fast enough for most cases (since it's just maps and lookup-refs); you can always add reactive mutations if you'd like; and if you put Pathom behind it you're free to swap out and add a more complicated datastore (Datomic / DataScript / SQL / etc). I'm definitely keeping an eye on Asami for it's speed and durability promises and I hope to use it in anger sometime.

👍 1
pithyless 2021-03-16T12:05:59.011700Z

I think that was kind of rambling; so I usually need a reason not to hide everything behind a Pathom EQL API (irrespective of what ends up resolving the query).

pithyless 2021-03-16T12:06:49.011900Z

but with that approach, Fulcro's DB map with lookup-refs works nicely for fetching data locally to components

raspasov 2021-03-16T12:13:26.012100Z

I am pretty new to Datascript but I noticed the not-so-good performance as well. On less than 2000 entities, on a very recent iPhone (React Native), queries can take as much as 100ms+ (!!!) to fetch around 100 entities.

raspasov 2021-03-16T12:13:56.012300Z

As far as I can tell, the query performance is proportional to the size of the result.

raspasov 2021-03-16T12:14:21.012500Z

It seems to be about ~1ms per entity (that’s a very rough ballpark estimate)

raspasov 2021-03-16T12:16:40.012700Z

One solution I’ve devised it to use DataScript very carefully and avoid using it as the primary source of reads. Basically I came up with a solution that puts an additional atom as a sort of a cache in front of Datascript. I use that cache atom to do most reads (instead of doing directly to datascript via (query… ) etc which is quite slow)

raspasov 2021-03-16T12:17:00.012900Z

But I like the expressive power of the Datalog queries… so it’s definitely a trade-off.

raspasov 2021-03-16T12:17:57.013100Z

I wish I could use it directly as a primary source of reads but it’s simply too slow for the needs of my mobile application, where sometimes I need to read values from the the app state dozens of times a second.

2021-03-16T12:29:54.013300Z

Even though datalog is infinitely powerful, it's usually not used to its full potential on the frontend, and thanks to clojure's expressiveness you can achieve the same effect with not much more code. For the more ambitious there is still meander.

raspasov 2021-03-16T12:30:57.013600Z

@huxley how many datoms approx. do you have in your database when you noticed the slow down? Where you using indices in DataScript?

2021-03-16T12:31:28.013900Z

As much as I was a big proponent of datascript, I currently advise against it for everyone. Datalog on the BE side ❤️. On the frontend side, state is best managed in fulcro or in an identical way to fulcro.

simongray 2021-03-16T12:32:16.014100Z

@pithyless I was just wondering what libs you used for handling state and how you handle those transitions between frontend and backend, basically. Thank you for answering.

2021-03-16T12:32:37.014400Z

@raspasov We have several thousand entities in production.

simongray 2021-03-16T12:34:04.014600Z

It doesn't make much sense to me that an in-memory db can be that slow.

raspasov 2021-03-16T12:34:13.014800Z

@huxley Did you have to requirement to run on mobile? That’s where I noticed the bulk of the slow down. I tested on non-mobile and the perf. was quite a bit better .

raspasov 2021-03-16T12:34:21.015Z

@simongray I was SUPER surpised as well.

2021-03-16T12:34:34.015300Z

yes

2021-03-16T12:34:58.015500Z

if someone really wants a datalog, I recommend asami

2021-03-16T12:35:09.015700Z

is 100-200x faster

raspasov 2021-03-16T12:35:10.015900Z

There’s an explanation by tonsky here: https://github.com/tonsky/datascript/issues/130

simongray 2021-03-16T12:36:08.016200Z

So what does Asami do differently? AFAIK it started as a fork of Datascript.

simongray 2021-03-16T12:36:26.016400Z

Just like Datahike and Datalevin

raspasov 2021-03-16T12:37:01.016600Z

Perhaps the query planner? (I have no knowledge of asami): • Query planner: Queries are analyzed to find an efficient execution plan. This can be turned off.

2021-03-16T12:37:21.016800Z

Asami has a planner that is additionally cached.

raspasov 2021-03-16T12:38:32.017Z

Yeah… I felt that explanation by tonsky gives a lot of clarity: “DataScript is in different category, so expect different tradeoffs: query speed depends on the size of result set, you need to sort clasuses to have smaller joins, accessing entity properties is not free given its id, etc. As a benefit, you gain ability to query dataset for different projections, forward and reverse reference lookups, joins between different sets, etc. And direct index lookup (`datascript.core/datoms`) is still fast and comparable to lookup in a map (at least comparable, think binary lookup vs hashtable lookup, logarithm vs constant). Queries do much more than that.”

raspasov 2021-03-16T12:38:50.017200Z

This was key for me: “query speed depends on the size of result set”

raspasov 2021-03-16T12:39:05.017400Z

Can’t expect to fetch a giant result set in constant time… It feels more like linear time.

2021-03-16T12:40:25.017600Z

[(datascript-q1) (asdb-q1) (mdb-q1)]
;; => [3.99 0.15 17.51]

[(datascript-q4) (asami-q4) (mdb-q4)]
;; => [169.43 3.58 167.25]

raspasov 2021-03-16T12:41:00.017800Z

@huxley are those times in ms?

2021-03-16T12:41:06.018Z

yes

2021-03-16T12:41:07.018200Z

jvm

raspasov 2021-03-16T12:41:34.018400Z

asdb: asami?

2021-03-16T12:41:36.018600Z

mdb is a simple replica of the datalog in meander, which I posted here

2021-03-16T12:41:39.018800Z

yes

raspasov 2021-03-16T12:41:43.019Z

Cool

raspasov 2021-03-16T12:42:20.019200Z

Any downsides of asami you’ve noticed?

2021-03-16T12:42:47.019400Z

apart from testing, I have not had the opportunity to use

raspasov 2021-03-16T12:43:35.019600Z

It seems around 50x faster

raspasov 2021-03-16T12:43:44.019800Z

Based on those two queries

2021-03-16T12:44:12.020Z

I was talking with noprompt from cisco while discussing meander, and they are using asami in production along with re-frame

2021-03-16T12:44:24.020200Z

so it's battle tested

raspasov 2021-03-16T12:44:32.020400Z

It feels like DataScript is a pretty simple implementation and leaves a lot on the table for improvement.

2021-03-16T12:45:31.020600Z

actually, considering the speed it offers, I'd say it's rather complicated

raspasov 2021-03-16T12:45:34.020800Z

Databases are tricky things (in memory or not), you need to resort to clever tricks to squeeze performance.

raspasov 2021-03-16T12:46:04.021Z

@huxley alright 🙂 I haven’t explored the internals, so I can’t speak; only speculate.

2021-03-16T12:46:05.021200Z

simple it is to use filter, and it is not slower specifically, despite the lack of indexing

raspasov 2021-03-16T12:46:57.021400Z

Rrrright 🙂

pithyless 2021-03-16T12:47:15.021600Z

> Any downsides of asami you’ve noticed? It's not a port of Datascript - it was started independently around the same time - and it doesn't try to be 1:1 feature compatible with Datomic API. So you might be surprised by how certain things are incompatible with your existing queries (e.g. no pull syntax at the moment, db/idents work different than DS/Datomic, etc.)

raspasov 2021-03-16T12:47:31.021800Z

Hmmm… giving me a lot of food for thought here; I like the organization of data that datalog provides

pithyless 2021-03-16T12:48:42.022Z

And FYI - there is an active #asami channel on Slack ;]

raspasov 2021-03-16T12:48:58.022200Z

@pithyless just joined, thanks! 🙂

pithyless 2021-03-16T12:49:39.022400Z

There was also a #datalog channel that was created sometime ago, meant for these kind of cross-library discussions, but it has been quiet recently

2021-03-16T12:50:34.022600Z

(defn transduce-q4 []
  (e/qb 1e1
    (into []
          (comp
           (filter (fn [[_ m]] (= "Ivan" (m :name))))
           (filter (fn [[_ m]] (= :male (m :sex))))
           (map (fn [[_ m]] (select-keys m [:db/id :last-name :age]))))
          mdb100k)))
[(datascript-q4) (asami-q4) (mdb-q4) (transduce-q4)]
;; => [158.78 4.39 151.03 46.03]

raspasov 2021-03-16T12:51:42.022800Z

transduce-q4 is just regular Clojure transduce code, yes?

2021-03-16T12:52:07.023Z

yes

2021-03-16T12:52:27.023200Z

you have the code above

pithyless 2021-03-16T12:52:54.023400Z

@huxley have you tried q4 with specter?

2021-03-16T12:53:04.023600Z

yes

raspasov 2021-03-16T12:53:06.023800Z

Yes… Well… One “trick” I’ve resorted to on React Native is runAfterInteractions https://reactnative.dev/docs/interactionmanager (not sure if there’s comparable browser API/trick)

2021-03-16T12:53:21.024200Z

I even have the code, just let me find

raspasov 2021-03-16T12:53:23.024400Z

Basically it delays the execution of a given fn after all user interactions have ended

raspasov 2021-03-16T12:53:45.024600Z

Another option I’ve explored is run DataScript in its own worker…

raspasov 2021-03-16T12:53:58.024800Z

(That would definitely help, but it comes with its own set of challenges)

raspasov 2021-03-16T12:54:33.025Z

Aka, you can only communicate with your in-memory db asynchronously… but to be fair… with runAfterInteractions… it’s already happening! Lol

pithyless 2021-03-16T12:54:42.025200Z

@huxley I have a suspicion the destructs [_ m] are killing your perf in transduce-q4

2021-03-16T12:55:35.025500Z

Yes, but it's just a quick write-up

2021-03-16T12:56:14.025800Z

db is of the form

{{?id {:db/id ?id ?k ?v ...} ...}

2021-03-16T12:57:07.026100Z

this is slower to query, but pull syntax/eql is lightning fast

raspasov 2021-03-16T13:01:52.026300Z

I am really curious where the major slow down in DataScript is compared to other options.

raspasov 2021-03-16T13:03:07.026500Z

The similarity to Datomic is still very compelling for me, and the power of Datalog + pull syntax is definitely useful.

raspasov 2021-03-16T13:03:39.026700Z

@huxley have you explored putting :db/index on certain schema elements in DataScript?

pithyless 2021-03-16T13:09:07.026900Z

I find Fulcro's approach pragmatic - seldom do you need the full power of Datalog when you're re-rendering a component; I think of it as a UI data cache for my EQL-backed data (which can still be a proxy for a DataScript instance running in the browser; just not something that needs to run every animation frame).

2021-03-16T13:11:54.027100Z

[(datascript-q4) (asami-q4) (mdb-q4) (transduce-q4) (specter-q4)]
;; => [168.09 4.36 153.06 48.14 49.84]

2021-03-16T13:12:35.027400Z

I had to rewrite because I lost the q4 with the specter

2021-03-16T13:13:16.027600Z

please note that I am far from proficient with specter

raspasov 2021-03-16T13:14:57.027800Z

Specter is quite an amazing tool IMO… Esp. when it comes down to data transformation (less so for just data reading)

pithyless 2021-03-16T13:16:01.028Z

@huxley if you're yak-shaving you may be interested in updating that transduce with some macros from https://github.com/bsless/clj-fast (and bsless also has this libary I never played with - https://github.com/bsless/impedance#performance-differences)

pithyless 2021-03-16T13:16:31.028500Z

but I think it's going to be hard to beat asami, since it looks like the query-planner short-circuits a lot of work in your benchmark 😄

quoll 2021-03-17T15:15:41.000700Z

I only joined this channel this morning, so I didn’t see any of the questions here until now. If anyone is interested I can explain how Asami works? It’s quite different to the structures in DataScript.

quoll 2021-03-17T15:16:12.000900Z

Truth be told, if someone had told me about DataScript 5 years ago then I wouldn’t have started Asami

quoll 2021-03-17T15:16:36.001100Z

(Asami was originally part of Naga, and that project started in 2016)

Filipe Silva 2021-03-17T16:05:17.001300Z

heya, I'm the sync person from Roam Research

Filipe Silva 2021-03-17T16:05:56.001500Z

I can't talk about query performance very much, as I mostly operate on transaction semantics and database persistence

Filipe Silva 2021-03-17T16:06:44.001700Z

I can say that for large databases (50+ mb of datascript transit) transact starts getting slow

Filipe Silva 2021-03-17T16:06:51.001900Z

proportionally to the database size

Filipe Silva 2021-03-17T16:07:28.002100Z

transit deserialization is also overall slow, but that's unsurprising

Filipe Silva 2021-03-17T16:07:36.002300Z

this is all from browser CLJS

Filipe Silva 2021-03-17T16:08:26.002500Z

@quoll hey there! kinda curious about the asami query planner, is it still efficient when the data keeps changing?

quoll 2021-03-17T16:08:40.002700Z

We’re still working on durable storage for CLJS, so that’ll be a while, sorry

Filipe Silva 2021-03-17T16:08:53.002900Z

e.g. query query query vs query transact query transact query

quoll 2021-03-17T16:09:06.003100Z

Yes, the whole point of the planner is to base the plan on the data

Filipe Silva 2021-03-17T16:09:45.003300Z

uhm... at Roam we have a mostly generic persistence layer for Datascript

quoll 2021-03-17T16:09:46.003500Z

It relies on the “count” of resolution of individual patterns (these get cached too, so it’s not hitting the DB too much for this)

quoll 2021-03-17T16:10:12.003700Z

Can you explain what you mean by that please?

Filipe Silva 2021-03-17T16:10:29.003900Z

it syncs datascript transactions as a totally ordered list locally first and then remotely

Filipe Silva 2021-03-17T16:10:40.004100Z

right now we use it to sync first to indexeddb, then to firebase

Filipe Silva 2021-03-17T16:10:52.004300Z

but it's based on an abstract driver system

Filipe Silva 2021-03-17T16:11:07.004500Z

so it was easy to make variants for indexeddb+datomic

Filipe Silva 2021-03-17T16:11:18.004700Z

the in-memory db is still datascript

Filipe Silva 2021-03-17T16:11:41.004900Z

but the only things that matter as far as syncing is concerned the the transaction fn and error handling

Filipe Silva 2021-03-17T16:12:09.005100Z

so that can be abstracted to use asami or anything else (e.g. datahike) as long as it's an in-memory database

Filipe Silva 2021-03-17T16:12:53.005300Z

it's important to do in-memory because there can be a lot of rollbacks as optimistic transactions are turned into confirmed txs

Filipe Silva 2021-03-17T16:13:38.005500Z

e.g. two clients doing txs at the same time will have different optimistic orders than the final confirmed order, the sync system "rebases" the optimistic txs on top of the confirmed as these come

Filipe Silva 2021-03-17T16:13:53.005700Z

we were thinking of open sourcing this

Filipe Silva 2021-03-17T16:16:00.005900Z

can asami run as an in-memory db? if so maybe we could work together to make the sync system generic

Filipe Silva 2021-03-17T16:16:39.006100Z

then you could use arbitrary persistence layers via these drivers

quoll 2021-03-17T16:17:02.006300Z

Asami on CLJS is currently only in-memory

Filipe Silva 2021-03-17T16:17:13.006500Z

oh cool then that'd definitely work

Filipe Silva 2021-03-17T16:17:37.006700Z

our sync thing (we call it Link) could persist it to indexedb and other places

quoll 2021-03-17T16:17:41.006900Z

I don’t have a pull API for it yet. It hasn’t been a priority

Filipe Silva 2021-03-17T16:20:10.007100Z

are you interested in some collaboration if we can provide an open source persistence layer for in-memory asami dbs?

quoll 2021-03-17T16:22:43.007400Z

Sure. I keep it open source for a reason 🙂

Filipe Silva 2021-03-17T16:23:01.007600Z

coolio, going to see what I can do WRT making our stuff open source

Filipe Silva 2021-03-17T16:23:06.007800Z

will keep you posted

quoll 2021-03-17T16:24:11.008Z

I’m doing persistence right now though. Everything is based on a block abstraction that can be stored in anything (the first implementation of this is memory mapped files in the JVM, but the second one is going to be indexedb… partly implemented now)

quoll 2021-03-17T16:25:27.008200Z

Probably better to ask in #asami 🙂

Filipe Silva 2021-03-17T16:45:41.008400Z

the sync persistence we have is just based on the transaction log (and optionally snapshots)

Filipe Silva 2021-03-17T16:46:20.008600Z

saving immutable serialized transactions instead of mutable data structures

quoll 2021-03-17T16:49:47.008800Z

https://clojurians.slack.com/archives/C07V8N22C/p1615898168016200?thread_ts=1615872613.002600&cid=C07V8N22C @simongray from a high level, one of the main differences I’d seen is that DataScript (and Datomic) store datoms, and then index them. Asami doesn’t do that. Instead, it has indexes for the valid statements, without pointing at instances of statements. It’s all just nested maps. The main consequence of this is that searching for when statements get created or deleted isn’t so straight forward. But so far we haven’t needed that. If you’re looking for a :where clause with a single pattern in it, then that might be [entity :my-property '?value]. In this case, both the entity and the attribute have been set. So you can just go to the EAV index, and say: (get-in eav [entity :my-property]) and you have your values. So simple queries that just do a single pattern are literally just a lookup in a map, followed by a lookup in the nested map. Joins cost a bit more. For instance, [?person :name "Betty"][?person :age 20] First of all, the optimizer figures out the pattern with a smaller result, and uses the above to get a result. If this first one is people named “Betty”, then it will go to the AVE index, to get the set of all person entities. It then iterates over that, and uses it to modify the second pattern, which it then looks up. So the first person named “Betty” may be an entity identified by :node-123, which means that the second pattern gets updated to [:node-123 :age 20]. This is resolved with (get-in eav [:node-123 :age 20]), and if it is true, then that value for ?person gets returned. The same goes for every other person who was resolved as well. How does this compare to joins in DataScript? I don’t actually know! I never looked 🙂

👍 1
quoll 2021-03-17T16:50:47.009100Z

@filipematossilva That makes sense. And it’s easy to replay. It’s not what Asami is using though. I’m saving immutable data structures.

Filipe Silva 2021-03-17T18:55:51.009300Z

the difference is this model (for asami) is that the in memory version would be fed the relevant transactions on load, and those transactions would be persisted to disk or network separately of asami

quoll 2021-03-17T19:34:12.009500Z

I see. Well, Asami doesn’t store the relevant transactions. That said, they do get returned from a call to transact (like datomic does), meaning that they’re easy to accumulate

2021-03-16T06:57:56.002700Z

We use datascript in production, or rather what's left of it. We had to cut out most of the functionality due to tragic performance with more products in the db. Basically, all that's left is the pull syntax. Datascript, contrary to what we can read on github, has very poor performance. For example datalevin despite the fact that it uses data stored on disk is much faster. Almost everything is faster than datascript, even queries written in meander, operating on flat db in fulcro style. Simplest macro that creates transreducer on the fly beats datascript. As about re-posh, sometimes it loses changes, especially if you evict data from db. It also doesn't allow to use all possibilities of datascript, and with bigger number of arguments in :in it loses order, so you have to wrap arguments in vector.

1
2021-03-16T06:59:21.002900Z

much better db, but still having many rough edges, is Asami