datahike

https://datahike.io/, Join the conversation at https://discord.com/invite/kEBzMvb, history for this channel is available at https://clojurians.zulipchat.com/#narrow/stream/180378-slack-archive/topic/datahike
whilo 2019-11-11T06:59:27.151Z

@csm301 very nice 🙂. one thing that i have not done for the hitchhiker-tree is to flush the dirty segments to disk in parallel before returning the root node. right now they are written in sequence, which is unnecessary. https://github.com/replikativ/hitchhiker-tree/blob/master/src/hitchhiker/tree.cljc#L484

whilo 2019-11-11T07:01:18.152400Z

as you have pointed out, only the root needs to be written atomically, all the other nodes can be written in parallel.

whilo 2019-11-11T07:04:29.153800Z

this requires two phases: 1. walk the tree and trigger all write operations and 2. walk the tree again and wait for each write operation.

whilo 2019-11-11T07:06:26.155500Z

for core.async that is easy to do, unfortunately core.async is a bit slow so we compile (macro-expand) it away on the JVM. but we could decide not to compile it away here, i.e. use it directly in the code and do the two phases.

whilo 2019-11-11T07:08:45.156Z

maybe it is easier to use futures though, i would have to think about it a bit more.

csm 2019-11-11T07:11:40.158100Z

Konserve returns channels already, though, so it’s not a stretch to think about firing off writes and then collecting the results

csm 2019-11-11T07:12:16.159Z

I’m surprised to hear core.async being a bottleneck itself though

whilo 2019-11-11T07:14:45.160500Z

it is when the hitchhiker-tree is used in memory. especially if we use the go-try variant. seems exception handling + dispatching to the thread pool is adding a cost somewhere around one magnitude. @mpenet had also done some experiments and came to this conclusion.

whilo 2019-11-11T07:16:12.161400Z

we do not use the hitchhiker-tree for in-memory databases anymore though, but @tonsky’s in-memory indices: https://github.com/tonsky/persistent-sorted-set

whilo 2019-11-11T07:17:39.162100Z

so for datahike it would not matter too much anymore.

whilo 2019-11-11T07:18:48.163100Z

unless we use redis, hmm.

csm 2019-11-11T07:18:57.163200Z

Yeah after writing that it occurred to me why that could be a problem

whilo 2019-11-11T07:20:17.164400Z

it is pretty cool that you could import a million datoms in half an hour without doing a lot of tweaking. i think there is still a lot of room though. the first is to write as much as possible in parallel.

csm 2019-11-11T07:24:02.166600Z

Yeah, I was getting discouraged, but that last test went well

whilo 2019-11-11T07:24:30.167600Z

what was discouraging?

csm 2019-11-11T07:24:51.168100Z

It does sorely need a GC though— I spent a good hour cleaning up unused nodes

whilo 2019-11-11T07:25:26.169100Z

fair point

csm 2019-11-11T07:25:35.169300Z

Oh I wasn’t getting the import to work well until that point

whilo 2019-11-11T07:32:30.169900Z

the consistent key function trick for konserve is neat.