datahike

https://datahike.io/, Join the conversation at https://discord.com/invite/kEBzMvb, history for this channel is available at https://clojurians.zulipchat.com/#narrow/stream/180378-slack-archive/topic/datahike
csm 2019-11-05T01:12:08.104800Z

following on https://github.com/csm/datahike-s3/issues/1, this here is my handle on slack

🎉 1
csm 2019-11-05T01:15:48.107400Z

last I had left that project, I was experimenting with importing the mbrainz example dataset into datahike on a local, mocked s3 store, but the import was glacial. I didn’t spend too much thought on it, even to figure out if the bottleneck was using the local mocked s3 store, or something else

whilo 2019-11-05T07:28:49.108200Z

i see, could you fire up a profiler to see where it is slow? just to get a sense whether s3 is the problem or something else

csm 2019-11-05T18:38:28.118100Z

my guess right now is that just S4 (not a typo-- s4 meaning a local mocked s3) gets slower as you add more objects to it. I’m going to do a new test against an actual S3 bucket and see how that goes. I actually had this partially set up in EC2 but got distracted by other things.

whilo 2019-11-05T20:03:54.118500Z

i see. that would be very interesting. let me know if you need help.

csm 2019-11-05T21:07:46.120300Z

I’m running some transactions against a real S3/DDB store and it’s… much slower. It started out at a couple of seconds per transaction, but now it’s up to 300+ seconds per transaction 😞

whilo 2019-11-05T07:30:51.108800Z

i have a redis backend with konserve-carmine working

whilo 2019-11-05T07:34:06.111600Z

this would be the first distributed variant of datahike besides our experiment with dat replication (by virtue of redis). an explicit deref on the local machine is still necessary to get the current db root. we have no code for that yet, but it is easy to do (just query redis whenever you deref the root).

whilo 2019-11-05T07:39:06.112200Z

if somebody is interested in couchdb, then this is easy to add as well. but i am not sure how popular it is nowadays

whilo 2019-11-05T07:40:09.113100Z

i would also suggest to replace level db by rocks db with @bandarra nice work on https://github.com/purrgrammer/konserve-rocksdb

whilo 2019-11-05T07:40:17.113500Z

any reasons for keeping level db around?

kkuehne 2019-11-05T11:52:11.113800Z

what's the difference between the two?

rschmukler 2019-11-05T16:58:45.114500Z

Just chiming in - they're very similar - they both use a LSMT as their underlying data structure. RocksDB has a lot more performance optimization that could be programmed by writing custom code targeting it. Eg. RocksDB lets you write prefix bloom filters that could speculatively be useful for some of the indexes. RocksDB originally began as a fork of level db. https://github.com/facebook/rocksdb/wiki/Features-Not-in-LevelDB

whilo 2019-11-05T20:07:46.119200Z

i see. yes i had read up something like that some time ago, i am not sure whether we will be able to use these rocks features. it is still supposed to be faster, i think.

rschmukler 2019-11-05T17:05:51.117500Z

Also, while we're wishing for alternative backends - I highly recommend checking out https://github.com/dgraph-io/badger - which is used by DGraph. The major advantage it offers is that it can do key-only iteration, meaning that when you iterate it doesn't load the data from the outside the tree. I always saw this as a big advantage for EATV, etc indexes. It's written in golang so it'd be a fair amount of work to write the bindings etc

whilo 2019-11-08T10:21:56.125600Z

konserve's list-keys function also does that, at least for our own filestore implementation. the problem is that the indices are stored in the hitchhiker-tree and the query engine benefits from scanning the datoms directly. so keys and values are not split at the moment. if you need to store large values, you can point to an external medium as well though, that will make joining with them slower though.