datahike

https://datahike.io/, Join the conversation at https://discord.com/invite/kEBzMvb, history for this channel is available at https://clojurians.zulipchat.com/#narrow/stream/180378-slack-archive/topic/datahike
timo 2020-06-01T09:55:40.319400Z

Hi @ben.sless. There were some changes in the last release regarding the configuration and there should be no reload-config. The mention of it in the api is a leftover. We will fix this asap. The database you are using for datahike should exist in Postgres and your postgres-user should also not have the permission to create or delete databases anyway. If you need any help, let me know what your current problem is and I'll try to help.

Ben Sless 2020-06-01T10:01:31.319500Z

Thanks timo, I managed to get everything working shortly after making the post, but thank you for taking the time. I'd say the biggest hurdle is the clarify of the documentation and reload-config If you'd like I can try to pinpoint all the parts I stumbled over and make suggestions. All in all datahike is pretty cool and I want to see it succeed πŸ™‚

timo 2020-06-01T10:03:07.319700Z

πŸ‘we are very much interested in your hurdles and suggestions

Ben Sless 2020-06-01T10:11:28.319900Z

Alright, I'll try to compile a list of all the pain points

2020-06-01T10:32:50.320500Z

I am continuing to explore Datahike. Very excited about this project. I have a few questions: 1. Does Datahike maintain a live in memory index, so as to avoid flushing to storage on every transaction or does it just flush to storage on every transaction? 2. Does Datahike do any storage GC to delete outdated indexes? 3. Datomic Cloud Strings are limited to 4096 characters. Any limitations in Datahike? Any recommendations on string size to keep queries performant?

2020-06-03T13:09:12.349100Z

For volatile in-memory backend, does jvm gc take care of outdated indexes?

2020-06-03T13:09:48.349300Z

@konrad.kuehne

whilo 2020-06-04T12:51:43.351400Z

I think it depends whether you keep the history or not. If you don't it should clean up the outdated indices.

πŸ‘ 1
2020-06-04T13:23:38.351800Z

Thanks

Ben Sless 2020-06-01T12:00:29.320700Z

@timok another quick win "bug" in the doc is any reference to :dbname in the config map when it should in fact be :path . A more subtle one is that it must start with a /

Ben Sless 2020-06-01T14:46:05.324800Z

Transaction question: I have in my schema entities with scores which can be modified by users (other entities) by increments of +1/-1 (upvote/downvote). I chose to represent it as a score integer, and a set of refs to the entities which voted +1 and -1. Is this a good model? What's the idiomatic way to go about it? I guess I have to work with db.fn/call? My thinking was, for upvote (without loss of generality): β€’ find user in downvotes β€’ if found, increment score by 2, remove from downvotes, add to upvotes β€’ if not found, increment score by 1, add to upvotes Does this make sense?

Ben Sless 2020-06-01T15:43:20.325500Z

Also, how would I avoid duplication, up/down vote look almost the same and also look pretty similar between the two cases

Ben Sless 2020-06-01T15:44:02.325600Z

thoughts?

Ben Sless 2020-06-01T17:21:46.328200Z

@timok another really weird behavior I just came across: all my tests work in REPL, all of them throw NPEs when I run them with lein test They throw at this part:

Exception in thread "async-dispatch-6" java.lang.NullPointerException
        at clojure.core$deref_future.invokeStatic(core.clj:2300)
        at clojure.core$deref.invokeStatic(core.clj:2320)
        at clojure.core$deref.invoke(core.clj:2306)
        at konserve.core$get_lock.invokeStatic(core.cljc:29)
        at konserve.core$get_lock.invoke(core.cljc:28)
        at konserve.core$assoc_in$fn__16363$state_machine__13314__auto____16370$fn__16373.invoke(core.clj 

kkuehne 2020-06-01T18:06:46.328400Z

Happy to hear that you're excited. 1. Datahike supports a volatile in-memory backend. We flush with each tx to the index and thus to the backend. 2. GC is worked on with the next version of the storage protocol konserve. 3. We haven't thought about limitations for strings yet. Thanks for that. We will add that to our benchmark suite.

πŸ‘ 1
timo 2020-06-01T18:17:56.328600Z

thanks @ben.sless I don't really know what this second one is about. :thinking_face: the other issues I appended to an already open issue about the documentation in api namespace. we will get to it soon.

Ben Sless 2020-06-01T18:20:55.328800Z

Maybe it has to do with how the db is initialized. I created this utility function

(defn- empty-db
  []
  (-> (dc/empty-db) (dc/db-with s/schema) dc/conn-from-db))

Ben Sless 2020-06-01T18:21:14.329Z

is there a chance the connection is nil during the tests?

Ben Sless 2020-06-01T18:34:04.329200Z

Update with regards to the NPE, tried a different approach:

(def cfg {:backend :file :path "/tmp/hnc"})

(defn- empty-db
  []
  (sut/connect cfg))

;;; 
(defn connect
  [cfg]
  (d/delete-database cfg)
  (d/create-database cfg :initial-tx s/schema)
  (d/connect cfg))
It works, so in the meanwhile I'll go with this, but the above issue might need to be investigated

2020-06-01T18:55:49.329400Z

I may be wrong about this, but I thought I remember discussion with @whilo to the effect that at least in theory the underlying hitchiker tree could be flushed on a different basis. But @konrad.kuehne can probably speak to that better than I.

kkuehne 2020-06-01T18:56:50.329600Z

Sure it can be flushed differently, but at the moment Datahike is using it only at tx time.

πŸ‘ 2
2020-06-01T19:03:53.332100Z

@ben.sless I would probably just store votes as {:vote/user ... :vote/post ... :vote/score ...} and compute the total :post/score using an aggregate (sum) query.

2020-06-02T23:53:14.348700Z

You could use a transaction function for that

Ben Sless 2020-06-03T06:40:30.348900Z

That's what I ended up doing in the end, thanks πŸ™‚

2020-06-01T19:05:10.333400Z

Ensuring unique votes per post+user is a little bit more work, and this is where Datomic's composite tuples as unique/identity attributes come in handy.

2020-06-01T19:05:27.333700Z

But you could use tx fns to ensure that

2020-06-01T19:05:48.334600Z

A big problem with :post/downvotes and :post/upvotes as you have it here is that you're going to blow up your indices

Ben Sless 2020-06-01T19:05:48.334700Z

Which is what my ungainly solution uses

Ben Sless 2020-06-01T19:05:56.334900Z

ah

2020-06-01T19:06:20.335400Z

Yeah, but it would be a different tx function with the data modelling approach I'm describing

Ben Sless 2020-06-01T19:06:59.336700Z

I'm not sure I'll make the switch because this is just an exercise, but thanks for the advice, I didn't think about storing the votes as their own entities

πŸ‘ 1
2020-06-01T19:07:11.337Z

The index problem is that you'd end up with very big blocks in your primary EAV index, killing performance on any queries that end up needing to scane over those entities (potentially for other attributes,)

2020-06-01T19:07:46.337800Z

@ben.sless Gotcha

Ben Sless 2020-06-01T19:07:48.337900Z

Do I have to keep an index for :post/upvotes?

2020-06-01T19:07:57.338300Z

No

2020-06-01T19:08:14.338700Z

Manual indexing is just for performance on reverse lookups

Ben Sless 2020-06-01T19:10:14.339600Z

Thanks. Your suggestion makes more and more sense, I might make the switch just so it would stop bugging me now

Ben Sless 2020-06-01T19:12:09.340300Z

I'm also having trouble with the aggregate functions, as now I want to model a lookup of the top N rated posts

BjΓΆrn Ebbinghaus 2020-06-01T20:55:38.343200Z

When does flushing happen with transact! I am using a file backend. I am creating a database and then I transact! a schema. So far so good. The file in store/data grows to 2,6kb After that I do more transact! s and the connection gets updated. But not the file.. Any idea?