@quoll This is an interesting benchmark. Do you have a reproducible version somewhere? We can compare the numbers with the hitchhiker-tree then. Was this with or without persisting to a durable medium?
I’ve been messing with it since 🙂 And there’s a bug that I’m tracking down, though I don’t think it shows up in the version that led to that benchmark.
The string pool does not store short strings. Instead, it encodes them in a negative number. Longer strings are stored
A lot of the document was short strings, so it’s hard to say what the ratio of encoding into numbers vs storing was. To figure that out, I told it to store everything. And it turns out that there’s a bug when the stored string is shorter than the size of the data in a tree node. I’m debugging this now. 🙂
So it was a bit premature to be talking about that above, but I was really excited that it worked!
It’s possible to reproduce what I was talking about, though it’s tricky… • checkout the “storage” branch” • checkout the working commit: 0b6031f • execute the code in the attached thread…
(require '[clojure.string :as s])
(require '[asami.durable.common :refer :all])
(require '[asami.durable.pool :refer [create-pool]])
;; read a book
(def book (slurp "resources/pride_and_prejudice.txt"))
(def words (s/split book #" "))
;; create a data pool
(def pool (create-pool "book"))
;; load words into the pool, accumulating the numbers the pool assigns to the words
(def coded-pool (time (reduce (fn [[ids p] w]
(let [[id p'] (write! p w)]
[(conj ids id) p']))
[[] pool]
words)))
(def coded (first coded-pool))
(def bpool (second coded-pool))
;; transactions are handled outside of the pool, so get transaction point for later
(def root (:root-id bpool))
;; coded now contains numbers for every word in the document
(count coded)
(take 10 coded)
;; ask the data pool for the data associated with each number
(def output-words (map #(find-object bpool %) coded))
(time (count output-words))
;; 3.1s 3.4s
(= words output-words)
;; truncate the files to just what is in use
(close bpool)
;; come back and reopen the file. Use the transaction point from above
(def pool2 (create-pool "book" 68))
;; ask the data pool again for the data associated with each number in coded
(def output-words2 (map #(find-object pool2 %) coded))
;; is it all still available?
(count output-words2)
(= words output-words2)
The file pride_and_prejudice.txt is at:
<https://www.gutenberg.org/files/1342/1342-0.txt>
Either that, or give me a day to 2 to get it back into shape.
The reason for this work is that the tree nodes are bigger than they seem to require. It’s not slowing anything down, but it’s using more space