tree-sitter

tree-sitter fun
2020-03-30T12:54:59.012300Z

over the last few days i worked on trying to use the grammar to locate def-like forms within a clojure file. i found that with the grammar the way it was, the problem was difficult. the main contributing factor was that the tree from tree-sitter contained pieces that were not quite atomic (e.g. you get back ^, single quote, etc. as individual nodes, detached from what they might have been considered a part of). initially i wrote a parser to parse the parse tree (inspired by the github semantic folks). i emulated the approach taken by rewrite-clj and this worked out. the resulting secondary tree was much easier to work with, but perhaps just as importantly i improved my understanding of various clojurish bits. i also got to see a cost of doing it that way -- the cost of going from a file to the secondary parse tree (for clojure.core) was about 270 ms. so estimating that the tree sitter tree construction takes between 20 ms - 50 ms (depending on whether one uses rust / c vs wasm), possibly that's a cost of more than 200 ms. motivated by my improved mental model and the performance numbers, i set out to rework the grammar to not yield any non-atomic bits, and i think i may have succeeded. i've run the > 100,000 sample files through the new grammar (iterating testing with adjustments to the grammar) about 10 times -- the result was that i have a lower detected error count than when i started (c. less than 80 out of 140,000 or so files) and the performance of the grammar hasn't suffered too much. it's now in the mid-to-high 20 ms range when run via rust.