possibly redundant (if you've seen borkdude's recent poc based on the grammar) but fwiw, i've uploaded recent work. details of changes: * the grammar has been simplified a bit and returned to a single grammar as a result of further testing and helpful discussion with pez (ty!) * i've tested building and running on linux and macos * docs have instructions on building and running on macos as well now * dependencies have been updated
i'm currently working on improving the testing -- having the tests has been helpful in catching unexpected consequences of changing the grammar (which unfortunately appears to be a very common thing when working with tree-sitter). one difficult part about the testing has been that straight-forward testing would likely lead to embedding the specifics of the grammar (e.g. node names) into the tests. i have held off on this so far (after having tried it for a while) and looked to figure out other ways to test (as details of use still remain unclear). a few of the adopted methods include: 1. use generators from test.check to check tokenization without specifically mentioning node names. also use some of the collection generators and count the number of children. this approach allows one to test (at least tokenization of) different clojure grammars without baking in grammar details into the tests. 2. parse a lot of different input looking for errors. i have close to 150,000 .clj, .cljs, and .cljc files that i feed to a grammar to look for errors. this does turn up some problems with the grammar, but it isn't as good for assessing certain properties of the grammar as having specific expected results. the samples were obtained from github, but the method used was such that this is not so reproducible by others. so i'm working on something that fetches from clojars instead. running the samples-based tests currently takes more than an hour though so i'm working on parallelization as well as not forking a process for each file tested.
Wonderful. You are doing some super great job there!
thanks for taking a look and the ongoing discussion -- very much appreciated!
so i got xargs
to work for testing across ~ 100,000 files.
before the tests used to take about 80 minutes.
once an initial find
command has been run across all of the files (so i presume some caching is going on) it only takes 47 seconds(!) on subsequent runs (including subsequent invocations of find
).
one caveat is that there is at least one file that has a single quote in its name -- so for find
using -print0
and for xargs
using -0
seems to be effective in working around this issue.
usint the -P
option for xargs further improved things.
now the whole thing takes 9 seconds.