@aengelberg I have a parser that I'm not sure how else to improve: https://github.com/naddeoa/elm-clojurescript-hello-parser/blob/master/src/elm_toolkit/parser.cljs
I've eliminated a ton of ambiguity but I'm still seeing that its performance scales poorly with the size of the input, even when there are only 1 or two possible parses
but that same input when fed chunk by chunk adds up to a much more reasonable parse time
I was reading about the :optimize :memory
option, but that seems to make it worse
I even tried it with Java instead of Node to see if it was just a JS thing and the results were still similar
My next step was about to be to write a function that breaks the input up into chunks, but to do that I'll need to know where a chunk starts/stops or impose some sort of convention around the input
have you seen https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md? I feel like I have seen splitting input in to chunks recommended for larger inputs to instaparse
Yeah that's where I got it from
It just seems like something I shouldn't have to do
If only because I need my parser to determine when a readable chunk starts/stops
it isn't a line by line thing
I might try and eliminate regexes in the grammar definition, if I recall correctly those can be (or maybe they were?) inefficient
Are there any examples you know of large grammars that scale linearly with the size of the input?
Before I go down a rabbit hole I want to make sure I know what to expect
no, I don't
I don't know of any large grammars, and I have only used smallish grammars on small inputs
I'm just afraid that the performance actually won't get better as hard as I may try and I want to cut my losses if I can.
Or just use this for things that only require parsing snippets
Thanks for the advice though @hiredman
it would be neat if instaparse could emit warnings and suggest fixes if your grammar is not LL(1)
At this point, it seems like what I really want is a mode where it parses in chunks itself. My grammar is really just a repetition of 4 possible blocks. It seems like it could just parse the first block independent of the second. That is to say, as soon as it matches just consider it a block and move on.
Given the right level of ambiguity I would be ok with massaging the results
How large are your problematic inputs?
Well, it scales poorly. There isn't a size in specific. It parses the Elm programming langauge. When testing, I start with a single function, then I just keep duplicating that function and observe the performance
each additional function adds more than its fair share of time
I'll paste a snippet of something that takes too long
(def input "nextSource : Source a -> Source a
nextSource (Source a next) =
Source (next a) next
nextSource : Source a -> Source a
nextSource (Source a next) =
Source (next a) next
nextSource : Source a -> Source a
nextSource (Source a next) =
Source (next a) next
nextSource : Source a -> Source a
nextSource (Source a next) =
Source (next a) next
nextSource : Source a -> Source a
nextSource (Source a next) =
Source (next a) next
nextSource : Source a -> Source a
nextSource (Source a next) =
Source (next a) next
nextSource : Source a -> Source a
nextSource (Source a next) =
Source (next a) next
nextSource : Source a -> Source a
nextSource (Source a next) =
Source (next a) next
nextSource : Source a -> Source a
nextSource (Source a next) =
Source (next a) next
nextSource : Source a -> Source a
nextSource (Source a next) =
Source (next a) next
nextSource : Source a -> Source a
nextSource (Source a next) =
Source (next a) next
")
That one I actually just had to kill
but if you cut it in half it only takes about a second or so to parse
And its a pretty reasonable size for an Elm source file
Ouch
I'll try to take a look soon
Cool, thanks a lot
Also, definitely pretty new to clojure still. If I can make the code more approachable/replable feel free to point that out
I kind of just hacked everything together
@anthony.naddeo I fixed it before:
start = (<ws>* block <ws>*)+
after:
start = (<ws> block <ws>)+
That? Let me try now
the large input now takes 100ms on my machine
100ms? I just tried it and it is WAY better, but I wish I had those numbers
are you just in a node repl?
I'm down to 900ms now
which is great
compared to never stopping
I wouldn't have thought to do that
sorry I'm in JVM, not JS
900ms makes sense for node, instaparse in cljs is known to be ~ 10x slower
oh interesting. Did you use my parser.clj file?
yeah this is way better
Thanks a ton
I just copied the grammar string from parser.cljs
into my clojure file since the EBNF syntax is the same on both platforms
np
One odd thing though
I would expect that the *
impacts performance because it adds ambiguity right?
If so, shouldn't there be more than a single parse for it? (time (count (insta/parses parser/parser input :unhide :all)))
returned 1 unless I did something wrong
ambiguity can mean multiple results, but it can also mean a single result that required checking lots of different cases to arrive at
What metrics could people use to determine ambiguity if not parse counts?
#6 in that peformance doc suggests looking at rules individually to see if you have rules that in isolation can result in multiple parses
Instaparse de-duplicates results, so some internal ambiguity can lurk around but still impact the perf
interesting
well, I'm glad that was an easy fix, thanks guys
I don't think I can go back to other parsers. I would have been heart broken
btw @anthony.naddeo, I made the following optimization to your Name
and name
which halved the time for me:
Name = #'(?!\\b(if|then|else|in|let|case|of)\\b)[A-Z][a-zA-Z0-9]*'
name = #'(?!\\b(if|then|else|in|let|case|of|type)\\b)[a-z][a-zA-Z0-9]*'
Actually the first \b doesn't really do anything, because the first character of this particular token is the first char in the string from the regex's perspective
But my optimization was to shift more responsibility into the regex; that is always ideal