@aengelberg many thanks for the confirmation! of course, it stems from the difference between what regular expression language is and what grammars are, not from how instaparse works doesn't it 🙂 Would you agree to this?
You mean regexes’ greediness? I would say that’s an artifact of regexes’ supported use case not matching what Instaparse needs it to do. If regexes had a way to lazily emit multiple parse results of the same string, we could maybe use that to make regex non-terminals behave more intuitively.
e.g. if I match #"^a+"
on "aaa"
I’d like to see "a", "aa", "aaa"
Yes, well, at a higher level, regular expressions and grammars generate disparate automatons, whereas what you mention is one difference in the detail thereof (the most relevant one). This relates to two area I'm lately looking at ― fuzzy parsing and scannerless parsing
As to regex, have you ever used something like this? https://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches
Oh, not really, ignore that last one
That’s returning one match for each of a variety of starting indexes, which is not quite what I want
Yes, again, ignore that
You could theoretically do (for [i (range (count s))] (re-match re (subs s 0 i)))
but that is super slow and might not actually work depending on the regex
the automaton structure of regexes is simply not designed to (efficiently) reason about the set of all possible matches for one string, using backtracking and laziness
actually maybe if you compile a NDFA
then run it through one character at a time
and mark every time you are in a success state
The interaction of a grammar with the need to account for syntactic categories (a.k.a variables, and/or non-terminals) that translate to non-finite sets of productions is very interesting. Right now, we typically "escape" to regular expressions for that. Of course, this goes beyond the scope of my original question with its particular silly use case.
In this case, the set of productions is not infinite; for the current index i
, all I really need to know is, the set of indexes j \in [0,N)
such that s[i:j]
is a complete parse according to the regex.
Right, obviously, when looking at a specific input text (or string).
But in the general sense, I also meant to say that we use regular expressions for where we want to stipulate a non-finite set, whereas a plain CFG doesn't provide (IIRC) that kind of support, which is why we use regex along with it. I have to refresh and brush up more on formal languages though, maybe what I just said is totally incorrect, or just sufficiently inaccurate to be incorrect.