hello people, I'm stuck with a instaparse rule, maybe someone here can help me out 🙂
I'm writing a parser for Javascript regexes
this is the current grammar:
Regex = <'/'> Alternation <'/'> MatchFlag*
Alternation = Concatenation (<'|'> Concatenation)*
Concatenation = SuffixedExpr*
SuffixedExpr = SingleExpr Suffix?
SingleExpr = BaseExpr | ParenthesizedExpr
ParenthesizedExpr = <'('> GroupFlags? Alternation <')'>
Suffix = (Optional | Positive | NonNegative | CurlyRepetition) Quantifier?
Optional = <'?'>
Positive = <'+'>
NonNegative = <'*'>
CurlyRepetition = <'{'> #"\d+" (<','> #"\d+" ?) ? <'}'>
Quantifier = '?' | '+'
BaseExpr = CharExpr | LiteralChar | Anchor | BackReference
Anchor = '^' | '$' | '\\' #"[bB]"
LiteralChar = PlainChar | EscapedChar
BackReference = <'\\'> #"[1-9][0-9]*"
PlainChar = #"[^.|\\+*$^\[(){?]"
CharExpr = Dot | LiteralChar | BCC
Dot = '.'
BCC = <'['> BCCUnionLeft? <']'>
BCCUnionLeft = BCCNegation? BCCElemBase*
BCCNegation = '^'
BCCElemBase = BCCCharNonRange | SpecialCharClass | BCCRange | BCC
BCCRangeRightable = BCCCharEndRange | SpecialCharClass
BCCRange = BCCChar <'-'> BCCCharEndRange
BCCRangeWithBracket = <']-'> BCCCharEndRange
BCCCharNonRange = BCCChar !('-' BCCRangeRightable)
BCCChar = BCCPlainChar | EscapedChar
BCCCharEndRange = BCCPlainChar | EscapedChar
BCCPlainChar = #"[^\]\[\\]" | '\\b'
EscapedChar = SpecialCharClass | NormalSlashedCharacters | ControlChar | HexChar | BasicEscapedChar
HexChar = ShortHexChar | MediumHexChar | LongHexChar | VeryLongHexChar
ShortHexChar = <'\\x'> #'[0-9a-fA-F]{2}'
MediumHexChar = <'\\u'> #'[0-9a-fA-F]{4}'
LongHexChar = <'\\x{'> #'[0-9a-fA-F]{4}' <'}'>
VeryLongHexChar = <'\\x{'> #'[0-9a-fA-F]{6}' <'}'>
BasicEscapedChar = <'\\'> #"[\s\S]"
SpecialCharClass = <'\\'> #"[dDwWsSv0]"
NormalSlashedCharacters = #"\\[tnrf]"
ControlChar = <'\\c'> #"[A-Z]"
(** FLAGS **)
GroupFlags = NonCapturingMatchFlags
| PositiveLookAheadFlag
| NegativeLookAheadFlag
NonCapturingMatchFlags = <'?'> !')' <':'>
PositiveLookAheadFlag = <'?='>
NegativeLookAheadFlag = <'?!'>
MatchFlag = #"[gimuy]"
I would like it to parse {
, as a plain char, currently on the PlainChar
definition this char is excluded to allow for the CurlyRepetition
I'm probably missing something
but if I allow the {
at PlainChar
then when I try to do: a{2}
I was expect it to go into SuffixedExpr
and match a PlainChar
followed by a Suffix
that would be a CurlyRepetition
but instead seems like it's not matching the suffix, and instead matches a series of plain chars
sorry the noise, I got a much simplified version now, by this grammar:
Concatenation = SuffixedExpr*
SuffixedExpr = LiteralChar CurlyRepetition?
CurlyRepetition = <'{'> #"\d+" <'}'>
LiteralChar = #"."
I expected it to match "a{2}"
as a LiteralChar
followed by a CurlyRepetition
, but instead it matches as [:Concatenation [:SuffixedExpr [:LiteralChar "a"]] [:SuffixedExpr [:LiteralChar "{"]] [:SuffixedExpr [:LiteralChar "2"]] [:SuffixedExpr [:LiteralChar "}"]]]
how can make it try force match the CurlyRepetition
before stepping a level up and matching more literal chars?
@wilkerlucio maybe ordered choice /
in instaparse might help here?
so you want a{2
to parse as 3 PlainChars, but a{2}
to parse as a PlainChar followed by a CurlyRepetition?
@aengelberg yes, with some help I was able to figure it out, here is a way to handle it:
Concatenation = SuffixedExpr*
SuffixedExpr = LiteralChar CurlyRepetition? / AnyLiteralChar
CurlyRepetition = <'{'> #"\d+" <'}'>
LiteralChar = #"[^{]"
AnyLiteralChar = #"."
I had to restrict the first one a little and have a second more permissive rule, now it parses the way I was expecting 🙂
I suggest you take away the ?
after CurlyRepetition
, to make the grammar unambiguous (improving performance).
@aengelberg but it is needed there, because the curlyrepetition is optional
ah, I think I got you said, it would match anyway
in my real case its a bit more complicated
and the latest AnyLiteral actually just matches the {
, otherwise other complications arise
parsing regex is pretty annoying to be honest -.-
lol yeah
also, something to watch out for: .
does not include the newline character
but [^{]
does
yeah, when I need everything I like to use something like [\s\S]
so it matches everything
that’s exactly what I was going to suggest
in case you wonder, this is what my current grammar looks like (for parsing JS RegExp):
Regex = <'/'> Alternation <'/'> MatchFlag*
Alternation = Concatenation (<'|'> Concatenation)*
Concatenation = SuffixedExpr*
SuffixedExpr = SingleExpr Suffix? / CurlyRepetition / LiteralSpecialChar
SingleExpr = BaseExpr | ParenthesizedExpr
ParenthesizedExpr = <'('> GroupFlags? Alternation <')'>
Suffix = (Optional | Positive | NonNegative | CurlyRepetition) Quantifier?
Optional = <'?'>
Positive = <'+'>
NonNegative = <'*'>
CurlyRepetition = <'{'> #"\d+" (<','> #"\d+" ?) ? <'}'>
Quantifier = '?' | '+'
BaseExpr = CharExpr | LiteralChar | Anchor | BackReference
Anchor = '^' | '$' | '\\' #"[bB]"
LiteralChar = PlainChar | EscapedChar
LiteralSpecialChar = '{'
BackReference = <'\\'> #"[1-9][0-9]*"
PlainChar = #"[^.|\\+*$^\[(){?]"
CharExpr = Dot / LiteralChar / BCCEmpty / BCC
Dot = '.'
BCC = <'['> BCCUnionLeft? <']'>
BCCEmpty = '[]'
BCCUnionLeft = BCCNegation? BCCElemBase*
BCCNegation = '^'
BCCElemBase = BCCCharNonRange | SpecialCharClass | BCCRange | BCC
BCCRangeRightable = BCCCharEndRange | SpecialCharClass
BCCRange = BCCChar <'-'> BCCCharEndRange
BCCRangeWithBracket = <']-'> BCCCharEndRange
BCCCharNonRange = BCCChar !('-' BCCRangeRightable)
BCCChar = BCCPlainChar | EscapedChar
BCCCharEndRange = BCCPlainChar | EscapedChar
BCCPlainChar = #"[^\]\[\\]" | '\\b'
EscapedChar = SpecialCharClass / NormalSlashedCharacters / ControlChar / HexChar / BasicEscapedChar
HexChar = ShortHexChar | MediumHexChar | LongHexChar | VeryLongHexChar
ShortHexChar = <'\\x'> #'[0-9a-fA-F]{2}'
MediumHexChar = <'\\u'> #'[0-9a-fA-F]{4}'
LongHexChar = <'\\x{'> #'[0-9a-fA-F]{4}' <'}'>
VeryLongHexChar = <'\\x{'> #'[0-9a-fA-F]{6}' <'}'>
BasicEscapedChar = <'\\'> #"[\s\S]"
SpecialCharClass = <'\\'> #"[dDwWsSv0]"
NormalSlashedCharacters = #"\\[tnrf]"
ControlChar = <'\\c'> #"[A-Z]"
(** FLAGS **)
GroupFlags = NonCapturingMatchFlags
| PositiveLookAheadFlag
| NegativeLookAheadFlag
NonCapturingMatchFlags = <'?'> !')' <':'>
PositiveLookAheadFlag = <'?='>
NegativeLookAheadFlag = <'?!'>
MatchFlag = #"[gimuy]"
it doesn't need to be perfect, the usage of this is to port the test.chuck
string-from-regex
to CLJS
gfredericks has already done work in test.chuck
to parse Java regexes in instaparse, which may be useful: https://github.com/gfredericks/test.chuck/blob/master/resources/com/gfredericks/test/chuck/regex.bnf
yeah, this is actually based of that
I'm trying to port it to CLJS
oh lol, you’re right, I just needed to look closer
you’re porting test.chuck to cljs?
or just the regex generator?
just the regex generator, the rest is already all cljc
actually
ok this explains a lot 🙂
so then how did you get the grammar into a weird state that behaved improperly with curly repetitions, just by removing certain things not part of the EcmaScript regex spec?
or was it already like that?
the Java Regex have many different features compared to JS one
hmm I was under the impression that Java had a super-set of features to JS
clearly I was mistaken
no, JS is more tolerant in some cases
for example, those are invalid on JVM, but ok on JS: /a{/
/a{}/
/[]/
when JS sees an incomplete curly braces, it threats as literals
where JVM throws an exception
but in general the JS is simpler, since it doens't support character class unions (that feature adds a lot of complexity on the JVM Regex grammar, see: http://www.regular-expressions.info/charclassintersect.html)
so, this is the kind of feature that has a strong dependency on the platform
Gary did a great job making generative testing to check if the custom parser conforms with the platform regex parser, check this test: https://github.com/gfredericks/test.chuck/blob/master/test/com/gfredericks/test/chuck/regexes_test.clj#L117-L119
it generates random regexs and try to parse it with custom and native regex parsers, and they have to conform (all fail or all pass)
it's just very hard to get the grammar to work exactly like the native one, with all quirks dealt with