Clojurians Log v2

Clojure programming

Channels

# 100-days-of-code # aatree # admin-announcements # adventofcode # ai # alda # aleph # all-the-channels # announcements # arachne # architecture # asami # atlanta-clojurians # atom-editor # autochrome-github # avi # aws # aws-lambda # babashka # babashka-sci-dev # bangalore-clj # beginners # berlin # biff # bigdata # bitcoin # boot # boot-dev # boulder-clojurians # braid-chat # braveandtrue # brevis # bristol-clojurians # business # calva # capetown # carry # cbus # cestmeetup # chestnut # chlorine-clover # cider # circleci # clara # clj-commons # cljdoc # cljfx # clj-http # clj-kondo # clj-on-windows # cljs-dev # cljs-experience # cljsfiddle # cljsjs # cljsrn # cljtogether # clojars # clojure # clojure-android # clojure-argentina # clojure-art # clojure-austin # clojure-australia # clojure-austria # clojure-bangladesh # clojure-bay-area # clojure-beijing # clojure-belgium # clojure-berlin # clojure-boston # clojure-brasil # clojurebridge # clojurebridge-ams # clojure-canada # clojure-chennai # clojure-chicago # clojure-china # clojure-colombia # clojure-conj # clojurecup # clojure-czech # clojured # clojure-denmark # clojure-denver # clojure-derby # clojuredesign-podcast # clojure-dev # clojure-dusseldorf # clojure-ecuador # clojure-egypt # clojure-estonia # clojure-europe # clojure-filipino # clojure-finland # clojure-france # clojure-gamedev # clojure-germany # clojure-greece # clojure-guangzhou # clojure-hamburg # clojure-hk # clojure-houston # clojure-hungary # clojure-india # clojureindia # clojure-indonesia # clojure-ireland # clojure-israel # clojure-italy # clojure-japan # clojure-kc # clojure-korea # clojure-losangeles # clojure-madison # clojure-mexico # clojure-miami # clojure-mk # clojure-mke # clojure-morsels # clojure-my # clojure-new-zealand # clojure-nl # clojure-nlp # clojure-norway # clojure-poland # clojure-portugal # clojure-provo # clojure-quebec # clojureremote # clojure-romania # clojure-russia # clojure-sanfrancisco # clojurescript # clojurescript-ios # clojure-sdn # clojure-seattle # clojure-serbia # clojure-sg # clojure-shanghai # clojure-spain # clojure-spec # clojuresque # clojure-survey # clojure-sweden # clojure-switzerland # clojure-taiwan # clojure-turkiye # clojure-uk # clojure-ukraine # clojureverse-ops # clojurewerkz # clojurewest # clojurex # clojure-za # clojurian-chat-app # clojutre # cloverage # cloxp # clr # code-art # code-reviews # community-development # component # conf-proposals # conjure # consulting # contributions-welcome # copenhagen-clojurians # core-async # core-logic # core-matrix # core-typed # cryogen # crypto # css # cursive # cz-clojure # d2q # datacrypt # datahike # datalevin # datalog # data-oriented-programming # data-science # datascript # datavis # dato # datomic # defnpodcast # deps-new # depstar # devcards # devops # dirac # docker # docs # domino-clj # duct # dunaj # eastwood # editors # emacs # error-message-catalog # etaoin # ethereum # euroclojure # events # exercism # expound # figwheel # figwheel-main # flambo # fulcro # funcool # functionalprogramming # funimage # garden # ghostwheel # girouette # gis # google-cloud # gorilla # graalvm # graalvm-mobile # graclj # graphql # gratitude # gsoc # hammock-driven-dev # helix # heroku # hispano # holy-lambda # honeysql # hoplon # hugsql # humor # hypercrud # hyperfiddle # immutant # improve-getting-started # incanter # indycljs # inf-clojure # instaparse # integrant # interceptors # interop # introduce-yourself # iot # iotivity # ipfs # jackdaw # jaunt # java # javascript # javelin # jobs # jobs-discuss # jobs-rus # joker # jukebox # juxt # jvm # kaocha # keechma # kekkonen # keyboards # klipse # kosmos # lambdaisland # ldnclj # ldnproclodo # lein-figwheel # leiningen # liberator # liquid # livestream # local-first-clojure # london-clojurians # lsp # luminus # lumo # mail # malli # mathematics # meander # melbourne # membrane # mental-health # microservices # mid-cities-meetup # midje # minecraft # minimallist # missionary # monads # mount # music # new-channels # new-clojure # nextjournal # nginx # nrepl # numerical-computing # nyc # observability # off-topic # om # om-next # onyx # other-languages # other-lisps # overtone # pamela # parinfer # pathom # pedestal # perun # philosophy # phzr # planck # plastic # play-clj # podcasts # polylith # portal # portkey # portland-or # powderkeg # practicalli # precept # prelude # programming-beginners # project-updates # proletarian # proton # protorepl # pulsar # pure-frame # qa # qlkit # quil # random # rdf # react # reactive # reading-clojure # reagent # reclojure # re-frame # reitit # releases # remote-jobs # respo # rethinkdb # reveal # rewrite-clj # ring # ring-swagger # robots # rum # schema # sci # sfcljs # shadow-cljs # _silence # sim-testing # sioux-falls # slack-help # sneer # sneer-br # spacemacs # specmonstah # specter # speculative # spirituality-ethics # sql # startup-in-a-month # sydney # test200 # test-check # testing # thejaloniki # timbre # tmp-json-parsing # tools-build # tools-deps # trading # tree-sitter # uncomplicate # unrepl # untangled # utah-clojurians # videos # vim # vrac # vscode # wasm # web-security # windows # xtdb # yada # yleinen

Apps

instaparse

If you're not trampolining your parser, why bother getting up in the morning?

wilkerlucio 2017-06-06T12:46:28.033584Z

hello people, I'm stuck with a instaparse rule, maybe someone here can help me out 🙂

wilkerlucio 2017-06-06T12:46:36.035452Z

I'm writing a parser for Javascript regexes

wilkerlucio 2017-06-06T12:46:40.036584Z

this is the current grammar:

wilkerlucio 2017-06-06T12:46:49.038868Z

Regex = &lt;'/'&gt; Alternation &lt;'/'&gt; MatchFlag*

Alternation = Concatenation (&lt;'|'&gt; Concatenation)*

Concatenation = SuffixedExpr*

SuffixedExpr = SingleExpr Suffix?
SingleExpr = BaseExpr | ParenthesizedExpr
ParenthesizedExpr = &lt;'('&gt; GroupFlags? Alternation &lt;')'&gt;
Suffix = (Optional | Positive | NonNegative | CurlyRepetition) Quantifier?
Optional = &lt;'?'&gt;
Positive = &lt;'+'&gt;
NonNegative = &lt;'*'&gt;
CurlyRepetition = &lt;'{'&gt; #"\d+" (&lt;','&gt; #"\d+" ?) ? &lt;'}'&gt;
Quantifier = '?' | '+'
BaseExpr = CharExpr | LiteralChar | Anchor | BackReference

Anchor = '^' | '$' | '\\' #"[bB]"
LiteralChar = PlainChar | EscapedChar

BackReference = &lt;'\\'&gt; #"[1-9][0-9]*"

PlainChar = #"[^.|\\+*$^\[(){?]"
CharExpr = Dot | LiteralChar | BCC
Dot = '.'

BCC = &lt;'['&gt; BCCUnionLeft? &lt;']'&gt;

BCCUnionLeft = BCCNegation? BCCElemBase*

BCCNegation = '^'

BCCElemBase = BCCCharNonRange | SpecialCharClass | BCCRange | BCC
BCCRangeRightable = BCCCharEndRange | SpecialCharClass
BCCRange = BCCChar &lt;'-'&gt; BCCCharEndRange
BCCRangeWithBracket = &lt;']-'&gt; BCCCharEndRange
BCCCharNonRange = BCCChar !('-' BCCRangeRightable)
BCCChar = BCCPlainChar | EscapedChar
BCCCharEndRange = BCCPlainChar | EscapedChar
BCCPlainChar = #"[^\]\[\\]" | '\\b'

EscapedChar = SpecialCharClass | NormalSlashedCharacters | ControlChar | HexChar | BasicEscapedChar

HexChar = ShortHexChar | MediumHexChar | LongHexChar | VeryLongHexChar
ShortHexChar = &lt;'\\x'&gt; #'[0-9a-fA-F]{2}'
MediumHexChar = &lt;'\\u'&gt; #'[0-9a-fA-F]{4}'
LongHexChar = &lt;'\\x{'&gt; #'[0-9a-fA-F]{4}' &lt;'}'&gt;
VeryLongHexChar = &lt;'\\x{'&gt; #'[0-9a-fA-F]{6}' &lt;'}'&gt;
BasicEscapedChar = &lt;'\\'&gt; #"[\s\S]"
SpecialCharClass = &lt;'\\'&gt; #"[dDwWsSv0]"

NormalSlashedCharacters = #"\\[tnrf]"

ControlChar = &lt;'\\c'&gt; #"[A-Z]"

(** FLAGS **)
GroupFlags = NonCapturingMatchFlags
           | PositiveLookAheadFlag
           | NegativeLookAheadFlag

NonCapturingMatchFlags = &lt;'?'&gt; !')' &lt;':'&gt;
PositiveLookAheadFlag = &lt;'?='&gt;
NegativeLookAheadFlag = &lt;'?!'&gt;

MatchFlag = #"[gimuy]"

wilkerlucio 2017-06-06T12:48:39.067072Z

I would like it to parse {, as a plain char, currently on the PlainChar definition this char is excluded to allow for the CurlyRepetition

wilkerlucio 2017-06-06T12:48:47.069241Z

I'm probably missing something

wilkerlucio 2017-06-06T12:49:01.072898Z

but if I allow the { at PlainChar

wilkerlucio 2017-06-06T12:49:15.076763Z

then when I try to do: a{2}

wilkerlucio 2017-06-06T12:49:46.084308Z

I was expect it to go into SuffixedExpr and match a PlainChar followed by a Suffix that would be a CurlyRepetition

wilkerlucio 2017-06-06T12:50:11.090777Z

but instead seems like it's not matching the suffix, and instead matches a series of plain chars

wilkerlucio 2017-06-06T12:55:31.177168Z

sorry the noise, I got a much simplified version now, by this grammar:

wilkerlucio 2017-06-06T12:55:34.178031Z

Concatenation = SuffixedExpr*

SuffixedExpr = LiteralChar CurlyRepetition?

CurlyRepetition = &lt;'{'&gt; #"\d+" &lt;'}'&gt;

LiteralChar = #"."

wilkerlucio 2017-06-06T12:56:22.191481Z

I expected it to match "a{2}" as a LiteralChar followed by a CurlyRepetition, but instead it matches as [:Concatenation [:SuffixedExpr [:LiteralChar "a"]] [:SuffixedExpr [:LiteralChar "{"]] [:SuffixedExpr [:LiteralChar "2"]] [:SuffixedExpr [:LiteralChar "}"]]]

wilkerlucio 2017-06-06T12:57:02.202234Z

how can make it try force match the CurlyRepetition before stepping a level up and matching more literal chars?

aengelberg 2017-06-06T16:00:42.435558Z

@wilkerlucio maybe ordered choice / in instaparse might help here?

aengelberg 2017-06-06T16:02:33.481808Z

so you want a{2 to parse as 3 PlainChars, but a{2} to parse as a PlainChar followed by a CurlyRepetition?

wilkerlucio 2017-06-06T17:34:21.513188Z

@aengelberg yes, with some help I was able to figure it out, here is a way to handle it:

wilkerlucio 2017-06-06T17:34:23.513971Z

Concatenation = SuffixedExpr*

SuffixedExpr = LiteralChar CurlyRepetition? / AnyLiteralChar

CurlyRepetition = &lt;'{'&gt; #"\d+" &lt;'}'&gt;

LiteralChar = #"[^{]"
AnyLiteralChar = #"."

wilkerlucio 2017-06-06T17:34:49.523743Z

I had to restrict the first one a little and have a second more permissive rule, now it parses the way I was expecting 🙂

aengelberg 2017-06-06T17:35:55.547653Z

I suggest you take away the ? after CurlyRepetition, to make the grammar unambiguous (improving performance).

wilkerlucio 2017-06-06T17:36:50.567211Z

@aengelberg but it is needed there, because the curlyrepetition is optional

wilkerlucio 2017-06-06T17:37:16.576619Z

ah, I think I got you said, it would match anyway

wilkerlucio 2017-06-06T17:37:24.579828Z

in my real case its a bit more complicated

wilkerlucio 2017-06-06T17:37:39.584812Z

and the latest AnyLiteral actually just matches the {, otherwise other complications arise

wilkerlucio 2017-06-06T17:37:57.590912Z

parsing regex is pretty annoying to be honest -.-

aengelberg 2017-06-06T17:38:04.593433Z

lol yeah

aengelberg 2017-06-06T17:38:21.599856Z

also, something to watch out for: . does not include the newline character

aengelberg 2017-06-06T17:38:33.604146Z

but [^{] does

wilkerlucio 2017-06-06T17:38:54.611626Z

yeah, when I need everything I like to use something like [\s\S]

wilkerlucio 2017-06-06T17:38:57.612823Z

so it matches everything

aengelberg 2017-06-06T17:39:08.616498Z

that’s exactly what I was going to suggest

wilkerlucio 2017-06-06T17:39:16.619531Z

in case you wonder, this is what my current grammar looks like (for parsing JS RegExp):

wilkerlucio 2017-06-06T17:39:20.621292Z

Regex = &lt;'/'&gt; Alternation &lt;'/'&gt; MatchFlag*

Alternation = Concatenation (&lt;'|'&gt; Concatenation)*

Concatenation = SuffixedExpr*

SuffixedExpr = SingleExpr Suffix? / CurlyRepetition / LiteralSpecialChar
SingleExpr = BaseExpr | ParenthesizedExpr
ParenthesizedExpr = &lt;'('&gt; GroupFlags? Alternation &lt;')'&gt;
Suffix = (Optional | Positive | NonNegative | CurlyRepetition) Quantifier?
Optional = &lt;'?'&gt;
Positive = &lt;'+'&gt;
NonNegative = &lt;'*'&gt;
CurlyRepetition = &lt;'{'&gt; #"\d+" (&lt;','&gt; #"\d+" ?) ? &lt;'}'&gt;
Quantifier = '?' | '+'
BaseExpr = CharExpr | LiteralChar | Anchor | BackReference

Anchor = '^' | '$' | '\\' #"[bB]"
LiteralChar = PlainChar | EscapedChar
LiteralSpecialChar = '{'

BackReference = &lt;'\\'&gt; #"[1-9][0-9]*"

PlainChar = #"[^.|\\+*$^\[(){?]"
CharExpr = Dot / LiteralChar / BCCEmpty / BCC
Dot = '.'

BCC = &lt;'['&gt; BCCUnionLeft? &lt;']'&gt;
BCCEmpty = '[]'

BCCUnionLeft = BCCNegation? BCCElemBase*

BCCNegation = '^'

BCCElemBase = BCCCharNonRange | SpecialCharClass | BCCRange | BCC
BCCRangeRightable = BCCCharEndRange | SpecialCharClass
BCCRange = BCCChar &lt;'-'&gt; BCCCharEndRange
BCCRangeWithBracket = &lt;']-'&gt; BCCCharEndRange
BCCCharNonRange = BCCChar !('-' BCCRangeRightable)
BCCChar = BCCPlainChar | EscapedChar
BCCCharEndRange = BCCPlainChar | EscapedChar
BCCPlainChar = #"[^\]\[\\]" | '\\b'

EscapedChar = SpecialCharClass / NormalSlashedCharacters / ControlChar / HexChar / BasicEscapedChar

HexChar = ShortHexChar | MediumHexChar | LongHexChar | VeryLongHexChar
ShortHexChar = &lt;'\\x'&gt; #'[0-9a-fA-F]{2}'
MediumHexChar = &lt;'\\u'&gt; #'[0-9a-fA-F]{4}'
LongHexChar = &lt;'\\x{'&gt; #'[0-9a-fA-F]{4}' &lt;'}'&gt;
VeryLongHexChar = &lt;'\\x{'&gt; #'[0-9a-fA-F]{6}' &lt;'}'&gt;
BasicEscapedChar = &lt;'\\'&gt; #"[\s\S]"
SpecialCharClass = &lt;'\\'&gt; #"[dDwWsSv0]"

NormalSlashedCharacters = #"\\[tnrf]"

ControlChar = &lt;'\\c'&gt; #"[A-Z]"

(** FLAGS **)
GroupFlags = NonCapturingMatchFlags
           | PositiveLookAheadFlag
           | NegativeLookAheadFlag

NonCapturingMatchFlags = &lt;'?'&gt; !')' &lt;':'&gt;
PositiveLookAheadFlag = &lt;'?='&gt;
NegativeLookAheadFlag = &lt;'?!'&gt;

MatchFlag = #"[gimuy]"

wilkerlucio 2017-06-06T17:40:24.644587Z

it doesn't need to be perfect, the usage of this is to port the test.chuck string-from-regex to CLJS

aengelberg 2017-06-06T17:40:40.650305Z

gfredericks has already done work in test.chuck to parse Java regexes in instaparse, which may be useful: https://github.com/gfredericks/test.chuck/blob/master/resources/com/gfredericks/test/chuck/regex.bnf

wilkerlucio 2017-06-06T17:40:55.655726Z

yeah, this is actually based of that

wilkerlucio 2017-06-06T17:41:01.657789Z

I'm trying to port it to CLJS

aengelberg 2017-06-06T17:41:06.659511Z

oh lol, you’re right, I just needed to look closer

aengelberg 2017-06-06T17:41:15.662835Z

you’re porting test.chuck to cljs?

aengelberg 2017-06-06T17:41:20.664624Z

or just the regex generator?

wilkerlucio 2017-06-06T17:41:33.669242Z

just the regex generator, the rest is already all cljc actually

aengelberg 2017-06-06T17:41:42.672534Z

ok this explains a lot 🙂

aengelberg 2017-06-06T17:42:26.688181Z

so then how did you get the grammar into a weird state that behaved improperly with curly repetitions, just by removing certain things not part of the EcmaScript regex spec?

aengelberg 2017-06-06T17:42:34.691019Z

or was it already like that?

wilkerlucio 2017-06-06T17:43:04.701872Z

the Java Regex have many different features compared to JS one

aengelberg 2017-06-06T17:43:22.708350Z

hmm I was under the impression that Java had a super-set of features to JS

aengelberg 2017-06-06T17:43:26.709570Z

clearly I was mistaken

wilkerlucio 2017-06-06T17:43:34.712365Z

no, JS is more tolerant in some cases

wilkerlucio 2017-06-06T17:43:54.720002Z

for example, those are invalid on JVM, but ok on JS: /a{/ /a{}/ /[]/

wilkerlucio 2017-06-06T17:44:20.729567Z

when JS sees an incomplete curly braces, it threats as literals

wilkerlucio 2017-06-06T17:44:27.731898Z

where JVM throws an exception

wilkerlucio 2017-06-06T17:45:16.750124Z

but in general the JS is simpler, since it doens't support character class unions (that feature adds a lot of complexity on the JVM Regex grammar, see: http://www.regular-expressions.info/charclassintersect.html)

wilkerlucio 2017-06-06T17:45:36.757593Z

so, this is the kind of feature that has a strong dependency on the platform

wilkerlucio 2017-06-06T17:47:11.793075Z

Gary did a great job making generative testing to check if the custom parser conforms with the platform regex parser, check this test: https://github.com/gfredericks/test.chuck/blob/master/test/com/gfredericks/test/chuck/regexes_test.clj#L117-L119

wilkerlucio 2017-06-06T17:47:38.803432Z

it generates random regexs and try to parse it with custom and native regex parsers, and they have to conform (all fail or all pass)

wilkerlucio 2017-06-06T17:48:03.812542Z

it's just very hard to get the grammar to work exactly like the native one, with all quirks dealt with