instaparse

If you're not trampolining your parser, why bother getting up in the morning?
2016-12-31T03:00:27.000019Z

instaparse requires keywords for the names of the whatchamacallits?

2016-12-31T03:00:55.000020Z

I think I might be using instaparse in a weird enough way for that to be a very mild problem

2016-12-31T03:01:13.000021Z

because I have to gensym the names and so it's a memory leak

seylerius 2016-12-31T04:47:01.000022Z

@gfredericks It outputs either hiccup or enlive notation, so yes it probably would want keywords in reverse.

aengelberg 2016-12-31T09:52:28.000023Z

@gfredericks:

(def all-keywords-ever (map keyword (range)))

;; each time you dynamically create a parser
(let [my-syms ...
kws (zipmap my-syms all-keywords-ever)]
...)

aengelberg 2016-12-31T09:52:40.000024Z

That might be a way to conserve on keywords

aengelberg 2016-12-31T09:55:21.000025Z

Or do a string replace in the grammar to substitute non terminals with reusable symbols, then postwalk the resulting tree to convert back

2016-12-31T14:24:50.000026Z

I'm using the combinators, so it shouldn't be too hard to do something like that if I decide this matters

zmaril 2016-12-31T19:41:39.000027Z

@gfredericks @aengelberg if we can actually get generating from grammars going I'd still be really stoked

zmaril 2016-12-31T19:42:36.000028Z

I've been working on https://github.com/zmaril/instaparse-c the past few weeks and am getting within spitting distance of doing some fun stuff.

zmaril 2016-12-31T19:43:09.000030Z

It can basically parse C at this point and I'm working on finishing the macro preprocessor now.

zmaril 2016-12-31T19:45:17.000032Z

The goal is to get the output into datascript and queryable. But a side product of this is that if you have something that can generate strings from grammars then we already have something that can produce c programs (sans macros).

2016-12-31T19:58:30.000033Z

@zmaril do you or anybody know if all instaparse grammars are implemented using the combinators?

2016-12-31T19:58:39.000034Z

s/grammars/parser/

zmaril 2016-12-31T19:58:55.000035Z

Yes they should be

zmaril 2016-12-31T19:59:37.000036Z

My understanding is that the ebnf notation that everybody uses is actually parsed by a parser expressed in the combinators that transforms the output into combinators

2016-12-31T20:00:11.000038Z

I just glanced at the combinator list -- I think only the lookaheads are problematic, but that's probably a big deal for sophisticated parsers

zmaril 2016-12-31T20:00:24.000039Z

yep

2016-12-31T20:00:29.000040Z

so...oh well.

zmaril 2016-12-31T20:00:56.000041Z

how does one express negation in generators now?

2016-12-31T20:01:09.000042Z

you could implement them with gen/such-that but the generator would fail if the lookahead condition is unlikely to pass by chance

2016-12-31T20:01:37.000043Z

I have no how that would play out IRL

zmaril 2016-12-31T20:01:43.000044Z

That should be fine then. For the parsers I write lookahead is typically used to implement reserved keywords.

zmaril 2016-12-31T20:02:03.000045Z

I've never used positive lookahead actually now that I think about it

2016-12-31T20:02:18.000046Z

when I made the regex→string generator I just decided not to support look[ahead|behind] for the same reason

zmaril 2016-12-31T20:02:38.000047Z

It's one of those things that is academic to me at this point

zmaril 2016-12-31T20:03:05.000048Z

I'm pretty sure that 99% gen/such-that of the time would be fine

2016-12-31T20:03:29.000050Z

it might not be too hard to throw together a PoC

2016-12-31T20:03:42.000051Z

in fact that would potentially be useful for what I'm working on right now

zmaril 2016-12-31T20:04:39.000052Z

yeah, I think that would fit really well and mirror what spec is doing

zmaril 2016-12-31T20:04:52.000053Z

I've been using spec/conform the same way I use instaparse and it works really well

zmaril 2016-12-31T20:05:12.000054Z

So I imagine we could use generators the same way spec does and it would work well (fingers crossed)

2016-12-31T20:07:32.000055Z

😂 I just realized that it would require using string-from-regex from test.chuck to support regexes in the grammars, and string-from-regex uses instaparse to parse the regex.

zmaril 2016-12-31T20:07:49.000056Z

turtles

2016-12-31T20:08:00.000057Z

indeed

zmaril 2016-12-31T20:08:10.000058Z

that was the thing that was holding me up actually

zmaril 2016-12-31T20:08:14.000059Z

was that I didn't want to mess with regexs

aengelberg 2016-12-31T20:09:30.000060Z

just catching up

aengelberg 2016-12-31T20:10:30.000061Z

After I wrote "instagenerate" I realized going the generator route (as opposed to core.logic) would probably be easier, despite the lookahead such-that problem

aengelberg 2016-12-31T20:10:41.000062Z

But what do you want to do about hide-tags?

zmaril 2016-12-31T20:11:10.000063Z

I think I have an idea, h/o

zmaril 2016-12-31T20:11:44.000064Z

well, hmmm what is the problem you see with hide-tags?

aengelberg 2016-12-31T20:12:14.000065Z

It depends on what you expect the "input" to the generator to be

aengelberg 2016-12-31T20:12:24.000066Z

a parse tree still?

2016-12-31T20:12:34.000067Z

it'd be the combinator

2016-12-31T20:12:46.000068Z

it would generate totally random parsable things

2016-12-31T20:12:53.000069Z

not based on same partial input

aengelberg 2016-12-31T20:13:24.000070Z

ok, in that case I don't really have a problem with hide tags despite just waking up

zmaril 2016-12-31T20:13:40.000071Z

I think if we got something going that just took a grammar and gave back random strings, that would be a good first step

aengelberg 2016-12-31T20:14:50.000072Z

part of why I did core.logic in instagenerate is @zmaril's initial request to go from partial input -> parseable strings, so I felt the need to put in the sophistication of logic programming as a general solver for all cases

zmaril 2016-12-31T20:15:15.000073Z

oh, if we want to do partial input, we can provide skeletons with places to start generating from

zmaril 2016-12-31T20:15:41.000074Z

then we just walk the skeleton and generate random strings at the indicated places

zmaril 2016-12-31T20:16:04.000075Z

still not fully general but better

zmaril 2016-12-31T20:17:19.000076Z

and then we could restrict the grammar inside the combinator somehow

aengelberg 2016-12-31T20:21:48.000078Z

(def p (insta/parser "
S = A B A | B A B
<A> ('a' <'c'> 'b')+
<B> ('b' 'a')+
"))

(generate p [:S "a" "b" "b" "a" "a" "b"])
=> ("acbbaacb")

aengelberg 2016-12-31T20:23:35.000079Z

seems hard to performantly solve generally

zmaril 2016-12-31T20:24:34.000080Z

who said anything about performance

aengelberg 2016-12-31T20:24:39.000081Z

🙂 fair enough

aengelberg 2016-12-31T20:25:00.000082Z

but a generator approach using such-that may never complete on a large enough grammar

zmaril 2016-12-31T20:25:31.000083Z

cross that bridge when we get there

zmaril 2016-12-31T20:25:48.000084Z

computers are like really fast

zmaril 2016-12-31T20:26:20.000085Z

this is more of a what's possible idea than a production thing

aengelberg 2016-12-31T20:27:43.000086Z

cool

aengelberg 2016-12-31T20:28:00.000087Z

let me know if I can help out in whichever path you decide to try out

zmaril 2016-12-31T20:28:28.000088Z

for sure!

2016-12-31T20:38:51.000089Z

yeah generators aren't generally for production stuff

2016-12-31T20:43:20.000090Z

I want a combinator that doesn't match anything

2016-12-31T20:43:41.000091Z

I thought maybe (combo/alt) but that returns ε

zmaril 2016-12-31T20:44:11.000092Z

(gen/such-that (constantly false)) or something?

2016-12-31T20:44:18.000093Z

a combinator, not a generator

zmaril 2016-12-31T20:44:22.000094Z

oh right sorry

2016-12-31T20:44:38.000095Z

I guess I can do negative lookahead with epsilon?

zmaril 2016-12-31T20:44:52.000096Z

or a really unlikely string?

zmaril 2016-12-31T20:45:27.000097Z

like (string "THISWILLNEVERBEMATCHEDHOPEFULLY")

2016-12-31T20:46:12.000098Z

🙂

zmaril 2016-12-31T20:46:30.000099Z

we're not fancy here

2016-12-31T20:46:38.000100Z

(string (str (java.util.UUID/randomUUID)))

zmaril 2016-12-31T20:46:56.000101Z

that works!

2016-12-31T20:48:24.000102Z

I have an alternate thing in my codebase that could be called a parser, but instaparse also has something by that name so I called it a parsifier instead

2016-12-31T20:48:32.000103Z

and it's hard to remember that word because it could also have been parsinator

zmaril 2016-12-31T20:49:34.000104Z

hahaha

zmaril 2016-12-31T20:50:17.000105Z

(defn enlive-output->datascript-datums [m]
 (if-not (map? m)
    {:type :value :value m}
    (as-> m $
        (assoc $ :meta (meta m))
        (assoc $ :db/id (d/tempid :mcc))
        (transform [:content ALL] enlive-output->datascript-datums $))))
This will take enlive output and make it so you can query it from datascript

2016-12-31T20:53:24.000106Z

does instaparse use its own regex engine?

zmaril 2016-12-31T20:53:37.000107Z

no

2016-12-31T20:53:40.000108Z

I just got a misparse where the thing matches the regex but instaparse disagrees

zmaril 2016-12-31T20:53:42.000109Z

depends on java if I recall

2016-12-31T20:53:52.000110Z

and reordering a disjunction in the regex fixes it

zmaril 2016-12-31T20:54:04.000111Z

hmm

2016-12-31T20:54:10.000112Z

this is the instparse-cljs thing in particular, but still on the jvm

zmaril 2016-12-31T20:54:16.000113Z

check if instaparse passes any flags in

2016-12-31T20:55:53.000114Z

here's the failing version: https://www.refheap.com/124435

zmaril 2016-12-31T20:58:11.000115Z

hmm

zmaril 2016-12-31T20:58:18.000116Z

"0/2" parses

zmaril 2016-12-31T20:58:44.000119Z

can you add in some parens to the second part to clarify your intent

2016-12-31T20:59:59.000120Z

"0/2" is not supposed to parse o_O

2016-12-31T21:00:34.000121Z

I see that's my fault though

zmaril 2016-12-31T21:03:49.000122Z

ha

aengelberg 2016-12-31T21:59:41.000123Z

I second !epsilon as the "don't parse"

aengelberg 2016-12-31T22:00:29.000124Z

also instaparse fails on infinite loop grammars, so this might work

never-succeed = never-succeed
(then use never-succeed wherever)

2016-12-31T22:01:58.000125Z

@aengelberg do you think the current behavior of (combo/alt) is bad/weird?

2016-12-31T22:02:44.000126Z

my hunch is that According To Math it should either throw or not match anything

aengelberg 2016-12-31T22:03:01.000127Z

yeah I agree with your instinct. Not really sure what the thinking was in that design.

2016-12-31T22:03:17.000128Z

my argument is that because (combo/alt p) probably does not match ε, neither should (combo/alt)

aengelberg 2016-12-31T22:03:23.000129Z

Maybe since "don't parse anything" isn't really a common use case

2016-12-31T22:03:33.000130Z

you shouldn't parse more things by removing an arg from combo/alt

aengelberg 2016-12-31T22:03:46.000131Z

agreed

2016-12-31T22:04:03.000132Z

yeah I always end up finding the uncommon use cases

2016-12-31T22:04:25.000133Z

for a while every time I tried to use CLJS I ended up creating a jira ticket

aengelberg 2016-12-31T22:04:55.000134Z

#gobigorgohome

aengelberg 2016-12-31T22:06:09.000135Z

I think I know why your parser is failing

aengelberg 2016-12-31T22:06:48.000136Z

The regex for the denominator, when given "25" as input, may arbitrarily decide to match either "2" or "25"

aengelberg 2016-12-31T22:07:04.000137Z

In instaparse, whatever the regex decides is the one and only possible parse

aengelberg 2016-12-31T22:07:53.000138Z

user=> (re-matches #"[2-9]|[1-9][0-9]+" "25")
"25"
user=> (re-seq #"[2-9]|[1-9][0-9]+" "25")
("2" "5")
user=> (re-find #"[2-9]|[1-9][0-9]+" "25")
"2"

2016-12-31T22:08:33.000139Z

oh it's about re-matches vs re-find?

2016-12-31T22:08:47.000142Z

oh I think I see

aengelberg 2016-12-31T22:09:04.000143Z

you could instead do #"[2-9]" | #"[1-9][0-9]+"

aengelberg 2016-12-31T22:09:22.000144Z

If you move logic from regexes into instaparse, you get flexibility at the cost of speed

2016-12-31T22:11:10.000146Z

so the fact that I fixed it by rearranging the regex is sort of an implementation detail I guess?

aengelberg 2016-12-31T22:12:02.000147Z

Yes, so I would call rearranging the regex an improper solution

aengelberg 2016-12-31T22:12:25.000148Z

but #"[2-9]" | #"[1-9][0-9]+" is proper

2016-12-31T22:15:10.000149Z

okay fine I'll switch it 😛