instaparse

If you're not trampolining your parser, why bother getting up in the morning?
2016-07-04T01:53:46.000012Z

What's the status of cljs support ?

2016-07-04T01:54:49.000013Z

Is it still living as a fork?

aengelberg 2016-07-04T01:57:56.000014Z

The only cljs support still lives in lbradstreet/instaparse-cljs

aengelberg 2016-07-04T01:59:57.000015Z

But I'm currently in the process of rewriting instaparse-cljs into a form that we'd be willing to accept back into upstream, now that cljsee exists

aengelberg 2016-07-04T07:46:31.000016Z

@seylerius: Here's a grammar that parses exponents like you were you asking:

boot.user=> (def p (insta/parser "
<S> = ows (exponent ows)+
<exponent> = token <'^'> super
super = token | <'{'> token <'}'>
<token> = #'[^\\s\\^{}]+'
<ows> = <#'\\s*'>
"))
#'boot.user/p
boot.user=> (p "foo^2 x^{x+1}")
("foo" [:super "2"] "x" [:super "x+1"])
This parser is pretty naive about the range of possible inputs, since I'm not totally sure myself what that range of inputs is in your use case.

seylerius 2016-07-04T16:43:30.000018Z

Thanks!

seylerius 2016-07-04T16:47:13.000020Z

Another question: * / + = & ~ can appear in singles without being tokens. How would you represent that? Current parser: http://sprunge.us/GNDe

seylerius 2016-07-04T16:54:58.000021Z

@aengelberg: What I have will do for the moment, but it's a part of the spec I'd like to meet eventually.

2016-07-04T17:03:26.000022Z

Hi, We switched recently for parsing user input using plain regex to instaparse. Code looks way better. However there are two corner cases where I am not sure what would be idiomatic way: 1) parsing of certain domain of inputs should result on noop. Our current solution is:

"sentence = define / explain / help / catchall
<<skipped definitions>>
 catchall = #'(.|[\n\r])*'"
with an intention to just ignore last part during transformation : catchall (fn [_] nil) Now I wonder if there is another way to catch this case and ignore without using exceptions. 2)`'(.|[\n\r])*'` comes with | which on JVM leads on recursion and might result in stack overflow. In fact it happened one to us. Is there a better way to write catchall which would account for anything including \n and \r.

aengelberg 2016-07-04T17:10:05.000023Z

@happy.lisper for catchall you could do #'[\s\S]*'

2016-07-04T17:10:24.000024Z

ty

aengelberg 2016-07-04T17:11:16.000025Z

So your use case is: "Parse the entire string as a define, an explain, or a help, but if that doesn't work then return nil"?

aengelberg 2016-07-04T17:11:43.000026Z

Because you could just run the parse and a transform, then check (insta/failure? result)

2016-07-04T17:11:52.000027Z

yes, where nil is just a signal to ignore the input.

aengelberg 2016-07-04T17:13:54.000028Z

(def p (insta/parser ...))
(let [result (p input-string)
      transformed (insta/transform p {...})]
  (when-not (insta/failure? transformed)
    transformed))

aengelberg 2016-07-04T17:14:12.000029Z

Note that insta/transform is specifically designed to pass through failures

2016-07-04T17:15:13.000030Z

Let me consider that 🙂.

aengelberg 2016-07-04T17:19:50.000031Z

@seylerius: Given an input ~a ~b, how do you know the a and b are to be parsed as individual ~'s, as opposed to a code string of "a " followed by "b"?

seylerius 2016-07-04T17:24:06.000034Z

@aengelberg: If I'm reading this correctly, the characters touching the inside of the tokens need to be alphanumeric, or at least non-whitespace.

aengelberg 2016-07-04T17:27:43.000035Z

so *a b c* shouldn't be allowed?

aengelberg 2016-07-04T17:28:24.000036Z

the current grammar that I suggested would allow that. Just trying to get a sense of the range of inputs so I can help design a parser accordingly

seylerius 2016-07-04T17:29:24.000037Z

*foo* *bar* ➡️ [:b "foo" "bar"] foo* bar* ➡️ "foo* bar*"

seylerius 2016-07-04T17:33:36.000038Z

@aengelberg: that make sense?

aengelberg 2016-07-04T17:34:50.000039Z

for the first example do you mean [:b "foo"] [:b "bar"]?

aengelberg 2016-07-04T17:37:16.000040Z

is there a guarantee that *a**b* won't happen?

seylerius 2016-07-04T17:38:46.000041Z

@aengelberg: Yes. And guarantee? No. Ambiguity in the spec we can lock to an interpretation? Yes.

seylerius 2016-07-04T17:45:17.000042Z

We basically get to decide if that's a pair of bold characters or a flat string we'll leave be.

seylerius 2016-07-04T17:45:28.000043Z

It would only likely happen as a typo.

seylerius 2016-07-04T17:45:41.000044Z

(Or a stupid user)

seylerius 2016-07-04T17:48:07.000045Z

@aengelberg: I'm basically upgrading organum. Sample org file: http://sprunge.us/KBbL

aengelberg 2016-07-04T17:51:01.000046Z

hmm, thinking through how to enforce alphanumeric chars on the insides of tokens.

aengelberg 2016-07-04T17:52:22.000047Z

doing a "lookbehind" on the last * is nontrivial.

seylerius 2016-07-04T18:01:16.000048Z

What if I stripped leading and trailing whitespace before parsing, and modified the base string rule to start and end alphanumeric? Would that be easier?

seylerius 2016-07-04T18:05:37.000049Z

But, no, that wouldn't quite work.

seylerius 2016-07-04T18:11:29.000050Z

@aengelberg: Will the parser ignore escaped tokens, like \*?

seylerius 2016-07-04T18:12:48.000051Z

Ach. Clojure doesn't like \* in a string

seylerius 2016-07-04T18:30:43.000052Z

@aengelberg: Is here any way to mark tokens to not be parsed?

2016-07-04T18:33:35.000053Z

would angle brackets <> to hide parsed elements work?

aengelberg 2016-07-04T18:35:29.000054Z

@seylerius you'd have to do \\* if inside a Clojure string

aengelberg 2016-07-04T18:36:54.000055Z

the goal is to avoid parsing *a * as [:b "a "]

seylerius 2016-07-04T18:37:34.000056Z

@aengelberg: Anything special I have to do to mark that? I just tried parsing \\*foo\\* and got ("\\" [:b "foo\\"])

aengelberg 2016-07-04T18:38:22.000057Z

instaparse doesn't automatically handle backslashes in any special way besides what has been defined in your grammar.

seylerius 2016-07-04T18:41:42.000059Z

Okay. How do you define a simple backslash replacement in this type of grammar, then?

aengelberg 2016-07-04T18:45:59.000060Z

Maybe replace <string> with:

<string> = '\\\\*' | #'[^*/_+=~^_\\\\]+'
user> (inline-markup "a\\* b")
("a" "\\*" " b")

aengelberg 2016-07-04T18:46:17.000061Z

Pretty messy, I know. (four backslashes :face_with_rolling_eyes:)

aengelberg 2016-07-04T18:48:03.000062Z

I don't know if this solves your problem though; you don't want to escape *'s in every ** My Subsection text, do you?

aengelberg 2016-07-04T18:49:13.000063Z

sorry if I'm a bit unhelpful; phasing in and out of AFK

seylerius 2016-07-04T18:50:38.000064Z

I'm thinking I'm just going to tell users that if they want a plain * they have to escape it.

seylerius 2016-07-04T18:51:23.000065Z

Headlines are already handled by the time this stage of parsing is invoked, so those won't be an issue.

seylerius 2016-07-04T18:53:21.000066Z

And your special case of *a**b* is apparently already readily converted to ([:b "a"] [:b "b"])

seylerius 2016-07-04T20:11:06.000067Z

@aengelberg: Separate (earlier stage) parser: Is it possible (other than by having respective rules for #'^* ', #'^** ', #'^*** ', etc) to easily produce h1, h2, h3, etc?

seylerius 2016-07-04T20:20:25.000068Z

Actually, yeah. Just don't hide the token, and I can put that through a counter after the fact.