@seylerius this is a good place for that.
You could put further "insta/parse"s in the functions inside the "insta/transform" map
Wat
This is awesome.
(insta/transform {:x (fn [s] (insta/parse otherparser s))} (insta/parse firstparser s)
Hard to bang out a good example on mobile
Lolyep.
That looks fascinating.
It would get weird if the nested parser had an error though.
Yeah.
So how deep does it go looking for :x
?
And how do you make it check for loose strings?
It does a full traversal of the hiccup / enlive, as long as all structures around the :x
are valid hiccup / enlive
Nice
@aengelberg: How do you get solo strings?
Gah, what's wrong with this parser? doc-metadata
works fine, but running headlines
on the remaining content just returns flat content. https://github.com/seylerius/organum
@aengelberg: Got any clues?
Simple reproduction: (headlines (last (doc-metadata (slurp "<http://sample.org|sample.org>"))))
It's something in the h
token, because that's the last thing I changed before it started failing.
At a first glance, the #'.+'
looks suspicious to me. Is greediness biting you here? (Did not try it out, though)
@seylerius the regex you put for :content
is probably not what you want. Due to the (?s)
flag, seems to match everything including newlines, as long as the first character is not a *
.
I'm not sure what your desired behavior is though.
BTW, both the first ^
and the ?
in your regex appear redundant, if I understand it correctly.
The content regexp is fine. It's after I changed a few things to tidy up :h
and added tag parsing that it started failing.
Basically, a headline starts with some number of stars. Everything else isn't a headline.
I cloned your project and am looking at that parser. Is there a different version / branch I missed?
Nope, I pushed the latest version just before I spoke up today.
Sorry I may have been unclear. When I said :content
I meant the content inside the headlines parser.
Not the doc-metadata parser
As an experiment I removed all the hide-tags from the headlines parser, since I got that behavior you were talking about (flat content). That exposed the headlines' :content
rule as being greedy.
organum.core> (headlines content)
[:S [:token [:content "This is an attempt...
Yep. I've got an ordered choice making it prefer to define a section (headline then content) if possible, and just content if not. The defining difference between content and headline is whether it starts with stars.
Although, Hmmm. You've got a point about the mode there.
I think this is what happened:
- The section
rule failed at the start of the string
- It then fell back to the content
rule due to ordered choice
- The content
rule mistakenly parses the whole string (for the reason I mentioned above)
- Parse is done
Yeah. You're right. Making the content rule less accepting (not (?s)
) fixes that part, and now I'm seeing failures to parse the first headline. Joy.
How does inataparse play with non-capturing groups?
Not familiar with that term; are you referring to the groups returned by a Java regex match?
Non-capturing groups are for saying, "this should be here, but don't return it in a group"
Okay, new push. Can't manage to get tags out separate.
oh, you mean things like regex lookahead and lookbehind?
They work if I make them mandatory, but get eaten by the headline body if they're optional. Would lookahead allow saying "if there's whitespace followed by a colon, stop here"?
This is the instaparse source code that applies regexes, may shed some light on whether certain constructs would work. https://github.com/Engelberg/instaparse/blob/master/src/instaparse/gll.clj#L670
I would expect regex non matching lookaheads to work, but non-matching lookbehinds to NOT work. Instaparse runs a regex match on the substring of the current index onward, so previous characters are invisible. EDIT: I misunderstood the term "non-matching"
I see you're using (?:)
now. I don't think "non capturing" is what you want
I think you're right.
organum.core> (re-find #"a" "a")
"a"
organum.core> (re-find #"(?:a)" "a")
"a"
What's weird is non-greedy options fail entirely.
(?:)
basically means, if there are any other groups ()
inside that block, DON'T return them as an additional output.
Ah, it looks like negative lookahead is the trick.
(?!=)
?
the ?:
flag shouldn't affect Instaparse's usage of regexes at all. Instaparse throws away match groups
(?!\\s+:)
seems legit
Nope. Pushing. Still eats the tags.
hmm
Pushed
need to run now, can probably help more in an hour or so. I'd say the next step is manually parsing the regexes on the strings.
and try gradually taking characters away from the regex to see what the problem is
Okay, thanks for the help. Talk with ya when you've got time.
feel free to dump any further findings here
Will do. Slack has persistence, which is pretty handy
Okay, trying reluctance means I only get the first character of the headline, and the rest becomes part of the content.
Trying lookahead seems to just fail.
Okay, tags are mostly fixed, but it's only grabbing the first one.
Pushed.
Would appreciate a look when you have time, @aengelberg
Ach. It's also not getting second headlines. They're turning into content lines due to newline weirdness.
Pushed again. Fixed newline weirdness
Hah, fixed it. Required post-tag newline/whitespace.
Gah. Org is a beautiful format, but it's a bitch to parse.
The parser breaks if I put into the file
* The First : Section :foo:bar:
Not sure if that's valid org-mode.