I am trying to write a simple grammar that parses comments:
/* some text */
, is there a way in instaparse to say any character?
e.g.
"comment = ‘/*’ .* ‘*/‘"
@andrei Instaparse doesn't have a special character for that, but you can use regular expressions to cover any character
e.g. comment = '/*' #'[\\s\\S]'* '*/'
(`#"[\s\S]"` is my personal favorite way to match any character in a regex)
@andrei: Yeah, you'll want something like this:
"comment = <'/*'> #'.*' <'*/'>"
My version hides the comment tokens, though @aengelberg's regexp might be more appropriate.@aengelberg @seylerius thank you for the suggestions. I think I got a bit mislead by the source code, https://github.com/Engelberg/instaparse/blob/master/src/instaparse/abnf.clj#L19-L40 I thought there are some defaults in instaparse
but now reading through the doc strings, these are only to parse the grammar itself https://github.com/Engelberg/instaparse/blob/master/src/instaparse/abnf.clj#L2
a couple things I see in @seylerius's solution:
1) .
in a regex doesn't include newlines
2) .*
will greedily match past the */
and won't be able to parse the end of a comment
@andrei Sorry for the misleading code. Those constants are available but only to the ABNF format.
EBNF is the default
are there constants for ebnf? looking at the code I think not
@andrei A point to keep in mind with @aengelberg's solution is that you'll need to condense the individual characters of the output.
@seylerius @aengelberg is there a way for specifying in instaparse to group matches together, s.t. one doesn’t need to condense the matches?
yeah, thanks for clarifying that @seylerius
You'll get output like [:comment "f" "o" "o" " " "b" "a" "r"]
from input like /*foo bar*/
exactly
there are ways to use transform and apply str on it
Yep.
@andrei The official specification for ABNF is more strict and specific than EBNF, and it dictates that those constants are available. EBNF is more of an ambiguous mashup of a variety of standards we were able to find on the internet
it just feels that there should be a grammar direct way
So there are no constants in EBNF, since none of the EBNF resources we found seemed to indicate such
And remember to wrap your comment tokens in <>
like I did, so you don't save the markup itself.
Sadly there is no grammar direct way to concat the strings
Transform works pretty well, though.
hmm, or a more elaborated reg exp
I am using smth like this for strings
<string> = dqoute #'([^"\\]|\\.)*' dqoute
<dqoute> = <'\"'>
(insta/transform {:comment (partial apply str)} (comment-parser input-data))
and probably the performance impact is small if one applies transforms
Lolyep. Far as I can tell, inataparse does a good job with efficient transforms.
it depends on the size of the file. Probably actually creating all those individual strings is going to be the bottleneck rather than concatenating them later
I must admit I was lead astray by regexps vs transforms which is more efficient - although I think its a very premature optimisation
A regex is a sensible solution if you can get it right 🙂
My first thought is to do a negative lookahead for */
as part of the regex
Trouble is, from what I've found, that the */
will get eaten in the .*
And the negative lookahead will pass because the end token was already eaten
so more reg exp magic for me to look into. to give a bit more context I am playing around with parsing localizable strings.
/* This is a comment */
"hello" = "Hello!";
/* This is another comment */
"click_button" = "Click";
/* Title bar, prints the number of selected products (The translation should be short due to the limit of 100 characters for the title of the mobile app) */
"bar_print_$_selected_products" = "You Selected %@ Products”;
just an experiment, nothing production related.
@aengelberg @seylerius thank you for your help, so far I enjoyed using instaparse. is cool that I can use some things that I learned in college to do some useful things
although I must say that I need to re-learn things about parsers and defining grammars
@seylerius I meant a regex negative lookahead, i.e. #".*(?!=/\*)"
or something
@andrei glad you're having fun! feel free to ask here if you have any more questions
@aengelberg: That's what I thought. It winds up eating the end-token in the .*
and passes the negative lookahead anyway. I was fighting that with the headline parser in organum over the weekend.
When I was trying to get it to parse tags.
oh, I guess the regex would pass, saying "here's a sequence of characters (including /*
), and look, there is not a /*
*after* these characters!"
Bingo
so maybe #"((?!/\*).)*"
that would generate a bunch of match groups though due to the ()
Gah, lemme see what I did for that in the tags in organum.
https://github.com/seylerius/organum/blob/master/src/organum/core.clj
Yeah, ordered choice wound up featuring heavily.
Maybe (<'*/'> / #'.')+
?
Always prefer to end a comment if possible, otherwise continue eating characters?
Wait, not quite
That'll continue past the end.
Ach. I need to drive back to the store; I'm done with this client. Check in with y'all in about ten.
I will also catch up with you guys a bit later too or early tomorrow, its getting a bit late here in Berlin.
Have a good one.