instaparse

If you're not trampolining your parser, why bother getting up in the morning?
sova-soars-the-sora 2021-04-18T20:41:23.009200Z

Hi. I was wondering, is there a way to do fuzzy matching or fuzzy parsing? And how I mean is that I want to draw rectangles around Japanese text, parsing it effectively, but I would like to be able to parse even if some terms are unknown or undefined. Is there a way to do that in instaparse, a way to parse with fuzziness so not every single term in the parse input is defined by the rules?

aengelberg 2021-04-18T20:44:06.011200Z

Sadly no. Instaparse was designed to turn strings into data using a well-defined language, so partial matching and fuzzy matching aren't well-supported.

sova-soars-the-sora 2021-04-18T20:46:14.011800Z

No problem. I'm wondering how I can do this ^.^ Maybe I can pre-process everything and do a sort of mini dictionary prep step.

sova-soars-the-sora 2021-04-18T20:51:56.014100Z

So if I do a dictionary scan of the input text, I think it is smartest to start with longest strings first

sova-soars-the-sora 2021-04-18T20:52:37.015Z

match all the 7-letter words, 6-letter words, 5-letter words, and so on.

sova-soars-the-sora 2021-04-18T20:54:43.016700Z

maybe just cut it into slices? "Shewenttothemuseum" -> "Shewent" (no results) "emuseum" no results... but then "Shewent" (also no results) .... "museum" result found. mark it. keep it moving. Kinda like a sieve of erasthenes but on text

sova-soars-the-sora 2021-04-18T20:56:53.018600Z

Making m stringlets of size n from a string sounds like linear in data, so we could probably do pretty large datasets but maybe not a whole novel conveniently this way. Hmm, I suppose it is easy if we split on sentence ends (periods 。) and then do the sieve approach on each sentence

sova-soars-the-sora 2021-04-18T20:57:29.018800Z

This might actually work pretty darn well!

sova-soars-the-sora 2021-04-18T20:58:16.020300Z

Preprocess the input with a sieve + dictionary lookup, figure out the nouns and verbs throw them into the rules then try and run the parse on it. i'll still need some core rules for grammar but the idea is to have a lot of them hard-coded

aengelberg 2021-04-18T20:59:10.021300Z

A regex could be a good fit to quickly scan for valid dictionary words. Some regex libraries let you compile a large union of words ( #"word1|word2|word3|... ) into a finite state machine that can do a linear-time scan of text.

sova-soars-the-sora 2021-04-18T21:02:13.022Z

ohhh cool. that's a really neat idea. i think i might need to use web lookups but if i keep tabs on those results they could go into such a regex.