meander

All things about https://github.com/noprompt/meander Need help and no one responded? Feel free to ping @U5K8NTHEZ
timothypratley 2020-01-01T18:21:01.109300Z

I needed to do some HTML scraping in Meander; here is what I came up with:

(ns company.directory
  (:require [hickory.core :as h]
            [meander.strategy.epsilon :as m]))

(def company-directory-page "directory.html")

(def scrape-employee-from-table-row
  "Describes the structure of a table containing employee information."
  (m/rewrite
   [:tr {} _
    [:td {} _ _ _ [:a {} ?name . _ ...] . _ ...]
    _
    [:td {} ?title . _ ...]
    _
    [:td {} _ [:a {:href ?mailto} . _ ...] . _ ...]
    . _ ...]
   ;;=>
   {:name ?name
    :title ?title
    :email ~(subs ?mailto 7)}))

(def scrape-employees-from-table
  "Finds the table rows."
  (m/rewrite
   [:table {} _ _ _ . [:tbody {} _ . !trs _ ...] _ ...]
   ~(map scrape-employee-from-table-row !trs)))

(defn scrape-department-employees
  [department table]
  (for [employee (scrape-employees-from-table table)]
    (assoc employee :department department)))

(def scrape-employee-directory
  "Matches by id and collects per department the tables of employee information."
  (m/rewrite
   [:div {:id "employee_dashboard_directory"} . _ ...
    [:div {} _
     .
     [:h3 {} !department _] _
     [:table {} . _ ... :as !table] _
     ...]
    . _ ...]
   ;;=>
   ~(mapcat scrape-department-employees !department !table)))

(defn find-employee-directory
  "Given a hiccup tree, find the employee directory subtree."
  [hiccup-tree]
  (->> hiccup-tree
       (tree-seq vector? seq)
       (map scrape-employee-directory)
       (filter (complement m/fail?))
       (first)))

(defn fetch-as-hiccup
  "Read the source an parse it into a hiccup tree"
  [source]
  (-> (slurp source)
      (h/parse)
      (h/as-hiccup)
      (first)))

(defn company-directory
  "Extracts a sequence of employee information from the configured source."
  []
  (-> (fetch-as-hiccup company-directory-page)
      (find-employee-directory)))

jimmy 2020-01-01T18:25:09.112900Z

I've done very similar things. But instead of tree-seq I used $ for most things. I will say that I think it would be better to use match here rather than rewrite for most of these since you want code on your rhs and aren't using any substitute features.

jimmy 2020-01-01T18:25:58.114Z

Actually confused about your need to use fail?

jimmy 2020-01-01T18:26:29.114400Z

Oh you are using the strategy rewrite. Got it

timothypratley 2020-01-01T18:38:52.121200Z

Pros: • Writing hiccup to match is much nicer than writing dom selectors • It works Thoughts: • Often there is “stuff in between and after that I don’t care about”, so there are lots of _ _ _ and . _ ... which are actually pretty tedious to get right. If you mess up you get no match. In this case it would be nice to have some way to say “Get me these things in the structure I specify, but ignore stuff around them”. • There is a lot more code in this solution than I expected… probably things I am missing… for example find-employee-directory is code to try to find the first match top down; these seem like very meander concepts but I couldn’t figure out how to use meander features to do it. • I broke the patterns out into separate operations mainly because I couldn’t figure out how to get it to work any other way… it feels like I over complicated things.

timothypratley 2020-01-01T18:39:56.121300Z

I’d love to see alternatives! Do you happen to have your scraping example in a gist or something?

timothypratley 2020-01-01T18:44:45.121500Z

I feel like writing meander to match hiccup patterns works well, but I’m kinda doing it wrong (it took me a long time and my solution feels a bit brittle)

jimmy 2020-01-01T18:46:03.121700Z

I'll try to find some time today to post an example.

timothypratley 2020-01-01T18:47:27.121900Z

thank you! no rush

jimmy 2020-01-01T20:44:53.123200Z

You don’t happen to have an example html/hiccup you could share do you? I’d actually love to tackle the exact problem directly. (I know it might be hard because you have to remove information.)

jimmy 2020-01-01T20:49:50.123400Z

My uses have been a little less systematic and more about grabbing little bits of information and traversing to more pages. Things like:

(m/match archive-index
    (m/$ [:div {:class "pagination"} . _ _ . [:span _ ?first] . _ ... . [:a _ ?last] _ _])
    (range (Integer/parseInt ?first) (inc (Integer/parseInt ?last))))
Yours though is a more comprehensive thing and it would be really great to see what could be done.