specter

Latest version: 1.1.3
nathanmarz 2018-01-08T00:17:28.000002Z

@alexisvincent great to hear

2018-01-08T08:59:17.000238Z

New to Specter. I'm scraping http://docs.h2o.ai/h2o/latest-stable/h2o-docs/rest-api-reference.html to build a vector of maps where each map will have a key for :http-verb, :rest-path :inputs and outputs. Another challenge is that the html appears to be in 4 conceptual sections, 1) a section of a href links with rest endpoints, 2) a section of h2 headings with the http-verb and rest endpoint followed by a table with Input and Output, 3) a section of a href links with schema nouns, and 4) a final section of h2 headings with schema noun name followed by a table of keys and their descriptions. How might I keep the four sections separate, before combining them? I'm also unclear if I should use select, collect, codewalker, or continue-then-stay to collect and surface nested pieces of information. Thanks in advance.

2018-01-08T09:04:17.000288Z

https://pastebin.com/q0RQzLic

nathanmarz 2018-01-08T14:46:15.000252Z

@aaelony you're going to have to be more specific

nathanmarz 2018-01-08T14:46:53.000282Z

you want to use specter to extract information out of html?

nathanmarz 2018-01-08T14:47:23.000299Z

can you paste a sample of the html you're scraping, and what you want as output?

2018-01-08T16:39:34.000381Z

ok, let me take some time to formulate a better question.

2018-01-08T20:20:43.000386Z

hi @nathanmarz, here is the code in clojure that I'm wondering how to produce in Specter.

2018-01-08T20:20:45.000285Z

(ns testing 
      (:require [net.cgrand.enlive-html :as html]                                                                                                                                                                                                                                  
            [org.httpkit.client :as http]                                                                                                                                                                                                                                      
            [clojure.string :as str] ))

(->> (html/html-snippet
(:body @(http/get "<http://docs.h2o.ai/h2o/latest-stable/h2o-docs/rest-api-reference.html>"
{:insecure false})))
(filterv #(= (:tag %) :html))
first
:content
(filterv #(= (:tag %) :body))
first
:content
(filterv #(= (:tag %) :div))
first
:content
(filterv #(= (:tag %) :h2))
(mapv #(let [[verb endpoint] (-&gt; %
:content
first
(str/split #" ")
)
inputs (if endpoint
(re-seq #"\{(.*?)\}" endpoint))
]
{:verb verb :endpoint endpoint :inputs inputs}
))
(filterv #(or (= (:verb %) "GET")
(= (:verb %) "POST")
(= (:verb %) "DELETE")
(= (:verb %) "HEAD")))
)