I’m trying to use planc for processing a huuge file, I thought it might be cool to make it stream based and do ./blah.cljs < file > output
instead of reading the file, processing and then writing
I’m trying to decipher <http://planck.io|planck.io>
docs but I don’t understand how to get stdin stream and stdout stream
@nooga If your processing is textual and line based, planck.core/read-line
might be useful
An interesting thing that may easily occur when processing an absolutely huge file this way is head-holding.
My immediate thoughts on that issue is to try to build something that reduces on (iterate (fn [_] (planck.core/read-line)) nil)
I’ve got ~300MB of stanzas like:
AA=12345678
BBA=12345678 CCC=12345678
and I basically need to make it so that they end up as 12345678 12345678 12345678
in separate lines
Ahh, that's cool, perhaps a partition transducer could help get the pairs of lines.
Also, to really go the transducer route, you'd need the reducible iterator
that is in ClojureScript head, which isn't yet in the shipping Planck. (It is easily built, though via script/pilot
in the Planck source tree.)
@nooga The reason I mention holding head, is that if blah.cljs
looked like
(require '[planck.core :refer [line-seq *in*]])
(run! println (partition 2 (line-seq *in*)))
Then it would print the pairs of lines, and arguably be a clean streaming solution. But it will still hold all lines in memory, if that's a concern.it may be since these files are huuge 😉
And in that case, the new iterate
may be self-hosted's friend 🙂
I’m writing an openrisc emulator in Java to have linux running inside of JVM and my main method of debugging is comparing CPU state logs from my emu and openrisc qemu
300 MB should easily fit in RAM. The transducer approach is fun to mess around with though.
yeah, got 16GB of ram here but somehow this feels dirty 😄
I tried sed
but it drove me crazy
I agree. The only reason ClojureScript doesn't clear locals is because there hasn't been much demand for it. Maybe if self-hosted ClojureScript becomes popular, that could cause some demand. In the meanwhile, I've been exploring the "reducible" route, if that makes sense. In other words, you could transduce on the sequence produced by iterate without consuming RAM. The only dirty thing about that approach for this problem is that you'd need to write to stdout as a side effect of the reduction 😞
@nooga I'm checking to see if this doesn't consume RAM:
(require '[planck.core :refer [read-line]])
(transduce (comp (drop 1)
(take-while some?)
(partition-all 2))
(fn [_ x] (println x))
nil
(iterate (fn [_] (read-line)) nil))
cool, I settled for a simple loop
and it did the job… slowly
Cool. FWIW, Planck also has -s
, -f
, and -O simple
as ways to try to make things run faster.
nice! didn’t know that
ah, I converted the files and tried to use them but now I see that they’re rubbish :F
debugging linux kernel on a CPU that you wrote is no fun
😄
Hah
esp after writing mostly clojure and functional langs for last 3 years
Well, FWIW, the transducer approach using iterate
(with ClojureScript master) doesn't consume RAM
that’s awesome!
thanks for checking it out 🙂
On Planck master line-seq
is directly reducible. This allows reducing over gigantic files without consuming RAM, avoiding ClojureScript head-holding.
This example is over a 1 GB file.
cljs.user=> (require '[planck.core :refer [line-seq]]
#_=> '[<http://planck.io|planck.io> :as io]
#_=> '[clojure.string :as string])
nil
cljs.user=> (reduce
#_=> (fn [c line]
#_=> (cond-> c
#_=> (string/starts-with? "a" line) inc))
#_=> 0
#_=> (line-seq (io/reader "big.txt")))
134217728