clojure-europe

For people in Europe... or elsewhere... UGT https://indieweb.org/Universal_Greeting_Time
dharrigan 2020-11-12T07:25:27.356900Z

Good Morning!

ordnungswidrig 2020-11-12T08:09:36.357100Z

Good morning

2020-11-12T08:23:46.357300Z

Morning

ordnungswidrig 2020-11-12T08:24:12.357500Z

moin

borkdude 2020-11-12T08:41:55.357700Z

morning

jasonbell 2020-11-12T09:21:07.357900Z

Morning

raymcdermott 2020-11-12T13:51:36.358800Z

good AM me

borkdude 2020-11-12T13:52:33.359Z

morning.

2020-11-12T14:40:47.360100Z

@ordnungswidrig cool. Glad it worked. What stack did you use in the end?

ordnungswidrig 2020-11-12T14:41:18.360900Z

http://Tech.ml and vega

2020-11-12T14:41:27.361100Z

Anyone here got a good suggestion for something like nippy but that writes out records in a file rather than just a big take it or leave it data structure?

ordnungswidrig 2020-11-12T14:41:43.361700Z

file per record?

2020-11-12T14:41:43.361800Z

I've done something before with baldr and record separators, but that felt a bit janky

2020-11-12T14:42:24.362500Z

file per record would overwhelm the OS file handles I think. There are about 2-10 million records

2020-11-12T14:42:59.362600Z

I'd really be into seeing how you did that if you can share the repo

2020-11-12T14:44:00.363800Z

I like the speed of nippy, and the compression is pretty good too, but I lose a lot of compression by needing to split things up and I lose a lot of file efficiency by having each file be a single vector of records that gets read in

plexus 2020-11-13T08:37:40.376100Z

probably not the performance you are looking for, but this is the main reason for ednl https://github.com/lambdaisland/edn-lines

2020-11-13T08:57:39.377200Z

thx 🙂

ordnungswidrig 2020-11-12T15:10:38.363900Z

let me collect this into a gist

2020-11-12T15:20:33.364100Z

😄

ordnungswidrig 2020-11-12T15:34:36.364500Z

it might run out of the box

ordnungswidrig 2020-11-12T15:34:37.364700Z

😛

2020-11-12T18:30:58.369Z

as this is often the eduction channel, I've been looking at @ben.hammond's blog post here: https://juxt.pro/blog/ontheflycollections-with-reducible and thinking that you don't need to have a reducible for the directory of files, you just need a reducible for each file type, you can then have a vector of eduction of those reducibles which would give you all your short circuiting/ reduced? functionality if you did something like

(eduction ;; changed from sequence thanks to Ben Hammond's advice
  cat
 [(eduction mappify-record (reducible-type-1 file-1))
   (eduction mappfiy-record (redcucible-type-1 file-2))])

2020-11-12T18:31:46.370100Z

you can replace sequence with eduction depending on whether or not you want to have the results in memory or recalculate them each time (from what I understand)

2020-11-12T18:32:11.370400Z

(errors of misunderstanding of the blog post are mine)

2020-11-12T18:33:36.370700Z

I think this simplifies the chaining-reducible bit. I think

2020-11-12T18:33:57.370800Z

I'm sure it is as bug free as all code is

2020-11-12T18:34:50.371200Z

the real magic happening in cat

2020-11-12T18:46:43.371300Z

looks like transit, based on fressian, might be the sweet spot? Looks like you can read and write individual objects from a stream. https://cognitect.github.io/transit-clj/#cognitect.transit/read

2020-11-12T18:47:28.371500Z

and there is a reducible friendly wrapper already https://gitlab.com/pjstadig/reducibles

2020-11-12T18:54:20.374Z

An eduction of a reducible might not implement ISeq, at which point things start breaking

2020-11-12T19:55:51.374300Z

Ah, TIL.