clojure

New to Clojure? Try the #beginners channel. Official docs: https://clojure.org/ Searchable message archives: https://clojurians-log.clojureverse.org/
Nazral 2021-02-15T02:13:11.135600Z

@dpsutton I'm crawling some data and generating ~ 20GB of uncompressed data per day (the watcher appends to gzipped version of the file directly). There are about 5000 files that I am writing to

2021-02-15T16:15:36.136300Z

> too many files to have one agent per file name > have the futures reset an agent...

2021-02-15T16:17:24.136500Z

I think we disagree about what agents are about? they are lightweight, in clojure terms at least, you can have a lot of them. and you send them actions - you wouldn't reset their file, you would close over the file handle and send them a function that makes them write. serializing writes is exactly the sort of thing agents are good at (as long as you aren't too worried about waiting for writes to finish before you move forward)

1βž•
2021-02-15T16:40:01.136800Z

a quick example:

(require '[<http://clojure.java.io|clojure.java.io> :as io])

(defn writing-agent
  [fname]
  (let [handle (io/file fname)
        writer (io/writer handle)]
    (agent
     {:fresh (.createNewFile handle)
      :handle handle
      :writer writer
      :written 0})))

(defn append-to
  [the-agent the-string]
  (send the-agent
        (fn [{:keys [writer] :as m}]
          (.write writer the-string)
          (.flush writer)
          (update m :written + (count the-string)))))

(defn close
  [the-agent]
  (send the-agent
        (fn [{:keys [writer] :as m}]
          (.close writer)
          (assoc m :handle nil :writer nil))))
(cmd)user=&gt; (.exists (io/file "foo.out"))
false
(cmd)user=&gt; (def a (writing-agent "foo.out"))
#'user/a
(ins)user=&gt; (pprint @a)
{:fresh false,
 :handle #object[java.io.File 0xff6077 "foo.out"],
 :writer
 #object[java.io.BufferedWriter 0x1280851e "java.io.BufferedWriter@1280851e"],
 :written 0}
nil
(ins)user=&gt; (append-to a "hello")
#object[clojure.lang.Agent 0x12a160c2 {:status :ready, :val {:fresh false, :handle #object[java.io.File 0xff6077 "foo.out"], :writer #object[java.io.BufferedWriter 0x1280851e "java.io.BufferedWriter@1280851e"], :written 5}}]
(cmd)user=&gt; (.exists (io/file "foo.out"))
true
(cmd)user=&gt; (slurp "foo.out")
"hello"
(ins)user=&gt; (append-to a ", world")
#object[clojure.lang.Agent 0x12a160c2 {:status :ready, :val {:fresh false, :handle #object[java.io.File 0xff6077 "foo.out"], :writer #object[java.io.BufferedWriter 0x1280851e "java.io.BufferedWriter@1280851e"], :written 5}}]
(cmd)user=&gt; (slurp "foo.out")
"hello, world"
(ins)user=&gt; (close a)
#object[clojure.lang.Agent 0x12a160c2 {:status :ready, :val {:fresh false, :handle #object[java.io.File 0xff6077 "foo.out"], :writer #object[java.io.BufferedWriter 0x1280851e "java.io.BufferedWriter@1280851e"], :written 12}}]
(ins)user=&gt; (pprint @a)
{:fresh false, :handle nil, :writer nil, :written 12}
nil

2021-02-15T16:46:25.137Z

I saw some weird behavior with .createNewFile btw- the docs say it creates a new fresh empty file only if it doesn't exist yet, but in my experiments it seems to be truncating the existing file before any writes

2021-02-15T16:46:44.137200Z

or maybe the writer I create is doing that - I should decouple for testing

dpsutton 2021-02-15T16:51:08.137400Z

i think you need {:append true} as options to the writer

2021-02-15T17:47:49.137900Z

Hum, ya it mixes weird with the (seq [1 2]) idiom though

2021-02-15T17:48:06.138100Z

Which made sense that you either have a seq with elements in it, or nil

2021-02-15T17:48:21.138300Z

But gets confusing when a seq can be empty

2021-02-15T17:49:49.138500Z

yes it used to be the case that (if s ... ...) was safe, but now(for years, the change was pretty close to clojure 1.0) you need (if (seq s) ... ...)

2021-02-15T17:53:30.138700Z

I guess in the taxonomy, what is the difference between s and what seq returns?

2021-02-15T17:55:55.138900Z

I'm trying to see if like there's a new word for a seq that can't be empty?

2021-02-15T18:11:43.139100Z

looking at it again, I think he is distinguishing between a seq and a lazy-seq (not a seq and a sequence), a seq is never empty (it is a seq of something or nil), a lazy-seq can be empty, and both of those are a sequence

2021-02-15T18:14:14.139300Z

but that doesn't help to make sense of the comment about (rest coll)

2021-02-15T18:15:25.139500Z

and like the type hint on rest says it returns an ISeq, the method call it bottoms out on (ISeq.more) says it returns an ISeq

Quest 2021-02-15T18:51:18.143400Z

My team would like to standardize on a Clojure library for API and/or function validation. β€’ Clojure.spec can be used for these purposes but we've found difficulties in usage that make it unappealing for our org. β€’ We're heavily considering Malli as our standardization target. I'm responsible for demo'ing this and other possibilities soon. Would anyone suggest other libraries or things to consider during this task?

Quest 2021-02-17T00:14:18.225200Z

Huh, we were looking for something with instrumentation built in but this is exactly the sort of smaller library I was hoping someone would bring up. Thank you @vemv & everyone else who contributed here πŸ™‡

1πŸ™‚
Quest 2021-02-15T18:52:36.143800Z

I'm aware of Prismatic Schema but have never used it myself. We were previously having some success with Orchestra & Spec Tools, but we've generally seen low adoption into Clojure.spec.

borkdude 2021-02-15T18:53:23.144Z

I'm hoping spec2 will come out some time, I think the alpha status of spec1 and spec2 slows down momentum

borkdude 2021-02-15T18:53:45.144200Z

Malli seems to be actively developed and embraced

Quest 2021-02-15T18:54:11.144400Z

That was our take on looking at the library. We're a larger enterprise and must be a little more risk adverse, but Malli seems very reasonable

dharrigan 2021-02-15T18:55:06.144600Z

I've been using malli for several of my projects (in tandem with Reitit) and it works very well imho.

1πŸ‘
Quest 2021-02-15T18:55:46.145300Z

In my own personal work, I've done some really hacky work on clojure.spec to get runtime specs working. I'm also really looking forward to spec2, but my fingers have been crossed for a long time there πŸ™‚

dpsutton 2021-02-15T18:56:03.146100Z

Cognitect is committed to spec as part of the library. It’s inconceivable that spec won’t be supported. That’s not necessarily the case with malli

borkdude 2021-02-15T18:58:05.146300Z

Even Schema has been maintained while the company behind it pivoted.

Quest 2021-02-15T18:58:11.146500Z

Good point, though I don't believe I'd have success trying to force adoption of spec1. I'll need to come to a conclusion by end of month, and I haven't checked but I'm assuming spec2 is still a ways out

borkdude 2021-02-15T18:58:26.146700Z

As long as the community has a big interest in it, it will probably be maintained

borkdude 2021-02-15T18:58:48.146900Z

And Metosin has a good track record I'd say

Quest 2021-02-15T18:59:09.147100Z

Yeah, I'll βž•1 that. Most of our team was pleasantly surprised to see Metosin was the publisher

ikitommi 2021-02-15T19:36:08.147400Z

If you decide go with Malli, there is #malli to get help with your demo. Also, 0.3.0 around the corner (parsers, function & sequence schemas).

1πŸ‘2πŸŽ‰
Leonid Korogodski 2021-02-15T20:15:05.152200Z

Does there exist a ready Clojure solution for the following? I want to zip several very large files while streaming the result on the fly to a consumer. The ZIP compression algorithm allows sending chunks of the output even before the entire zip archives is complete, and I don't want to hold the entire thing in memory at any given time. Obviously, Java's ZipOutputStream is unsuitable for that. Alpakka has a solution for this problem in Java and Scala: https://doc.akka.io/docs/alpakka/current/file.html#zip-archive However, while I can certainly call Alpakka from Clojure, I don't want to drag in the dependency and Akka and have to initialize its actor systems just for this little thing. Any suggestions?

2021-02-15T20:22:14.152900Z

what makes you think ZipOutputStream requires holding the entire output in memory?

2021-02-15T20:22:35.153500Z

surely if the target it writes to isn't in memory, it doesn't need to hold the output in memory

Leonid Korogodski 2021-02-15T20:22:58.154Z

Well, you cannot stream the results from its buffers as an input stream until it's done.

2021-02-15T20:23:22.154500Z

oh, that's counterintuitive, I'll read up, pardon my ignorance

Leonid Korogodski 2021-02-15T20:23:51.154800Z

Unless you can and it's me who's ignorant, of course. πŸ™‚

Leonid Korogodski 2021-02-15T20:24:10.155100Z

I'd love to be proven wrong.

2021-02-15T20:24:37.155500Z

it takes an outputstream as a constructor argument (that part I did expect)

Leonid Korogodski 2021-02-15T20:25:03.156Z

I want an input stream constructed out of several input streams, which simply zips their contents.

2021-02-15T20:25:21.156300Z

an input stream is something you read from :D

2021-02-15T20:25:35.156800Z

(simple typo I'm sure)

Leonid Korogodski 2021-02-15T20:25:41.157Z

Yup. So I could read the zipped contents from even before the entire files are done being zipped.

Leonid Korogodski 2021-02-15T20:26:34.157900Z

Here's how Alpakka does it in Java:

Source&lt;ByteString, NotUsed&gt; source1 = ...
Source&lt;ByteString, NotUsed&gt; source2 = ...

Pair&lt;ArchiveMetadata, Source&lt;ByteString, NotUsed&gt;&gt; pair1 =
    Pair.create(ArchiveMetadata.create("akka_full_color.svg"), source1);
Pair&lt;ArchiveMetadata, Source&lt;ByteString, NotUsed&gt;&gt; pair2 =
    Pair.create(ArchiveMetadata.create("akka_icon_reverse.svg"), source2);

Source&lt;Pair&lt;ArchiveMetadata, Source&lt;ByteString, NotUsed&gt;&gt;, NotUsed&gt; source =
    Source.from(Arrays.asList(pair1, pair2));

Sink&lt;ByteString, CompletionStage&lt;IOResult&gt;&gt; fileSink = FileIO.toPath(Paths.get("logo.zip"));
CompletionStage&lt;IOResult&gt; ioResult = source.via(Archive.zip()).runWith(fileSink, mat);

2021-02-15T20:26:34.158Z

oh - so you want data -> zipper -> (dupe) -> output, where the dupe step creates something you can read

2021-02-15T20:27:05.158700Z

that's a limit algorithmically - with that kind of buffering you always need to hold onto the unread backlog, no matter what you do

2021-02-15T20:27:12.158900Z

(unless I still misunderstand)

Leonid Korogodski 2021-02-15T20:28:32.161Z

Yeah, kind of. If you look at how ZipOutputStream works, it reads a chunk at a time, deflates it, adds it to a buffer, updates the checksum, gets another chunk, etc. In the meantime, what's already processed can already be send downstream. That's what I want.

2021-02-15T20:28:34.161100Z

of course the (dupe) step can be implemented so it only holds backlog and doesn't keep read state

2021-02-15T20:28:56.161300Z

deflates?

Leonid Korogodski 2021-02-15T20:29:07.161600Z

That's their terminology for compression.

dpsutton 2021-02-15T20:29:56.162100Z

> The ZIP file format permits a number of compression algorithms, though DEFLATE is the most common.

Leonid Korogodski 2021-02-15T20:31:20.163600Z

So, something like this: Input: Files A.a, B.b, C.c. Zipped state: filename -> send right away downstream. first chunk -> compress -> send downsteam. second chunk -> compress -> send downstream. filename -> ...

Leonid Korogodski 2021-02-15T20:31:40.164100Z

Don't wait to send the entire thing downstream when everything is complete.

Leonid Korogodski 2021-02-15T20:32:02.164300Z

The files are too big for that.

2021-02-15T20:42:11.167300Z

it seems like the composition of output streams should allow this, if you use an IO based stream and don't use in-memory if I understand your sitution, the gotcha is that you want to be able to "tee" the output, you need something with sensible buffering behavior (if you control both producer and consumer you can make sure the buffer never goes too large in simple use cases) more generally you can get into backpressure and document that the stream will stall if the secondary consumer doesn't read fast enough or use a dropping buffer and document that the secondary reader will not see all the data if it reads too slowly

2021-02-15T20:42:38.167900Z

this is something that can be done with outputstreams

Leonid Korogodski 2021-02-15T20:44:55.169100Z

Well, I can get to solving the backpressure problem later. To begin with, what do you mean by "composition of output streams" in this particular scenario?

2021-02-15T20:46:21.170Z

what I mean is that by chaining streams you can have two destinations for the zipped data - one you can read back in process, and one that gets sent to IO

2021-02-15T20:46:36.170400Z

if you don't need two destinations, your problem is simpler than I thought

2021-02-15T20:47:03.170800Z

I might have misunderstood your usage of the term "input stream" relating to the zipped data

2021-02-15T20:49:55.171500Z

based on your step by step, all you need is for the output stream you supply to ZipOutputStream to be an IO stream and not an in-memory buffer

Leonid Korogodski 2021-02-15T20:51:58.173Z

Yes, I want something like ZipOutputStream (in the sense of building up a zipped data in its buffers) that can also be queried for available zipped data and read from--to wit, that also functions as an input stream.

Leonid Korogodski 2021-02-15T20:52:35.173500Z

I suppose that channels can be used for that, no?

2021-02-15T20:53:02.173700Z

channels?

Leonid Korogodski 2021-02-15T20:53:25.174200Z

core.async

2021-02-15T20:53:48.175Z

this is totally unrelated to core.async, that will just make your problem more complicated

2021-02-15T20:54:12.175800Z

you can make an OutputStream that writes data to two other OutputStreams

phronmophobic 2021-02-15T20:54:18.176100Z

you can use PipedInputStream and PipedOutputStream to turn your ZipOutputstream into an InputStream

1πŸ’―
2021-02-15T20:54:37.176600Z

you can make an OuputStream that sends to an InputStream

Leonid Korogodski 2021-02-15T20:55:38.177400Z

Thanks. I'll take a look at the piped streams.

phronmophobic 2021-02-15T20:56:07.178400Z

the big caveat with the Piped*Streams is that you need two threads, whereas you might be able to get away with a single thread

2021-02-15T20:56:32.178900Z

@lkorogodski essentially, even without the Piped stuff, you can reify http://java.io.OutputStream and in the implementation simply delegate by writing to two other streams

2021-02-15T20:56:42.179100Z

this only needs one thread

1πŸ‘1πŸ’―
2021-02-15T20:56:54.179500Z

(but will block if one reader blocks)

Leonid Korogodski 2021-02-15T20:57:43.180400Z

Ok, that will work, I think.

2021-02-15T20:58:21.181100Z

the reason I so glibly rejected core.async here is that the synchronization here is not actually complex and the mechanics of putting the problem into core.async leads to potential pitfalls (blocking IO should not be in go blocks for example)

2021-02-15T20:58:59.181600Z

and the core.async solution wouldn't simplify the management of the data consumption, just displace the complexity

dpsutton 2021-02-15T20:59:13.181800Z

also, solve one problem at a time

Leonid Korogodski 2021-02-15T21:04:13.182500Z

Thanks!

Leonid Korogodski 2021-02-15T21:15:36.183300Z

Yeah, backing a ZipOutputStream by an OutputStream that uses a CircularByteBuffer would do the trick, as the circular buffer has an associated input stream, too.

vemv 2021-02-15T22:21:34.183500Z

https://github.com/nedap/speced.def takes bit of a Schema-like approach (i.e. no instrumentation, using what essentially boils down to a pre/postcondition system) while using Spec1 as its backing validation system. The magic being that Spec1 is just an impl detail. I have a branch where I replaced it with Schema (because it's relevant for my day job). I could do the same for Spec2, Malli, etc. It's bit of an underdog but also a completely unique, robust proposition when considered carefully

1πŸ‘