@dpsutton I'm crawling some data and generating ~ 20GB of uncompressed data per day (the watcher appends to gzipped version of the file directly). There are about 5000 files that I am writing to
My team would like to standardize on a Clojure library for API and/or function validation. β’ Clojure.spec can be used for these purposes but we've found difficulties in usage that make it unappealing for our org. β’ We're heavily considering Malli as our standardization target. I'm responsible for demo'ing this and other possibilities soon. Would anyone suggest other libraries or things to consider during this task?
Huh, we were looking for something with instrumentation built in but this is exactly the sort of smaller library I was hoping someone would bring up. Thank you @vemv & everyone else who contributed here π
I'm aware of Prismatic Schema but have never used it myself. We were previously having some success with Orchestra & Spec Tools, but we've generally seen low adoption into Clojure.spec.
I'm hoping spec2 will come out some time, I think the alpha status of spec1 and spec2 slows down momentum
Malli seems to be actively developed and embraced
That was our take on looking at the library. We're a larger enterprise and must be a little more risk adverse, but Malli seems very reasonable
I've been using malli for several of my projects (in tandem with Reitit) and it works very well imho.
In my own personal work, I've done some really hacky work on clojure.spec to get runtime specs working. I'm also really looking forward to spec2, but my fingers have been crossed for a long time there π
Cognitect is committed to spec as part of the library. Itβs inconceivable that spec wonβt be supported. Thatβs not necessarily the case with malli
Even Schema has been maintained while the company behind it pivoted.
Good point, though I don't believe I'd have success trying to force adoption of spec1. I'll need to come to a conclusion by end of month, and I haven't checked but I'm assuming spec2 is still a ways out
As long as the community has a big interest in it, it will probably be maintained
And Metosin has a good track record I'd say
Yeah, I'll β1 that. Most of our team was pleasantly surprised to see Metosin was the publisher
If you decide go with Malli, there is #malli to get help with your demo. Also, 0.3.0 around the corner (parsers, function & sequence schemas).
Does there exist a ready Clojure solution for the following?
I want to zip several very large files while streaming the result on the fly to a consumer. The ZIP compression algorithm allows sending chunks of the output even before the entire zip archives is complete, and I don't want to hold the entire thing in memory at any given time. Obviously, Java's ZipOutputStream
is unsuitable for that.
Alpakka has a solution for this problem in Java and Scala: https://doc.akka.io/docs/alpakka/current/file.html#zip-archive
However, while I can certainly call Alpakka from Clojure, I don't want to drag in the dependency and Akka and have to initialize its actor systems just for this little thing. Any suggestions?
what makes you think ZipOutputStream requires holding the entire output in memory?
surely if the target it writes to isn't in memory, it doesn't need to hold the output in memory
Well, you cannot stream the results from its buffers as an input stream until it's done.
oh, that's counterintuitive, I'll read up, pardon my ignorance
Unless you can and it's me who's ignorant, of course. π
I'd love to be proven wrong.
it takes an outputstream as a constructor argument (that part I did expect)
I want an input stream constructed out of several input streams, which simply zips their contents.
an input stream is something you read from :D
(simple typo I'm sure)
Yup. So I could read the zipped contents from even before the entire files are done being zipped.
Here's how Alpakka does it in Java:
Source<ByteString, NotUsed> source1 = ...
Source<ByteString, NotUsed> source2 = ...
Pair<ArchiveMetadata, Source<ByteString, NotUsed>> pair1 =
Pair.create(ArchiveMetadata.create("akka_full_color.svg"), source1);
Pair<ArchiveMetadata, Source<ByteString, NotUsed>> pair2 =
Pair.create(ArchiveMetadata.create("akka_icon_reverse.svg"), source2);
Source<Pair<ArchiveMetadata, Source<ByteString, NotUsed>>, NotUsed> source =
Source.from(Arrays.asList(pair1, pair2));
Sink<ByteString, CompletionStage<IOResult>> fileSink = FileIO.toPath(Paths.get("logo.zip"));
CompletionStage<IOResult> ioResult = source.via(Archive.zip()).runWith(fileSink, mat);
oh - so you want data -> zipper -> (dupe) -> output, where the dupe step creates something you can read
that's a limit algorithmically - with that kind of buffering you always need to hold onto the unread backlog, no matter what you do
(unless I still misunderstand)
Yeah, kind of. If you look at how ZipOutputStream
works, it reads a chunk at a time, deflates it, adds it to a buffer, updates the checksum, gets another chunk, etc. In the meantime, what's already processed can already be send downstream. That's what I want.
of course the (dupe) step can be implemented so it only holds backlog and doesn't keep read state
deflates?
That's their terminology for compression.
> The ZIP file format permits a number of compression algorithms, though DEFLATE is the most common.
So, something like this: Input: Files A.a, B.b, C.c. Zipped state: filename -> send right away downstream. first chunk -> compress -> send downsteam. second chunk -> compress -> send downstream. filename -> ...
Don't wait to send the entire thing downstream when everything is complete.
The files are too big for that.
it seems like the composition of output streams should allow this, if you use an IO based stream and don't use in-memory if I understand your sitution, the gotcha is that you want to be able to "tee" the output, you need something with sensible buffering behavior (if you control both producer and consumer you can make sure the buffer never goes too large in simple use cases) more generally you can get into backpressure and document that the stream will stall if the secondary consumer doesn't read fast enough or use a dropping buffer and document that the secondary reader will not see all the data if it reads too slowly
this is something that can be done with outputstreams
Well, I can get to solving the backpressure problem later. To begin with, what do you mean by "composition of output streams" in this particular scenario?
what I mean is that by chaining streams you can have two destinations for the zipped data - one you can read back in process, and one that gets sent to IO
if you don't need two destinations, your problem is simpler than I thought
I might have misunderstood your usage of the term "input stream" relating to the zipped data
based on your step by step, all you need is for the output stream you supply to ZipOutputStream to be an IO stream and not an in-memory buffer
Yes, I want something like ZipOutputStream
(in the sense of building up a zipped data in its buffers) that can also be queried for available zipped data and read from--to wit, that also functions as an input stream.
I suppose that channels can be used for that, no?
channels?
core.async
this is totally unrelated to core.async, that will just make your problem more complicated
you can make an OutputStream that writes data to two other OutputStreams
you can use PipedInputStream and PipedOutputStream to turn your ZipOutputstream into an InputStream
you can make an OuputStream that sends to an InputStream
Thanks. I'll take a look at the piped streams.
the big caveat with the Piped*Streams is that you need two threads, whereas you might be able to get away with a single thread
@lkorogodski essentially, even without the Piped stuff, you can reify
http://java.io.OutputStream and in the implementation simply delegate by writing to two other streams
this only needs one thread
(but will block if one reader blocks)
Ok, that will work, I think.
the reason I so glibly rejected core.async
here is that the synchronization here is not actually complex and the mechanics of putting the problem into core.async leads to potential pitfalls (blocking IO should not be in go blocks for example)
and the core.async solution wouldn't simplify the management of the data consumption, just displace the complexity
also, solve one problem at a time
Thanks!
Yeah, backing a ZipOutputStream
by an OutputStream
that uses a CircularByteBuffer
would do the trick, as the circular buffer has an associated input stream, too.
https://github.com/nedap/speced.def takes bit of a Schema-like approach (i.e. no instrumentation, using what essentially boils down to a pre/postcondition system) while using Spec1 as its backing validation system. The magic being that Spec1 is just an impl detail. I have a branch where I replaced it with Schema (because it's relevant for my day job). I could do the same for Spec2, Malli, etc. It's bit of an underdog but also a completely unique, robust proposition when considered carefully