@mfiano You're in the right place, just share your code and hopefully some people will read and review (also try #beginners if there's little reaction here)
Heya
I wrote a project that parses a few different binary audio file types to get metadata https://github.com/mfiano/mediacat
in mfiano.mediacat.parser
ns, usage is (parse-file "/path/to/file")
supported file types are *.ogg, *.flac, and *.mpc
This is the first bit of Clojure I did, so be nice 🙂
Also noteworthy, the first time I touched Java, interop or not 🙂
@mfiano the biggest thing I see is many usages of loop, where eg. dotimes
(for numeric bindings in series) or reduce
(for doing some action for each entry of an input and building a result) could have been used
and then the usage of the humanize library by full name instead of requiring with an alias in the util namespace
in util/sequence* - I don't think (lazy-seq (loop ...))
makes any sense at all
I tend not to alias a namespace if I'm only using it once, but fair points
yeah, that's a style thing I guess - the risk there is that eg. it works accidentally when moved to a new ns even though you don't add a require
the lazy-seq / loop combo is the one thing I see so far that might actually be a correctness concern
Can you explain that last thing about lazy-seq
. A friend of mine wrote that for me
you can wrap any sequence in lazy-seq, but that won't make it lazily generate
loop is eager, so what that code says is "create this collection eagerly, and once you are done, wrap it in a lazy-seq abstraction"
which is weird
and might or might not indicate a logic error - it's probably not even that though
also, now that I see what's inside the loop, it's a classic case of a potential concat-bomb
(you'd need a large enough input before ever seeing the issue though)
I asked them to stop by here and talk about it
but it can blow up the stack
all in all, for a first clojure project, the quality is very good
Which function was it that had the loop in the lazy seq?
sequence*
@suskeyhose https://github.com/mfiano/mediacat/blob/master/src/mfiano/mediacat/util.clj#L15
Right.
It's fully intended to be eager for that portion.
https://github.com/mfiano/mediacat/blob/master/src/mfiano/mediacat/util.clj#L19
so, the lazy-seq call around it doesn't do anything except making an empty lazy-seq instead of nil - in the loop body it's using lazy-cat on every branch
Yeah, this was after a refactor and it never occurred to me to just raise the loop form
so the lazy-seq on the outside isn't doing much but that isn't an error - just something that makes me suspicious
aha
The lazy-cat is already taking care of all the needed laziness
second issue is the eager nested calls of lazy-cat, see the blog post I linked above
Yup, I've read it.
with a large enough input it will blow the stack
Yes. Although it's unlikely to occur in this case, because the part which will blow the stack is the side that's being catenated onto, which is the set of return results from calling a transducer, which means that it's unlikely to be large.
it's straightforward to replace the usage of loop/concat with a recursive call that doesn't have that danger
but you know your domain better than I do
I might try to put a little thought into it today to see if there is a better, more idiomatic way to handle this, perhaps with giving the function more intermediate values as a part of the argument list
actually yeah, that looks like an easy refactor
I'm forgetting the precise locations now, but I saw some usage of count
as a local, this is a style issue but it tends to trip me up as a reader and that kind of thing tends to cause weird bugs when refactoring
shadowing of functions with local values in general that is
Yeah, that's fair. Shadowing bindings do tend to get weird
@noisesmith Thanks a lot for the criticism. This project was very tedious...some of these binary formats were under-specified or needlessly convoluted. lazy sequences helped a bunch with transforming the data efficiently. I learned a lot for sure.
Also thanks to @suskeyhose and @seancorfield for answering all my questions along the way. You guys are awesome
(defn- sequence*
[xf coll]
(let [rf (xf (fn [xs v] (conj xs v)))
f (fn f [coll xs]
(if (seq xs)
(lazy-seq (cons (first xs) (f coll xs)))
(loop [coll coll]
(when (seq coll)
(if-let [res (rf [] (first coll))]
(if (reduced? res)
(lazy-seq (cons (first @res) (f nil xs)))
(lazy-seq (cons (first res) (f (rest coll) (rest res)))))
(recur (rest coll)))))))]
(f coll [])))
I dislike how slack doesn't allow code highlighting for specified languages
It does, just not like that
ah
Anyway, this one shouldn't explode the stack, and has no redundant lazy-seq call
Oh, oops, there's actually one mistake there
the (f nil xs)
should actually be (f nil (rest @res))
pardon my reading skills, but what differentiates sequence*
from cat
?
sequence*
isn't a trasducer, it's a trasducible context that's identical to sequence
except that it is not chunked and does not require realizing at least two results or consuming the whole sequence.
I would not have guessed
What would make that more clear?
Probably me stabilizing the code to warrant docstrings for every function 🙂
I'm not sure - maybe local binding names that are meaningful instead of repeated calls to eg. first, rest on the same coll in the same scope
or comments even
Unfortunately it actually does have a bug now that it didn't before. gotta track that down.
@suskeyhose No rush, you can come back to it another time. I'm actually doing some CL code today and tomorrow.
Okay, sounds good.
Anyway, the original will likely never run into the problem of blowing the stack unless you decide to use partition-all
with a size in the thousands.
of all the code I saw so far, (not claiming I read deeply...) sequence* would likely benefit from unit tests (which would likely also elucidate what it is meant to do)
it's the only code that made me question my own ability to read it correctly
that's definitely true, and in my own project I'd probably have built some around it. In this case I'm just helping Michael out with some helper functions and thinking functionally for his project.
I think @suskeyhose refactored that function like 4 times already. It's definitely pretty obtuse to me
This is true. It's trying something that is rather complex.
To be honest, sequence
works just as well for this project with the input it is given, so I'm not sure chunking would be an issue here. I could be wrong though, I just know that it works with the builtin.
The problem with chunking is that you read in more data than necessary by at least one. Which can be a problem if you're reading a very large sequence and use a rather specific filter on it. So basically it damages performance in the case when computing the next value is expensive.
(and you only need a very specific number of values)
This is true, and also I haven't tested the builtin once we added the reified bit reader, which may possibly advance the stream too much.
Like for example that notes app that I made near the beginning of all this. It used sequence
which made getting the first item from the sequence slow when the only other item which matched the filter was at the end.
yeah, it might, which is the other reason
Anyway, this project was fun. A few of the formats I parsed would have been pretty impossible without using lots of imperative code, or putting lots of binary data on the heap for processing. lazy sequencing helped a ton
I wouldn't think of using another language for parsing binary files after those few I encountered
I'd say that's impressive considering just how much better it could be if Java didn't throw so many walls at you for doing binary IO.
The Java library I'm using, bitio
, really was a great help with a lot of that. I tried about 3 others, which were all lacking.
the whole "no unsigned types" thing is a bit odd
if they really wanted to be consistent with that they could have defined the integer value of Boolean to be -1 and 0
🙂
That was my reasoning for doing https://github.com/mfiano/mediacat/blob/master/src/mfiano/mediacat/parser/types.clj#L19-L21
So I wasn't wasting tons of space for small bit reads
nice
If I recall correctly, the only mutable function I wrote was tr/reverse-bytes!
for reversing a byte array to be read in little endian, since it would have needlessly consed when its only use is going to be discarded afterwards anyway.
I was concerned about performance there a bit, because ultimately I'd like a thread pool to parse an entire directory tree of files to be parsed and sent off to Datalog or something.
Here's the commented, better-named, and better-tested version which doesn't use lazy-cat. What do you think @noisesmith, @mfiano?
I'd probably have used a letfn
But looks good
doesn't work with reducing-function
since you have to call a function on the lambda
before it can be used.
Also, is that more understandable Michael?
I meant for pull-next
Yes very
Suggesting having a nested let and then letfn?
I guess that works. It's actually tabbed over less, so yeah
that's probably not a bad idea
I was just thinking of a way to prevent using the symbol pull-next
a few times outside of itself