I just returned from the Spark codebase and I’m more hopeful than before about Encoders
so if I don't have time with my sql branch you can take over relevant code
that is neat :)
encoder relief :)
ExpressionEncoder is the type to use https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L219
But now we need to provide two arguments `serializer: Seq[Expression], deserializer: Expression,`
Expressions will be built using https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala which provides StaticInvoke
, ùInvoke`, and NewInstance
.
You have to understand that these expressions are just builders for Java code source.
(that will be compiled on the fly)
So one has to think like when one does Clojure interop from Java
I think we can achieve an API like (df src spec & xforms-then-options)
very close to what we have for rdd
that would be awesome
the snippet above is totally untested (well it compiles and produces an expression)
while learning about CollReduce, what about parallel fold?
on todo list for rdd I guess: https://github.com/HCADatalab/powderkeg/blob/master/src/main/clojure/powderkeg/core.clj#L662
not managing a PR tonight, maybe later
This code is there from the early times. I don't remember why it got commented out.
there is a related todo comment
on another note, where is the book on Clojure collection protocols, would need one :)
I do remember. Transducers assume linear traversal. So you have to solve transducers+fold first.
saw some stackoverflow question about that
Rich's strange loop talk on transducers mentioned parallellism on one slide, wonder where it is at now :)
If I understand Alex's mention of kv, I believe I solved it my way in xforms.
have to grok that too, but now have to try sleeping :)
With by-key I deal with deterministic partitioning. Fold is non-deterministic.