hmmm
i should look through all of my shit
https://github.com/joshrotenberg/cassandra-flambo-read-write
i’m going to update this to use dataframes for both ends
because i think from a wrapping/interop perspective thats gonna be way more flexible
@ccann: thanks for the tip!
Can anyone here tell me about how they’re currently using Flambo at their companies? I work with Apache Spark on a daily basis (mostly from Scala), but would love to push my clients to move over to Clojure + Spark
last year i started using flambo to process lots and lots of log data to update various meta data entries in one of our systems
basically we have a 10 minute log window
typically 6-12 million records per window
spread across about 1000 logs that end up pushed to a shared directory
so i used flambo to build a small system that looks for a complete set of logs, parses every single entry, and runs anywhere from zero to 5 functions on each line based on various params
and creates a set of updates
initially those updates were inline but it ended up being safer to create a separate set of updates, and then run an “updater” task against them
this was more of a ‘see how it works’ project
but it worked really well, and let me write the whole thing in clojure
so thats a win
now we are writing more sophisticated stuff and a lot of it is in scala
but i’m pushing for clojure for at least a few things and hopefully we’ll get more as time goes on
using the datastax spark/cassandra connector fairly heavily as well so my main task lately has been sorting that api out in clojure land
That sounds like fun
Was your organization already using Spark and you decided to write your app using Clojure and Flambo?
yes
we have data processing pipelines written in just about everything that’s existed for the past 12+ years
so
oracle
hadoop
and most recently spark
mixed with everything in between
basically a huge shitshow of infrastructure and legacy code
heh
That sounds simultaneously fun and terrifying
its both pretty much 100%
@stephenmhopper: we (yieldbot) use flambo for all our production spark jobs
I’m fairly certain we are the largest company using it (we are quite large).
@sorenmacbeth: Cool. What’s feature parity look like between flambo and Spark 1.6?
is everything supported right now?
@stephenmhopper We were using it at http://weft.io (until last week when we ran out of money) for all our historical processing of maritime data. We had ~ 1TB worth. In Clojure we had one job for enforcing data schemas and one for enriching it with new fields from other sources. We used Dataframe interfaces (spark-csv, spark-avro) for loading and saving (which required a fair bit of java interop) and the meat of our work was usually a group-by
on an id
field followed by some iterative transforming of the resulting grouped data (and eventually reformulations of this idea when we realized the grouped data was too big to not process in a distributed fashion)
we were a smalllll startup so I didn’t have to convince anybody to let me use spark or clojure 🙂
Oh, cool
Were you using plumatic/schema for the schema checking or something else?
yeah
Cool
Given the option to start over again, would you do it with Spark and Flambo or would you use Onyx?
mostly for testing though, I’d turn off validation when running on the bigger dataset
ok
I wouldn’t use Onyx
I don’t know enough about that project’s maturity
I can’t really decide if I’d use Spark with Clojure again. Flambo is great and helpful but it mostly focuses on the RDD spec while DataFrames et al. are emerging rapidly
but the java interop really isn’t all that rough
dataframes are just a PR away for flambo 😉
🙂
flambo works just fine on spark 1.6.1
just as another data point I’ve used flambo on spark 1.6.1 with no issues
all the common RDD functions and transformations that we use are support by flambo. adding currently unsupported ones is trivial
and we largely leave those to PRs from other to add anything missing they want.
the real fun with flambo is starting an nrepl server up on a cluster, connecting to that with emacs, and doing interactive data exploration on the cluster 🙂
combine that with vizard http://github.com/yieldbot/vizard and you can have real fun
@sorenmacbeth: have you guys had issues with clojure’s core data structures being quite a bit larger than java’s (i.e. map vs HashMap)? Slow jobs or the like?
that interactive exploration with nrepl and vizard sounds really great
@ccann: no, because all of the clojure data structures are serialized with kryo
so they usually end up smaller than the java equivalents
plus the java api does a ton of disgusting implicit conversions to and from scala data structures
Oh, wow, okay. So the deserialization is fast enough to warrant that approach?
that’s reassuring. When I saw in the spark documentation that it recommends use of something like http://fastutil.di.unimi.it/ instead of heavy java data structures I always assumed I was getting slow downs with clojure data structures
the serialization is handled by spark itself
spark supports kryo
yeah
it’s faster than the default java serialization
much faster
right okay my question makes less sense now
gotcha
I knew that kryo was faster than the default java serialization but how do clojure structures end up smaller than java (assuming java is being serialized by kryo too)
they probably end up about the same if your using kryo serialization in spark for java data structures
This all sounds great. I think I’m going to start doing local development with Clojure and Flambo and see where I get
i was assuming normal java serialization
gotcha okay
@stephenmhopper: cool