flambo

jrotenberg 2016-05-24T16:42:29.000086Z

hmmm

jrotenberg 2016-05-24T16:42:35.000087Z

i should look through all of my shit

jrotenberg 2016-05-24T16:43:20.000090Z

i’m going to update this to use dataframes for both ends

jrotenberg 2016-05-24T16:43:39.000091Z

because i think from a wrapping/interop perspective thats gonna be way more flexible

jrotenberg 2016-05-24T16:43:58.000092Z

@ccann: thanks for the tip!

stephenmhopper 2016-05-24T18:09:18.000094Z

Can anyone here tell me about how they’re currently using Flambo at their companies? I work with Apache Spark on a daily basis (mostly from Scala), but would love to push my clients to move over to Clojure + Spark

jrotenberg 2016-05-24T18:22:33.000096Z

last year i started using flambo to process lots and lots of log data to update various meta data entries in one of our systems

jrotenberg 2016-05-24T18:22:46.000097Z

basically we have a 10 minute log window

jrotenberg 2016-05-24T18:23:00.000098Z

typically 6-12 million records per window

jrotenberg 2016-05-24T18:23:29.000099Z

spread across about 1000 logs that end up pushed to a shared directory

jrotenberg 2016-05-24T18:25:17.000100Z

so i used flambo to build a small system that looks for a complete set of logs, parses every single entry, and runs anywhere from zero to 5 functions on each line based on various params

jrotenberg 2016-05-24T18:25:24.000101Z

and creates a set of updates

jrotenberg 2016-05-24T18:25:57.000102Z

initially those updates were inline but it ended up being safer to create a separate set of updates, and then run an “updater” task against them

jrotenberg 2016-05-24T18:26:08.000103Z

this was more of a ‘see how it works’ project

jrotenberg 2016-05-24T18:26:20.000104Z

but it worked really well, and let me write the whole thing in clojure

jrotenberg 2016-05-24T18:26:22.000105Z

so thats a win

jrotenberg 2016-05-24T18:26:47.000106Z

now we are writing more sophisticated stuff and a lot of it is in scala

jrotenberg 2016-05-24T18:27:06.000107Z

but i’m pushing for clojure for at least a few things and hopefully we’ll get more as time goes on

jrotenberg 2016-05-24T18:27:38.000108Z

using the datastax spark/cassandra connector fairly heavily as well so my main task lately has been sorting that api out in clojure land

stephenmhopper 2016-05-24T18:27:58.000109Z

That sounds like fun

stephenmhopper 2016-05-24T18:28:18.000110Z

Was your organization already using Spark and you decided to write your app using Clojure and Flambo?

jrotenberg 2016-05-24T18:54:30.000111Z

yes

jrotenberg 2016-05-24T18:55:04.000112Z

we have data processing pipelines written in just about everything that’s existed for the past 12+ years

jrotenberg 2016-05-24T18:55:05.000113Z

so

jrotenberg 2016-05-24T18:55:07.000114Z

oracle

jrotenberg 2016-05-24T18:55:09.000115Z

hadoop

jrotenberg 2016-05-24T18:55:14.000116Z

and most recently spark

jrotenberg 2016-05-24T18:55:21.000117Z

mixed with everything in between

jrotenberg 2016-05-24T18:55:40.000118Z

basically a huge shitshow of infrastructure and legacy code

jrotenberg 2016-05-24T18:55:41.000119Z

heh

stephenmhopper 2016-05-24T18:58:01.000120Z

That sounds simultaneously fun and terrifying

jrotenberg 2016-05-24T20:07:05.000121Z

its both pretty much 100%

sorenmacbeth 2016-05-24T22:09:24.000122Z

@stephenmhopper: we (yieldbot) use flambo for all our production spark jobs

sorenmacbeth 2016-05-24T22:10:14.000123Z

I’m fairly certain we are the largest company using it (we are quite large).

stephenmhopper 2016-05-24T22:38:25.000124Z

@sorenmacbeth: Cool. What’s feature parity look like between flambo and Spark 1.6?

stephenmhopper 2016-05-24T22:38:36.000125Z

is everything supported right now?

ccann 2016-05-24T22:53:57.000126Z

@stephenmhopper We were using it at http://weft.io (until last week when we ran out of money) for all our historical processing of maritime data. We had ~ 1TB worth. In Clojure we had one job for enforcing data schemas and one for enriching it with new fields from other sources. We used Dataframe interfaces (spark-csv, spark-avro) for loading and saving (which required a fair bit of java interop) and the meat of our work was usually a group-by on an id field followed by some iterative transforming of the resulting grouped data (and eventually reformulations of this idea when we realized the grouped data was too big to not process in a distributed fashion)

ccann 2016-05-24T22:55:30.000128Z

we were a smalllll startup so I didn’t have to convince anybody to let me use spark or clojure 🙂

stephenmhopper 2016-05-24T22:55:37.000129Z

Oh, cool

stephenmhopper 2016-05-24T22:55:59.000131Z

Were you using plumatic/schema for the schema checking or something else?

ccann 2016-05-24T22:56:02.000132Z

yeah

stephenmhopper 2016-05-24T22:56:07.000133Z

Cool

stephenmhopper 2016-05-24T22:56:24.000134Z

Given the option to start over again, would you do it with Spark and Flambo or would you use Onyx?

ccann 2016-05-24T22:56:25.000135Z

mostly for testing though, I’d turn off validation when running on the bigger dataset

stephenmhopper 2016-05-24T22:56:30.000136Z

ok

ccann 2016-05-24T22:56:37.000137Z

I wouldn’t use Onyx

ccann 2016-05-24T22:56:52.000138Z

I don’t know enough about that project’s maturity

ccann 2016-05-24T23:00:22.000140Z

I can’t really decide if I’d use Spark with Clojure again. Flambo is great and helpful but it mostly focuses on the RDD spec while DataFrames et al. are emerging rapidly

ccann 2016-05-24T23:00:53.000141Z

but the java interop really isn’t all that rough

sorenmacbeth 2016-05-24T23:02:41.000142Z

dataframes are just a PR away for flambo 😉

ccann 2016-05-24T23:02:47.000143Z

🙂

sorenmacbeth 2016-05-24T23:03:11.000144Z

flambo works just fine on spark 1.6.1

ccann 2016-05-24T23:03:32.000145Z

just as another data point I’ve used flambo on spark 1.6.1 with no issues

sorenmacbeth 2016-05-24T23:03:53.000146Z

all the common RDD functions and transformations that we use are support by flambo. adding currently unsupported ones is trivial

sorenmacbeth 2016-05-24T23:04:14.000147Z

and we largely leave those to PRs from other to add anything missing they want.

sorenmacbeth 2016-05-24T23:05:32.000148Z

the real fun with flambo is starting an nrepl server up on a cluster, connecting to that with emacs, and doing interactive data exploration on the cluster 🙂

🦜 1
sorenmacbeth 2016-05-24T23:06:06.000149Z

combine that with vizard http://github.com/yieldbot/vizard and you can have real fun

ccann 2016-05-24T23:07:15.000151Z

@sorenmacbeth: have you guys had issues with clojure’s core data structures being quite a bit larger than java’s (i.e. map vs HashMap)? Slow jobs or the like?

ccann 2016-05-24T23:07:42.000152Z

that interactive exploration with nrepl and vizard sounds really great

sorenmacbeth 2016-05-24T23:12:26.000153Z

@ccann: no, because all of the clojure data structures are serialized with kryo

sorenmacbeth 2016-05-24T23:12:45.000154Z

so they usually end up smaller than the java equivalents

sorenmacbeth 2016-05-24T23:13:20.000155Z

plus the java api does a ton of disgusting implicit conversions to and from scala data structures

ccann 2016-05-24T23:14:51.000157Z

Oh, wow, okay. So the deserialization is fast enough to warrant that approach?

ccann 2016-05-24T23:16:40.000158Z

that’s reassuring. When I saw in the spark documentation that it recommends use of something like http://fastutil.di.unimi.it/ instead of heavy java data structures I always assumed I was getting slow downs with clojure data structures

sorenmacbeth 2016-05-24T23:17:00.000159Z

the serialization is handled by spark itself

sorenmacbeth 2016-05-24T23:17:12.000160Z

spark supports kryo

ccann 2016-05-24T23:17:16.000161Z

yeah

sorenmacbeth 2016-05-24T23:17:35.000162Z

it’s faster than the default java serialization

sorenmacbeth 2016-05-24T23:17:42.000163Z

much faster

ccann 2016-05-24T23:17:43.000164Z

right okay my question makes less sense now

ccann 2016-05-24T23:17:48.000165Z

gotcha

ccann 2016-05-24T23:18:32.000166Z

I knew that kryo was faster than the default java serialization but how do clojure structures end up smaller than java (assuming java is being serialized by kryo too)

sorenmacbeth 2016-05-24T23:19:58.000167Z

they probably end up about the same if your using kryo serialization in spark for java data structures

stephenmhopper 2016-05-24T23:20:11.000168Z

This all sounds great. I think I’m going to start doing local development with Clojure and Flambo and see where I get

sorenmacbeth 2016-05-24T23:20:13.000169Z

i was assuming normal java serialization

ccann 2016-05-24T23:20:32.000170Z

gotcha okay

sorenmacbeth 2016-05-24T23:30:46.000171Z

@stephenmhopper: cool