flambo

ccann 2016-05-23T13:44:36.000069Z

What I should have said was that Spark (and working w/ Spark in Clojure) on the particular problems that I had to solve, was a nightmare. I think for me it was that I was new to Spark and writing jobs in both Python and Clojure and the Python versions were always much faster to write and get working with the rapidly changing API. With Clojure flambo was super helpful but because I was so new to Spark when I saw performance differences between really similar spark code in each language I didn’t know how to “fix it”, so I blamed everything iteratively (most recently switching to using dataframes for loading and saving data and their optimizations. I’d try to stay in dataframe-land as long as possible but that quickly turned into using RDDs implicitly) and learned a lot, but ultimately i still can’t explain performance differences between these spark jobs across languages.

jrotenberg 2016-05-23T17:01:20.000070Z

@ccann: it’s definitely hard to know when something is “right”, that’s what i’ve found

jrotenberg 2016-05-23T17:01:32.000071Z

where performance is kind of the marker for right

jrotenberg 2016-05-23T17:02:09.000072Z

for me, so far, the best part is that everything is new so as long as i never make it slower, i’m usually happy with it

sorenmacbeth 2016-05-23T17:50:14.000073Z

never done or looked at the python performance

sorenmacbeth 2016-05-23T17:50:19.000074Z

but for clojure/jvm

sorenmacbeth 2016-05-23T17:51:23.000075Z

I look at the task times, GC times, make sure there are no straggler task, check for reflection in my code, check for lazy sequences

ccann 2016-05-23T20:31:34.000076Z

@jrotenberg: 100% agreed about it being hard to know when something’s “right”, or even “right enough”. Moreover I still don’t really have a great intuition in some cases for how to make something better.

ccann 2016-05-23T20:32:41.000077Z

@sorenmacbeth: that’s good advice. Lazy sequences in particular I never really looked into — are they a known problem area for spark?

sorenmacbeth 2016-05-23T20:34:26.000078Z

@ccann: not just spark, and distributed JVM stuff

ccann 2016-05-23T20:36:02.000079Z

do you know of a resource off-hand you could point me at to learn more? I can obviously just google around

sorenmacbeth 2016-05-23T22:46:53.000080Z

@ccann: not that I know of. just stuff from experience. favor vectors over seqs, wrap lazy evaluations in doall

sorenmacbeth 2016-05-23T22:47:06.000081Z

(doall (map…))

sorenmacbeth 2016-05-23T22:47:23.000082Z

(doall (for…))

sorenmacbeth 2016-05-23T22:47:34.000084Z

or wrap things in into

sorenmacbeth 2016-05-23T22:47:46.000085Z

(into [] (map…))