What I should have said was that Spark (and working w/ Spark in Clojure) on the particular problems that I had to solve, was a nightmare. I think for me it was that I was new to Spark and writing jobs in both Python and Clojure and the Python versions were always much faster to write and get working with the rapidly changing API. With Clojure flambo
was super helpful but because I was so new to Spark when I saw performance differences between really similar spark code in each language I didn’t know how to “fix it”, so I blamed everything iteratively (most recently switching to using dataframes for loading and saving data and their optimizations. I’d try to stay in dataframe-land as long as possible but that quickly turned into using RDDs implicitly) and learned a lot, but ultimately i still can’t explain performance differences between these spark jobs across languages.
@ccann: it’s definitely hard to know when something is “right”, that’s what i’ve found
where performance is kind of the marker for right
for me, so far, the best part is that everything is new so as long as i never make it slower, i’m usually happy with it
never done or looked at the python performance
but for clojure/jvm
I look at the task times, GC times, make sure there are no straggler task, check for reflection in my code, check for lazy sequences
@jrotenberg: 100% agreed about it being hard to know when something’s “right”, or even “right enough”. Moreover I still don’t really have a great intuition in some cases for how to make something better.
@sorenmacbeth: that’s good advice. Lazy sequences in particular I never really looked into — are they a known problem area for spark?
@ccann: not just spark, and distributed JVM stuff
do you know of a resource off-hand you could point me at to learn more? I can obviously just google around
@ccann: not that I know of. just stuff from experience. favor vectors over seqs, wrap lazy evaluations in doall
(doall (map…))
(doall (for…))
or wrap things in into
(into [] (map…))