Anyone used Cascalog on a very recent version of Hadoop? I’m using Hadoop 2.7.1 and my Map-Reduce jobs runs fine on a local hadoop instance for that version but not on the cluster. I get a Class Not Found Exception for
cascading.tap.hadoop.io.MultiInputSplit
. See https://www.refheap.com/112124Oh yeah, Good Morning everyone
Good morning.
morning
@agile_geek: does your uberjar contain the cascading classes ?
@mccraigmccraig: I’ve not specifically included them but not sure if they are a transitive dependency of Cascalog and I’m not including hadoop-client (referenced as :provided to allow compilation). Should I be including them?
i'm guessing - i don't have specific experience, but either they will need to be in the uberjar, or you will need to install them on the cluster nodes
if they aren't being transitively included, then you could try adding them to your lein project
that's assuming that they aren't already installed on your cluster, which it looks like they aren't
or perhaps you have some terrible jar-hell problem
i think i like the node approach to dependencies better than the jvm approach
What are cascading classes?
@pupeno: cascading is the high-level hadoop interface on which cascalog builds
Ah, ok.
Thanks.
it lets you model your map-reduce ops as more familiar joins, aggregations etc
@mccraigmccraig: i thought similar myself but reading around the Cascalog docs it is clear about not including any hadoop jars and the examples don’t include anything other than cascalog. I have a feeling it’s something to do with the Hortonworks distribution not having Cascading on it.
that would make sense... i have no idea whether it's even possible to run with cascading in the uberjar - i presume there are some gnarly ClassLoader hierarchies inside hadoop, and components like cascading might have to be installed at a higher-level than the app
I assumed that too
I need to read around how cascading gets installed
I think I’ll start by looking at the transitive dependencies for Cascalog and unpacking my uberjar
agile_geek: lein deps :tree
is your friend :simple_smile:
Yep!
Along with piping it’s output to a file so I can search.
@mccraigmccraig: hmm, that class is in the uberjar…as it’s a transitive dep of Cascalog. It’s an older version (2.5.3 instead of 3.0.2) of it but it’s there. I wonder if version is causing an issue.
@agile_geek: so u have a newer version of cascading on hadoop, and an older version in your uberjar ? exclude cascading from your uberjar and pray 🙏
@mccraigmccraig: I’ll try it but I’m a bit confused. The stack trace is that this class is missing which suggests it’s not on the cluster OR in my uberjar. In all the examples of Hadoop-Cascading-Cascalog the Cascading jar needed to be jar’ed up and deployed, which it is - I’ve unpacked my uberjar and it’s there. Admittedly, it’s a slightly older version. I’ve tried excluding the older version of cascading and building on a newer one but I get the same error. I’ll try excluding altogether but can’t see how that can work as the class is definitely missing then!
As suspected excluding the cascading lib altogether means the job fails to even compile (eval) when the cascalog functions try to resolve any references to cascading. Previously it failed when it hit the cluster.
@agile_geek: can you pre-compile your sources, then exclude cascading from the uberjar ? then, if you are lucky and they are api compatible, your .classes will perhaps link to the cascading classes on the hadoop cluster
if that fails, then can i suggest spark on mesos 😉
That’s what I did. AOT all on uberjar but it fails as soon as I submit to hadoop
@mccraigmccraig: Unfortunately it took 5 years for the client to get Hortonworks distro of Hadoop approved! Not sure Spark and Mesos will take less than 10!
so you can't run against an EMR cluster instead of the one you are using ?
Nope
and presumably the hadoop distro you have is deeply frozen and there's no chance of getting anything on to or off of the node classpaths ?
You guessed it!
bugger
This job runs ok locally on same version!
I’m going to give up and write it in Java! Ouch!
you mean same version of hadoop or same version of same hortonworks distro ?
version of Hadoop
@agile_geek: you might have a look at : https://github.com/damballa/parkour/
i've not used it, but it looks interesting as a nice interface to vanilla hadoop
Unfortunately the only reason I got to do this bit in Clojure is I said it would be faster but as I’ve lost 2 days to this problem I think I’ve burnt my ‘goodwill’ and I will be forced back to Java.
ha, i guess the argument that "jar-hell is not peculiar to clojure and can burn any attempt to use just about anything on a fixed platform" won't melt much ice, huh ?
Nope. The ppl I talk to would hear Charlie Brown’s teacher “whah, whah, whah Clojure whah, whah, whah, doesn’t work whah whah…”
i shall not complain. this is the mechanism through which large organisations get their lunch eaten by smaller organisations. without it the world would still be dominated by feudal organisations which have been around for thousands of years. oh, wait...
:simple_smile:
:simple_smile:
@mccraigmccraig: So that's why Windows exists? I'd never thought about it that way!
Hello you lovelies
I've not been hanging here much, but should be less busy in 2016
anyone going (or submitting) to http://www.clojured.de/ ?