OR is easy, you just union the two sets, AND ... I guess you could do intersection
@mobileink: ^ but I think, either way, there needs to be a short way to formulate "I want all objects that satisfy <COMPLICATED CRITERION>", without describing how to search, and that's where datalog/query langauges come in to play
Found this, and thought I would shareβ¦ If youβre dabbling with Dataflow: https://github.com/ngrunwald/datasplash
lovely stuff...
we actually use dataflow extensively with clojure
but we have our own fairly minimal wrapper. we're hoping one day to have time to open source it
that would be great.
@bfabry If you had an example of a simple pubsub source to bigquery sink, I would be eternally grateful. π
lol, well in our repo that'd be something like (->> (io/read-resource "pubsub-url) (io/write-to-bigquery table-definition)) but that probably doesn't help you much
I will say there's some tension with how dataflow works and how clojure works, so you either end up having to AOT the world or have a few pieces of code that look like this
(defn clj-call-invoke
[{:keys [full-name params ns-name fn-validation]} & args]
(try
(apply (var-get (find-var full-name)) (into (vec args) params))
(catch Exception _
(CljDoFnWithContext/synchronizedRequire ns-name)
(try
(apply (var-get (find-var full-name)) (into (vec args) params))
(catch Exception e
(if (= 'clojure-dataflow.pardo ns-name)
(throw e)
(throw (ex-info (str "Exception in " full-name) {:params params :data args} e))))))))
@bfabry out of curiosity, why do you need to aot the world?
because dataflow serializes dofn's and sends them over the wire to be executed. so when the dofn is executed either all of your functions need to be compiled and ready in the jar, or you need a piece of code like this that catches the exception the first time it happens and requires in all your clojure code
we do the latter, because I don't like AOT
sorry, dofn? the reason i ask is because of my experience dealing with servlets. lots of clj servlet stuff aots everything, but you don't need to do that, you can just aot a dinky little clj file and leave the rest to clojure, by using :impl-ns appropriately.
like I said, you don't have to aot everything, you have to aot everything or have some points in your code that detect when the clojure code hasn't been loaded and load it
sorrry i haven't used dataflow but it sounds like a similar situation. wherever there is is a "container" that calls one of your methods.
so i guess my question is, what specific classes do you need to aot compile to make it work, if you do aot compile? e.g. with servlets you must aot compile a gen-class bit, but that's all you need to aot compile.
it's not a particularly similar situation
i mean a gen-class bit that extends HttpServlet. is there some class/interface like that in dataflow?
hmm, ok, i'll do some homework. π
here's the line in datasplash where they end up having to do what I was talking about https://github.com/ngrunwald/datasplash/blob/master/src/datasplash/core.clj#L87
yikes. i'll have to actually think a little bit to figure that out.
i took a quick look at the dataflow docs and datasplash. it looks to me like the only reason to aot is because the container is looking for -main, no?
no
https://clojurians.slack.com/archives/google-cloud/p1485809943000713
heh. ok, more thinking. point being only that there is ususlly a midway point between aot everything, and complicated dynamic stuff. you can just aot compile a stubb, and delegate everything to clojure.
sure, and that's what we do, although our "stub" is just written in java rather than aot'd clojure
doesn't get around the catch-and-load need though
interesting. time to learn dataflow.
@bfabry sorry to bother you, but looking at the docs, it seems that a pipeline must have a "main", no? the pipeline runner, as far as i can see, is effectively a container: when you go to run a pipeline, it looks for "main". the docs are somewhat less than clear on this.
a pipeline is an abstract description of the graph of operations you want performed upon a set of pcollections, it does not have a main function
ok, but then how does it get started? there must be an entry point, no?
we execute it by submitting it to the dataflow service via api
you can also run it locally using DirectPipelineRunner
ok, but i still do not see how this can work if we do not have a way of saying, in effect, "start here".
the runner must know what to do with the submitted pipeline.
i don't see how that can happen without a convention. if not "main", then there must be something else so the runner can decide where to start.
the runner must call some method, obviously.
it looks to me like a runner is a "container", just like a servlet container, or an AWS Lambda container, or whatever. am i missing sth?
our project certainly has a main function. it just doesn't really have anything to do with dataflow. it's the thing that constructs the execution pipeline and then submits it to the dataflow api. if you would like to learn dataflow I'd recommend you read the dataflow whitepaper, streaming 101 and 102, and dive in to the examples repository
will take a look, but only if you will take a look at JCL. π Honestly, all this dataflow stuff has been around since about 1968.
we used to do it in cobol!
@mobileink: did you go dinosaur hunting too? π
anyway, just a guess, the only thing you need to aot is the main fn.
@qqq i was raised on a dinosaur farm.
if you need a trexx, lemme know, i'll talk to some people who know some people.
there were no distributed data processing frameworks in 1968, let alone a unified batch and streaming one delivered as a managed service. like I said, you don't need to aot anything if you catch the exception and deal with code load manually, but you need to deal with code load somehow, because dataflow has a very different execution model
@bfabry: did i say 68? ok, 78. ;). my 1st programming job (i was a liberal arts major) was with EDS, and they took us all to cobol camp where we learned to do a "master file update", pure batch processing using JCL to control the "pipeline", and looking at dataflow, it's essentially the same thing, just a little fancier.
what goes 'round comes 'round!
the really cool thing is that boot is essentially JCL, done right.
for anybody interested in the history of programming: see http://www.mail-archive.com/ibm-main@bama.ua.edu/msg108364.html and http://www.drdobbs.com/architecture-and-design/a-personal-history-of-systems-and-comput/222700827
@bfabry serious question: if you've written your code in 100% clojure, and you do not aot anything, then how can the pipeline runner start your pipeline? it would have to know how to talk clojure, which seems unlikely, but even if it did know that it would still have to know where to start. can you explain how to run a pipeline with no aot at all?
https://clojurians.slack.com/archives/google-cloud/p1485811300000728
which is not related to how it is started. our start function, the main function, is defined in a namespace which we pass to clojure.main using the -m option
because clojure.main is aot'd and comes bundled in the clojure jar
so, you always have sth aot compiled, even if you do not aot compile everything. is that correct.
sth?
something. dictionary geek here, sorry.
the java classes which call into clojure are compiled yes, and clojure.main (which comes compiled with clojure)
we could translate those java classes into clojure gen-class trickery and aot those if we wanted to as well
fwiw, not beating you up. i have what i think is a nice technique for dealing with this sort of situation, where a "container" kicks things off, thus requiring byte-code-on-disk.
which may apply here.
see the docs at https://github.com/migae/boot-ask/blob/master/README.adoc. still a WIP, but you'll get the odea.
in short: you can get the aot stuff but completely hide it, since it is always the same.
i.e. you can get rid of the java stuff, and in fact you cn get rid of the gen-class stuff too.
I really doubt it
if you use boot. π
heh, ok now i have to prove it. might take few weeks, other stuff to do too.
please don't go out of your way. I'm perfectly happy having two 10 line long java classes. it doesn't concern me at all
oh, it's the challenge, makes it fun!
it's on my list anyway.
i don't suppose you could make the java bit publicly inspectable?
eh, sure I can give you the two most important ones
https://gist.github.com/bfabry/f89aa19eac563b83840839a6add02829
https://gist.github.com/bfabry/8df6cbf7a955484f388a44289896c480
thanks! will be in touch in a couple weeks.