data-science

Data science, data analysis, and machine learning in Clojure https://scicloj.github.io/pages/chat_streams/ for additional discussions
niveauverleih 2020-05-21T08:05:44.214200Z

I am gathering some selling arguments for clojure with clj-python over plain python for evangelism. Is it faster, more stable, easier to put to production?

niveauverleih 2020-05-22T11:43:30.219400Z

Thanks all! Why does Chris Nuernberger say that one should use containers when using clojure and clj-python?

niveauverleih 2020-05-22T11:48:43.219600Z

@metasoarous I have to read up on the GIL. As for parallelization, doesn't python have a library for that? I know distributed computing is not the same as parallel computing, but doesn't Spark solve part of their speed problems?

niveauverleih 2020-05-22T11:52:19.219900Z

@daslu so the main take away from the long text are fun, easy to extend, high productivity, better REPL, right?

niveauverleih 2020-05-22T11:57:34.220100Z

@aaelony I'm afraid "one language to rule them all" will just make pythonistas suspicious. For the package-mangement problems, could you please tell me more? I used pip before, and it worked.

2020-05-22T11:58:59.220300Z

@nick.romer I think that is a nice way to summarize it! I like the point it makes about the REPL in the last paragraph.

2020-05-22T11:59:40.220500Z

Because of this visibility advantage a common way to use Clojure is to model your problem as a transformation from datastructure to datastructure, testing each stage of this transformation in the repl and just letting the REPL printing show you the next move. 

niveauverleih 2020-05-22T11:59:50.220700Z

@metasoarous typing races: are you saying that one can type vectors faster in Clojure? (Sorry, my command of English is mediocre, I don't have access to hidden meanings).

niveauverleih 2020-05-22T12:15:20.221Z

@daslu I don't know these python functional pipelines that @metasoarous talks about. Can they be used for transforming datastructure to datastructure? Background info: I've just been accepted as junior data scientist and will have to survive in python until I have gained enough credibility to start clojure evangelism. Worse, I'm newbie to both clojure and python.

niveauverleih 2020-05-22T12:18:56.221200Z

... I've used transducers with core.async, so I have a vague idea what these transformations might look like.

2020-05-22T12:25:11.221400Z

I don't know so much about the different possibilities of doing Functional Programming in Python. When I have to, I like to compose things with toolz.curried.pipe as demonstrated here: https://toolz.readthedocs.io/en/latest/streaming-analytics.html but probably there are other options. Hoping your time with Python will not be so bad : ) .. till your time of evangelism comes.

2020-05-22T12:28:22.221600Z

Yes, one can work with datastructures in Python in a somewhat functional fashion. But I think the Clojure REPL experience make it clearer and simpler. Probably one reason for that is that things are Printed in the same notations they are Read. So the Read and Print parts of the Read-Eval-Print-Loop speak the same language. Does it make sense?

niveauverleih 2020-05-22T12:29:52.221900Z

Yes, that makes sense.

2020-05-22T12:31:10.222100Z

user=> ;; Evaluate something:
user=> (update {:x 9} :x inc)
{:x 10}
user=> ;; Take the printed result and pass it to be read for the next evaluation:
user=> (update {:x 10} :x inc)
{:x 11}

niveauverleih 2020-05-22T12:31:44.222300Z

That's clear. Different question. Are neanderthal and numpy overlapping in functionality? Would you mix them?

2020-05-22T12:35:44.222500Z

Afaik they overlap. Never tried to mix them.

niveauverleih 2020-05-22T12:41:34.222700Z

and thanks for the toolz link.

2020-05-22T17:29:11.233100Z

@chris441's tech.dataset has some level of interop with numpy matrices, but I don't think we get this with neanderthal.

2020-05-22T17:30:49.233300Z

toolz looks cool; I've used some thing s like that. I just hate that I have to go to a second hand library to get all the Functional Programming goodies. Basically every python program I've written since learning Clojure results in me implementing a few clojure.core functions. I'd rather not have to do that work if I don't have to.

2020-05-22T17:35:03.233500Z

And yes, Python is a fairly functional language as far as OOP languages go. In fact, I'd argue that at a core design level, it's actually functional first! The reason is that in Python, Objects & Methods are built using basic functional primitives. So it's not like Ruby, which for all the hype over having lambdas, is different in that in Ruby methods are not functions. It's possible to get to the function (lambda, really) for a given method, but it's an extra step. By contrast, in python, every object has a "magic" o.__dict__ attribute which points to all of the Object attributes and methods (as functions!) that the Object (and class) knows about. Which is hella elegant in my opinion. For an OOP language, Python is pretty nice (again, IMHO).

2020-05-22T17:39:36.233700Z

However... And this is a big "however": Functional programming is deeply limited in power (compared to its full potential) when you don't bake persistent/immutable data structures into the language. JS has this problem too; It has first class functions, and at least used to require you to use protyping to build objects (which is also a functional pattern, if not sometimes a messy one). But the lack of persistent data structures means you can't squeeze every bit of juice out of those FP patterns. And this isn't just academic: Perhaps the single strongest part of the JS ecosystem right now is React, which is fundamentally a Functional Reactive Programming paradigm. Time and time again I have found that vanilla JS React is significantly limited relative to ClojureScript+React due to this limitation. As with toolz in python, there are libraries for adding persistent data structures to JS. In fact, mori just rips off ClojureScript's persistent data structures as a JS lib! Which is great, but also not the same as being a pervasive and fundamental assumption in the language. In Clojure everything is build up around these ideas and so they carry much more heft and power.

2020-05-22T17:39:52.233900Z

Taking comments/questions in turn... <deep-breath>

2020-05-22T17:40:10.234100Z

Next up for package management (now let the real grilling begin).

2020-05-22T17:40:23.234300Z

Yes, pip works fine for the first few pacckages you install.

2020-05-22T17:42:17.234500Z

Until you end up needing to install a package that depends on a different version of a package than the one you already have installed. In python you cannot install more than one version of a package in a given python environment! Which, to be honest, is fucking lunacy. Ruby, JS, Java and by implication Clojure all dodge this bullet. Each project specifies which version it needs, but you can have multiple versions installed along side each other.

2020-05-22T17:52:24.234700Z

Python does not allow this, and thus there have been scores of projects which try to patch over the fundamental failing, and none of them have really done a good job of it (again, IMHO).

2020-05-22T17:54:16.234900Z

In virtualenv you actually create separate environments for each project, and have to install things separately in each one of them. Managing and switching between these becomes a pain.

2020-05-22T17:56:38.235100Z

Pip has finally embraced this pipenv project that gives you something close to what we have had forever in a lot of languages, which is a single tool for specifying packages and versions in a file, and managing virtualenvs based on those: https://pypi.org/project/pipenv/

2020-05-22T17:58:00.235400Z

Hopefully that project is going well. It's very new, and I haven't looked at it since maybe 1.5yr ago. When I did, I had some issues with it, but it seems to be doing the right thing by copying "bundler, composer, npm, cargo, yarn, etc." (aka all the things that other languages have had for years, and Pythonistas are only now coming around to...).

2020-05-22T18:02:38.235600Z

To be fair, I've found Python libraries to be pretty good about not breaking by themselves, probably because of lived experience with the pains of package management in this kind of system. But also because of the Python etiquette for simplicity (as they see it, at least: "there should be one obvious right way to do things"; to be contrasted with the ruby philosophy of "monkeypatching goes vrrrrrrmmmm!"), I think there's a natural inclination towards some of the Clojure philosophy of simplicity, which may rub off in part as "try not to break things". In Ruby, any upgrade of any major Rails or Active whatever infrastructure meant a big refactor. Not as frequently the case in Python, in my experience. So credit where due.

2020-05-22T18:03:01.235800Z

However, and we all knew this was coming. Drullroll please!

2020-05-22T18:03:06.236Z

Py3k.

2020-05-22T18:03:50.236200Z

The language itself likes to break things! 🎉 (Actually I think it may be worse than Ruby in this ironically, though I really haven't used Ruby in years)

2020-05-22T18:09:51.236400Z

It has taken Python something like 15 years to transition to Python 3. People are still running Python 2! It's getting more and more uncommon, but contrast the situation with that in pretty much any other language.

2020-05-22T18:10:39.236600Z

A lot of Pythonistas hate this, and I think it will prime them nicely for Clojure, which in language and community has an exceptional ethic of not breaking things.

2020-05-22T18:12:43.236800Z

I have seen such an ethic in exactly 0 other languages I have spent time with. Everyone seems to think that by incrementing a version number it's safe break things (semvar). Wrong! It still causes pain! https://www.youtube.com/watch?v=oyLBGkS5ICk

2020-05-22T18:18:49.237100Z

Anyway, the part that pipenv (see above) doesn't solve is the rest of your system (native dependencies, etc). This is more in the realm of conda, which to some extent does a decent job of aiming at that problem, but also (to my knowledge) doesn't interoperate with pipenv 😂🔫

2020-05-22T18:19:33.237300Z

Clojure has a sort of unfair advantage here, which is that it's hosted on the JVM, and so there generally aren't as many system-level libraries and such needed.

2020-05-22T18:20:36.237500Z

This is all important context for understanding how Clojure could potentially help, and is where we dovetail with @chris441's point about containers.

2020-05-22T18:23:58.237700Z

Because Clojure also doesn't really solve the problem of native/system dependencies (again, because it mostly doesn't have the problem, except when doing things like this: trying to interop with python or low level computational libraries), there's a space here where if we build good tooling (ideally as an extension (or at least in compatibility with) the deps.edn config) we could potentially solve some of the combined problems of conda+pipenv :thinking_face: 🏗️

2020-05-22T18:25:14.238Z

Imagine a world where pythonists would use this cool new tool for solving their combined conda+pipenv needs, that lets them write in python! 🌈 :unicorn_face:

2020-05-22T18:27:16.238200Z

And where that tool just happens to be implemented in Clojure and comes preconfigured with libpython-clj so that curious Pythonistas can dabble in the divine art of true parallelization and immutable data! 🕍 🎆 🙏

2020-05-22T18:29:10.238400Z

I get that none of this is easy, but all of this brings me to my last point in response to @nick.romer's questions/comments: Parallelism!

2020-05-22T18:36:18.238600Z

Python does have some libraries for parallel computation, but they are nothing like the support we have in Clojure. A lot of this (to be honest) comes from being on the JVM. The JVM is awesome for threading. Clojure takes it to the next level by providing pervasive persistent & immutable data, which solves a lot of the problems one gets with place oriented (and Object Oriented) programming in particular (see https://www.infoq.com/presentations/Are-We-There-Yet-Rich-Hickey/).

2020-05-22T18:40:06.238900Z

But it goes even further than that! Clojure provides state management primatives like atoms, refs, agents, futures, and abstract/custom dereffables (like Reagent's ratoms)!

2020-05-22T18:42:35.239100Z

It also comes with concurrency tooling like core.async, which, it's worth mentioning, is a testament to Clojure's macro capabilities as a lisp! 💪 With macros, lisps allow users (programmers, libraries) to extend the syntax of the language, and so we were able to copy design features from the Go programming language 🎉

2020-05-22T18:43:29.239300Z

What python has (the multiprocessing library) only allows you to spawn new processes from a parent python process.

2020-05-22T18:46:11.239500Z

This is different than being able to spawn new threads in that processes can't share data in memory like threads can. You're stuck with message passing which requires data serialization/deserialization, and can severely constrain the kinds of things you can do.

2020-05-22T18:46:32.239700Z

In python, the GIL means that only one thread can ever be running at time.

2020-05-22T18:47:05.239900Z

So you can share memory between "threads", but they're not really threads, because you can't get them to run in parallel.

2020-05-22T18:48:09.240100Z

Spark is basically just a layer of abstraction over lots of separate processes, which might be python or other (in fact its jvm, so we win there as well cause it's easier to interop there, and we have some very snappy spark wrappers, or you can interact with the jvm code directly).

2020-05-22T18:49:29.240400Z

But you're still going to have the same constraints around processes passing messages, and will have added overhead of... Spark as infrastucture.

2020-05-22T18:50:51.240600Z

Bottom line: There's a lot that Pythonistas stand to gain by working with the Clojure community, but the reverse is also true. We should be thinking about how our communities can benefit each other.

2020-05-22T18:52:03.240800Z

The idea of "one language to rule them all" is in some ways a sort of lisp pipe dream, but also one which Clojure has sort of come closest to with it's hosted philosophy & design.

2020-05-22T18:52:25.241Z

But it's not really "one language to rule them all", it's "one language to connect them all" 🌈 :unicorn_face:❇️

2020-05-22T18:56:58.241400Z

Oh; looked up and saw two more things to respond to: > typing races: are you saying that one can type vectors faster in Clojure? (Sorry, my command of English is mediocre, I don't have access to hidden meanings). Yes, not as in static typing but as in literally keyboard typing. Try it:

[1 2 3 4 5 6 7 8 9]
;vs
[1, 2, 3, 4, 5, 6, 7, 8, 9]

2020-05-22T18:58:00.241600Z

Obviously, it's a little bit silly calling out little things like commas, but that's exactly the point! Anyone who says "BuT thE COmMaS!?!" is a) focused on minutia, and b) not seeing the whole picture.

2020-05-22T19:00:05.241800Z

I would take Clojure's parens over the commas and semicolons and syntactic conflation/complexion of code blocks with data structures (looking at you { ... }) of other languages any day.

2020-05-22T19:05:24.242Z

Last thing: functional pipelines. @nick.romer You are already familiar with transducers, so you more or less know what I'm talking about. But also, transducers are the more decomplected (simpler/more-general/more-powerful) versions of a thing we have long done with the -&gt; and -&gt;&gt; (and cond-&gt;/`cond->>`) macros, which more or less do the same thing but greadily. (-&gt; x (f :arg) (g 1 2)) is the same as

(g (f x :arg))  
Meanwhile, (-&gt;&gt; xs (map f) (filter g)) is the same as
(filter g
        (map f xs))

2020-05-22T19:07:08.242200Z

Basically just a nice way of taking (certain) deeply nested Cojure expressions and rewriting them as a sequece of operations, much like with transducer composition. I guess the above exampole looks better like:

(-&gt;&gt; xs
     (map f)
     (filter g))

2020-05-22T19:08:06.242400Z

I actually usually end up writing threadf and threadl functions in python projects now, where I pass vectors of [fn, arg1, arg2], because I find it easier to read.

2020-05-22T19:10:14.242700Z

OK; Sorry for the long rant. Hope that was helpful.

👍 2
jsa-aerial 2020-05-22T19:59:51.242900Z

Wow, this thread is crazy long. Even so, I think a few points of clarification might help. Neanderthal is analogous to Numpy while http://tech.ml.dataset (TMD) covers more the Pandas, data.table http://et.al. space. So, Neanderthal is for real number crunching while TMD is for dataframe style slicing and dicing. Python doesn't really have any parallelism capabilities (due to the GIL issues) - it does have concurrency. Sort of. Anyone who has ever tried to do concurrency in Python knows it is just awful. There have been loads of libs over the decades trying to 'fix' this in various ways. Asyncio was supposed to be the final true way and was 'blessed' by Guido. But it is ludicrously complex and error prone. Trio is a lib that actually finally does (mostly) work for concurrent / async work. Thankfully. I threw away a bunch of brittle horrible asyncio code and replaced with trio code which actually worked w/o random exceptions and other idiocies. The python2 vs python3 fiasco isn't as bad as perl 5 / 6 mess, but it is closer than you might think. There is good evidence (and reasons) that py2 will be around more or less forever. IMO, Python is very non functional in almost all aspects. You can beat on it to some extent to use it in a broken functional manner, but it fights you all way. This isn't really surprising as Guido is on record dissing functional programming. Indeed, he tried hard to get rid of the simple functional constructs that it does have but failed.

2020-05-22T20:11:00.243100Z

Thanks for the clarifications and additional context @jsa-aerial.

👍 2
2020-05-22T20:13:54.243300Z

My point about http://tech.ml.dataset is that IIRC libpython-clj does a bit of work to give you zero-copy mappings to/from http://tech.ml datasetructures and what I thought were numpy data structures. I could have gotten that wrong, and that mapping is more directly to pandas. In any case, you're point is well taken that conceptually http://tech.ml.dataset & pandas serve the same role.

2020-05-22T20:17:38.243500Z

The only real way that python is functional is in embracing first class functions as values. Which in my book is sort of the minimal sufficient condition. And again, their OOP approach leans heavily on that assumption, hence my perspective on it. But you're absolutely right @jsa-aerial that this is pretty weak sauce relative to what you can do with first-class functions when you build around a core of immutable data structures & functions, and provide utilities for managing state separate from value (atoms, software transactional memory, etc).

2020-05-22T20:49:11.243700Z

BTW, not having commas is actually important in making code-editing easier. It is not just about aesthetics and space-saving. In languages that require commas, whenever I have to comment out some part of a big data structure (say, a long list, or some nested dict, or something), it becomes quite annoying. Lots of editing is required to take care of the commas. In clojure, you just comment out (`#_`) the relevant inner form, without any such bother.

✔️ 1
jsa-aerial 2020-05-22T21:16:04.244700Z

@metasoarous I don't think you are wrong about that bit (as Pandas sits on top of Numpy to various extents), but TMD is focused on the Pandas columnar data sliceing and dicing stuff - Clojisr also uses it to map to/from R dataframe things (dplyr / data.type / http://et.al.). TMD also tries (like Pandas on top Numpy) to utilize efficient native memory.

✔️ 1
jsa-aerial 2020-05-22T21:19:38.245100Z

Re: Python func stuff: Yeah, I think that is a reasonable point Chris. And following that line of thought, (it hurts to say this but...) JS is much more functional that Python

✔️ 1
jsa-aerial 2020-05-22T21:25:35.245400Z

How do you quote if F-ing slack anyway??? Daniel: "BTW, not having commas is actually important in making code-editing easier. " This is definitely a VERY big deal. This has made Saite code transformation (at the editor level) far more simple. Having extraneous, and totally irrelevant, syntax is an f-ing disaster.

jsa-aerial 2020-05-22T21:33:29.245600Z

Geez as long as I am on this, another thing that sucks like the tar pit from hell, is Python (and R just as much - maybe even worse) scoping 'rules'. Yeah, if it weren't such a geyser of bugs, this would be a LOL statement.

🎉 1
2020-05-22T22:00:56.245900Z

You can quote with &gt; followed by the quote

👍 2
🆒 1
🙏 2
jumar 2020-05-28T05:27:51.251800Z

@metasoarous wow, you should write a blog post about this 😉

2020-05-28T18:01:41.252Z

Yeah... I kinda started thinking that about half way through

2020-05-28T18:04:29.252200Z

I'm going to assume that since this is a more or less public forum that it would be fine for me to use folk's handle's or names (happy to link out to folk's twitter's as well), but if you've contributed to this thread please let me know if you'd like yours to be elided.

niveauverleih 2020-05-28T18:52:48.252400Z

+1 for the blog post.

✔️ 1
chrisn 2020-05-29T14:01:24.253600Z

I don't have much to add here as @metasoarous really did sum things up really well. 1. I think containers are important because I don't have the time to dive into everyone's individual machine configuration and figure out why the thing does not work. If you come to me with an issue that seems to be hardware or machine config related then my first response will be to ask you to reproduce it in a container. You need containers anyway to move to production and conda+docker gives you a completely reproduceable pathway for a lot of things. We have put effort into making a container development flow as painless as possible like for instance we have a container that mounts the local directory and runs as your logged in user so it reads/writes files as your logged in user. 2. We do have support for zerocopy to/from neanderthal. In my talk at time around 14:38 (https://www.youtube.com/watch?v=vQPW16_jixs) I show zerocopy from neanderthal to numpy via the tech platform. I haven't kept up the neaderthal bindings mainly because it is tough to dig through the class hierarchy that Dragan uses but the bindings used in the talk are at https://github.com/techascent/tech.neanderthal. They are somewhat out of date.

🙏 1
2020-05-29T21:32:37.254300Z

Wonderful; Thanks @chris441!

2020-05-21T10:19:33.216Z

It is easier to combine with the JVM. I think the sweet spot is saying that the data transformation and IO are done with Clojure and the computational part on python.

2020-05-21T10:37:50.216100Z

libpython-clj's docs offers some thoughts of what clojure is about, with a pythonista reader in mind: https://github.com/clj-python/libpython-clj/blob/master/docs/new-to-clojure.md

2020-05-21T18:29:09.216400Z

A few big selling points to me: • Clojure is a data-driven language, and so actually much better for basic data manipulation than python (functional pipelines etc are the shit) • Clojure is much faster than python, generally speaking • Clojure runs on JVM, which has real threads and no GIL, making it a better choice for parallelization • Moreover, RH lists parallelization/concurrency as one of his core motivations for Clojure, and so has lots of built in primatives like atoms, refs, agents, futures, (not to mention libs like core.async) which greatly empowers this sort of work.

2020-05-21T19:06:57.216600Z

No more "indentation errors" 😄

🔥 1
2020-05-21T19:07:32.216800Z

Also: "One language to rule them all..."

💯 1
2020-05-21T20:17:04.217200Z

Absolutely! And ClojureScript is a huge win here as well! A solid dynamic front end target is something python lacks, and which is gold for productizing data-science & data-viz.

2020-05-21T20:17:20.217400Z

No more commas...

2
2020-05-21T20:17:24.217600Z

Seriously; fuck commas

2020-05-21T20:18:54.217800Z

Here's the 🎫: • Host typing races where clojurists and pythonists have to type vectors/lists of numbers • let the naysayers eat their "BuT tHE paREnS?!?" • ... • profit

2020-05-21T21:41:23.218300Z

I am hoping that eventually, the management of sandboxed conda-like environments (or docker) will be driven by clojure and allow libpython-clj to be a "run anywhere" seamless thing

2020-05-21T21:55:29.218500Z

YES

2020-05-21T21:55:39.218700Z

Jesus... the package management system there is a nightmare

💯 1
2020-05-21T21:57:00.218900Z

If we cleaned that nightmare up for them, that could be a huge sell.

👍 2