Hello all. Anyone come across this issue loading libcublas?
Stack trace from the attempt to load the library as a resource:
java.lang.UnsatisfiedLinkError: /tmp/libJCublas2-0.8.0-linux-x86_64.so: libcublas.so.8.0: cannot open shared object file: No such file or directory
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)
at java.lang.Runtime.load0(Runtime.java:809)
at java.lang.System.load(System.java:1086)
at jcuda.LibUtils.loadLibraryResource(LibUtils.java:260)
at jcuda.LibUtils.loadLibrary(LibUtils.java:158)
at jcuda.jcublas.JCublas2.initialize(JCublas2.java:81)
at jcuda.jcublas.JCublas2.<clinit>(JCublas2.java:66)
at uncomplicate.neanderthal.internal.device.cublas$eval51203.invokeStatic(cublas.clj:764)
I've got cuda installed to /opt/cuda
, and can see libcublas.so in my LD path.
Oh, wait. I have version 9 and this is looking for version 8!
ldconfig -p | rg libcublas
libcublas.so.9.0 (libc6,x86-64) => /opt/cuda/lib64/libcublas.so.9.0
libcublas.so (libc6,x86-64) => /opt/cuda/lib64/libcublas.so
@jcf You are right. It currently requires CUDA 8 (downgrade helps if you've already updated to 9).
It'll be soon upgraded to CUDA 9.
@blueberry I don't know if you have seen it, but I am discussing autograd and neanderthal with the cortex guys: https://groups.google.com/forum/#!topic/clojure-cortex/ba4eVXT8DMM
I would be interested whether you think that a generalized buffer management DSL like Cortex builds has advantages over neanderthals direct API mapping. One can do ahead of time optimizations and compilation with a data description (AST) of tensor operations. I don't see this as competing with neanderthal, because it provides the low-level primitives. But I might miss some subtleties.
And I am curious about your opinion on buffer-management in the special purpose low-level matrix formats, that are actually used by most higher-level frameworks.
I think the plurality of deep learning frameworks is obfuscating the fact that they all use these low-level APIs.
BLAS in form of MKL, OpenCL or Cuda and additional ones like cuDNN.
One interesting things with that is that it is (was for me) impossible to find any comparison with any other implementation. Either performance-wise or by the ease of use. I am interested in Cortex and follow what's happening from time to time. I never saw any comparison of Cortex with any of the leading libs (TF, Caffe, Pytorch etc.), or even with dl4j, that says "in this and this example, Cortex achieved this speed compared to library X", or "we showed that cortex requires less configuration" or whatever. There are some blog posts, but these are more like "you can build this example model with cortex" and nothing more. So, the only thing interesting to me now is that it is a Clojure-centric library. That is awesome, but only if I can leverage it to build custom, or not-strictly-NN things with it. If it is aimed at users who need a packeged ready-to-use solution where you provide some configuration map, and the lib figures some of the built in stuff, and that's it, then why would I use it instead of TF? I would like to know, but until now I didn't find a compelling answer. Of course, there's nothing wrong with that, ThinkTopic earns money using Cortex, and they are happy with it...
The thing with that is that they use something akin to NDArray. This is basically an N-dimensional dense cube.
This is fine if you only need it for "standard" NNs. But I am not sure what kind of "generalized buffer management" that refers to. They reshape a hypercube from one dimension to the other, but the structure is always dense, without any other automatic property. How do you specify an (optimized) symmetric 2-D array there? What is the performance of linear algebra operations?
Yes, I see this problem as well.
In your particular case, since you need to experiment with a novel NN method, how can you (re)use Cortext there? I don't have idea.
A question I have is whether it is possible to transform (reshape) these arrays through the low-level APIs and how the OpenCL access to the buffers work in that context. Is it possible for example to directly transfer a triangular matrix into an equivalent dense one without unnecessary copying.
In neanderthal-of course, provided that you want dense triangular matrix (TR)
(def a (fge 5 5 (range 1 26))
(view-tr a)
that is, for tr->ge
(def a (ftr 5 5 (range 1 26))
(view-ge a)
The question with the NDarray approach is: can it describe anything other than the general dense matrix (in the case of dim=2)
?
As for your own opencl or cuda kernels: You get the raw buffer. Accessing the structure inside the kernel is completely up to you.
Of course, with either of the libraries, you can take the raw buffer structure and do whatever you please with its contents...
Yes. But the question is whether I am doing stupid memory copying or accesses from the low-level perspective. I can describe all kinds of high-level tensor operations, like reshaping for example. But if they yield in copying of memory or inefficient access of the elements in tensor operations, then this is a clear problem that cannot be abstracted away.
I understand neanderthal as focusing to avoid exactly this problem of all generalizing APIs for tensors, e.g. a ndarray lib.
I still think these operations should be supported, but it has to be possible to opt out. This is only possible if the higher-level abstractions are built out of the lower ones. That is why I think a stack on neanderthal might be a much better toolbox for optimization pipelines, than defining some high-level API which leaves the rest of the mapping to low-level primitives to external, opaque AOT compilation pipelines.
I understand that these pipelines represent significant engineering effort.
Yet my current experience tells me that this is not really that relevant for deep learning in large scale environments.
I have problems with the Python runtime as a deployment and data processing environment (slow, hacky and a lot of accumulated debt + lack of enterprise support). But not with pytorch.
This is a very important insight, I think. Up until pytorch I bought the common wisdom that TensorFlow, Theano or Mxnet are necessary as a high-level API interface to a middleware doing all the stuff for you.
Pytorch only provides autograd and the necessary low-level ops to execute tensor operations efficiently (i.e. without unnecessary copying), yet it proves competitive to almost all people I currently talk to, who train models basically all day.
The arguments that industry needs larger scale deployments involves the data-processing pipeline, deployment and parallelization.
The first two are a problem of Python, but the latter has been successfully tackled with pytorch.
I agree, and this is roughly the idea that I pursue with neanderthal: provide different layers that you can use directly by hand, or/and build higher-level magic on top of.
And each layer adds as little complexity as possible, while being as automatic as desired (but not more)
And I am yet to find a clean presentation of a simple tensor operation in these libraries. More often than not, the examples revolve exclusively around computation graphs. I'd like to see a nice description of how I can create some tensors and do tensor contraction without much fuss...
What do you mean with tensor contraction?
a tensor equivalent of matrix multiplication https://en.wikipedia.org/wiki/Tensor_contraction
Something needed for deep-learning is efficient broadcasting, so (mini-)batches of data can be quickly sent through matrix multiplication. I am not sure how to do this best from a low-level perspective.
I see.
That's the thing. All talk is about tensor this tensor that, but underneath the surface, everyone is working with graphs and matrices.
I do not claim (or even say) that they should do this with tensor contractions.
They probably do this quite optimally, especially with cuDNN.
I'm just not sure that it has to do that much with tensors proper
Or "tensor" there is just a convenient way to describe 4-dimensional layers of matrices, because images happen to be conveniently described by m x n x 3 cubes, stacked into 4-dim cube in memory.
Which I am sure on some level is equivalent to tensors, but I am not sure helps in a general way. What if you need 6-dim tensor, for whatever reason. Would any of those libraries help you? (A genuine question. I don't know the answer.)
Me neither.
I know that they have to use the low-level BLAS primitives for optimal performance.
I don't even claim that 6-dim tensor is a particularly useful thing...
I don't think they reimplement the actual matrix multiplications.
Well, it probably can be.
BLAS is vector/matrix all the way down. Nothing to do with tensors or ND-arrays
This kind of stacking can be helpful, e.g. to represent embeddings of matrices, e.g. for binary relations.
Yes.
So tensor contraction is best implemented with these primitives, but not besides them.
But I think that cuDNN is not BLAS, but a bunch of specific 4-dim ND operations optimized for images
Yes, I think so, too.
My OpenCL readup on convolutions basically yielded that they are far superior on their own hardware.
I mean cuDNN.
This is the reality.
The trouble is that general n-dim tensor contraction suffers from the curse of dimensionality - it's O(n^d)
So you have to use these kind of supplied native bindings which represent your interface to the hardware.
That's correct
Ok, I am not sure about whether actual contraction is what people do in minibatch optimization.
For any kind of vectorized operation you have to use hardware-optimized primitives and that's it. We can daydream about the niceties of clojure, but one look at the benchmarks shows otherwise...
Maybe it is a contraction (or not) but the point is that it is not any kind of general contraction operation, nor anyone uses it in a general way.
It is a specialized thing for a specialized purpose (NNs optimized for images and similar signals).
Where Clojure comes handy is composing those well-defined lower level operations dynamically.
And interactively!
I agree. One of the objection is valid though, whether Clojure can actually be attractive for researchers.
Julia is a very strong contender, I would say.
It still blows anything else out of the water when it comes to data processing in my experience, but this is not so important for researchers.
My current pytorch script has one nested loop (3-levels) and a bit of model descriptions, written down without too much higher-level abstractions, because it basically runs from top to bottom like a shell script with global state.
I think Clojure fills a nice spot between being good enough for experimenting, while being easy enough to integrate in production.
I agree, but we have also to convince researchers, because they will fill in the gaps in the tooling over time.
They might not be good at the core engineering, but a community is important.
Sure, Julia might be great for algorithmic thinkering (but I don't see it much different or better than Clojure there for my needs) but then -> how do you use in production?
Point taken, I agree absolutely. But for researchers this point is very unimportant.
At least superficially.
If the production environment provides them with better tools to organize their experiments, then it is important.
That's fine. But what can I do about that? Probably nothing much...
I still think that for example plotting in Clojure for example is not as easy as matplotlib or ggplot in R.
I use plotly, which is ok, but it requires a web view.
Yeah. Currently a huge empty spot.
I mean, there are million of options for basic plotting in JVM
No, you can't. I am not proposing to implement anything yet. I just think that we need to give them an execuse to do Clojure. Some will like it, but it will need to have plausible horizon for research as well.
But nothing automatic like ggplot or matplotlib
Yes, but I have none so far that is good enough for high quality paper plots.
Maybe plotly is.
I haven't tried hard enough yet, it is fairly well done and a long-term project with commercial backing.
ClojureScript is also an asset, I would say.
Yep.
But this is not obvious to people wanting to do optimization experiments.
They expect something matlab like to start with.
incanter tried it for R users, but I think it was a way too large chunk of work necessary to swallow at once and it has not yielded composable libraries.
I think ClojureScript is great for building an "presentation server" for these plots (possibly using plotly). Provide a good generic interface from Clojure REPL and that's it. I think that's the way @hswick’s library works, and I think it is a good approach.
I think at least it would be important that there is a set of libraries that play well together in general. Friction is a show stopper in my experience. numpy for example standardized the basic linear algebra in Python so that it became reasonable to use for numerical optimization.
I still think that trying to provide an X-like experience to win over the users of X is something that will not work.
I agree.
But doing the things right that X did, might be crucial.
Because people who prefer the X experience will use X
I think composability is a key experience in Clojure.
This cannot be done if X is copied.
Maybe there is a better Clojure way. Provide the best Clojure experience, so people who prefer Clojure (us) can do things they need, not some imaginary users that we are trying to convert. That's at least how I look at it.
I create this for me and people who find this approach useful. Not for someone else who might prefer something else.
Well, I am not imaginary 😂
I agree.
You do not need to be converted 🙂
But I think the people in my environment are not stupid, they have reasons for their choices and they are flexible. If you can show something better, they prefer it.
The friend of mine, actually a pretty smart mathematician and Bayesian machine learning guy already read SICP and is curious about Clojure in general.
Then show them something better! I agree that's the best strategy.
But I couldn't recommend to do the kind of things he does in Clojure.
However, in such cases I always remember this (alleged) quote of Henry Ford: "If I'd asked people what they need, they'd say a better horse cart."
The production kind of arguments against the researcher attitude also do not necessarily help, I think. Clojure might be better at deployment, but this often sounds as if researchers are no real programmers.
They might not be that good system engineers, but this is really not the way to win people over. These kind of real-world arguments.
Providing low-level libraries and compositions on top of them is.
I understand, and agree, but as I've said, I don't see how I could do anything about that.
Yes
Sure, I just needed something experienced to reason with 🙂
someone
🙂
Oh, it is late 😂
Sorry
Thankfully, my main goal is not to win over new users for Clojure, but to create tools that I need and like. If some other people find them useful too, that's great, but I won't beat my head over it too much.
For the one tensor 3d network I used, I need basically a to first do matrix multiplication along one axis and then along the other. So one has to be able to shift this view on the axis, think, without doing stupid things.
I agree, this is a good approach.
In the longer run it is helpful if a few people work together though I think. So sharing some common problems to solve seems important to me.
I agree completely
If I pursue the autograd stuff in Clojure, which is strictly necessary for anything more I will do, I need to get these low-level memory ops right.
You do not plan to wrap cuDNN, do you?
Yes.
However, not a priority, since I do not need it for any Bayesian stuff, which is my main interest.
So it might be some time before I do it.
So, I do plan to connect to cudnn
And also provide a fast CPU-oriented implementation for tensors
But, realistically, that won't happen in 2017
Hey both, Don’t mean to butt in, but thought I’d say that I’m following this conversation with interest and nodding often. I long for tape based autograd in Clojure! I’m glad for the work that people are doing on Cortex, but while it’s essentially NN abstractions only, it’s not that interesting to me personally. To my shame, I haven’t made time to give Neanderthal a proper try yet, though I’m soon rewriting a system where I intend to and I expect the speed gains to be substantial. That thread you pointed out @whilo was very interesting, though I’ll have to re-read in the morning. I’ll definitely take a look at your clj-autograd library for interest too.
@chrjs Welcome, Chris 🙂
Also, the progress of Neanderthal recently has been staggering (even if I am not using it, I’ve seen the change logs). Thanks for your output Dragan.
@chrjs Hi 🙂
👋
I don't know how to exactly take all the parts in the thread apart, I wanted to reply this evening, but there are many arguments somehow interleaved and it is challenging to separate the reasonable ideas from false assumptions.
I just read the pro tensorflow blog post and it is not a lot better: https://hackernoon.com/ml-framework-design-reliability-composition-flexibility-9314f72d2c73
@blueberry Is it possible to take a 3d tensor, first apply matrix multiplications in parallel along one axis, then rotate (transpose) the result and multiply matrices from the another axis.
How problematic is the transpose.
?
In neanderthal?
Yes.
Neanderthal currently does not support tensors.
I know, but I mean with the primitives that BLAS (neanderthal) provides, is this doable efficiently?
I guess that this changes row major to column major mode at least.
You'd have to "simulate" it with a large matrix and it's submatrices, and I think it is possible, but you'd have to investigate it in detail.
If it is possible to do, it will be efficient.
Ok.
The tensorflow post basically just emphasizes Python deployment problems, which are non-Problems on the JVM.
A jar-file you built today against native interfaces will work in 10 years if you can still get a compatible low-level library.
Transpose in neanderthal is O(1), so if you just change col/row without actually reordering elements in memory, it would not require any copying.
The latter is independent of the framework. It might be able to abstract it away, so does neanderthal or a compute graph description on top of it.
If you actually have to physically rearrange elements, you'd use (trans!), which is fast but not O(1)
I guess if you multiply once, transposing beforehand is slower.
(?)
Or is it efficient for both row and column-major modes?
It transparently supports both major modes, even mixing them, without penalty or rearrangements.
Which you can test by simply creating a column matrix and a row matrix and multiplying them transparently.
Python is crazy with respect to serialization. You have to pickle everything, which is subject to the exact versioning of things in your current runtime.
I see, ok.
Thanks.
Having been there, Python deployment can really be a pain, especially as it compares to Clojure.
I’m no great fan of the JVM in general, but not having to set up virtual envs everywhere and ensure consistency is a genuine win, as far as I’m concerned.
I agree that Hackernoon article complects python deployment problems with library problems.
> If it is possible to do, it will be efficient. That’s a very powerful feature in itself.
So I could do a matrix A, tensor B, matrix C multiply, A: k x l, B: l x m x n, C: n x o by first taking the m x n block l times and do elementwise matrix multiply and then the resulting l x m x o block and multiply it o times with A to get a k x m x o result. The crucial thing is that the intermediary result needs to be "transposed".
I know that I could unroll the tensor, but this will yield a block diagonal matrix, which is very sparse.
not directly (yet) since there is no such thing as tensor B in neanderthal. You'd have to decide how you simulate that tensor B. I suppose as a big l x n matrix that you take m submatrices from (without copying). submatrix function allows stride, so I think you'd be also able to take the right hind of submatrices, but I'd have to work the details to see whether this simulates that tensor B.
Will this big matrix take l x n amount of memory?
Well I am confused, it has to be larger than l x n, at least l x m x n
If it is dense - yes. But you also have other matrix types in neanderthal. Whether they can simulate that tensor B is something that you have to work out and see.
You probably mean l x no
I see.
Yes, l x m x n. My typo.
You'd decide whether it's l x (m n) or (l m) x n
@chrjs what have you done in python?
@blueberry right, that should work out for one side fine at least.
But other (sparser) combinations may be possible
Sparse multiply with dense blocks is probably not as efficient as the dense blocks themselves, right?
But it might also work for the other side, since submatrices can be sparse!
Sorry to ask you these n00b questions.
The sparser the matrix, the less cache hits, of course. But it might not be that much slower. The best way is to test with the dimensions you have
Ok
For matrices that fit into cache, the hit might be negligeable...
(@whilo, I was writing machine learning systems in Python for a startup for a couple of years)
Ok, I have to digest that a bit.
@chrjs cool, what kind of systems?
I work for a startup that predicts the box office returns for films. We now moved to Clojure and a simulation based methodology, but by personal interests still lie in (mostly Bayesian/generative models) ML.
Lots of things changed with the move to Clojure. I do miss the general data ecosystem from Python, but not much else.
Clojure is a much better platform for writing software systems in general I think.
You mean the scientific computing ecosystem?
Yeah.
Absolutely, Clojure is really difficult to beat atm. I tried other things a few times over the last few years, but they are all seriously a step backward.
I mean Julia for example.
I know, it’s ruined me for other languages.
:p
It is cool to run high-perf tight loops written in optimization code, but not much else.
Hehe, yes, that is my impression as well.
My bar for their frustrations is very low.
Heh.
I have not always made friends with this attitude, though 😂
I am going to sleep, but I will definitely be around. It seems like the Clojure ML/scientific computing in general scene is approaching a critical mass.
Soon we will maybe even be able to call it a community.
I think it is important to get a few key concepts useable enough for practical purposes and simple enough so they stay composable, then it could actually be interesting.
That is the dream!
Just solving some top-level problem is way too much effort and does not yield reuseability.
To my mind, I’d rather have many composable libraries that each do one thing.
The biggest thing missing for autograd, besides the perf optimizations to not copy memory is convolutions for me. So if cuDNN was available, I should be able to hack some conv2d layer together.
Yes, me, too.
But for optimization the interfaces need to be efficient and performance needs to be considered upfront. Every bit that you lose will be painful for many potential users in the long run.
So for instance, autograd would not have to be part of a general tensor (or just vectors and matrices, sticking closer to the hardware) library. But I agree, the performance trade offs of composability need to be careful.
That is the problem, you can establish something that is usable for 50% of people, but that will never work out for 90% of your audience. Usually the latter part is the one that would also contribute and make your library more attractive to more users, because they use it so heavily.
I mean, if we really want to attract people from ML to the Clojure ecosystem, we need to win a Kaggle competition with Neanderthal. That should do it.
But I think there are already some people who use Clojure but then reach for Python to do scientific computing.
That is the real goal. Provide good tools. Those who find them useful will use them.
Yes, keep people with Clojure, that have to go elsewhere because they must.
Just to point out that Neanderthal is only a vector/matrix/linear algebra library. We might win a Kaggle competition (or fail miserably) with a ML library built on top of neanderthal 😉 I agree that showing actual results is the right way to get attention, not general talk how Clojure is super nice language.
I know that, but since we are in the uncomplicate channel, I thought I should mention the library 😉
Agreed. But even easier is to keep people who now have to move elsewhere, but would like to stay.
🙂
The question is who are they and why are they leaving.
deep learning is obviously one field
I have some thoughts on that, but I gotta sleep. Talk to y’all soon.
I think they could build the deployment stuff themselves actually.
Ok, gn8 @chrjs
I also should go to bed.
Good night 🙂
Good night 🙂