@blueberry: Tonight I added mm, mv, and copy to my pure-Clojure implementation of the Blas interface. When this Blas implementation is complete, should it be a part of the Neanderthal library? Or would you prefer that I make it a separate library that depends on Neanderthal? https://github.com/ericlavigne/mlearn/blob/master/test/uncomplicate/neanderthal/jblas_test.clj
Separate library, at least until it matures a bit. For it to be included into Neanderthal, I think that it should at least: 1. Pass all tests 2. Be at least as fast as the other fast pure java libraries 3. Be correct, which in my opinion is actually harder than 2, since floating point numerical operations are tricky 4. Has a long-term committed maintainer
And a separate library is also easier for you to develop and improve until it stabilizes.
Okay, I'll make a separate library. I am not even aiming for 2, because I think that someone who cares about speed will be willing to install ATLAS or OpenCL. This implementation is intended to allow quick and easy setup for initial development, while allowing an easy switch to faster implementations later.
Sure. However, I advise you to check the speed anyway, since matrix multiplication is O(n^3). Naive solutions are extremely slow even for relatively small matrix sizes. I don't mean slow like twice slower than normal, but like orders of magnitude slower. That might easy mean that you'll wait for many seconds for your REPL experiments or minutes for the tests to run.
Okay, I'll watch out for that. So far seeing less than factor of 2 difference between test runtimes on pure Clojure versus Native implementation. That is on extremely small matrices (such as 4x3) where I can do the calculations manually. I'll consider running the implementations side by side with comparative timings as I start using for machine learning exercises.
Just out of curiosity: can you post the timings for 256x256?
Sure. I'll start writing that test now.
Native takes 570ms. Clojure has been running over a minute and still not finished.
Clojure implementation took 8 minutes.
That's what I was talking about. 256x256 is small and you'll regularly have larger in ML problems 😞
Is this mostly an issue of optimizing iteration speed? Or are there tricks to avoid O(n^3)?
There are no tricks to avoid O(n^3). The fastest theoretical limit is something like O(n^2.xx) but that comes with many other performance losses, so all computer implementations are O(n^3). The difference is in optimizing cache hits (not easy at all) and using SIMD operations. The first can be implemented in Java, but the implementation becomes fairly tricky. Take into account that you'll have to support different orders (column vs row) and it becomes a formidable challenge.
Of course, using non-primitive data is also a no-go.
I'm using 3 seqs to represent indices in the 3 dimensions. That much is easy enough to change for faster iteration. Might also get a factor of 3 speedup from multithreading. My abstractions to treat blocks similarly to matrices also introduce some recalculation inefficiency.
Those few things should make a big difference in speed compared to current. Not sure it would get the speed into reasonable range though.
I hope 😉
Multithreading won't help you much, or even at all.
Clojure multithreading, at least.
It seems like matrix multiplication would work best if the first matrix were in row-major order and the second matrix were in column-major order. Maybe it would be worth copying the matrices into the preferred orders. The copying would be O(n^2), allow for faster data access, and reduce the implementation complexity of dealing with varying row/col orders.
I know I often sound as a grumpy grump in these talks about "why I don't support core.matrix" and why there is no pure Java implementation in Neanderthal. It would be easier for me to just say nothing and let people learn it the hard way: don't write your own matrix algorithms. Just don't, unless you are a PhD student or a researcher that precisely does only this - writing matrix algorithms. Especially don't do it in Java 🙂
As a Neanderthal author, I am satisfied both ways: it is good for me if you write it (someone might use neanderthal with your implementation), but it is even better for me if you use Neanderthal's full power to achieve your main goal: implement ML algorithms.
Looking at your benchmark comparison with Vectorz. If 60x difference translates into my tests taking 10 minutes instead of 10 seconds, that could be very painful. Getting back to ML is sounding very appealing right now. 🙂
If you write a blog post about what you've discovered today, that would be super cool (and might save some time to other Clojurists in the future) 🙂
I don't really want to discourage someone from writing, for example, a Vectorz-backed implementation of Neanderthal's Blas interface. That would have helped me a lot last week, and I may still get around to doing it myself later.
Unfortunately, vectorz' do not have features to cover a sizable part of BLAS. It would have probably helped you for basic explorations, but it would be far from a general solution.
Otherwise, I would have written a pure java implementation backed by vectorz, or some other pure java library.
Your benchmark explanation mentioned that multiplication was the most complex operation. So far multiplication, addition, and dot are the only operations I've been using. Might either write a slow implementation of the rotX operators or just throw an "unimplemented" exception when they are used.
For now, though, I'll enjoy ATLAS speed, and even consider installing OpenCL if my tests start taking more than a few seconds. 🙂
Depending on your system, OpenCL might be already there. You get it when you install GPU drivers in Linux/Windows and it is already there on Mac.
The speedup is dependent on your hardware. If you have an integrated Intel HD, you need not bother. If you are on a desktop with some recent GPU beast, then it is usually worth doing it.
I'm using macbookpro, roughly one year old.
import uncomplicate.clojurecl.core and http://uncomplicate.clojurecl.info and type (map info (devices (first (platforms)))) to see what hardware you have.
macs come with all three major brands, depends on the time of year when yours was manufactured
but laptop GPUs are considerably slower than desktop ones
UnsatisfiedLinkError: no JOCL_0_2_0-apple-x86_64 in java.library.path
What? What version of Neanderthal do you use? Macs are supported in 0.6.2 (latest)
I'm on 0.5.0, which was latest a couple weeks ago when I started. I'll try upgrading that. I think the Blas interface which I implemented changed a bit though.
0.5.0 does not support OpenCL on macs out of the box
or you might just use an example hello world project with 0.6.2 just to inspect the hardware
After switching library version, just got an extraordinarily long stacktrace trying to run that little code snippet you posted.
Actually, it looks like a long sequence of maps, with have stacktraces as one of their keys.
It is not a stack trace. It works. See here for details. https://github.com/uncomplicate/clojurecl/issues/11
Yep, Macs cannot answer all info questions, so you get the report in that cases.
See the latest answer by amherag
also see the newest hello-world. It has opencl1 namespace with an example
In short, Mac puts its CPU as the default device, and it does not support opencl 1.2 for some reason. See which GPU(s) you have there and select them for the context.
Looks like more setup is required than what I've been doing. I've just been "use" ing the opencl namespace and calling clge to create matrices, not specifying engines and contexts.
I need to take care of something else for next few hours. I'll try this again tonight.
cool. If you encounter any obstacles, open an issue on github, and I'll try to help. amherag uses Mac, so he might try to help with mac caveats.
Would you expect noticeable speedup on macbook using gpu compared to native?
@ericlavigne: not orders of magnitude. Depending on the GPU, it might be slower, or, say, 5x faster. BUT, even like that, it might be very useful if you'd like to write your own ML kernels in OpenCL (not right away, of course), because then you get access to hardware parallelism.
+ if you have a GPU that is not already in the tuning database it wouldn't be as fast as it can be with tuning. In that case, you can tune clblast and it will be optimized for your machine in the next version.
Post the info you got, and I could tell you what to expect.
When you say "ML kernel in OpenCL", you mean using OpenCL directly instead of via Neanderthal? What would be the advantage of that?
I'm not at computer anymore, but that info was many pages long. Not suitable for posting to slack.
No. I mean using neanderthal and clojurecl to do 95% of the chores for you, and write only a dozen of lines of OpenCL C for the core stuff
Instead of writing 20.000 lines in C.
And still getting the full speed.
Do you have an example of an ML technique that would benefit from that? I'm still early in studying this. Neural networks seem to be entirely based on matrix operations, where neanderthal can handle 100% of the performance intensive parts.
Regarding NNs: their operations can be described by matrix operations, and compared to pure Java it would be fast. However, when you know the characteristics of those operations, you can optimize further, and that's why all deep learning beasts do not use CUDA cuBLAS, but cuDNN, which is nvidia's super-messy implementation of DNN's at really low level. On top of that, often you have operations that could be vectorized and/or parallelized, and are similar to matrix operations, but not quite. A simple example: calculating variance. It can be expressed through mv and mm, but is quite faster if implemented directly.
To clarify, I am talking about multivariate variance. Univariate can be implemented with sum and dot.
I'm looking forward to that. Maybe in a few weeks. 🙂
More likely a few months 🙂 I'll help when I can.