uncomplicate

2016-05-28T06:48:00.000194Z

@blueberry: Tonight I added mm, mv, and copy to my pure-Clojure implementation of the Blas interface. When this Blas implementation is complete, should it be a part of the Neanderthal library? Or would you prefer that I make it a separate library that depends on Neanderthal? https://github.com/ericlavigne/mlearn/blob/master/test/uncomplicate/neanderthal/jblas_test.clj

👍 1
2016-05-28T07:19:56.000197Z

Separate library, at least until it matures a bit. For it to be included into Neanderthal, I think that it should at least: 1. Pass all tests 2. Be at least as fast as the other fast pure java libraries 3. Be correct, which in my opinion is actually harder than 2, since floating point numerical operations are tricky 4. Has a long-term committed maintainer

2016-05-28T07:28:32.000198Z

And a separate library is also easier for you to develop and improve until it stabilizes.

2016-05-28T14:54:37.000199Z

Okay, I'll make a separate library. I am not even aiming for 2, because I think that someone who cares about speed will be willing to install ATLAS or OpenCL. This implementation is intended to allow quick and easy setup for initial development, while allowing an easy switch to faster implementations later.

2016-05-28T15:43:35.000200Z

Sure. However, I advise you to check the speed anyway, since matrix multiplication is O(n^3). Naive solutions are extremely slow even for relatively small matrix sizes. I don't mean slow like twice slower than normal, but like orders of magnitude slower. That might easy mean that you'll wait for many seconds for your REPL experiments or minutes for the tests to run.

2016-05-28T15:50:55.000201Z

Okay, I'll watch out for that. So far seeing less than factor of 2 difference between test runtimes on pure Clojure versus Native implementation. That is on extremely small matrices (such as 4x3) where I can do the calculations manually. I'll consider running the implementations side by side with comparative timings as I start using for machine learning exercises.

2016-05-28T15:51:48.000202Z

Just out of curiosity: can you post the timings for 256x256?

2016-05-28T15:59:48.000203Z

Sure. I'll start writing that test now.

2016-05-28T16:14:41.000204Z

Native takes 570ms. Clojure has been running over a minute and still not finished.

2016-05-28T16:21:42.000205Z

Clojure implementation took 8 minutes.

2016-05-28T16:24:36.000206Z

That's what I was talking about. 256x256 is small and you'll regularly have larger in ML problems 😞

2016-05-28T16:27:51.000207Z

Is this mostly an issue of optimizing iteration speed? Or are there tricks to avoid O(n^3)?

2016-05-28T16:31:55.000208Z

There are no tricks to avoid O(n^3). The fastest theoretical limit is something like O(n^2.xx) but that comes with many other performance losses, so all computer implementations are O(n^3). The difference is in optimizing cache hits (not easy at all) and using SIMD operations. The first can be implemented in Java, but the implementation becomes fairly tricky. Take into account that you'll have to support different orders (column vs row) and it becomes a formidable challenge.

2016-05-28T16:33:45.000210Z

Of course, using non-primitive data is also a no-go.

2016-05-28T16:41:10.000211Z

I'm using 3 seqs to represent indices in the 3 dimensions. That much is easy enough to change for faster iteration. Might also get a factor of 3 speedup from multithreading. My abstractions to treat blocks similarly to matrices also introduce some recalculation inefficiency.

2016-05-28T16:42:27.000212Z

Those few things should make a big difference in speed compared to current. Not sure it would get the speed into reasonable range though.

2016-05-28T16:43:40.000213Z

I hope 😉

2016-05-28T16:44:11.000214Z

Multithreading won't help you much, or even at all.

2016-05-28T16:45:14.000215Z

Clojure multithreading, at least.

2016-05-28T16:47:47.000216Z

It seems like matrix multiplication would work best if the first matrix were in row-major order and the second matrix were in column-major order. Maybe it would be worth copying the matrices into the preferred orders. The copying would be O(n^2), allow for faster data access, and reduce the implementation complexity of dealing with varying row/col orders.

2016-05-28T17:17:29.000217Z

I know I often sound as a grumpy grump in these talks about "why I don't support core.matrix" and why there is no pure Java implementation in Neanderthal. It would be easier for me to just say nothing and let people learn it the hard way: don't write your own matrix algorithms. Just don't, unless you are a PhD student or a researcher that precisely does only this - writing matrix algorithms. Especially don't do it in Java 🙂

2016-05-28T17:19:32.000218Z

As a Neanderthal author, I am satisfied both ways: it is good for me if you write it (someone might use neanderthal with your implementation), but it is even better for me if you use Neanderthal's full power to achieve your main goal: implement ML algorithms.

2016-05-28T17:35:48.000221Z

Looking at your benchmark comparison with Vectorz. If 60x difference translates into my tests taking 10 minutes instead of 10 seconds, that could be very painful. Getting back to ML is sounding very appealing right now. 🙂

2016-05-28T17:42:14.000222Z

If you write a blog post about what you've discovered today, that would be super cool (and might save some time to other Clojurists in the future) 🙂

2016-05-28T17:51:23.000224Z

I don't really want to discourage someone from writing, for example, a Vectorz-backed implementation of Neanderthal's Blas interface. That would have helped me a lot last week, and I may still get around to doing it myself later.

2016-05-28T18:07:41.000225Z

Unfortunately, vectorz' do not have features to cover a sizable part of BLAS. It would have probably helped you for basic explorations, but it would be far from a general solution.

2016-05-28T18:10:44.000227Z

Otherwise, I would have written a pure java implementation backed by vectorz, or some other pure java library.

2016-05-28T18:16:23.000228Z

Your benchmark explanation mentioned that multiplication was the most complex operation. So far multiplication, addition, and dot are the only operations I've been using. Might either write a slow implementation of the rotX operators or just throw an "unimplemented" exception when they are used.

2016-05-28T18:18:29.000229Z

For now, though, I'll enjoy ATLAS speed, and even consider installing OpenCL if my tests start taking more than a few seconds. 🙂

2016-05-28T18:25:15.000230Z

Depending on your system, OpenCL might be already there. You get it when you install GPU drivers in Linux/Windows and it is already there on Mac.

2016-05-28T18:26:34.000231Z

The speedup is dependent on your hardware. If you have an integrated Intel HD, you need not bother. If you are on a desktop with some recent GPU beast, then it is usually worth doing it.

2016-05-28T18:30:21.000233Z

I'm using macbookpro, roughly one year old.

2016-05-28T18:31:35.000234Z

import uncomplicate.clojurecl.core and http://uncomplicate.clojurecl.info and type (map info (devices (first (platforms)))) to see what hardware you have.

2016-05-28T18:32:20.000235Z

macs come with all three major brands, depends on the time of year when yours was manufactured

2016-05-28T18:32:41.000236Z

but laptop GPUs are considerably slower than desktop ones

2016-05-28T18:33:07.000237Z

UnsatisfiedLinkError: no JOCL_0_2_0-apple-x86_64 in java.library.path

2016-05-28T18:33:54.000238Z

What? What version of Neanderthal do you use? Macs are supported in 0.6.2 (latest)

2016-05-28T18:35:11.000239Z

I'm on 0.5.0, which was latest a couple weeks ago when I started. I'll try upgrading that. I think the Blas interface which I implemented changed a bit though.

2016-05-28T18:36:11.000240Z

0.5.0 does not support OpenCL on macs out of the box

2016-05-28T18:37:11.000241Z

or you might just use an example hello world project with 0.6.2 just to inspect the hardware

2016-05-28T18:37:49.000242Z

After switching library version, just got an extraordinarily long stacktrace trying to run that little code snippet you posted.

2016-05-28T18:39:12.000243Z

Actually, it looks like a long sequence of maps, with have stacktraces as one of their keys.

2016-05-28T18:39:25.000244Z

It is not a stack trace. It works. See here for details. https://github.com/uncomplicate/clojurecl/issues/11

2016-05-28T18:40:42.000246Z

Yep, Macs cannot answer all info questions, so you get the report in that cases.

2016-05-28T18:49:24.000251Z

See the latest answer by amherag

2016-05-28T18:49:50.000252Z

also see the newest hello-world. It has opencl1 namespace with an example

2016-05-28T18:50:59.000253Z

In short, Mac puts its CPU as the default device, and it does not support opencl 1.2 for some reason. See which GPU(s) you have there and select them for the context.

2016-05-28T18:53:55.000254Z

Looks like more setup is required than what I've been doing. I've just been "use" ing the opencl namespace and calling clge to create matrices, not specifying engines and contexts.

2016-05-28T18:54:25.000255Z

I need to take care of something else for next few hours. I'll try this again tonight.

2016-05-28T19:00:28.000256Z

cool. If you encounter any obstacles, open an issue on github, and I'll try to help. amherag uses Mac, so he might try to help with mac caveats.

2016-05-28T19:02:34.000257Z

Would you expect noticeable speedup on macbook using gpu compared to native?

2016-05-28T19:12:25.000259Z

@ericlavigne: not orders of magnitude. Depending on the GPU, it might be slower, or, say, 5x faster. BUT, even like that, it might be very useful if you'd like to write your own ML kernels in OpenCL (not right away, of course), because then you get access to hardware parallelism.

2016-05-28T19:16:13.000260Z

+ if you have a GPU that is not already in the tuning database it wouldn't be as fast as it can be with tuning. In that case, you can tune clblast and it will be optimized for your machine in the next version.

2016-05-28T19:16:36.000261Z

Post the info you got, and I could tell you what to expect.

2016-05-28T19:17:13.000262Z

When you say "ML kernel in OpenCL", you mean using OpenCL directly instead of via Neanderthal? What would be the advantage of that?

2016-05-28T19:17:59.000263Z

I'm not at computer anymore, but that info was many pages long. Not suitable for posting to slack.

2016-05-28T19:18:21.000264Z

No. I mean using neanderthal and clojurecl to do 95% of the chores for you, and write only a dozen of lines of OpenCL C for the core stuff

2016-05-28T19:18:32.000265Z

Instead of writing 20.000 lines in C.

2016-05-28T19:18:42.000266Z

And still getting the full speed.

2016-05-28T19:28:18.000267Z

Do you have an example of an ML technique that would benefit from that? I'm still early in studying this. Neural networks seem to be entirely based on matrix operations, where neanderthal can handle 100% of the performance intensive parts.

2016-05-28T19:38:20.000268Z

https://github.com/uncomplicate/bayadera

2016-05-28T19:42:05.000270Z

Regarding NNs: their operations can be described by matrix operations, and compared to pure Java it would be fast. However, when you know the characteristics of those operations, you can optimize further, and that's why all deep learning beasts do not use CUDA cuBLAS, but cuDNN, which is nvidia's super-messy implementation of DNN's at really low level. On top of that, often you have operations that could be vectorized and/or parallelized, and are similar to matrix operations, but not quite. A simple example: calculating variance. It can be expressed through mv and mm, but is quite faster if implemented directly.

2016-05-28T19:44:35.000272Z

To clarify, I am talking about multivariate variance. Univariate can be implemented with sum and dot.

2016-05-28T19:51:13.000274Z

I'm looking forward to that. Maybe in a few weeks. 🙂

2016-05-28T19:51:55.000275Z

More likely a few months 🙂 I'll help when I can.