uncomplicate

2017-03-29T00:06:37.692478Z

create an empty vector, and use uncomplicate.neanderthal.real/entry! to set each value

qqq 2017-03-29T00:07:15.696707Z

that sounds veyr slow

qqq 2017-03-29T00:07:26.697962Z

I think much faster is (1) read into a java.nio.buffer and (2) use dv directly on that

2017-03-29T00:07:49.700487Z

and how fast can you read a file into a buffer?

qqq 2017-03-29T00:08:03.702096Z

probably faster than clojure looping through the bytes one by one 🙂

2017-03-29T00:08:14.703269Z

btw your approach will work in 0.8.0, but not in 0.9.0

qqq 2017-03-29T00:08:19.703984Z

oh?

2017-03-29T00:08:51.707548Z

you are not looping through the bytes, but through elements.

2017-03-29T00:09:28.711789Z

is MNIST dataset in binary format. if not, looping is negligible in comparison to parsing strings or something like that

qqq 2017-03-29T00:09:43.713453Z

the mnist training set is 50,000 * (28 x 28 images). YOu're telling me to call "entry! 50,000 28 28" times ?

qqq 2017-03-29T00:09:53.714728Z

It's binary

qqq 2017-03-29T00:10:03.715999Z

http://yann.lecun.com/exdb/mnist/

2017-03-29T00:12:40.733585Z

but is that binary format compatible with intel's native LITTLE_ENDIAN floats/doubles?

2017-03-29T00:13:09.736636Z

how much time are you expect that reading from the disk for these 50.000 images are?

2017-03-29T00:13:30.738941Z

are you loading all 50.000 images in one matrix/vector?

qqq 2017-03-29T00:13:34.739387Z

sounds like I should stfu and benchmark 🙂

2017-03-29T00:14:16.743726Z

I guess that writing those bytes in neanderthal would be way les than 1 second

qqq 2017-03-29T00:14:16.743789Z

back when I was using Tensorflow, it would be tierh (1) setup an input queue which reads in batches or (2) read whole thing, then slice relevant section

qqq 2017-03-29T00:14:32.745597Z

can you point me at 0.9 docs? I'm just curious now how input is changing

qqq 2017-03-29T00:15:01.748592Z

Coming from an APL / Tensorflow background, it's hammermed into my brain "think in terms of vector / array / matrix ops -- if you have for-looping, you are doing something wrong."

2017-03-29T00:16:12.756315Z

input stays almost the same. the only change relevant to you is that buffer is not accepted as the source (to stop users to shoot themselves in the foot)

2017-03-29T00:16:57.761051Z

that's because tf is a higher level library. practically, no one stops you from creating such layer on top of neanderthal

2017-03-29T00:18:09.768426Z

that vectorize everything approach is right. but it is for computation. when you are doing IO, then it may be convenient, but I doubt it helps much (or at all) for speeding things up.

2017-03-29T00:20:11.780769Z

So, you have 50.000 28 28 numbers. That's less than 20 million. If each entry! call is in the order of a nanosecond or three, I guess that you'll load all those numbers in the vector in less than 100 ms. or I could be wrong. set up a test loop and write a dummy number to benchmark.

2017-03-29T00:20:29.782679Z

I guess that accessing 50.000 images from the disk will take more time.

qqq 2017-03-29T00:26:38.820626Z

another newb question: let A = m * n let b = list of k numbers, a common operation is to do A[b] ==> k * n matrix, by selecting k rows or k cols Does blas / neandertheral provide this "indexing operator" or am I expected to for loop this by (1) creating a new matrix and thn copying over the rows / cols ?

2017-03-29T00:29:44.839776Z

You mean to create a submatrix, but with scattered regions? Not supported, on purpose, since it would produce a structure scattered in memory that would thrash the cache and kill performance. You can create a "proper" submatrix in O(1) or copy relevant columns to a destination matrix that you intend to work with very efficiently with copy!

qqq 2017-03-29T00:32:17.855981Z

okay, suppose these k rows/cols are not completely contiginious, suppose they're split across T contigious regions, then I'm expected to write a for loop that makes T calsl to copy! right? [main issue if I'm checking whether I'm supposed to do this for loopop -- since I thought GPUS had gather/scatter ops precisely to avoid the for loop in this case]

qqq 2017-03-29T00:32:47.858856Z

with a for loop, I'm copying this sequentailly, whereas, in theory, this new matrix can be copied in parallel, and thus my wondering about scatter/gather ops and whether they exist in opencl/neantherdral

2017-03-29T00:34:44.870540Z

calling such methods on GPU is not a good idea at all, nor GPU libraries do this. If you need to create such scattered structures, prepare them first on the CPU, and then send them to the GPU.

qqq 2017-03-29T00:35:06.872771Z

okay

2017-03-29T00:35:29.874973Z

if you installed parallel atlas, each copy operation will be parallelized anyway

2017-03-29T00:35:50.876989Z

and from 0.9.0 MKL is used, which is parallelized by default

2017-03-29T00:36:54.883291Z

if you call these operations on the GPU, they will be asynchronously called, but just the communication overhead would make it not efficient

2017-03-29T00:38:36.893711Z

simply put, use GPU only when you have really lots of data, or custom kernels, or CPU was slow for you and you know what you are doing on the GPU. GPU is not a magic wand -> requires lots of experience and learning to give you the speedups you're seeing reported by nvidia/google/etc.

qqq 2017-03-29T00:40:07.902761Z

I'm playing with deep learning. Most of it is just matrix multiplies and calculating auto derivatives. Tensorflow has this notion of a "computation algraph", where I create nodes on the GPU with inputs / pre-defined ops (which are very limited but highly parallelizable); then to send data to the GPU, I use a "feed dict",and to get data back from the GPU, I use "sess.run ... " on a gpu varaible.

2017-03-29T00:40:39.906089Z

btw, I may implement some efficient bulk column/row copying operation in the future but for now it's not there.

qqq 2017-03-29T00:40:43.906457Z

It's not clear to me how this line is drawn in neandertheral. It seems like dv/sv are CPU, while clv are GPU

2017-03-29T00:40:51.907236Z

yes

qqq 2017-03-29T00:41:45.912447Z

is the control logic running on the CPU side in neanthereral? does the GPU, after each op, query the CPU for what to do next? in Tensorflow, I define this 'computation graph' on the GPU, then once I feed in the input, the GPU crunches until it gets the output.

2017-03-29T00:42:06.914455Z

you can do this because somebody implemented those operations on top of linear algebra libraries. neanderthal is not a DNN library 😉

2017-03-29T00:42:45.918323Z

theoretically, it would be your job to implement those efficient GPU kernels in ClojureCL, just like TF people did.

qqq 2017-03-29T00:43:13.920861Z

yes, i unerstand th TF people implemented lotsof code I'll have to rewrite

qqq 2017-03-29T00:43:56.925145Z

it seems that in the TF world, I declare the computation graph once -- and the GPU knows what to do after that here, in the neantherdal world, it seems like for every GPU op, the CPU has to tell the CPU "yo, dot product these two vectors"

2017-03-29T00:44:05.926088Z

GPU is not querying anything. each neanderthal operation is asynchronously queued to the GPU through the OpenCL driver.

qqq 2017-03-29T00:44:56.931018Z

ah, that's what I was missing, the CPU doens't wait for the GPU to finish, until it needs the data

2017-03-29T00:45:00.931369Z

General workflow is: load the data into the CPU memory, transfer it to the GPU, enqueue lots of operation, and when you have the final result transfer it back

qqq 2017-03-29T00:45:19.933460Z

this clarifies everythig; thanks for your patience!

2017-03-29T00:45:43.935869Z

I highly recommend following the OpenCL in Action book with ClojureCL's code (from the test folder).

2017-03-29T00:46:07.938298Z

i doubt you can learn it on your own just by poking around.

qqq 2017-03-29T00:46:27.940344Z

given how much of your time I have wasted, I feel like in any world with justice I should give that book a try 🙂

2017-03-29T00:46:42.941759Z

don't worry

2017-03-29T00:46:59.943455Z

you can give back by helping other users

2017-03-29T00:47:34.946925Z

write about your experience, so other users would be able to get to speed more quickly

qqq 2017-03-29T16:38:54.099779Z

@blueberry: with regards to 0.9.0 not allowing providing a bytebufer, and having to use entry! on every entry -- one question -- was there a benchmark done via: (1) creating a new bytebufer and copying over the data vs (2) creating a new bytebuffer and entry! for every entry ? INtuitively (1) would be faster than (2), but no idea by how much.

qqq 2017-03-29T18:20:57.759205Z

@blueberry : actually, can we have the following feature in 0.9.0 ? efficient methods for (1): writing out a list of neandertheral matrices/vectors reading in a list of neandertheral matrices/vectors "map of" may make more sense than "list of" I don't care how long it takes me to create he matrix the first time around -- as long as later on I can efficiently serialize / deserialize it

2017-03-29T18:58:19.370704Z

@qqq you can access a buffer with the block/buffer method and do whatever you like with it. The question is: how do you copy that data to the raw buffer? which method do you use and how fast is it? entry! is pretty efficient, and is way faster than your *source* data structure. I'd say it's almost as fast as buffer's native set methods, with the added benefit that entry! validates that what you are doing really makes sense.

2017-03-29T19:06:05.497569Z

@qqq regarding 0.9.0, it's about to be released, so new features won'd be added before 0.10.0-SNAPSHOT. As for the methods you suggest, it is not clear to me what would be the other side of that pipeline: write and read, but *where from and where to*. I guess that you need some standard serialization facilities. That is on the TODO list, as well as the transformers to/from standard formats (such as matrix market, or nist, or whatever is popular out there). I am not sure when I'll have time to think this through and commit to some design choice in that regard. For now, it is up to you how you'll do the serialization and deserialization, since you have 2 extremely fast options: (1) get the raw byte buffer with the buffer method and do whatever you like with it (but be careful to not fill it with garbage); or (2) set elements using real/entry! - still faster than your source can provide this data.

qqq 2017-03-29T19:11:26.579409Z

@blueberry: I was unaware of the get the raw byte buffer with the buffer` method` -- this is enough of a primitive to build everything else on to of; thanks!

2017-03-29T19:22:19.742191Z

@qqq but benchmark first. I myself use entry! 😉

qqq 2017-03-29T19:30:22.864152Z

@blueberry: the real problem is, you haven't written an Tensorflow (or even better, APL) layer on top -- then I wouldn't ahve to worry about such low level details 🙂