http://neanderthal.uncomplicate.org/codox/uncomplicate.neanderthal.native.html#var-dv <-- that's what I wnat
create an empty vector, and use uncomplicate.neanderthal.real/entry! to set each value
that sounds veyr slow
I think much faster is (1) read into a java.nio.buffer and (2) use dv directly on that
and how fast can you read a file into a buffer?
probably faster than clojure looping through the bytes one by one 🙂
btw your approach will work in 0.8.0, but not in 0.9.0
oh?
you are not looping through the bytes, but through elements.
is MNIST dataset in binary format. if not, looping is negligible in comparison to parsing strings or something like that
the mnist training set is 50,000 * (28 x 28 images). YOu're telling me to call "entry! 50,000 28 28" times ?
It's binary
but is that binary format compatible with intel's native LITTLE_ENDIAN floats/doubles?
how much time are you expect that reading from the disk for these 50.000 images are?
are you loading all 50.000 images in one matrix/vector?
sounds like I should stfu and benchmark 🙂
I guess that writing those bytes in neanderthal would be way les than 1 second
back when I was using Tensorflow, it would be tierh (1) setup an input queue which reads in batches or (2) read whole thing, then slice relevant section
can you point me at 0.9 docs? I'm just curious now how input is changing
Coming from an APL / Tensorflow background, it's hammermed into my brain "think in terms of vector / array / matrix ops -- if you have for-looping, you are doing something wrong."
input stays almost the same. the only change relevant to you is that buffer is not accepted as the source (to stop users to shoot themselves in the foot)
that's because tf is a higher level library. practically, no one stops you from creating such layer on top of neanderthal
that vectorize everything approach is right. but it is for computation. when you are doing IO, then it may be convenient, but I doubt it helps much (or at all) for speeding things up.
So, you have 50.000 28 28 numbers. That's less than 20 million. If each entry! call is in the order of a nanosecond or three, I guess that you'll load all those numbers in the vector in less than 100 ms. or I could be wrong. set up a test loop and write a dummy number to benchmark.
I guess that accessing 50.000 images from the disk will take more time.
another newb question: let A = m * n let b = list of k numbers, a common operation is to do A[b] ==> k * n matrix, by selecting k rows or k cols Does blas / neandertheral provide this "indexing operator" or am I expected to for loop this by (1) creating a new matrix and thn copying over the rows / cols ?
You mean to create a submatrix, but with scattered regions? Not supported, on purpose, since it would produce a structure scattered in memory that would thrash the cache and kill performance. You can create a "proper" submatrix in O(1) or copy relevant columns to a destination matrix that you intend to work with very efficiently with copy!
okay, suppose these k rows/cols are not completely contiginious, suppose they're split across T contigious regions, then I'm expected to write a for loop that makes T calsl to copy! right? [main issue if I'm checking whether I'm supposed to do this for loopop -- since I thought GPUS had gather/scatter ops precisely to avoid the for loop in this case]
with a for loop, I'm copying this sequentailly, whereas, in theory, this new matrix can be copied in parallel, and thus my wondering about scatter/gather ops and whether they exist in opencl/neantherdral
calling such methods on GPU is not a good idea at all, nor GPU libraries do this. If you need to create such scattered structures, prepare them first on the CPU, and then send them to the GPU.
okay
if you installed parallel atlas, each copy operation will be parallelized anyway
and from 0.9.0 MKL is used, which is parallelized by default
if you call these operations on the GPU, they will be asynchronously called, but just the communication overhead would make it not efficient
simply put, use GPU only when you have really lots of data, or custom kernels, or CPU was slow for you and you know what you are doing on the GPU. GPU is not a magic wand -> requires lots of experience and learning to give you the speedups you're seeing reported by nvidia/google/etc.
I'm playing with deep learning. Most of it is just matrix multiplies and calculating auto derivatives. Tensorflow has this notion of a "computation algraph", where I create nodes on the GPU with inputs / pre-defined ops (which are very limited but highly parallelizable); then to send data to the GPU, I use a "feed dict",and to get data back from the GPU, I use "sess.run ... " on a gpu varaible.
btw, I may implement some efficient bulk column/row copying operation in the future but for now it's not there.
It's not clear to me how this line is drawn in neandertheral. It seems like dv/sv are CPU, while clv are GPU
yes
is the control logic running on the CPU side in neanthereral? does the GPU, after each op, query the CPU for what to do next? in Tensorflow, I define this 'computation graph' on the GPU, then once I feed in the input, the GPU crunches until it gets the output.
you can do this because somebody implemented those operations on top of linear algebra libraries. neanderthal is not a DNN library 😉
theoretically, it would be your job to implement those efficient GPU kernels in ClojureCL, just like TF people did.
yes, i unerstand th TF people implemented lotsof code I'll have to rewrite
it seems that in the TF world, I declare the computation graph once -- and the GPU knows what to do after that here, in the neantherdal world, it seems like for every GPU op, the CPU has to tell the CPU "yo, dot product these two vectors"
GPU is not querying anything. each neanderthal operation is asynchronously queued to the GPU through the OpenCL driver.
ah, that's what I was missing, the CPU doens't wait for the GPU to finish, until it needs the data
General workflow is: load the data into the CPU memory, transfer it to the GPU, enqueue lots of operation, and when you have the final result transfer it back
this clarifies everythig; thanks for your patience!
I highly recommend following the OpenCL in Action book with ClojureCL's code (from the test folder).
i doubt you can learn it on your own just by poking around.
given how much of your time I have wasted, I feel like in any world with justice I should give that book a try 🙂
don't worry
you can give back by helping other users
write about your experience, so other users would be able to get to speed more quickly
@blueberry: with regards to 0.9.0 not allowing providing a bytebufer, and having to use entry! on every entry -- one question -- was there a benchmark done via: (1) creating a new bytebufer and copying over the data vs (2) creating a new bytebuffer and entry! for every entry ? INtuitively (1) would be faster than (2), but no idea by how much.
@blueberry : actually, can we have the following feature in 0.9.0 ? efficient methods for (1): writing out a list of neandertheral matrices/vectors reading in a list of neandertheral matrices/vectors "map of" may make more sense than "list of" I don't care how long it takes me to create he matrix the first time around -- as long as later on I can efficiently serialize / deserialize it
@qqq you can access a buffer with the block/buffer method and do whatever you like with it. The question is: how do you copy that data to the raw buffer? which method do you use and how fast is it? entry! is pretty efficient, and is way faster than your *source* data structure. I'd say it's almost as fast as buffer's native set methods, with the added benefit that entry! validates that what you are doing really makes sense.
@qqq regarding 0.9.0, it's about to be released, so new features won'd be added before 0.10.0-SNAPSHOT. As for the methods you suggest, it is not clear to me what would be the other side of that pipeline: write and read, but *where from and where to*. I guess that you need some standard serialization facilities. That is on the TODO list, as well as the transformers to/from standard formats (such as matrix market, or nist, or whatever is popular out there). I am not sure when I'll have time to think this through and commit to some design choice in that regard. For now, it is up to you how you'll do the serialization and deserialization, since you have 2 extremely fast options: (1) get the raw byte buffer with the buffer
method and do whatever you like with it (but be careful to not fill it with garbage); or (2) set elements using real/entry!
- still faster than your source can provide this data.
@blueberry: I was unaware of the get the raw byte buffer with the
buffer` method` -- this is enough of a primitive to build everything else on to of; thanks!
@qqq but benchmark first. I myself use entry!
😉
@blueberry: the real problem is, you haven't written an Tensorflow (or even better, APL) layer on top -- then I wouldn't ahve to worry about such low level details 🙂