Papers on implementing RBM in GPU
Just read some papers on how to implement Restricted Boltzmann Machine on GPU. In RBM most computationally intensive part is weight update stage. Using GPU (CUDA, OpenCL, etc) can speed this stage 5-70 times depending on GPU and algorithm used. Major bottleneck in the implementations is the communication between main memory and the GPU unit:
“Design and Analysis of BLAS, GPU, and Sparse Multithreaded Acceleration Methods for Restricted Boltzmann Machine Training” (2009) 12-43 times speedup achieved - PDF
“Large-scale Deep Unsupervised Learning using Graphics Processors” (2009) 12-70 times speedup achieved - PDF
Interesting part in the second paper is that they use a model with 100 million parameters. No discussion is given on regularization efforts and I am afraid this model can easily become overfitted.
“Neural Networks on GPUs: Restricted Boltzmann Machines” (2008) 66 times speedup achieved – PDF
Tags: CUDA, GPU, Neural networks, OpenCL, RBM

July 30th, 2009 at 4:51 pm
I’m working on a GPU implementation myself. I’ve got horrible code up and about at http://wiki.github.com/IanCal/leonard.
It’s just a start, but over the next month or so it should become a usable library for others working with RBMs and C++. It’ll be heavily restructured soon to make it easy to create experiments
.
July 30th, 2009 at 8:36 pm
Nice. I can only suggest to keep a pure C++ together with GPU implementation for testing purposes. It is hard to get it right. My current implementation is in C++ with some interface for future parallelization.
It seems you are using 8600GT like me. Do you observe 5-10 times speedup?
July 31st, 2009 at 4:53 pm
Thanks. Sorry, I’m not quite sure what you mean by:
“I can only suggest to keep a pure C++ together with GPU implementation for testing purposes. It is hard to get it right. ”
Do you mean have a CPU based version to check that the GPU version runs in the same way? I would like to do that.
“It seems you are using 8600GT like me. Do you observe 5-10 times speedup?”
I haven’t got a reference point with an otimised C++ program. Using the measure from the last paper, I get 279MCUPS (for a single layer 512×512, batch size of 32), so that would suggest a 27 times speedup. If I’ve got my maths right that is
That includes all file reading (though the file is written as a series of floats), conversion to column major format, transfer, weight updates (including momentum) and random number generation.
I’m not sure of the meaning of a few bits of the paper though, the update period is defined as “the time it takes for the implementation to complete a single batch of data”. What’s the batch size? Does it mean a single 512 vector? That’s what my calculation above assumes btw.
Raw performance figures, for training a 784×512x512×2048 network with 10 label units (not softmaxed though) I get a speed of roughly 900 samples per second. That’s for training all layers in sequence. As in, to train all layers in sequence over 50k training samples it takes just under a minute for everything. I get about 3k/s speeds for recognition.
I’m curious about their implementation, they seem to use a matrix multiplication and a matrix addition for the weight updates, as well as a transpose. The transpose worries me a lot, since there’s no need to store a transposed matrix in global memory. You can just change the access pattern (as I assume sgemm does).
Maybe my program is doing something wrong
August 3rd, 2009 at 6:45 am
“Do you mean have a CPU based version to check that the GPU version runs in the same way? I would like to do that. ”
Yes, it is always better to have a “reference implementation” and when measuring the speed-up this implementation have to be optimised.