Papers on implementing RBM in GPU

Just read some papers on how to implement Restricted Boltzmann Machine on GPU. In RBM most computationally intensive part is weight update stage. Using GPU (CUDA, OpenCL, etc) can speed this stage 5-70 times depending on GPU and algorithm used. Major bottleneck in the implementations is the communication between main memory and the GPU unit:

“Design and Analysis of BLAS, GPU, and Sparse Multithreaded Acceleration Methods for Restricted Boltzmann Machine Training” (2009) 12-43 times speedup achieved  -  PDF

“Large-scale Deep Unsupervised Learning using Graphics Processors” (2009) 12-70 times speedup achieved -  PDF

Interesting part in the second paper is that they use a model with 100 million parameters. No discussion is given on regularization efforts and I am afraid this model can easily become overfitted.

“Neural Networks on GPUs: Restricted Boltzmann Machines” (2008) 66 times speedup achieved – PDF

Tags: , , , ,

4 Responses to “Papers on implementing RBM in GPU”

  1. Ian Calvert Says:

    I’m working on a GPU implementation myself. I’ve got horrible code up and about at http://wiki.github.com/IanCal/leonard.

    It’s just a start, but over the next month or so it should become a usable library for others working with RBMs and C++. It’ll be heavily restructured soon to make it easy to create experiments :) .

  2. zoo Says:

    Nice. I can only suggest to keep a pure C++ together with GPU implementation for testing purposes. It is hard to get it right. My current implementation is in C++ with some interface for future parallelization.
    It seems you are using 8600GT like me. Do you observe 5-10 times speedup?

  3. Ian Calvert Says:

    Thanks. Sorry, I’m not quite sure what you mean by:

    “I can only suggest to keep a pure C++ together with GPU implementation for testing purposes. It is hard to get it right. ”

    Do you mean have a CPU based version to check that the GPU version runs in the same way? I would like to do that.

    “It seems you are using 8600GT like me. Do you observe 5-10 times speedup?”

    I haven’t got a reference point with an otimised C++ program. Using the measure from the last paper, I get 279MCUPS (for a single layer 512×512, batch size of 32), so that would suggest a 27 times speedup. If I’ve got my maths right that is :) That includes all file reading (though the file is written as a series of floats), conversion to column major format, transfer, weight updates (including momentum) and random number generation.

    I’m not sure of the meaning of a few bits of the paper though, the update period is defined as “the time it takes for the implementation to complete a single batch of data”. What’s the batch size? Does it mean a single 512 vector? That’s what my calculation above assumes btw.

    Raw performance figures, for training a 784×512x512×2048 network with 10 label units (not softmaxed though) I get a speed of roughly 900 samples per second. That’s for training all layers in sequence. As in, to train all layers in sequence over 50k training samples it takes just under a minute for everything. I get about 3k/s speeds for recognition.

    I’m curious about their implementation, they seem to use a matrix multiplication and a matrix addition for the weight updates, as well as a transpose. The transpose worries me a lot, since there’s no need to store a transposed matrix in global memory. You can just change the access pattern (as I assume sgemm does).

    Maybe my program is doing something wrong :)

  4. zoo Says:

    “Do you mean have a CPU based version to check that the GPU version runs in the same way? I would like to do that. ”
    Yes, it is always better to have a “reference implementation” and when measuring the speed-up this implementation have to be optimised.