Archive for the ‘OpenCL’ Category

Commercial Users of Functional Programming Videos

Just found a lot of videos from the “Commercial Users of Functional Programming” Edinburgh 2009

Lectures on Haskel, Scala, Erlang, F# etc.

Running Haskell Array Computations on a GPU

OpenCL Visual Profiler with Updated Drivers and SDK Code Samples

New Tool or Utility has just been posted to NVIDIA site.

This information is titled: OpenCL Visual Profiler with Updated Drivers
and SDK Code Samples

This information is for: General

The information description is: Leveraging the extensive performance
instrumentation in NVIDIA’s OpenCL
drivers and hardware performance signals designed into NVIDIA GPUs, the
OpenCL Visual Profiler provides developers with insight into performance
bottlenecks and opportunities for optimization. Key features include:
* Profiling of actual hardware signals, kernel efficiency,
and instruction issue rate
* Timing of memory copies between system memory and device memory
* Customizable graphs to help developers focus in on problem areas
* Basic auto-analysis to reveal warp serialization problems
* Easy import/export to CSV for custom analysis

This release also includes unified driver packages and an updated SDK
with support for both OpenCL and DirectCompute (be sure to follow the
instructions in the release notes to activate DirectCompute support).

Support for multi-GPU performance scaling has been added to most of the
OpenCL code samples, and several new code samples have been added as
well, including:
* oclMedianFilter
* oclFDTD3d
* oclRadixSort
* oclMersenneTwister
* oclSemirandomGenerator
* ImageProcessingCS
* nBodyCS
* OceanCS
* SimpleParticlesCS

Don”t miss the helpful OpenCL Best Practices Guide, designed to help
developers using OpenCL on the CUDA architecture implement high
performance parallel algorithms and understand best practices for GPU
Computing. Chapters on the following topics and more are included in
the guide:
* Heterogeneous Computing with OpenCL
* Performance Metrics
* Memory Optimizations
* NDRange Optimizations
* Instruction Optimizations
* Control Flow
* Performance Optimization Strategies

The drivers and SDK code samples in this release are compatible with
with the publicly available CUDA Toolkit 2.3 which is available at
www.nvidia.com/cuda.

Tags: , , ,

Papers on implementing RBM in GPU

Just read some papers on how to implement Restricted Boltzmann Machine on GPU. In RBM most computationally intensive part is weight update stage. Using GPU (CUDA, OpenCL, etc) can speed this stage 5-70 times depending on GPU and algorithm used. Major bottleneck in the implementations is the communication between main memory and the GPU unit:

“Design and Analysis of BLAS, GPU, and Sparse Multithreaded Acceleration Methods for Restricted Boltzmann Machine Training” (2009) 12-43 times speedup achieved  -  PDF

“Large-scale Deep Unsupervised Learning using Graphics Processors” (2009) 12-70 times speedup achieved -  PDF

Interesting part in the second paper is that they use a model with 100 million parameters. No discussion is given on regularization efforts and I am afraid this model can easily become overfitted.

“Neural Networks on GPUs: Restricted Boltzmann Machines” (2008) 66 times speedup achieved – PDF

Tags: , , , ,

OpenCL 1.0 Conformance Candidate Release

I just received a message from NVIDIA that they are releasing a OpenCL 1.0 Conformance
Candidate to GPU Computing registered developers:

The information description is:  We are pleased to announce the release of
our OpenCL 1.0 Conformance
Candidate to GPU Computing registered developers.  You now have access
to the OpenCL drivers we submitted this week to the Khronos OpenCL
working group.

The release also includes several OpenCL SDK code samples and
additional documentation to help you get started programming with
OpenCL.

Please submit bug reports (and feature/extension requests) using the
“Bug Report” link in top left when your are logged in.  You may also
ask questions and discuss this release and other OpenCL-related topics
in the OpenCL developer forums, here:

http://forums.nvidia.com/index.php?showforum=134

It is recommended that you follow the installation instructions in the
Release Notes for your platform. A driver update may be required, as
noted in the release notes.

Please review the release notes carefully after installation, as
this will allow for a much smoother introduction to the release.
While this release can be used on a wide variety of NVIDIA products,
only a subset were tested for this release.

NVIDIA OpenCL Jump Start Guide [PDF]

OpenCL 1.0 Specification

OpenCL at NVIDIA

Tags: , ,

Neural Information Processing Systems (NIPS) 2008 papers

Neural Information Processing Systems (NIPS) 2008 papers.
Some papers that attracted my attention:

The Recurrent Temporal Restricted Boltzmann
Machine
(PDF)

Supervised Bipartite Graph Inference (PDF)

Implicit Mixtures of Restricted Boltzmann Machines (PDF)

Tags: , ,

Restricted Boltzmann Machine – Short Tutorial

I have read a lot of papers on RBM and it seems to be difficult to grasp all the implementation details.
So, I want to share my experience with people facing the same problems. My tutorial is based on variant of RBM-s named Continuous Restricted Boltzmann Machine or CRBM for short. CRBM have very close implementation to original RBM with binomial neurons (0,1) as possible values of activation. At the end of the article I provide some code in Python. No guarantee is given that this implementation is correct so let me know of any bugs found.
First you may have a look at the original papers that describe theory behind RBM neural networks:

RBM application to Netflix challenge

Boltzmann Machine – Scholarpedia

Continuous RBM

What is a Restricted Boltzmann Machine

Restricted Boltzmann Machine is a stochastic neural network (that is a network of neurons where each neuron have some random behavior when activated). It consist of one layer of visible units (neurons) and one layer of hidden units. Units in each layer have no connections between them and are connected to all other units in other layer (fig.1). Connections between neurons are bidirectional and symmetric . This means that information flows in both directions during the training and during the usage of the network and that weights are the same in both directions.

Restricted Boltzmann Machine

fig.1 Restricted Boltzmann Machine

RBM Network works in the following way:
First the network is trained by using some data set and setting the neurons on visible layer to match data points in this data set.
After the network is trained we can use it on new unknown data to make classification of the data (this is known as unsupervised learning)

Data set

For purpose of this tutorial I will use artificially generated data set. It is 2D data that form a pattern shown in fig.2. I choose to use 500 data points for training. Because each data points is formed by to numbers between 0 and 1 neurons have to accept continuous values and I use Continuous RBM. I have tried different 2D patterns and this one seems to be relative difficult to be learned by CRBM.

Training Data

fig. 2 Training data

Learning Algorithm

Training a RBM is performed by algorithm known as “Contrastive Divergence Learning”.

More info on Contrastive Divergence
Let W be the matrix of IxJ (I – number of visible neurons, J – number of hidden neurons) that represents
weights between neurons. Each neuron input is provided by connections from all neurons in other layer.
Current neuron state S is formed by multiplication of each input by weight, summation over all inputs and application of this sum as a argument of nonlinear sigmoidal function:
Sj = F( Sum( Si x Wij + N(0,sigma)) ) – here Si are all neurons in given layer plus one bias neuron that stays constantly set at 1.0
N(0,1) is random number from normal distribution with mean 0.0 and standard deviation sigma (I use sigma=0.2).
This nonlinear function in my case is:
F = lo + (hi – lo)/(1.0 + exp(-A*Sj))

Where lo and hi are the lower and higher bound of input values (in my case 0,1), so it
becomes: F = 1.0/(1.0 + exp(-A*Sj))
A – is some parameter that is determined during the learning process.

Contrastive divergence is a value that is computed (actually matrix of values) and that is used to adjust values of W matrix. Changing W incrementally lead to training of W values.

Let W0 be the initial matrix of weights that are set to some random small values. I use N(0, 0.1) for this.
Let CD = <Si.Sj>0 – <Si.Sj>n  – contrastive divergence
Then on each step (epoch) W is updated to new value W’:
W’ = W + alpha*CD
Here alpha is some small step – “learning rate”. There exist more complex ways for W update that involve
some “momentum” and “cost” of update to avoid W values to become very large.

Contrastive Divergence explanation

There seems to be big confusion what exactly Contrastive Divergence means and how to implement it.
I have spend a lot of time to understand it.
First of all CD as shown in the formula above is a matrix of size IxJ. So this formula have to be
computed for each combination of I and J.
<…> is a average over each data point in the data set.

Si.Sj is just a multiplication of current activation (state) of neuron I and neuron J (obviously :) ). Where Si is the state of a visible neuron, and Sj is the state of a hidden neuron.
Indexes after <…> mean that average is taken after 0 or N-th reconstruction step performed.

fig. 2 Training Restricted Boltzmann Machine

fig. 2 Training Restricted Boltzmann Machine

How is the reconstruction performed?

  1. get one data point from data set.
  2. use values of this data point to set state of visible neurons Si
  3. compute Sj for each hidden neuron based on formula above and states of visible neurons Si
  4. now Si and Sj values can be used to compute (Si.Sj)0 – here (…) means just values not average
  5. on visible neurons compute Si using the Sj computed in step3. This is known as “reconstruction”
  6. compute state of hidden neurons Sj again using Si from 5 step.
  7. now use Si and Sj to compute (Si.Sj)1 (fig.3)
  8. repeating multiple times steps 5,6 and 7 compute (Si.Sj)n. Where n is small number and can increase with learning steps to achieve better accuracy.

The algorithm as a whole is:

  • For each epoch do:
    • For each data point in data set do:
      • CDpos =0, CDneg=0 (matrices)
      • perform steps 1…8
      • accumulate CDpos = CDpos + (Si.Sj)0
      • accumulate CDneg = CDneg + (Si.Sj)n
    • At the end compute average of CDpos and CDneg by dividing them by number of data points.
    • Compute CD = < Si.Sj >0 – < Si.Sj >n = CDpos – CDneg
    • Update weights and biases W’ = W + alpha*CD (biases are just weights to neurons that stay always 1.0)
    • Compute some “error function” like sum of squared difference between Si in 1) and Si in 5)
      e.g between data and reconstruction.
  • repeat for the next epoch until error is small or some fixed number of epoch.

The value of parameter A for visible units stay constant and for hidden unit is adjusted
by the same CD calculation but instead of formula above the following formula is used:
CD = (<Sj.Sj>0 – <Sj.Sj>n)/(Aj.Aj)

In my code i use n=20 initially and gradually increase it to 50.
Most of the steps in the algorithm can be performed by some simple matrix multiplication.

RBM usage

After the RBM learning process finishes it can be shown how well it performs on new data points.
I use set of 500 data points that are random and uniformly distributed in the 2D interval (0..1, 0..1).
Visible layer neurons states are set with values of each data point and steps 1) 2) 3) 5) are repeated
multiple times. Number of repetitions is the max number of “n” used in the learning process.

At the end of n-th step, neurons states in visible layer represent a data reconstruction. This is repeated for each data point. All the reconstruction points form a 2D image pattern that can be compared with initial image.

Python implementation of CRBM

At the end of this article I provide some implementation of Continuous Restricted Boltzmann Machine in Python. Most of the steps are performed by matrix operations using NumPy library. I used PyLab and SciPy to make some nice visualizations. Images of data that are shown (fig.4) are interpolated with Gaussian kernel in order to have some “feeling” of the density of data points and reconstructed data points.

RBM reconstruction

fig. 4 Training data and reconstruction.

Next image shows initial data set and reconstructed data superimposed. Initial data is in blue, reconstructed in red and green line connects each data point with reconstructed one.

Fig. 5 Training data (blue) and reconstruction (red) superimposed.

Fig. 5 Training data (blue) and reconstruction (red) superimposed.

Main difficulty I observed is to fine tune the set of learning parameters.

I used the following parameters and if somebody is able to find better values – let me know.
sigma=0.2
A=0.1 (on visible neurons)
Learning Rate W = 0.5
Learning Rate A = 0.5
Learning Cost = 0.00001
Learning Moment = 0.9

RBM architecture is 2 visible neurons for 2D data points and 8 hidden neurons. On some experiments with simple patterns 4 neurons are enough to reconstruct pattern successfully.

Python code for Restricted Boltzmann Machine

Tags: , , ,