Parallel Programming with Parallels
07 Apr 2013 - Nic Ho Chee
I've been a Microsoft Windows user for a while, and after some mishaps with traditional Windows laptops ended up migrating to running Windows on a Macbook using Bootcamp. Recently some of my hardware was obsoleted on Windows 7 so I decided to go the whole hog and try OSX for my daily needs with Windows running as a virtual machine through Parallels for any programming tasks needing Visual Studio. After using this setup for a few months, I was surprised by how quickly it ran my compiled programs, I wasn't noticing the difference in responsiveness that I was expecting so I thought I'd benchmark a known problem so I could get a sense of how fast virtualisation actually was...
My initial idea was to try benchmarking a parallelisable algorithm under a variety of implementations on Windows and OSX to get a feel for what the difference in running times were like between the native(-ish) OSX and virtualised Windows implementations. I settled on a simple Kohonen neural network implementation as its eminently parallelisable and i'd be able to quickly scrape some data to run through it. My initial idea was to test implementations using the following patterns/APIs:
- A naive implementation. A quick and dirty implementation which didn't care about memory access costs or general optimisations.
- A naive optimised implementation. A slightly faster implementation which would reduce memory copying and arrange operations to start to optimise for parallelisation.
- An implementation using the Parallel Patterns API. This would only be available running through Windows, but it looked like it would give an instant bang for buck.
- An implementation using OpenMP. Unfortunately I'm running with Visual Studio Express at home, and this doesn't support OpenMP out of the box at present. It looks like there might be a way around this but after a bit of experience hacking about in Visual Studio I didn't want to modify my install to get around this.
- An OpenCL implementation. I've been tinkering around with GPGPU programming for a bit on desktops so I was interested in seeing what could be done on a laptop GPU. This turned out to be a pain as Parallels has its own GPU driver and none of the Windows OpenCL code would run... I ended up writing this entirely under XCode for OSX.
The Results...
The table below shows the results generated averaged over 5 runs for 2D Kohonen networks with 10x10 and 100x100 nodes against a training set of 320 groups of 7 greyscale values from 0x00 to 0xFF. All timings are in milliseconds:
10 x 10 nodes | 100 x 100 nodes | |
Naive Implementation, Windows | 56.25306 ms | 4774.4 ms |
Naive Optimised Implementation, Windows | 2.35398 ms | 168.0156 ms |
Parallel Patterns Implementation, Windows | 4.26502 ms | 251.52 ms |
Naive Implementation, Mac OSX | 71.952 ms | 7018.11 ms |
Naive Optimised Implementation, Mac OSX | 1.8228 ms | 104.171 ms |
OpenCL Implementation, Mac OSX | 23.0954 ms | 24.6672 ms |
A few things jump out from this:
- The C++ code for the naive implementations is exactly the same on the virtualised Windows and native OSX implementations. It appears though that the memory copy heavy naive implementations run faster under Parallels than the OSX implementation! The optimised version which would conceivably not be memory copy limited is faster under OSX than the virtualised Windows version. To me this suggests that the OSX version is doing a lot of work behind the scenes in terms of memory management and that Parallels doesn't follow this path. When this isn't needed due to more optimal memory usage the overhead of virtualisation makes the Windows version slightly slower than the Mac OSX version. I wouldn't be surprised to find that code running on a native Windows system would be slightly faster than the OSX version after seeing this.
- The parallel patterns built version doesn't bring anything to the party when running through Parallels. I spent some time looking at the code to see if I had implemented this correctly and can only surmise that there is no benefit using parallel patterns with virtualisation. I need to have a look at running this on a comparable native Windows system to see whether this is an artefact of virtualisation or a side-effect of the chosen algorithm.
- OpenCL is fast! The 100x100 node runs should be somewhere in the region of 100 times slower than the 10x10 run. From the tests it looks like it is somewhere around 1.05 times slower. The initial setup is much slower than the naive optimised versions at 10x10 but significantly faster at 100x100. It may be possible to hold the entire computation carried out in GPU memory in which case the 100x100 OpenCL version will be faster still. My main bugbear with setting up memory objects to transfer to the GPU is that it breaks object oriented programming and leads programmers towards Premature Optimisation. I'm very impressed that even with a large amount of memory copying to and from the GPU it is still this capable, and its going to lead to me pushing more work its way in the future. I'm very much looking forward to a certain new Big Blue Box which is reputed to have a stonking GPU and unified memory, as it would seem to fix the need for data transfer between CPU and GPU memory.
- After looking at the results of my tests I'm quite happy with the performance of Windows running virtualised on Parallels when compared with the general speed of Mac OSX. I'm disappointed that I can't use OpenCL with Visual Studio under Parallels but apart from that I'm not too displeased with the performance.
- This was my first experience running Xcode 4.x and I have to say that I'm not that impressed with it. It seems clunky when compared with other IDEs, and there are a few things like Schemes which seem a bit broken when compared with their direct analogues. I'll need to play around with it for a bit longer to see whether I get used to it.
How It Was Done
In overview, the training algorithms were split into a common class and a derived class which carried out the various test permutation specialisations. The code was put together in a relatively quick-and-dirty fashion for me to prototype the various implementations in a decent time, I would need to look at cleaner implementations with more error checking if I went on to use this code for anything else. Here is the base training class:
//
// Trainer.h
// Simple trainer base class.
//
#pragma once
#include "Network.h"
#include "Distance.h"
#include "WeightVector.h"
#include "TrainingSet.h"
#define D_NUM_TRAINING_ITERATIONS 1000
class Trainer
{
/// Reference to the node setup.
Network* m_network;
/// Current step in the training set.
unsigned int m_step;
/// Total number of iterations to run.
unsigned int m_iterations;
/// Get a neighbourhood scalar value based on distance to the winner.
virtual double
GetNeighbourhoodValue(
Node<Weight<double>> const& winner,
Node<Weight<double>> const& neighbour ) = 0;
/// Calculate a simple distance check from a given position.
virtual double
CalculateDistance(
WeightVector<Weight<double>> const& weights,
const int x,
const int y );
protected:
/// Get the current node setup.
Network*
GetNetwork( void );
/// Get the total number of iterations for this trainer
const unsigned int
GetIterations( void ) const;
/// Get a simple learning coefficient which decreases with some
/// function as the number of steps increases.
virtual const double
GetLearningCoefficient( void ) const;
public:
/// Constructor which takes the node setup and number of iterations
/// we're expecting to run.
Trainer(
Network* network,
const unsigned int iterations = D_NUM_TRAINING_ITERATIONS );
/// Its the destructor.
~Trainer( void );
/// Run a complete set of iterations as training steps.
virtual void
Train(
TrainingSet const* trainingSet ) = 0;
/// Run a single iteration of training.
virtual void
Iterate(
TrainingSet const* trainingSet ) = 0;
};
// SimpleTrainer.h
// Simple, very dumb training algorithm for a Kohonen network
// Wv(s + 1) = Wv(s) + Θ(u, v, s) α(s)(D(t) - Wv(s))
//
#pragma once
#include "Trainer.h"
#include "TrainingSet.h"
#include "PerformanceCounter.h"
/// Rough and ready Kohonen network trainer.
class SimpleTrainer : public Trainer
{
protected:
/// Get a simple neighbourhood scalar value based on distance to the winner.
virtual double
GetNeighbourhoodValue(
Node<Weight<double>> const& winner,
Node<Weight<double>> const& neighbour );
/// Calculate the winning node closest to the incoming set of weights.
virtual Node<Weight<double>>*
CalculateBestMatchingNode(
WeightVector<Weight<double>> const& weights );
/// Calculate the next step value after finding a winning node.
void CalculateIncrement(
const WeightVector<Weight<double>>& weights,
const Node<Weight<double>>& winner );
public:
/// Constructor, takes the node network and the number of iterations to run for.
SimpleTrainer(
Network* network,
const unsigned int iterations );
/// Run a complete set of iterations as training steps.
virtual void
Train(
TrainingSet const* trainingSet );
/// Run a single iteration of training.
virtual void
Iterate(
TrainingSet const* trainingSet );
};
The Parallel Pattern based implementation is pretty simple, and whilst there is some locking in there it doesn't look like this is the cause of its tardiness compared to the other implementations. There may be a need for specific programming practices to make any use of it which aren't immediately obvious from reading the MSDN overview and API documentation which is a shame as you would hope there would be a scheduler to handle this in a similar manner as with GPGPU programming.
// ParallelTrainer.cpp
//
#include "stdafx.h"
#include "ParallelTrainer.h"
#include <ppl.h>
#include <atomic>
#include <thread>
using namespace concurrency;
ParallelTrainer::ParallelTrainer(
Network* network,
const unsigned int iterations ) :
SimpleTrainer( network, iterations )
{
}
void
ParallelTrainer::Train(
TrainingSet const* trainingSet )
{
PerformanceCounter counter( __FILE__, __FUNCTION__, "" );
Trainer::Train( trainingSet );
for (unsigned int i = 0; i < GetIterations(); ++i)
{
Iterate( trainingSet );
}
}
void
ParallelTrainer::Iterate(
TrainingSet const* trainingSet )
{
PerformanceCounter counter( __FILE__, __FUNCTION__, "" );
unsigned int count = trainingSet->GetCount();
for (unsigned int i = 0; i < count; ++i)
{
WeightVector<Weight<double>> const& weights = *(trainingSet->GetWeights(i));
Node<Weight<double>>* node = CalculateBestMatchingNode( weights );
CalculateIncrement( weights, *node );
}
}
Node<Weight<double>>*
ParallelTrainer::CalculateBestMatchingNode(
WeightVector<Weight<double>> const& weights )
{
unsigned int nodeCount = GetNetwork()->GetCount();
double minDistance = DBL_MAX;
Node<Weight<double>>* winner = NULL;
std::atomic_flag lock = ATOMIC_FLAG_INIT;
parallel_for((unsigned int) 0, nodeCount, [&](unsigned int i){
Node<Weight<double>>* node = GetNetwork()->GetNode( i );
double distance = node->CalculateDistance( weights );
while (lock.test_and_set(std::memory_order_acquire));
if ( distance < minDistance )
{
winner = node;
minDistance = distance;
}
lock.clear(std::memory_order_release);
});
return ( winner );
}
void
ParallelTrainer::CalculateIncrement(
const WeightVector<Weight<double>>& weights,
const Node<Weight<double>>& winner )
{
unsigned int nodeCount = GetNetwork()->GetCount();
parallel_for((unsigned int) 0, nodeCount, [&](unsigned int i){
Node<Weight<double>>* node = GetNetwork()->GetNode( i );
double scale = GetNeighbourhoodValue( winner, *node ) * GetLearningCoefficient();
const unsigned int count = node->GetWeights().Count();
for ( unsigned int j = 0; j < count; ++j )
{
double value = ( winner.GetWeights().operator[](j)->Get() - node->GetWeights().operator[](j)->Get() ) * scale;
node->GetWeights().operator[](j)->Set( value );
}
});
}
Finally the OpenCL implementation is very much a C-like implementation, with data unpacked from OO C++ structures into memory arrays to be dumped into GPU memory. The header file:
//
// ClTrainer.h
// ParallelTest
//
#ifndef __ParallelTest__ClTrainer__
#define __ParallelTest__ClTrainer__
#pragma once
#include <iostream>
#include <OpenCL/opencl.h>
#include "SimpleTrainer.h"
/// OpenCL based Kohonen network trainer.
class ClTrainer : public SimpleTrainer
{
/// Identifies a specific OpenCL device.
cl_device_id m_deviceId;
/// An OpenCL context which groups commands, programs etc.
cl_context m_context;
/// An ordered set of commands to carry out for various program objects.
cl_command_queue m_commandQueue;
/// A given OpenCL program which can contain one or more kernels.
cl_program m_program;
/// A kernel function which is used to find the node with the smallest distance given some input weights.
cl_kernel m_distanceKernel;
/// A kernel function which is used to calculate the incremented state of the network.
cl_kernel m_incrementKernel;
/// Is everything ready to use initialised
bool m_initialised;
public:
/// Constructor, takes the node network and the number of iterations to run for.
ClTrainer(
Network* network,
const unsigned int iterations );
/// Destructor, tear down the clTrainer.
~ClTrainer( void );
/// Run a complete set of iterations as training steps.
virtual void
Train(
TrainingSet const* trainingSet );
/// Run a single iteration of training.
virtual void
Iterate(
TrainingSet const* trainingSet );
};
#endif /* defined(__ParallelTest__ClTrainer__) */
//
// ClTrainer.cpp
// ParallelTest
//
#include "ClTrainer.h"
#include <assert.h>
const char* ProgramKernels = "\n" \
"__kernel void kohonenDistance( \n" \
" __global double* weights, \n" \
" __global double* nodeWeights, \n" \
" __global double* distance, \n" \
" unsigned int weightCount) \n" \
"{ \n" \
" int nodeWeightId = get_global_id(0); \n" \
" double euclidDistance = 0.0; \n" \
" for (unsigned int i = 0; i < weightCount; ++i) \n" \
" { \n" \
" double a = weights[i] - nodeWeights[weightCount*nodeWeightId+i]; \n" \
" euclidDistance += (a * a); \n" \
" } \n" \
" euclidDistance = sqrt(euclidDistance); \n" \
" distance[nodeWeightId] = euclidDistance; \n" \
"} \n" \
"__kernel void kohonenIncrement( \n" \
" __global double* winner, \n" \
" __global double* nodeWeights, \n" \
" __global int* nodePositions, \n" \
" double learningCoefficient, \n" \
" int winnerX, \n" \
" int winnerY, \n" \
" unsigned int weightCount) \n" \
"{ \n" \
" int nodeWeightId = get_global_id(0); \n" \
" const double oneOverSix = 1.0/6.0; \n" \
" const double oneOverThree = 1.0/3.0; \n" \
" double x = winnerX - nodePositions[nodeWeightId*2]; \n" \
" double y = winnerY - nodePositions[nodeWeightId*2+1]; \n" \
" double euclidDistance = sqrt( x*x + y*y ); \n" \
" double scale = 0.0; \n" \
" if ( euclidDistance < 6.0 ) \n" \
" { \n" \
" euclidDistance *= oneOverSix; \n" \
" scale = 1.0 - pow( euclidDistance, oneOverThree ); \n" \
" } \n" \
" for (unsigned int i = 0; i < weightCount; ++i) \n" \
" { \n" \
" double a = winner[i] - nodeWeights[weightCount*nodeWeightId+i]; \n" \
" a *= scale * learningCoefficient; \n" \
" nodeWeights[weightCount*nodeWeightId+i] += a; \n" \
" } \n" \
"} \n" \
"\n";
ClTrainer::ClTrainer(
Network* network,
const unsigned int iterations ) :
SimpleTrainer( network, iterations )
{
int result = 0;
bool useGpu = true; //If it fails it will crash the GPU, test with the CPU before doing this!
m_initialised = false;
m_program = 0;
m_distanceKernel = 0;
m_incrementKernel = 0;
m_commandQueue = 0;
m_context = 0;
m_deviceId = 0;
// Get the required device id, we need this to setup the rest of the CL system.
result = clGetDeviceIDs(NULL, useGpu ? CL_DEVICE_TYPE_GPU : CL_DEVICE_TYPE_CPU, 1, &m_deviceId, NULL);
if (result == CL_SUCCESS)
{
m_context = clCreateContext(0, 1, &m_deviceId, NULL, NULL, &result);
if (m_context)
{
m_commandQueue = clCreateCommandQueue(m_context, m_deviceId, 0, &result);
m_program = clCreateProgramWithSource(m_context, 1, (const char**)&ProgramKernels, NULL, &result);
if (m_program)
{
result = clBuildProgram(m_program, 0, NULL, NULL, NULL, NULL);
if (result != CL_SUCCESS)
{
char error[4096];
size_t length;
clGetProgramBuildInfo(m_program, m_deviceId, CL_PROGRAM_BUILD_LOG, sizeof(error), error, &length);
std::cout << "Error, build failed: " << error;
}
else
{
m_distanceKernel = clCreateKernel(m_program, "kohonenDistance", &result);
m_incrementKernel = clCreateKernel(m_program, "kohonenIncrement", &result);
if (m_distanceKernel && m_incrementKernel)
m_initialised = true;
}
}
}
}
}
ClTrainer::~ClTrainer( void )
{
if (m_program)
clReleaseProgram(m_program);
if (m_distanceKernel)
clReleaseKernel(m_distanceKernel);
if (m_incrementKernel)
clReleaseKernel(m_incrementKernel);
if (m_commandQueue)
clReleaseCommandQueue(m_commandQueue);
if (m_context)
clReleaseContext(m_context);
if (m_deviceId)
clReleaseDevice(m_deviceId);
}
void
ClTrainer::Train(
TrainingSet const* trainingSet )
{
PerformanceCounter counter( __FILE__, __FUNCTION__, "" );
assert(m_initialised);
Trainer::Train( trainingSet );
for (unsigned int i = 0; i < GetIterations(); ++i)
{
Iterate( trainingSet );
}
}
void
ClTrainer::Iterate(
TrainingSet const* trainingSet )
{
PerformanceCounter counter( __FILE__, __FUNCTION__, "" );
unsigned int count = trainingSet->GetCount();
// Assumption that all trainingSet inputs have the same count... at least check we have elements in
// the training set and there are some weights there.
assert(count > 0 && trainingSet->GetWeights(0)->Count() > 0);
unsigned int nodeCount = GetNetwork()->GetCount();
unsigned int weightCount = static_cast<unsigned int>(trainingSet->GetWeights(0)->Count());
double* rawWeights = new double [weightCount];
double* rawNodeWeights = new double [weightCount * nodeCount];
double* rawDistance = new double [nodeCount];
int* rawNodePositions = new int [nodeCount * 2];
cl_mem weights = clCreateBuffer(m_context, CL_MEM_READ_ONLY, sizeof(double) * weightCount, NULL, NULL);
cl_mem nodeWeights = clCreateBuffer(m_context, CL_MEM_READ_ONLY, sizeof(double) * weightCount * nodeCount, NULL, NULL);
cl_mem distance = clCreateBuffer(m_context, CL_MEM_WRITE_ONLY, sizeof(double) * nodeCount, NULL, NULL);
cl_mem nodeRwWeights = clCreateBuffer(m_context, CL_MEM_READ_WRITE, sizeof(double) * weightCount * nodeCount, NULL, NULL);
cl_mem nodePositions = clCreateBuffer(m_context, CL_MEM_READ_ONLY, sizeof(int) * 2 * nodeCount, NULL, NULL);
// Store the node weights. We're going to reuse these in both kernels so only have to store these once
// and write back once computation is finished.
for (unsigned int i = 0; i < nodeCount; ++i)
{
Node<Weight<double>>* node = GetNetwork()->GetNode( i );
std::vector<Weight<double>*> const& a = node->GetWeights().GetWeights();
for (unsigned int j = 0; j < weightCount; ++j)
rawNodeWeights[i * weightCount + j] = a[j]->Get();
rawNodePositions[i*2] = node->GetX();
rawNodePositions[i*2+1] = node->GetY();
}
// The localWorkSize and globalWorkSize control the threads which are spawned. In this case we have
// a 1D weight set of N elements to "multiply" through a 2D weight set of N-by-Number of Nodes which
// generates a set of distances with count equal to the number of Nodes. These are usable in both
// kernels :).
size_t localWorkSize[3];
size_t globalWorkSize[3];
localWorkSize[0] = nodeCount;
localWorkSize[1] = 0;
localWorkSize[2] = 0;
globalWorkSize[0] = nodeCount;
globalWorkSize[1] = 0;
globalWorkSize[2] = 0;
for (unsigned int i = 0; i < count; ++i)
{
std::vector<Weight<double>*> const& inputWeights = (trainingSet->GetWeights(i))->GetWeights();
// Run the first kernel.
for (unsigned int j = 0; j < weightCount; ++j)
rawWeights[j] = inputWeights[j]->Get();
int result = 0;
result = clEnqueueWriteBuffer(m_commandQueue, weights, CL_TRUE, 0, sizeof(double) * weightCount, rawWeights, 0, NULL, NULL);
result = clEnqueueWriteBuffer(m_commandQueue, nodeWeights, CL_TRUE, 0, sizeof(double) * weightCount * nodeCount, rawNodeWeights, 0, NULL, NULL);
result = 0;
result = clSetKernelArg(m_distanceKernel, 0, sizeof(cl_mem), &weights);
result |= clSetKernelArg(m_distanceKernel, 1, sizeof(cl_mem), &nodeWeights);
result |= clSetKernelArg(m_distanceKernel, 2, sizeof(cl_mem), &distance);
result |= clSetKernelArg(m_distanceKernel, 3, sizeof(unsigned int), &weightCount);
// Do the work.
result = clEnqueueNDRangeKernel(m_commandQueue, m_distanceKernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);
// Get the data back.
result = clEnqueueReadBuffer(m_commandQueue, distance, CL_TRUE, 0, sizeof(double) * nodeCount, rawDistance, 0, NULL, NULL);
// Now we have a set of distances and need to find the smallest distance.
unsigned int winner = 0;
double minDistance = std::numeric_limits<double>::max();
for (unsigned int j = 0; j < nodeCount; ++j)
{
if (rawDistance[j] < minDistance)
{
minDistance = rawDistance[j];
winner = j;
}
}
// Prepare for running the second kernel.
Node<Weight<double>>* winningNode = GetNetwork()->GetNode( winner );
std::vector<Weight<double>*> const& winningWeights = winningNode->GetWeights().GetWeights();
int winnerX = winningNode->GetX();
int winnerY = winningNode->GetY();
double learningCoefficient = GetLearningCoefficient();
// Store the winning weights
for (unsigned int j = 0; j < weightCount; ++j)
rawWeights[j] = winningWeights[j]->Get();
result = clEnqueueWriteBuffer(m_commandQueue, weights, CL_TRUE, 0, sizeof(double) * weightCount, rawWeights, 0, NULL, NULL);
result = clEnqueueWriteBuffer(m_commandQueue, nodeRwWeights, CL_TRUE, 0, sizeof(double) * weightCount * nodeCount, rawNodeWeights, 0, NULL, NULL);
result = clEnqueueWriteBuffer(m_commandQueue, nodePositions, CL_TRUE, 0, sizeof(int) * 2 * nodeCount, rawNodePositions, 0, NULL, NULL);
result = 0;
result = clSetKernelArg(m_incrementKernel, 0, sizeof(cl_mem), &weights);
result |= clSetKernelArg(m_incrementKernel, 1, sizeof(cl_mem), &nodeWeights);
result |= clSetKernelArg(m_incrementKernel, 2, sizeof(cl_mem), &nodePositions);
result |= clSetKernelArg(m_incrementKernel, 4, sizeof(int), &learningCoefficient);
result |= clSetKernelArg(m_incrementKernel, 5, sizeof(int), &winnerX);
result |= clSetKernelArg(m_incrementKernel, 6, sizeof(int), &winnerY);
result |= clSetKernelArg(m_incrementKernel, 7, sizeof(unsigned int), &weightCount);
// Do the work.
result = clEnqueueNDRangeKernel(m_commandQueue, m_incrementKernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);
// Get the data back.
result = clEnqueueReadBuffer(m_commandQueue, nodeRwWeights, CL_TRUE, 0, sizeof(double) * nodeCount, rawNodeWeights, 0, NULL, NULL);
}
// Writeback the weights..
for (unsigned int i = 0; i < nodeCount; ++i)
{
Node<Weight<double>>* node = GetNetwork()->GetNode( i );
for (unsigned int j = 0; j < weightCount; ++j)
*(node->GetWeights()[j]) = rawNodeWeights[i * weightCount + j];
}
delete [] rawWeights;
delete [] rawNodeWeights;
delete [] rawDistance;
delete [] rawNodePositions;
clReleaseMemObject(weights);
clReleaseMemObject(nodeWeights);
clReleaseMemObject(distance);
clReleaseMemObject(nodeRwWeights);
clReleaseMemObject(nodePositions);
}