Parallel Programming with Parallels

I’ve been a Microsoft Windows user for a while, and after some mishaps with traditional Windows laptops ended up migrating to running Windows on a Macbook using Bootcamp. Recently some of my hardware was obsoleted on Windows 7 so I decided to go the whole hog and try OSX for my daily needs with Windows running as a virtual machine through Parallels for any programming tasks needing Visual Studio. After using this setup for a few months, I was surprised by how quickly it ran my compiled programs, I wasn’t noticing the difference in responsiveness that I was expecting so I thought I’d benchmark a known problem so I could get a sense of how fast virtualisation actually was…

My initial idea was to try benchmarking a parallelisable algorithm under a variety of implementations on Windows and OSX to get a feel for what the difference in running times were like between the native(-ish) OSX and virtualised Windows implementations. I settled on a simple Kohonen neural network implementation as its eminently parallelisable and i’d be able to quickly scrape some data to run through it. My initial idea was to test implementations using the following patterns/APIs:

  • A naive implementation. A quick and dirty implementation which didn’t care about memory access costs or general optimisations.
  • A naive optimised implementation. A slightly faster implementation which would reduce memory copying and arrange operations to start to optimise for parallelisation.
  • An implementation using the Parallel Patterns API. This would only be available running through Windows, but it looked like it would give an instant bang for buck.
  • An implementation using OpenMP. Unfortunately I’m running with Visual Studio Express at home, and this doesn’t support OpenMP out of the box at present. It looks like there might be a way around this but after a bit of experience hacking about in Visual Studio I didn’t want to modify my install to get around this.
  • An OpenCL implementation. I’ve been tinkering around with GPGPU programming for a bit on desktops so I was interested in seeing what could be done on a laptop GPU. This turned out to be a pain as Parallels has its own GPU driver and none of the Windows OpenCL code would run… I ended up writing this entirely under XCode for OSX.

The Results…

The table below shows the results generated averaged over 5 runs for 2D Kohonen networks with 10×10 and 100×100 nodes against a training set of 320 groups of 7 greyscale values from 0x00 to 0xFF. All timings are in milliseconds:

10 x 10 nodes 100 x 100 nodes
Naive Implementation, Windows 56.25306 ms 4774.4 ms
Naive Optimised Implementation, Windows 2.35398 ms 168.0156 ms
Parallel Patterns Implementation, Windows 4.26502 ms 251.52 ms
Naive Implementation, Mac OSX 71.952 ms 7018.11 ms
Naive Optimised Implementation, Mac OSX 1.8228 ms 104.171 ms
OpenCL Implementation, Mac OSX 23.0954 ms 24.6672 ms

A few things jump out from this:

  • The C++ code for the naive implementations is exactly the same on the virtualised Windows and native OSX implementations. It appears though that the memory copy heavy naive implementations run faster under Parallels than the OSX implementation! The optimised version which would conceivably not be memory copy limited is faster under OSX than the virtualised Windows version. To me this suggests that the OSX version is doing a lot of work behind the scenes in terms of memory management and that Parallels doesn’t follow this path. When this isn’t needed due to more optimal memory usage the overhead of virtualisation makes the Windows version slightly slower than the Mac OSX version. I wouldn’t be surprised to find that code running on a native Windows system would be slightly faster than the OSX version after seeing this.
  • The parallel patterns built version doesn’t bring anything to the party when running through Parallels. I spent some time looking at the code to see if I had implemented this correctly and can only surmise that there is no benefit using parallel patterns with virtualisation. I need to have a look at running this on a comparable native Windows system to see whether this is an artefact of virtualisation or a side-effect of the chosen algorithm.
  • OpenCL is fast! The 100×100 node runs should be somewhere in the region of 100 times slower than the 10×10 run. From the tests it looks like it is somewhere around 1.05 times slower. The initial setup is much slower than the naive optimised versions at 10×10 but significantly faster at 100×100. It may be possible to hold the entire computation carried out in GPU memory in which case the 100×100 OpenCL version will be faster still. My main bugbear with setting up memory objects to transfer to the GPU is that it breaks object oriented programming and leads programmers towards Premature Optimisation. I’m very impressed that even with a large amount of memory copying to and from the GPU it is still this capable, and its going to lead to me pushing more work its way in the future. I’m very much looking forward to a certain new Big Blue Box which is reputed to have a stonking GPU and unified memory, as it would seem to fix the need for data transfer between CPU and GPU memory.
  • After looking at the results of my tests I’m quite happy with the performance of Windows running virtualised on Parallels when compared with the general speed of Mac OSX. I’m disappointed that I can’t use OpenCL with Visual Studio under Parallels but apart from that I’m not too displeased with the performance.
  • This was my first experience running XCode 4.x and I have to say that I’m not that impressed with it. It seems clunky when compared with other IDEs, and there are a few things like Schemes which seem a bit broken when compared with their direct analogues. I’ll need to play around with it for a bit longer to see whether I get used to it.

How It Was Done

In overview, the training algorithms were split into a common class and a derived class which carried out the various test permutation specialisations. The code was put together in a relatively quick-and-dirty fashion for me to prototype the various implementations in a decent time, I would need to look at cleaner implementations with more error checking if I went on to use this code for anything else. Here is the base training class:

…and here is the derived class for the naive implementation:

The Parallel Pattern based implementation is pretty simple, and whilst there is some locking in there it doesn’t look like this is the cause of its tardiness compared to the other implementations. There may be a need for specific programming practices to make any use of it which aren’t immediately obvious from reading the MSDN overview and API documentation which is a shame as you would hope there would be a scheduler to handle this in a similar manner as with GPGPU programming.

Finally the OpenCL implementation is very much a C-like implementation, with data unpacked from OO C++ structures into memory arrays to be dumped into GPU memory. The header file:

…and the implementation of the header is shown below. Unpacking data to be sent on to the GPU completely breaks the object oriented nature of programming. The hope is that with a unified memory architecture and some form of PS2 -style DMA packet stitching a machine of the future would be able to make use of GPGPU without having to unpack and repack memory in order to have access to a large number of parallel processes.