# Quest for speed

Last December, I experimented for a while with SPH (Smoothed Particle Hydrodynamics), using C# and XNA as usual. Of course, the more particles you can simulate the better, and I tried to optimize  for speed the different parts of the method as much as possible, which even led me to use a SOA (Structure of Arrays) rather than the more natural AOS (Array of Structures) for my data, since I basically ended up being limited by the speed of memory accesses and cache misses. Anyway, I thought I had done my best and I had no plans to multithread this code, when I ran into an explanation of Parallel.For and Parallel.ForEach on the internet, and decided to give it a try.

# The original code

private void MoveParticlesSOA(float deltaTime)
{
Vector2[] positions = particleSystemSOA.Position;
Vector2[] forces = particleSystemSOA.Force;
float xMax = dimensions.X;
float yMax = dimensions.Y;

for (int p = 0; p < particleSystemSOA.NbParticles; p++)
{
Vector2 position = positions[p];

if (position.X < 0)
position.X = dimensionEpsilon;
else if (position.X > xMax)
position.X = xMax - dimensionEpsilon;

if (position.Y < 0)
position.Y = dimensionEpsilon;
else if (position.Y > yMax)
position.Y = yMax - dimensionEpsilon;

positions[p] = position;
particleSystemSOA.UpdateParticle(p, deltaTime);
forces[p] = Vector2.Zero;
}
}

What this code does is for each particle, it first makes sure the position is inside a 2D rectangle (these are the if/else lines),  then it asks the particle system to move the particle, and finally it resets the forces applied to the particle to zero (for the next frame). Since there are potentially a lot of particles to deal with, how would we go about multithreading this function, so that it can take advantage of 2, 4, or more cores?

# Parallel.For

This is what it looks like with Parallel.For, after I added some #if/#else/#endif statements to easily enable or disable multithreading:

private void MoveParticlesSOA(float deltaTime)
{
Vector2[] positions = particleSystemSOA.Position;
Vector2[] forces = particleSystemSOA.Force;
float xMax = dimensions.X;
float yMax = dimensions.Y;

#if PARALLEL
Parallel.For(0, particleSystemSOA.NbParticles, p =>
#else
for (int p = 0; p < particleSystemSOA.NbParticles; p++)
#endif
{
Vector2 position = positions[p];

if (position.X < 0)
position.X = dimensionEpsilon;
else if (position.X > xMax)
position.X = xMax - dimensionEpsilon;

if (position.Y < 0)
position.Y = dimensionEpsilon;
else if (position.Y > yMax)
position.Y = yMax - dimensionEpsilon;

positions[p] = position;
particleSystemSOA.UpdateParticle(p, deltaTime);
forces[p] = Vector2.Zero;
}
#if PARALLEL
);
#endif
}

Simple, isn't it? The 'for' line is replaced with Parallel.For, and what used to be the loop's code is now a delegate that can be called on several threads simultaneously for different particles. All the synchronization is taken care of automatically, it's pretty awesome if you ask me!

# Shared memory

Parallel.For and the other methods of the Task Parallel Library don't do anything special regarding shared memory, and it's the programmer's job to protect the latter from data races and simultaneous accesses from different threads. Let's see another function from my SPH class as an example (sorry it's a bit long, there's no need to read it line by line as I will explain the important parts right after the code):

private void UpdateDensityAndPressureSOA()
{
Vector2[] positions = particleSystemSOA.Position;
float[] densities = particleSystemSOA.Density;
float[] pressures = particleSystemSOA.Pressure;
short[] gridIndex = particleSystemSOA.GridIndex;

for (int p = 0; p < particleSystemSOA.NbParticles; p++)
{
densities[p] = 0f;
neighborCount[p] = 0;
}

#if PARALLEL
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = -1;
Parallel.For(0, particleSystemSOA.NbParticles, options, p =>
#else
for (int p = 0; p < particleSystemSOA.NbParticles; p++)
#endif
{
int nbNeighbors;
int[] neighbors = grid.GetNeighbors(particleSystemSOA, gridIndex[p], out nbNeighbors);  // NOT GOOD FOR MULTITHREADING

for (int n = 0; n < nbNeighbors; n++)
{
int neighborIndex = neighbors[n];
if (neighborIndex < p)
continue;

if (neighborIndex == p)
{
densities[p] += selfDensity;
continue;
}

float deltaX = positions[p].X - positions[neighborIndex].X;
float deltaY = positions[p].Y - positions[neighborIndex].Y;
float r2 = deltaX * deltaX + deltaY * deltaY;

if (r2 < h2)
{
float diff2 = h2 - r2;
float density = poly6FactorMass * diff2 * diff2 * diff2;
densities[p] += density;
densities[neighborIndex] += density;

float r = (float)Math.Sqrt(r2);

if (neighborCount[p] < maxNeighbors)
{
int tableIndex = p * maxNeighbors + neighborCount[p];
neighborTable[tableIndex] = (short)neighborIndex;
neighborDist[tableIndex] = r;
neighborCount[p]++;
}

if (neighborCount[neighborIndex] < maxNeighbors)
{
int tableIndex = neighborIndex * maxNeighbors + neighborCount[neighborIndex];
neighborTable[tableIndex] = (short)p;
neighborDist[tableIndex] = r;
neighborCount[neighborIndex]++;
}
}
}

const float restDensity = 100f;
const float gasConstant = 0.1f;

pressures[p] = gasConstant * (densities[p] - restDensity);
}
#if PARALLEL
);
#endif
}

I won't go into the details of this function since they're not relevant to the current discussion, as the name indicates it basically calculates the density and pressure of the fluid at each particle's position. But there are two interesting things we haven't seen yet that I want to discuss.

The first one is just after the first #if PARALLEL line: as you can see, options can be passed to Parallel.For, such as the maximum number of threads you want to be used. Here I specified -1, which is the default and means all the available cores will be used, but you could pass 2, 4, etc.

The second topic I need to talk about is the line with the NOT GOOD FOR MULTITHREADING comment. This line in itself has no problem, but it calls a function of a grid that's used to find the neighbors of a particle (that is, other particles that are within a given radius of the current one) without having to test all the particles each time. Since a particle usually has a bunch of neighbors, the GetNeighbors function would initially store their indices in an array that was part of the grid object, and return a reference to that array. Once this function can potentially be called by several threads at the same time, this approach doesn't work anymore: the threads all try to access the same piece of memory simultaneously, and you can trust me when I tell you the SPH simulation became totally unstable as a result.

How do we fix that? Well, since I was only trying Parallel.For for the sake of trying it, I did the simplest thing I could think of just to verify that would fix the simulation and there wasn't any other problem: GetNeighbors allocates and returns a new array each time it is called, which means each thread gets its own piece of memory, each time it calls the function (that's the annoying thing, since one thread could totally reuse the same array for all the particles it handles). That's a lot of allocations per frame, that sounds crazy, and I don't recommend doing it in real code, although we're going to see in the next section that it wasn't so bad after all.

# Performance

So far we've seen that Parallel.For is very easy to use, definitely much much easier than managing a bunch of threads manually, managing a bunch of BackgroundWorker objects, or using the ThreadPool class (even if I haven't shown these methods in action, and I'm not saying they  don't have some valid usage cases). But what about performance?

Unfortunately, I don't have numbers to give you, since like I mentioned before, my algorithm was already memory bound when running on one core, and processing power was not the issue. But let me say this: despite the memory bottleneck, and all the extra allocations made in the GetNeighbors function, the multithreaded version of my SPH simulation still managed to run faster than the non-multithreaded one, which is a great sign in my opinion. And eventually also shows allocating memory is crazy fast in C#, and the garbage collector does a good job at quickly getting rid of short lived objects, at least on PC.

I really encourage you to read a bit more documentation about the Task Parallel Library, and try it for yourself on some existing code that could use more multithreading, I think you'll be happily surprised.