Julia Performance

So, we have basic, Lagrangian-only versions of GasCode written in Fortran, Python and Julia, so we can take a look at their relative performance. For these comparisons, I'll use the spherical Sod problem on a 50x50 mesh:

This problem is being used simply because it's the hard-coded one, and I've not added a mesh reader to the Python or Fortran versions of the code yet. Input options are as follows:

# EoS
gamma = 1.4

# Artificial Viscosity
CL = 0.5
CQ = 0.75

# Mesh Extents
x0 = 0.0
x1 = 1.0
y0 = 0.0
y1 = 1.0

# Timestep, etc.
t0 = 0.0
tend = 0.205
dtinit = 0.0001
dtmax = 0.01
growth = 1.02

# Mesh resolution
nregions = 1
meshx = 50
meshy = 50

# Debug settings
debug_step_count = 0 # Set to zero to disable debugging

# Cutoffs
rhocutoff = 1.0e-6

I.e. basic monotonic Q, and no anti-hourglass. Using these settings, and running the Fortran code along with the most simple (/naive) versions of the Julia and Python codes, we get the following runtimes for just the Lagrangian hydro loop:

Language	Runtime (s)
Fortran	2.12
Python	1215.45
Julia, first run	5.85
Julia, second run	4.20

Note that Julia is included twice. This is because of a lovely quirk of Julia whereby it will perform a JIT compilation of everything the first time you run a script, so to get the best possible runtime, you have to run the calculation twice. In the same script. For no good reason...

Julia is quite a bit slower than Fortran, but considerably quicker than Python. And a lot of people pushing Julia will stop there, having demonstrated that Julia is vastly superior to Python. And in a world without Cython, that would be fine.

But we don't live in a world without Cython...

Comparing Cython and optimised Julia

If we only ever use Python and Julia for prototyping, then the previous comparison is perfectly fair. Writing naive Julia results in far faster code than writing naive Python. The thing is, though, really what we want is a language which gives us decent enough performance *without* having to go and re-write our code in Fortran or C++, and at the moment neither our Python nor our Julia version are good enough:

Language	Multiple of Fortran
Fortran	1.00x
Python	573.33x
Julia, first run	2.76x
Julia, second run	1.98x

Obviously, the Python version is terrible, but even the 2.8x slowdown of the Julia version compared to Fortran would be problematic on a remotely large problem. Note that whilst it's interesting to log the "fastest possible" Julia run, since I have to run the problem twice in the same script to get it, it's pretty meaningless really. Speeding code up in both Julia and Python involves two key concepts:

Explicitly declaring variables
Removing bounds checking from array access

The main differences between Python and Julia are which of these has the biggest impacts, and how we go about performing them. Obviously, for Python, we need to include Cython. Since like any sane human being, I'm using Anaconda, it's trivial to install Cython on my system, so all I need to do is to write a compile script:

from setuptools import setup
from Cython.Build import cythonize
import numpy

setup(
    name='Hydrocode',
    ext_modules=cythonize("lagrangian_hydro.pyx"),
    include_dirs=[numpy.get_include()],
    zip_safe=False,
)

Obviously, whenever we modify the Lagrangian hydro kernel, we'll need to run the build script, but since the whole process takes only about 6 seconds, it's not a big problem. The actual process of optimisation is very similar for both languages:

For Julia, we use the ::T assertions in our function parameter lists to define the data types, and apply the @inbounds macro to the various loops. For Cython, we use C-style declarations (including their special NumPy forms), and use @cython.boundscheck(False) and @cython.wraparound(False) to disable bounds checking. For both languages, we also note that it is potentially faster to use memory views rather than direct memory access, and again this is simple in both languages. Let's look at the runtimes:

Language	Runtime (s)	Multiple of Fortran
Fortran	2.12	1.00x
Cython	3.26	1.54x
Julia	5.51	2.60x

Cython is now within 100% of the Fortran runtime (slowdown < 2x), but Julia is still lagging somewhat behind. It's worth noting that the Julia runtime does occasionally drop to much lower values (I've seen it as low as 3.7s), but the 5.5s time is the most common. I've certainly not seen Julia be as fast as the Cython or Fortran versions.

There's a further elephant in Julia's room though. I mentioned that the first pass through the hydro loop entails a JIT compilation stage, which makes the overall execution slower. In fact, it's worse than that - there's a compilation stage when the script is first loaded. I've actually optimised the entire hydrocode run, including setup, so let's look at the entire runtime, using the Linux time command:

Language	Runtime (s)	Multiple of Fortran
Fortran	2.14	1.00x
Cython	3.42	1.60x
Julia	9.99	4.67x

Obviously, this extra overhead would matter less in a longer calculation, but the fact is that it's present every time we run the script, which is far from ideal...

Threading

We'll take a look at how easy and effective it is to add some thread-based parallelism to the programs. In Cython, we first have to modify the setup script so that it now looks like this:

from setuptools import Extension, setup
from Cython.Build import cythonize
import numpy

ext_modules = [
    Extension(
        "lagrangian_hydro",
        ["lagrangian_hydro.pyx"],
        include_dirs=[numpy.get_include()],
        extra_compile_args=['-fopenmp'],
        extra_link_args=['-fopenmp'],
    )
]

setup(
    name='lagrangian_hydro',
    ext_modules=cythonize(ext_modules),
)

This adds OpenMP support. To make use of this, we can then simply replace any use of range(...) with prange(nogil=True,...), noting that any variables we use inside the loop must then be declared via cdefs.

For Julia, we first replace our Arrays with SharedArrays. This done, we can then look to add @distributed to our loops. Testing has, thus far, only shown any improvement when the only loop this is added to is the outermost loop for calculating nodal forces in the momentum update.

As an aside, the Julia community is utterly obnoxious. I've done various searches online for advice for slow performance of @distributed loops, and almost all of them have some senior-looking Julia developer making snide remarks about how the questioner's serial performance looks worse than C, and thus clearly they are stupid or something. Julia really seems to suffer from a senior development team who just can't cope with the idea that their precious language might not be perfect...

Interestingly, adding the SharedArrays back into the Julia script (as I'd tested these before) actually improved serial performance. This might be due to some other optimisation creeping back in, mind you. Also, adding OpenMP into Cython does introduce some serial overhead, so you'd likely want to maintain two different versions of the package. Anyway, let's look at the performance:

So, some takeaways:

Julia's scaling is all over the damn place, and only one point for n > 1 has better performance than n = 1
Julia never outperforms the original Cython serial code
Cython scales pretty well up to N = 6. It actually ends up faster than Fortran, but that's a little unfair as we have to run on 6 threads to achieve that, whereas Fortran is serial

Search This Blog

Adventures in Julia Programming