More on Performance

Let's scale things back, and look at a simple example - brute force calculation of Pi. We can do this using the following formula:


This is easy enough to code up in various languages:

C++

#include <iostream>
#include <stdio.h>

static long nsteps = 10000000000;

int main() {
    long i;
    double x, pisum = 0.0;
    double mypi;
    double pi = 3.141592653589793;
    
    double step = 1.0/(double)nsteps;  
    
    for (i=0;i<nsteps;i++) {
        x = (i+0.5)*step;
        pisum = pisum + 4.0/(1.0+x*x);
    }
    
    mypi = pisum * step;
    printf("%20.16f\n", mypi);
    printf("%20.16e\n", mypi - pi);
}

Fortran

program calculate_pi
    use, intrinsic :: iso_fortran_env
    USE OMP_LIB
    implicit none

    integer(kind=INT64), parameter :: nsteps = 10000000000_INT64
    integer(kind=INT64) :: i
    real(kind=REAL64) :: stepsize, x, pisum
    real(kind=REAL64) :: t1, t2
    real(kind=REAL64) :: mypi, diff
    real(kind=REAL64) :: pi = 3.141592653589793_REAL64

    stepsize = 1.0_REAL64 / nsteps
    pisum = 0.0_REAL64

    do i = 0, nsteps-1
        x = (real(i, REAL64)+0.5_REAL64) * stepsize
        pisum = pisum + 4.0_REAL64/(1.0_REAL64+x*x)
    end do

    mypi = pisum*stepsize
    diff = mypi - pi

    print *, "NSteps =", nsteps
    print *, "Pi =", mypi
    print *, diff
end program


Python

def calcpi(nsteps):
    stepsize = 1/nsteps
    pisum = 0.0

    for i in range(nsteps):
        x = (i+0.5)*stepsize
        pisum += 4/(1+x**2)

    pi = stepsize * pisum

    return pi

Julia

function calcpi(nsteps::Integer)
    stepsize = 1.0/nsteps
    pisum = 0.0

    for i = 0:nsteps-1
        x = (i+0.5)stepsize
        pisum += 4/(1+x^2)
    end

    return stepsize*pisum
end 
We can time the runtime for these functions/programmes using various methods depending on the language:
  • C++: Calls to std::chrono::system_clock::now()
  • Fortran: Calls to CPU_TIME()
  • Python: Calls to time.time()
  • Julia: @time macro
Using n = 100,000,000 gives the following runtimes:


Description Run 1 Run 2 Run 3 Average
Python 12.36320853233 12.40506696701 12.34379291534 12.37068947156
Fortran 0.113455 0.114801 0.113585 0.113947
C++ 0.137855 0.105894 0.0975554 0.1137681333333
Julia 0.091104 0.090891 0.091777 0.0912573333333

Unsurprisingly, Python is very much the outlier here. More interestingly, Julia appears to outperform both Fortran and C++. These are extremely short runtimes, though, so lets see if that particular trend lasts for larger n. Before we can increase n, though, we need to sort out the Python run time. As usual, we'll do this by Cythonising the Python script, which makes it look like this:

Cython

def calcpi(num_steps):
    cdef long nsteps = num_steps
    cdef double stepsize = 1.0/nsteps
    cdef double pisum = 0.0
    cdef double x, t1, t2

    cdef long i
    for i in range(nsteps):
        x = (i+0.5)*stepsize
        pisum += 4/(1+x*x)

    pi = stepsize * pisum

    return pi
This drops the average runtime for the Python script down to 0.11s, in line with everything else. We'll now increase n to 10,000,000,000, which gives us the following runtimes:

Description Run 1 Run 2 Run 3 Average
Fortran 9.66525 9.68191 9.77899 9.70872
C++ 9.75373 9.77261 9.85525 9.79386
Julia 8.86402 8.90834 8.98145 8.91793
Cython 9.58548 9.56131 9.59318 9.57999
Cython, prange 8.97424 8.96508 8.97478 8.97137

Plotted as a graph:
Ok, so this is getting a little ahead of ourself, because I have a Cython example using a threaded loop construct. The point of that result though is that simply changing to a prange has increased the Cython performance, even on a single thread. This is almost certainly because I've had to disable the GIL in the loop to enable prange, so we can say that the GIL is worth around a 0.6s performance hit in this example.

Adding Threading

So, as alluded to, we'll add threading to the problem. For Fortran, C++ and Python this will be done via OpenMP:
  • Fortran: !$OMP PARALLEL DO REDUCTION(+:pisum) PRIVATE(I, X)
  • C++: #pragma omp parallel for reduction(+:pisum) private(i, x)
  • Cython: for i in prange(nsteps, nogil=True):
Julia is a little more complex. We'll be using the Distributed package, but we can't simply add the @distributed macro to the loop. Instead, we have to rewrite the loop so that instead of updating pisum, instead pisum is set to the result of the loop. Only then, will it work correctly with @distributed:
pisum = @distributed (+) for i = 0:nsteps-1
    x = (i+0.5)stepsize
    4/(1+x^2)
end
Rather than spam a big table, I'll just show a graph of the results for different numbers of threads:

So Cython consistently outperforms Fortran and C++ slightly, and Julia underperforms slightly. The first of these is something I've seen before - Cython does a better job of optimising the OpenMP pragmas than a naive programmer is likely to manage without a lot of tinkering. As for Julia, I guess we are seeing that its threading model is less efficient than OpenMP.

Anyway, we've finally found an example where Julia does outperform C++ as the Julia fanatics claim should happen. Just a shame it's on an example where Cython also outperforms C++...

Comments

Popular posts from this blog

Creating a Julia Executable

Creating a Node-Node Connectivity Array

Creating an Element-Element Connectivity Array