More on Performance
Let's scale things back, and look at a simple example - brute force calculation of Pi. We can do this using the following formula:
This is easy enough to code up in various languages:
C++
#include <iostream>
#include <stdio.h>
static long nsteps = 10000000000;
int main() {
long i;
double x, pisum = 0.0;
double mypi;
double pi = 3.141592653589793;
double step = 1.0/(double)nsteps;
for (i=0;i<nsteps;i++) {
x = (i+0.5)*step;
pisum = pisum + 4.0/(1.0+x*x);
}
mypi = pisum * step;
printf("%20.16f\n", mypi);
printf("%20.16e\n", mypi - pi);
}
Fortran
program calculate_pi
use, intrinsic :: iso_fortran_env
USE OMP_LIB
implicit none
integer(kind=INT64), parameter :: nsteps = 10000000000_INT64
integer(kind=INT64) :: i
real(kind=REAL64) :: stepsize, x, pisum
real(kind=REAL64) :: t1, t2
real(kind=REAL64) :: mypi, diff
real(kind=REAL64) :: pi = 3.141592653589793_REAL64
stepsize = 1.0_REAL64 / nsteps
pisum = 0.0_REAL64
do i = 0, nsteps-1
x = (real(i, REAL64)+0.5_REAL64) * stepsize
pisum = pisum + 4.0_REAL64/(1.0_REAL64+x*x)
end do
mypi = pisum*stepsize
diff = mypi - pi
print *, "NSteps =", nsteps
print *, "Pi =", mypi
print *, diff
end program
Python
def calcpi(nsteps):
stepsize = 1/nsteps
pisum = 0.0
for i in range(nsteps):
x = (i+0.5)*stepsize
pisum += 4/(1+x**2)
pi = stepsize * pisum
return pi
Julia
function calcpi(nsteps::Integer)
stepsize = 1.0/nsteps
pisum = 0.0
for i = 0:nsteps-1
x = (i+0.5)stepsize
pisum += 4/(1+x^2)
end
return stepsize*pisum
end
We can time the runtime for these functions/programmes using various methods depending on the language:
- C++: Calls to std::chrono::system_clock::now()
- Fortran: Calls to CPU_TIME()
- Python: Calls to time.time()
- Julia: @time macro
Using n = 100,000,000 gives the following runtimes:
Description | Run 1 | Run 2 | Run 3 | Average |
Python | 12.36320853233 | 12.40506696701 | 12.34379291534 | 12.37068947156 |
Fortran | 0.113455 | 0.114801 | 0.113585 | 0.113947 |
C++ | 0.137855 | 0.105894 | 0.0975554 | 0.1137681333333 |
Julia | 0.091104 | 0.090891 | 0.091777 | 0.0912573333333 |
Unsurprisingly, Python is very much the outlier here. More interestingly, Julia appears to outperform both Fortran and C++. These are extremely short runtimes, though, so lets see if that particular trend lasts for larger n. Before we can increase n, though, we need to sort out the Python run time. As usual, we'll do this by Cythonising the Python script, which makes it look like this:
Cython
def calcpi(num_steps):
cdef long nsteps = num_steps
cdef double stepsize = 1.0/nsteps
cdef double pisum = 0.0
cdef double x, t1, t2
cdef long i
for i in range(nsteps):
x = (i+0.5)*stepsize
pisum += 4/(1+x*x)
pi = stepsize * pisum
return pi
This drops the average runtime for the Python script down to 0.11s, in line with everything else. We'll now increase n to 10,000,000,000, which gives us the following runtimes:
Description | Run 1 | Run 2 | Run 3 | Average |
Fortran | 9.66525 | 9.68191 | 9.77899 | 9.70872 |
C++ | 9.75373 | 9.77261 | 9.85525 | 9.79386 |
Julia | 8.86402 | 8.90834 | 8.98145 | 8.91793 |
Cython | 9.58548 | 9.56131 | 9.59318 | 9.57999 |
Cython, prange | 8.97424 | 8.96508 | 8.97478 | 8.97137 |
Plotted as a graph:
Ok, so this is getting a little ahead of ourself, because I have a Cython example using a threaded loop construct. The point of that result though is that simply changing to a prange has increased the Cython performance, even on a single thread. This is almost certainly because I've had to disable the GIL in the loop to enable prange, so we can say that the GIL is worth around a 0.6s performance hit in this example.
Adding Threading
So, as alluded to, we'll add threading to the problem. For Fortran, C++ and Python this will be done via OpenMP:
- Fortran: !$OMP PARALLEL DO REDUCTION(+:pisum) PRIVATE(I, X)
- C++: #pragma omp parallel for reduction(+:pisum) private(i, x)
- Cython: for i in prange(nsteps, nogil=True):
Julia is a little more complex. We'll be using the Distributed package, but we can't simply add the @distributed macro to the loop. Instead, we have to rewrite the loop so that instead of updating pisum, instead pisum is set to the result of the loop. Only then, will it work correctly with @distributed:
pisum = @distributed (+) for i = 0:nsteps-1
x = (i+0.5)stepsize
4/(1+x^2)
end
Rather than spam a big table, I'll just show a graph of the results for different numbers of threads:
So Cython consistently outperforms Fortran and C++ slightly, and Julia underperforms slightly. The first of these is something I've seen before - Cython does a better job of optimising the OpenMP pragmas than a naive programmer is likely to manage without a lot of tinkering. As for Julia, I guess we are seeing that its threading model is less efficient than OpenMP.
Anyway, we've finally found an example where Julia does outperform C++ as the Julia fanatics claim should happen. Just a shame it's on an example where Cython also outperforms C++...
Comments
Post a Comment