Benchmark Numpy with Openblas and MKL library on AMD Ryzen 3950x CPU

Shasha Feng
4 min readApr 12, 2020

--

AMD CPUs are famous for its affordability and performance. YES, AMD! But with an AMD CPU, the first thing toward data analysis or MD simulation (OpenMM)in Python is to install Numpy correctly. Here I show that on AMD Ryzen 3950X CPU, Openblas version is almost 2.4 times the speed of the MKL version. So you should definitely install it with Openblas!

Installation of Openblas and MKL-supported Numpy

In my case, the MKL version is automatically installed when I installed the anaconda. It is installed in the ‘base’ environment. To install Openblas supported Numpy, I used the following commands, which are originally posted in this post.

conda create --name openblas-np python=3.7.6
conda activate openblas-np
conda install numpy blas=*=openblas

Here I set an conda environment ‘openblas-np’ and deliberately had it with Python version 3.7.6 because, in the ‘base’ environment, it is also 3.7.6.

Run Numpy matrix job in both environments

Then I ran a Python script in both environments, which is simply about randomly building a Numpy matrix of 20,000 x 20,000 dimension with 64-bit float number and solving the matrix norm of this huge matrix.

#test_numpy.py
import numpy as np
import time
n = 20000A = np.random.randn(n,n).astype('float64')
B = np.random.randn(n,n).astype('float64')
start_time = time.time()
nrm = np.linalg.norm(A@B)
print(" took {} seconds ".format(time.time() - start_time))
print(" norm = ",nrm)
print(np.__config__.show())

Results

  1. Running time & validation of Numpy configuration

An example of output from Openblas is like this. Note that Numpy configurations are detailed too.

(openblas-np) mm@drm:~/Documents/test-numpy$ python run_numpy.py
took 24.554070234298706 seconds
norm = 2828377.19061757
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/home/sf/Applications/anaconda3/envs/openblas-np2/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/home/sf/Applications/anaconda3/envs/openblas-np2/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/home/sf/Applications/anaconda3/envs/openblas-np2/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/home/sf/Applications/anaconda3/envs/openblas-np2/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
None

An example of output from MKL version is like this:

(base) mm@drm:~/Documents/test-numpy$ python run_numpy.py
took 59.25364327430725 seconds
norm = 2828410.031760463
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/sf/Applications/anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/sf/Applications/anaconda3/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/sf/Applications/anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/sf/Applications/anaconda3/include']
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/sf/Applications/anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/sf/Applications/anaconda3/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/sf/Applications/anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/sf/Applications/anaconda3/include']
None

So we validated the installation of Numpy and clearly through the printed configuration information. MKL version takes ~59.25 seconds while Openblas version takes about ~24.55 seconds. I actually repeated this experiment three times in both cases, and the standard deviation is very small.

2. CPU thread utilization

I use a program called ‘htop’ to monitor the CPU thread utilization. Ryzen 3950x comes with 16 CPU cores and hyperthreading 32 threads. When running the testing job, the CPU profile is like below:

Openblas Numpy running matrix normalization calculation
MKL Numpy running matrix normalization calculation

The MKL Numpy only uses about half the threads available, while Openblas uses all. It makes sense that, in Intel CPU product line, there is no similar CPU like AMD 3950x that has 16 CPU cores. This might be the important reason that MKL is performing way worse than Openblas.

Conclusion

Yes, we should install Numpy with Openblas library support on an AMD Ryzen 3950x CPU! And from my benchmark, the performance gain seems to be mostly from the utilization of CPU threads. Also, it is always a good point to do a benchmark if unsure about the performance difference.🙌 🙌 🙌

Postscript notes

My benchmark test was guided by the two blogs by Dr. Donaol Kinghorn (1 and 2) posted in 2019. He first benchmarked on AMD 3900x, and later on 3960x too.

There is also an undocumented trick in Intel MKL library to set MKL_DEBUG_CPU_TYPE=5`. Because once Intel MKL detects a non-Intel CPU, it would automatically use code path for SSE2 library. By setting the ‘MKL_DEBUG_CPU_TYPE’ to 5, we command MKL to use the AVX2 library. But since it is undocumented, Intel can change this without any announcement, so it comes with risks. 😉

--

--

Shasha Feng
Shasha Feng

Written by Shasha Feng

MD simulations and data science

No responses yet