Matrix Multiplication Benchmark

The setting

import numpy as np
import time

n = 10000
x = np.random.randn(n,n)
a = time.time(); x.dot(x); print time.time() - a

The contestants

CPU	Freq	N_Cores	L3_Cache	Date	Price	Passmark
Intel Core i5-4260U	1.4GHz	2	3MB	Q2-14	315	3548
Intel Xeon E5-2643 v2	3.5GHz	2x6=12	2x25MB	Q3-13	2x1552	2x11735
AMD Opteron 8384	2.7GHz	8x4=32	8x6MB	Q4-08	8x2149	NA
AMD Opteron 8272	1.4Ghz	2x8=16	2x6MB	Q4-11	2x523	2x6748

MKL vs OpenBlas

Here are the running time in seconds. The number in () are roughly the fluctuation of running time. For the GPU result, Tesla K80 is a dual GPU, and this is only using one of them, which is equivalent to Tasla K40. In addition, calculation is carried out with float64, which GPU is bad at. Note for non-mkl, we use the default Blas library on OS X El Capitan. For the other, they are Openblas. MKL in general gives more variable results, but slightly better than the non-MKL on Intel CPUs.

CPU	Non-MKL	MKL
Intel Core i5-4260U	43	32
Intel Xeon E5-2643 v2	15.6	10.4 (3)
AMD Opteron 8384	15.4 (2)	12.3 (1)
AMD Opteron 8272	17.3	22 (5)
Tesla K-80	16.3

CPU vs GPU

To really see the power of GPU, we use float32 instead.

Matrix dim	CPU	GPU Tensorflow	GPU Skcuda
10000	6.3	2.3	1.3
15000	17	6.8	3.7
20000	39	10.8	8.32
30000	122	NA	27.0

GPU only provides a speed up of around 4-5 times. The GPU 1 is done by Tensorflow, which might not be very efficient. The GPU 2 is done by Scikit-cuda, which is a wrapper for pycuda. For the later one, we also see a breakdown of communication time between CPU and GPU. It spends around 15% of the time copying data in and out of GPU.

Tools for doing linear algebra on GPU.

Pycuda: this is the lowest level, a wrapper of CUDA for Python
Scikit-cuda: a wrapper over pycuda
Cula: provide LAPACK type of matrix multiplication for CUDA
Numbapro / accelerate: from Anaconda
Theano / Tensorflow

Codes

#### Skcuda + Pycuda

import pycuda.gpuarray as gpuarray
import pycuda.autoinit
import numpy as np, time
import skcuda.linalg as culinalg
import skcuda
culinalg.init()

dim = 30000
rnd = np.random.RandomState(0)
a = rnd.rand(dim, dim).astype(np.float32)

start = time.time()
a_gpu = gpuarray.to_gpu(a)
print 'Copy in', time.time() - start

start = time.time()
rescpu = np.dot(a, b)
print 'CPU:', time.time() - start

start = time.time()
resgpu = culinalg.dot(a_gpu, a_gpu)
print skcuda.misc.sum(resgpu)
print 'GPU:', time.time() - start

start = time.time()
resgpu = resgpu.get()
print 'Copy out', time.time() - start
print np.allclose(rescpu, resgpu)
print np.allclose(resgpu, rescpu)

Tensorflow

import numpy as np
import tensorflow as tf
import time

X = tf.constant(np.array(np.random.randn(20000,20000), dtype = np.float32), dtype = tf.float32)
Y = tf.matmul(X, tf.transpose(X))

init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)

a = time.time()
sess.run(Y)
print time.time() - a

CPU (both OpenBLAS and MKL)

import numpy as np
import time
np.random.seed(1)
n = 30000
x = np.array(np.random.randn(n,n), dtype = np.float32)
a = time.time(); x.dot(x); print time.time() - a