## DeepMind matrix multiplications on NVIDIA V100, Tesla T4, and a have a look at FBHHRBNRSSSHK — which isn’t me typing random letters!

Within the earlier publish, we realized the maths behind the Strassen algorithm and wrote a Python code to run-test it towards completely different matrices sizes. Furthermore, we’ve realized that the Holy Grail of linear algebra is an optimization algorithm for matrix multiplication. Usually, we’d consider a matrix multiplication code as three for-loops:

`def matmul(mat1, mat2, mat3):`

r""" Operate to multiply mat1 and mat2

returns mat3

Parameters

---------

mat1: np.array, matrix A

mat2: np.array, matrix B

mat3: np.array, empty matrix C Return

------

mat3: np.array, matmul between A & B

"""

for i in vary(mat1.form):

for j in vary(mat2.form):

mat3[i][j] = 0.0

for ok in vary(mat3.form):

mat3[i][j] += mat1[i][k]*mat2[k][j]

return mat3

Thus the computational complexity is *O(n³). *Strassen has improved this calculation, discovering the next relationships:

This algorithm is utilized to dam matrices and the full complexity is lowered to *O(n²·⁸⁰⁸). *Though 2.808 could seem a little or no enchancment, we noticed that for sq. matrices of dimension 4096 the usual numpy `matmult `

takes about 454.37 +/- 6.27 s, whereas Strassen takes 31.57 +/- 1.01, which is a distinction of about one order of magnitude.

We noticed that the matrix multiplication downside may be lowered to a tensor product, with the *tensoring* operation:

Fig.2 reviews precisely the matrix multiplication, expressed as *a triad*, particularly three parts. *The minimal variety of triads defines the minimal variety of operations to compute the product matrix.* This minimal quantity is the tensor’s *rank* *R(t)*. Engaged on the tensor rank is an environment friendly technique to discover new multiplication algorithms, as defined within the DeepMind papers.

DeepMind has proven in these years how mathematical issues, from theoretical to utility, may be tackled with a machine studying strategy, utilizing the reinforcement studying (RL) strategy. Additionally throughout this time they’ve tailored AlphaZero to seek out the very best technique for multiplying matrices, and the result’s AlphaTensor. I believe it’s price defining proper now what we will admire on this paper:

- DeepMind proved once more that RL could be a formidable ally for tackling sophisticated mathematical issues;
- AlphaTensor has discovered a suitable algorithm that may be even higher than Strassen’s one, to multiply 4 x 4 and 5 x 5 block matrices;
- Furthermore, AlphaTensor can discover an optimum answer for particular {hardware} necessities. As they present within the paper, there may be algorithms particularly tuned for TPUs and GPUs (V100 within the paper case);
- Though these will not be the very best outcomes, mathematicians can now have a totally new set of equations for the matrix multiplication downside that may be a brand new place to begin for locating new optimum options.

Fortunately for us, DeepMind gives on its GitHub the implementation for AlphaTensor, able to be examined and all written in JAX. The code is tailor-made for multiplying 4 x 4 block matrices, with a complete of 47 multiplications, already reported within the tensor notations, as you possibly can see right here.

The benchmark is run on a TPU and V100 GPU and so they’re testing matrices multiplications for the next sizes: 8192, 10240, 12288, 14336, 16384, 18432, 20480 for the normal `jax.numpy.dot`

multiplication, the Strassen algorithm and the AlphaTensor one.

I forked the code and added two tiny modifications:

- I’m printing the timings for every strategy
- Since we’re working the multiplication 10 occasions for every algorithm,
*I modified the**common quantity within the benchmark features to 10**, quite than maintaining it as 20*.

Having free entry to the Google Cloud Console (nonetheless counting on $300 credit) I’ve created a Digital Machine in GCP Compute Engine to check AlphaTensor, with the next specifics:

- Operating area
`europe-west4`

- Chosen GPU machine with 1 NVIDIA V100 GPU
- Computerized choice for the CPU platform
`n1-standard-4`

machine sort (4 vCPU and 15 GB RAM)- I modified the OS picture to: Working System
`Deep Studying on Linux`

and Model`Debian 10 primarily based Deep Studying VM for TensorFlow Enterprise 1.15 with CUDA 11.0 M100`

and disk dimension`50GB`

The full price is $1.94 per hour — so watch out and don’t let this machine run indefinitely.

As soon as the machine is created, you possibly can straight entry with SSH and obtain the repo with `git clone https://github.com/Steboss/alphatensor.git`

. You’ll must arrange the Python setting and set up `jax`

with `pip3 set up -r alphatensor/benchmarking/necessities.txt -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html`

. Lastly, you possibly can run the take a look at with `python alphatensor.benchmarking.run_gpu_benchmark`

.

Fig.3 reviews the efficiency time with respect to the matrix dimension for every algorithm. We will see that for *small matrices *dimension, 8192 and 10240, the Strassen achieves an *enchancment of about 6.5%* with respect to straightforward JAX, comparable with the AlphaTensor enchancment of about 8.5%. Glorious outcomes are achieved for *larger matrices* in order that for 18432 sq. matrices Strassen improves the calculation by *15%* (7.37 +/- 0.01) and AlphaTensor achieves an enchancment of the *16%* (7.31 +/- 0.01) in comparison with JAX (8.53 +/- 0.01).

One other take a look at I’ve accomplished was on Google Colab. On this case, we will depend on a Tesla T4 GPU. Though the algorithm has been examined on V100, it’s price investigating its transferability and evaluating the outcomes. Equally to the V100 take a look at, I’ve replicated these calculations on a Google Colab pocket book, eradicating these strains

As you possibly can see we’ve got extra variability within the outcomes, particularly for matrices of dimension 16384 we will see that each one the algorithms obtain the identical efficiency timings. This isn’t correct, as it could be on account of some downtime that we will’t handle on Google Colab. Tab.1 summarises all of the findings on Tesla T4:

Sizes 12288 and 16384 are difficult factors, the place we don’t have an actual enchancment with respect to JAX normal multiplication. On the opposite facet, we will see that we’ve got an enchancment for very giant matrices, at 18432 Strassen achieves a 20% speedup and Alphatensor 22%.

Just some days after DeepMind’s paper was printed, Manuel Kauers and Jakob Moosbauer wrote an important paper reply, proposing the FBHHRBNRSSSHK-Algorithm. This algorithm begins from DeepMind findings and improves the calculation over 4×4 matrices utilizing 47 multiplications, quite than 49 as AlphaTensor discovered, and 5×5 matrices to 95 multiplications (quite than the 96 proposed by AlphaTensor). That is nice information and it exhibits how people can collaborate productively with ML algorithms. After this reply, Kauers and Moosbaur printed a unbelievable mathematical paper known as “Flip Graphs for Matrix Multiplication”. On this paper the authors confirmed the method they discovered to additional enhance matrix multiplications. Particularly, the core a part of the method is to start out from recognized schemes for matrix multiplications and group them in a graph:

We outline a graph whose vertices are right matrix multiplication schemes and the place there’s an edge from one scheme to a different if the second may be obtained from the primary by some form of transformation. We take into account two transformations. One is named a flip and turns a given scheme to a distinct one with the identical variety of multiplications, and the opposite is named a discount and turns a given scheme to 1 with a smaller variety of multiplications

The navigation throughout all of the matrix multiplication schemes is finished by a random stroll. Then, the flip may be accomplished utilizing the next thought:

The thought is to subtract one thing from one rank-one tensor and add it to one of many others.

Fig.5 reviews the paper’s picture, the place the authors have depicted all of the recognized schemes and the way they’re interlinked with one another by flip and discount transformations. Thus, this isn’t the top of the story, but it surely’s one other nice place to begin, which can deliver increasingly environment friendly algorithms for matrix multiplication.

In the present day we’ve concluded the evaluation of the DeepMind paper “Discovering quicker matrix multiplication algorithms with reinforcement studying”. This paper has introduced quite a lot of curiosity within the matrix multiplication discipline and, clearly, a lot of questions.

We outlined 4 details that we will draw from the paper:

- RL is a formidable software that may assist us in tackling mathematical issues
- AlphaTensor has discovered new formulations for multiplying 5×5 and 4×4 matrices
- AlphaTensor can discover optimum options for particular {hardware} (e.g. GPU or TPU)
- It is a nice place to begin for brand new analysis on the matrix multiplication downside

Then, we run AlphaTensor on an NVIDIA V100 GPU and Tesla T4. Though some ups and downs, we will see that total AlphaTensor improves the calculation, with an enchancment of as much as 16% on V100 and 22% on Tesla T4 — though the algorithm will not be optimized for such a GPU.

Lastly, we noticed that this isn’t the top of the story, however a lovely begin for a brand new retailer. An instance is given by the FBHHRBNRSSSHK-Algorithm which proves how the DeepMind answer may be additional exploited with a pure mathematical formalism to seek out new and extra environment friendly matrix multiplication methods.

*Please, be at liberty to ship me an e mail for questions or feedback at: [email protected] or straight right here in Medium.*