This is no longer there case, given deep learning, though. Right? Effectively, the training process can be seen as a giant matrix multiplication in one step.
Deep learning mostly uses matrices with largest dimension <1000. The size of the matrix directly corresponds to the number of input and output units to to a fully connected layer.
Even a 2000x2000 matrix is relatively small as far as non O(N^3) matrix multiplication algorithms go. The faster asymptotic algorithms have larger fixed costs and do not map as nicely to current CPU/GPU architectures. Additionally, the people capable of writing high performance matrix multiplication mostly haven't spent time on the non-N^3 algorithms. I believe every common implementation (MKL, cuBLAS, Eigen, and various other BLAS's) all use the N^3 algorithm with optimizations more focused on computer architecture (cache and register blocking, limiting instruction dependencies, SIMD) than on algorithmic cleverness.
The only attempt that I know of to use a non-N^3 algorithm for practical matrix multiplication is this 2016 paper[0] called Strassens' reloaded. Figure 5 shows that their implementation of Strassen's outperforms MKL on a single core on about a (2K, 2K, 2K) matrix multiplication problem (look at upper right part of figure). However, you're rarely multiplying on a single core. Figure 6 shows that for a 10 core system, Strassen's becomes marginally faster with square matrices of size 4K, and only becomes significantly faster at size 8K. Although these results are super cool in my opinion, they're mostly not applicable to deep learning in it's current incarnation.
Finally, Strassen's algorithm is one of the simplest non-N^3 matmul algorithms (known since 1969, complexity O(n^2.8)) and I believe it has much lower fixed costs than something more recent that has complexity more like O(n^2.37).
Thanks for the links! I had never actually seen it quantified for the size matrix you needed. Just "large" which is always relative. :)
And I also hadn't looked closely enough to see if amount of training data influenced matrix size. Makes sense that it would only influence a single dimension.
The typical neural net matrix multiplication is N_EXAMPLES X N_FEATURES_IN multiplied with N_FEATURES_IN X N_FEATURES_OUT.
The output feature count is completely independent of the data size, and input feature count is only dependent on the dimensionality of the data (not the number of points), and that's only in the first layer of the network.
Even with datasets with huge number of examples, the net usually only trains on a small "minibatch" of examples at a time, typically somewhere between 16 and 1024. This minibatch size is the algorithmic N_EXAMPLES. Given these numbers, the typical neural net matrix multiplication is probably something like (32, 256) x (256, 128). This is not nearly large enough for non-N^3 tmatmul algorithms o accelerate things.
If deep learning we’re just a big matrix multiplication then it would not be deep, it would be linear. The (eg sigmoid) functions mixed between the layers make it deep. I’m not really sure what it is about the training that you think is just a matrix multiplication.
It’s also worth noting that eg a convolution can be written with matrix multiplication (or rather some tensor products and contraction but there would be much redundancy from the fact that distant points cannot influence each other and the same thing is done to each point in the convolution. This is much less general than matrix multiplication
If floating point values were real numbers, simple matrix multiplications would not produce nonlinearities.
However, researchers at OpenAI demonstrated that the use of subnormal floating point values and their discontinuity [0] provides sufficient nonlinearity to learn nonlinear associations.
Then, would it be better to store rationals as 2 ints with understood division, and irrationals as 2 ints as understood powers?
Why are we working with these numbers so imprecisely, and with such discontinuity? When I've done math work in school, I always worked with the primitives without approximating.. 2^.5 was handled as such unless asked for an approximate answer.
I also ran across this in AutoCAD. Because they force arbitrary precision, one cannot define a sqrt length. 1.414 doesn't reach all the way on a r-triangle with 1,1 .
Apologies, I did not mean to imply it was only a matter of multiplication. Only that training has several steps that are a giant matrix multiplication. The more data to train, the larger the matrix.
Edit: also, it isn't the activation function that makes it deep, is it? Rather, it is the literal depth of the network. Right?
Right. But without an activation function, the matrix multiplications would just compose into a single matrix for the whole net. I guess the floating-point rounding brought up above is supposed to serve as an activation function for this purpose, though I haven't looked into it.
So my question is how large is the linear part? Granted, I was hoping it was big enough to bring these algorithms into consideration. Looks like I'm right in that they are larger than 2x2. However, they are still not 8kx8k. :(
This misses the point. Yes, there have to be nonlinear activations. However, almost all of the compute time is spent in matrix multiplications (or stencil operations for convolutions, which are a form of structured matrix multiplication). So when talking about runtimes, you're basically talking about matrix multiplication.
> the training process can be seen as a giant matrix multiplication in one step.
It can't be seen that way.
Deep learning becomes deep when you alter linear matrix multiplications with non-continuous functions. After that you can't reduce the problem into one step. You also lose associativity so you can't even do matrix chain multiplication optimization.
Fully connected layers are matrix-vector multiplication (which turns into matrix-matrix if you do an entire batch at once). Convolutional layers are lots of tiny dot products. You can rewrite that as matrix-matrix multiplication.
That's right. For a fully connected layer the matrix is of dimension (batch size) x (number of neurons in layer k) which gets multiplied by a matrix of dimension (number of neurons in layer k) x (number of neurons in layer k+1). People do this because going through the network with a batch is more efficient than going through the network separately for each element in the batch, mainly because multiplying two matrices is more efficient than doing n matrix-vector products. The batch size is limited because you get diminishing speedup out of increasing your batch size. Stochastic gradient descent also gets less improvement out of each data point per pass through the data set as you increase the batch size, so you need more passes. If you take the whole data set to be your batch size then you get ordinary gradient descent, which needs a lot more passes through the data set than stochastic gradient descent.
Not sure I understand this. The circuit for addition is easier than the circuitry for multiplication.
That said, in circuits, it is far less likely to ever multiply that large of a matrix. Even if I were correct that many steps of large training can be cast as a matrix multiply.
Deep learning mostly uses matrices with largest dimension <1000. The size of the matrix directly corresponds to the number of input and output units to to a fully connected layer.
Even a 2000x2000 matrix is relatively small as far as non O(N^3) matrix multiplication algorithms go. The faster asymptotic algorithms have larger fixed costs and do not map as nicely to current CPU/GPU architectures. Additionally, the people capable of writing high performance matrix multiplication mostly haven't spent time on the non-N^3 algorithms. I believe every common implementation (MKL, cuBLAS, Eigen, and various other BLAS's) all use the N^3 algorithm with optimizations more focused on computer architecture (cache and register blocking, limiting instruction dependencies, SIMD) than on algorithmic cleverness.
The only attempt that I know of to use a non-N^3 algorithm for practical matrix multiplication is this 2016 paper[0] called Strassens' reloaded. Figure 5 shows that their implementation of Strassen's outperforms MKL on a single core on about a (2K, 2K, 2K) matrix multiplication problem (look at upper right part of figure). However, you're rarely multiplying on a single core. Figure 6 shows that for a 10 core system, Strassen's becomes marginally faster with square matrices of size 4K, and only becomes significantly faster at size 8K. Although these results are super cool in my opinion, they're mostly not applicable to deep learning in it's current incarnation. Finally, Strassen's algorithm is one of the simplest non-N^3 matmul algorithms (known since 1969, complexity O(n^2.8)) and I believe it has much lower fixed costs than something more recent that has complexity more like O(n^2.37).
[0] Strassen's Reloaded paper: https://www.cs.utexas.edu/~jianyu/papers/sc16.pdf
Thanks for the links! I had never actually seen it quantified for the size matrix you needed. Just "large" which is always relative. :)
And I also hadn't looked closely enough to see if amount of training data influenced matrix size. Makes sense that it would only influence a single dimension.
The typical neural net matrix multiplication is N_EXAMPLES X N_FEATURES_IN multiplied with N_FEATURES_IN X N_FEATURES_OUT.
The output feature count is completely independent of the data size, and input feature count is only dependent on the dimensionality of the data (not the number of points), and that's only in the first layer of the network. Even with datasets with huge number of examples, the net usually only trains on a small "minibatch" of examples at a time, typically somewhere between 16 and 1024. This minibatch size is the algorithmic N_EXAMPLES. Given these numbers, the typical neural net matrix multiplication is probably something like (32, 256) x (256, 128). This is not nearly large enough for non-N^3 tmatmul algorithms o accelerate things.
If deep learning we’re just a big matrix multiplication then it would not be deep, it would be linear. The (eg sigmoid) functions mixed between the layers make it deep. I’m not really sure what it is about the training that you think is just a matrix multiplication.
It’s also worth noting that eg a convolution can be written with matrix multiplication (or rather some tensor products and contraction but there would be much redundancy from the fact that distant points cannot influence each other and the same thing is done to each point in the convolution. This is much less general than matrix multiplication
If floating point values were real numbers, simple matrix multiplications would not produce nonlinearities.
However, researchers at OpenAI demonstrated that the use of subnormal floating point values and their discontinuity [0] provides sufficient nonlinearity to learn nonlinear associations.
[0] https://blog.openai.com/nonlinear-computation-in-linear-netw...
Then, would it be better to store rationals as 2 ints with understood division, and irrationals as 2 ints as understood powers?
Why are we working with these numbers so imprecisely, and with such discontinuity? When I've done math work in school, I always worked with the primitives without approximating.. 2^.5 was handled as such unless asked for an approximate answer.
I also ran across this in AutoCAD. Because they force arbitrary precision, one cannot define a sqrt length. 1.414 doesn't reach all the way on a r-triangle with 1,1 .
That post completely disregards the fact that subnormals are usually orders of magnitude slower to work with for common hardware.
Apologies, I did not mean to imply it was only a matter of multiplication. Only that training has several steps that are a giant matrix multiplication. The more data to train, the larger the matrix.
Edit: also, it isn't the activation function that makes it deep, is it? Rather, it is the literal depth of the network. Right?
Right. But without an activation function, the matrix multiplications would just compose into a single matrix for the whole net. I guess the floating-point rounding brought up above is supposed to serve as an activation function for this purpose, though I haven't looked into it.
But can't you typically do the activation as a separate step? Where you first incorporate all of the inputs, then decide activation?
I'm not sure quite what you mean, but you can do the linear part of one layer all together, then the nonlinear part, etc.:
So my question is how large is the linear part? Granted, I was hoping it was big enough to bring these algorithms into consideration. Looks like I'm right in that they are larger than 2x2. However, they are still not 8kx8k. :(
They can be pretty big! I don't know what's normal these days, though.
This misses the point. Yes, there have to be nonlinear activations. However, almost all of the compute time is spent in matrix multiplications (or stencil operations for convolutions, which are a form of structured matrix multiplication). So when talking about runtimes, you're basically talking about matrix multiplication.
> the training process can be seen as a giant matrix multiplication in one step.
It can't be seen that way.
Deep learning becomes deep when you alter linear matrix multiplications with non-continuous functions. After that you can't reduce the problem into one step. You also lose associativity so you can't even do matrix chain multiplication optimization.
I meant that as, "one of" the steps. Not that the entire thing was one step.
*nonlinear, not non-continuous.
Fully connected layers are matrix-vector multiplication (which turns into matrix-matrix if you do an entire batch at once). Convolutional layers are lots of tiny dot products. You can rewrite that as matrix-matrix multiplication.
This was my thought. Am I correct in saying the batch size influences the size of the matrix? Or is there a more natural limiting factor?
That's right. For a fully connected layer the matrix is of dimension (batch size) x (number of neurons in layer k) which gets multiplied by a matrix of dimension (number of neurons in layer k) x (number of neurons in layer k+1). People do this because going through the network with a batch is more efficient than going through the network separately for each element in the batch, mainly because multiplying two matrices is more efficient than doing n matrix-vector products. The batch size is limited because you get diminishing speedup out of increasing your batch size. Stochastic gradient descent also gets less improvement out of each data point per pass through the data set as you increase the batch size, so you need more passes. If you take the whole data set to be your batch size then you get ordinary gradient descent, which needs a lot more passes through the data set than stochastic gradient descent.
Deep learning is done in hardware so you're not optimising instructions but circuits.
Not sure I understand this. The circuit for addition is easier than the circuitry for multiplication.
That said, in circuits, it is far less likely to ever multiply that large of a matrix. Even if I were correct that many steps of large training can be cast as a matrix multiply.