The success of deep learning has no need to say more.Researchers have always tried to explain the effectiveness of neural networks from the perspective of mathematics.However, because the structure of the network can be regarded as a multi -compulsory compound between high -dimensional linear transformation and non -linear transformation (such as the RELU activation function), there is actually no good mathematical tool to crack such complex structures.
Therefore, theoretical research of neural networks is often limited to the approaching, optimization, generalization, and other observed phenomena of the network.
If you put aside the theoretical limit, an indisputable fact is that a wider and deeper network always has a better effect.As small as a few layers of full -connected networks and large models as large as trillion -scale, they all consistently maintain such rules.
So how to understand the facts in theory?What role does the activation function play in it?
Compared with width, it is more challenging for depth research, because the increase in the number of layers is also accompanied by the continuous compound of non -linear functions.
A typical problem is that when the width of the model is fixed, does the depth of the model be increased than the shallow model to fit more data points?
Graduate Grace Graduate Grand School of Applied Mathematics of the Chinese Academy of Sciences, Gai Kuo completed a work of generating a network algorithm design and an explanatory work of a phenomenon.Title.
Because I am a math background, I want to do some theoretical results.However, the framework of neural network theoretical research at that time was very clear, and the remaining blank problems were very difficult.
"So that I have read the existing literature for a long time, and I have not found an original entry point." He said.
After experiencing a series of unsuccessful attempts, Gai Kuo returned to the original intuitive idea: because the width of the network is easier to analyze, such as a simple linear equation
, when the size of W is increased, the number of equations between X and Y that can be solved will also increase linearly.
If the depth can be equivalent to the width, the two layers of networks are equivalent to a single layer of large matrix, then you can find the solution of this large matrix equation by the elevation methodThis corresponds to the solution of the two layers of neural networks, which also shows that increasing the depth of the network is as effective as increasing width.
However, there are almost no tools to help calculations for the composite between the elemental non -linear activation function and the matrix multiplication, and it does not have a good optimization nature.
For example, for equations
Assumptions
p>
is the RELU or SIGMOID function, so it is difficult to solve this equation.
Because it is not a problem, even if the optimized method is used, it will not guarantee that the answer will be obtained.However, solving such an equation is an important step in his ideas.
Although it has not been further promoted, the specific form of the problem is relatively clear.Gekuo said that if the range of the activation function is widened, this equation can be found (for example, replacing the activation function with a matrix index).
The advantage of doing this is that when the two matrices are exchanged, after the matrix index function is activated, the matrix obtained is also exchanged.
In order to make the specific matrix have a exchanging properties, an additional layer of network parameters need to be added.With the exchanging nature, it is easy to solve the above equation, so you can do eliminate element in an equivalent large matrix and find a set of solutions for the three layers of functions.
In this way, he realized the original idea under this special activation function.
Specifically, after discussing the discussion of Dr. Gai Kuo and Dr. Zhang Shihua, if you can find a simple and direct example, it can explain that the network deepen a layer when there is activation functions.After that can fit more data points, this result may be more meaningful.
To this end, they extend the network parameters to the complex domain, and the activation function of the element is replaced by the element to the matrix index activation function, so that the three layers of neural networks:
Find a set of parsing solutions:
All matrices are DDimine square matrix, which shows the effectiveness of the network depth.Because if there is only one layer of network, you can only satisfy one setIn general, they have found a better example in theory, which can help people better better wayUnderstand the depth of neural network and the effectiveness of the non -linear activation function.
In the experiment, they observed that although the theoretical results are for the activation function of the matrix index, for the element of RELU or Sigmoid activation functionThe similar optimization result is observed from time to time, that is, the ability of the two layers of network fitting data points is about twice the single layer.And this may inspire other researchers to find more general conclusions.
Recently, the relevant papers are based on the "Analytical Solution of A Threeer Network with A Matrix Exponential Activity" funch. InArxiv [1].
Gai Kuo said: "Thank you Teacher Zhang Shihua for their support and encouragement. When the subject has not progressed, Mr. Zhang did not give the paper on the paper.Published pressure, and did not urge the change of the topic.In the end, I found the solution.
Reference Data:
1.https: //arxiv.org/pdf/2407.02540
Types: Stream Tree