Very recently I was discussing with Eduardo and he introduced me to the http://shader-playground.timjones.io/ website.
Here you can write your shader code (vertex, fragment, etc...) and you can choose the compiler, language and the shader stage you are writing.
The best analyser is the Radeon GPU Analyser, because it compiles the code, and you can see the shader instructions it generates, the amount of registers it uses, etc…
This is our starting point.
Motivation
Understand better the shader code we are writing.
Squeeze our code to optimize until the last bit 🙂 .
The Inverse Transpose Matrix
Problem 1: calculate the inverse transpose matrix to fix the normal to use in light calculations.
I always thought that using big structures or group data on array data types on shaders is better, but take a look at the example I have below.
I implemented the inverse transpose matrix using the cofactor/determinant method.
I wrote 4 shader algorithms:
mat3 inverse_transpose_0(mat3 m) { mat3 result_a = mat3(m[2].z * m[1].y, m[1].z * m[2].x, m[2].y * m[1].x, m[0].z * m[2].y, m[2].z * m[0].x, m[0].y * m[2].x, m[1].z * m[0].y, m[0].z * m[1].x, m[1].y * m[0].x); mat3 result_b = mat3(m[1].z * m[2].y, m[2].z * m[1].x, m[1].y * m[2].x, m[2].z * m[0].y, m[0].z * m[2].x, m[2].y * m[0].x, m[0].z * m[1].y, m[1].z * m[0].x, m[0].y * m[1].x); mat3 result = result_a - result_b; return result / dot( m[0], result[0] );// result / determinant }
mat3 inverse_transpose_1(mat3 m) { float a0 = m[0][0], a1 = m[0][1], a2 = m[0][2]; float b0 = m[1][0], b1 = m[1][1], b2 = m[1][2]; float c0 = m[2][0], c1 = m[2][1], c2 = m[2][2]; mat3 result_a = mat3( c2 * b1, b2 * c0, c1 * b0, a2 * c1, c2 * a0, a1 * c0, b2 * a1, a2 * b0, b1 * a0); mat3 result_b = mat3( b2 * c1, c2 * b0, b1 * c0 , c2 * a1, a2 * c0, c1 * a0, a2 * b1, b2 * a0, a1 * b0); mat3 result = result_a - result_b; return result / dot( m[0], result[0] );// result / determinant }
mat3 inverse_transpose_2(mat3 m) { float a0 = m[0][0], a1 = m[0][1], a2 = m[0][2]; float b0 = m[1][0], b1 = m[1][1], b2 = m[1][2]; float c0 = m[2][0], c1 = m[2][1], c2 = m[2][2]; vec3 _a0 = vec3( c2 * b1, b2 * c0, c1 * b0 ); vec3 _a1 = vec3( b2 * c1, c2 * b0, b1 * c0 ); vec3 _a = _a0 - _a1; vec3 _b0 = vec3( a2 * c1, c2 * a0, a1 * c0 ); vec3 _b1 = vec3( c2 * a1, a2 * c0, c1 * a0 ); vec3 _b = _b0 - _b1; vec3 _c0 = vec3( b2 * a1, a2 * b0, b1 * a0 ); vec3 _c1 = vec3( a2 * b1, b2 * a0, a1 * b0 ); vec3 _c = _c0 - _c1; return mat3(_a,_b,_c) / dot( m[0], _a );// result / determinant }
mat3 inverse_transpose_3(mat3 m) { float a00 = m[0][0], a01 = m[0][1], a02 = m[0][2]; float a10 = m[1][0], a11 = m[1][1], a12 = m[1][2]; float a20 = m[2][0], a21 = m[2][1], a22 = m[2][2]; float b01 = a22 * a11 - a12 * a21; float b11 = a12 * a20 - a22 * a10; float b21 = a21 * a10 - a11 * a20; return mat3( b01, b11, b21, (a02*a21 - a22*a01), (a22*a00 - a02*a20), (a01*a20 - a21*a00), (a12*a01 - a02*a11), (a02*a10 - a12*a00), (a11*a00 - a01*a10) ) / (a00 * b01 + a01 * b11 + a02 * b21); // result / determinant }
Which one do you think performs better after the compilation?
Naively I thought that the inverse_transpose_0 was the best of the implementations. Because it uses just one matrix data, and subtract the 9 (3x3) values in one line of code.
But I was wrong, because it is one of the worst codes after the compilation.
Take a look at the statistics:
Name | Instructions | Registers |
---|---|---|
inverse_transpose_0 | 74 | 15 |
inverse_transpose_1 | 74 | 15 |
inverse_transpose_2 | 68 | 15 |
inverse_transpose_3 | 68 | 13 |
transpose(inverse(mat3())) | 73 | 14 |
mat3(transpose(inverse())) | 108 | 18 |
I found that related to the matrix manipulation code, the shader does operations on other primitive types like vec2, vec3 or vec4 according to the instruction set of the video card.
So the implementation 0 deceives our eyes with less data structures.
The implementation 1 has similar instructions in the Radeon compiler, but with other compilers it generates less instructions compared to the implementation 0. Both use mat3 subtractions to calculate the inverse. It is better to create a separate float set instead use directly the matrix indexing in the code.
The implementation 2 is a bit clever. It uses separated floats, and uses vec3 primitives. It generates less instructions. But it is not the best compiled code yet.
Finally we have the implementation 3. We can forget about every kind of data grouping we thought before and store all values as floats. The compiler automatically grouped all values and used some multiply and add instructions. It generates the best code with less register usage.
Cofactors calculations
The co-factors calculation caught my attention, because this situation is very common to happen when we implement any equation.
For example:
In the implementation 3, the original code was:
float b11 = - a22 * a10 + a12 * a20;
And the new is:
float b11 = a12 * a20 - a22 * a10;
The original code generates 1 instruction more than the new one.
The Normal Transformation
Problem 2: calculate the normal after a transformation to keep the light calculations OK in the fragment shader
The thought I have: When I transform a normal by a 4x4 matrix, I can cast the normal vertex attribute, do the multiplication and use the shader swizzle operator to get the final normal transformed.
There are a lot of shaders I wrote that uses this form to fix the normal:
vec3 N = normalize( uLocalToWorld_it * vec4( aNormal, 0.0) ).xyz;
And this is the new code I have:
vec3 N = normalize( mat3(uLocalToWorld_it) * aNormal );
The original code generates 2 instructions more than the new implementation.
Final Considerations
These considerations are not the master rule. They are conclusions I reached after analysing the result of my on-the-fly algorithm development.
I hope it can help you to think about your code.
Let's go:
- When you are using array indexing in several parts of the code. It is better to create a float, vec2, vec3 or vec4 as a temporary assignment and use it in the places the algorithm needs.
- When you are doing some calculations that are not already grouped in vec2, vec3 or vec4 structures. Try to keep the code as simple as you can and there is no problem to use a lot of floats for that. The implementation 3 of the inverse transpose is better because it generates less instructions and uses less registers.
- Try not to use negative signs in isolated elements of your equation. Every time you see a form like this: ‘-a+b’, try to use this form: ‘b-a’. It will generate less instructions.
- When you need to transform a vec3 that is a vector (not a point). Convert the mat4 to a mat3 and do the multiplication is better than convert the vec3 to vec4, do the multiplication and swizzle the result.
Thanks for reading.
Best Regards,
Alessandro Ribeiro