There are some details about this implementation: Three by three matrixes are used. Each matrix input is a two byte container, so the maximum value (in decimal) it can hold is 65,535.
Important note: Please don’t expect peak performance without fine-tuning hyperparameters such as the number of threads, kernel size and block sizes, unless you're running it on a Ryzen 7700(X).
I remember you, Mrs. Simmons, and your infernal multiplication quizzes. I remember you. Anyway. The storm dumps its load, takes a breath, and dumps some more. School is canceled, Christmas Eve ...