In the previous two posts I covered basic code vectorization and the integer algorithm is fast, but not accurate enough. To describe ‘not enough’ more quantitatively we need some tools…
Parallel convolution – vertical
Implementing the vertical pass seems to be a piece of cake after the horizontal one: 4 registers hold the coefficients, 7 registers hold the pixel data from consecutive rows and then do the math. No need for in register shifts. The tricky part is how to traverse the image. The horizontal pass was trivial: go line by line. But now we have 2 options:
- Going vertically (down) looks fine: only 1 row (8 pixels) is loaded for each convolution, it’s in-place but cache inefficient
- Going horizontally means reloading all 7 data registers for each convolution and it can’t be implemented in-place at all. The cache will love it.
The first option definitely looks better, except for the cache. Let’s see both and take the faster one.