A few months ago I’ve been working on speeding up some image processing code. It was quite interesting, especially the Gaussian filter. I think it’s a good read for anyone interested in C++ code optimization.

# So what does this Gaussian filter do?

The original code implements the filter as a separable (2 pass) convolution. It’s quite simple:

template<int width, int height> void gauss( Img<unsigned char, width, height>& img, const std::vector<float>& coeff) { const int halfWindow = coeff.size()/2; Img<float, width, height> tmp; // horizontal pass: img -> tmp for (int y=0; y<height; y++) for (int x=0; x<width; x++) { float val = 0; for (int i=-halfWindow; i<=halfWindow; i++) val += img.getPixel(x+i, y) * coeff[i+halfWindow]; tmp.setPixel(x, y, val); } // vertical pass: tmp -> img for (int y=0; y<height; y++) for (int x=0; x<width; x++) { float val = 0; for (int i=-halfWindow; i<=halfWindow; i++) val += tmp.getPixel(x, y+i) * coeff[i+halfWindow]; img.setPixel(x, y, std::round(val)); } }

`Img`

is small class template that holds the image data and has some convenience functions like `getPixel()`

and `setPixel()`

that handle trivial bounds checking. Though the input/output image is unsigned char the internal calculations are performed on floats for precision. One other thing to note is the API suggests an in-place filter, but it allocates a large temporary buffer and does 2 out-of-place passes. The function is called with at most 7 filter coefficients: 3-3 neighbors left/right in the horizontal pass, and 3-3 top/down in the vertical pass.

# In-place or out-of-place?

This depends on your image processing pipeline. If you need the input image for further processing the out-of-place filter is the right selection since it does not alter the input image. On the other hand if you don’t need the input any more an in-place filter is more efficient because it doesn’t perform an unnecessary copy. From implementation standpoint an in-place filter is more flexible, it’s usually enough to redirect writes to a new buffer to make it out-of-place. But the other way around is not that simple especially if the filter involves many to one mapping of pixels. Just look at the above code segment, in the horizontal pass for each pixel you need 3 unprocessed neighbors to the right, so these must be stored somewhere before overwriting.

# Requirements for optimized version

- single threaded C++ code
- target CPU is Intel with SSE4.2 instruction set
- preferably in-place operation
- no significant degradation in precision (7 pixel window)
- in/out: 1024×1024 unsigned char

# Let’s get started

There are many ways to implement this filter, I simply opted for the same algorithm but with SSE and some other tricks. The primary way to gain performance with SSE is parallelism, as much parallelism as possible. The target CPU has 128 bit SSE registers, which can be divided into 4x floats (same arithmetic as above), or 4x ints (getting rid of int-float conversions), or 8x short ints (loss of precision). 16x chars are not an option since you can’t get the arithmetic right, there are simply no bits to code the multiplications.

For me it seems to be overkill to process 8 bit pixels with 32 bit float coeffs so I opted for 16 bit integer arithmetic: process 8 pixels in parallel.

To get the integer arithmetic right we should have a look at the convolution (5 wide window):

`P`

_{out} = C_{1}*P_{-2} + C_{2}*P_{-1} + C_{3}*P + C_{4}*P_{+1} + C_{5}*P_{+2}

where P’s denotes the pixel value (indices indicating neighbors), and C’s denotes filter coeffs. For the float version P’s are in the range [0..255], and C’s in [0..1] and their sum is 1. For the integer math I will use fixpoint notation x.y: x denotes integer bits, y fractional bits. P’s are in 8.0 format, and to fit the result of the multiplication in 16 bit, C’s must be stored in 0.8 format. So C has no integer bits only fractional, but that’s ok, because C < 1.0; the result will be in 8.8 format. This means coeffs must be multiplied by 256 when generated and the result must be divided by 256. Of course will use bit shifts instead of division. (Actually integer division is so painful to implement in silicon all CPU's have at most 1 of this unit per core, there's no integer vector division in SSE.) Since the sum of the float coeffs is 1, the sum for the integer will be 256. The calculation will never overflow: 255*256 < 65535. Let's see them for sigma=0.8 (sampled Gaussian kernel):

Float: 0.000441 0.021910 0.228311 0.498676 0.228311 0.021910 0.000441 Int: 0 6 58 128 58 6 0

We can clearly see that the integer version doesn’t really match the dynamic range of the float, so there goes the precision requirement… I’ll refine the arithmetic later, for now it will do.

# Parallel convolution – horizontal

How to calculate P_{out} in parallel? Let’s see the first 4 equations of a simplified case (window of 5 pixels):

P_{out0} = C_{1}*P_{-2} + C_{2}*P_{-1} + C_{3}*P_{0 } + C_{4}*P_{+1} + C_{5}*P_{+2}

P_{out1} = C_{1}*P_{-1} + C_{2}*P_{0 } + C_{3}*P_{+1} + C_{4}*P_{+2} + C_{5}*P_{+3}

P_{out2} = C_{1}*P_{0 } + C_{2}*P_{+1} + C_{3}*P_{+2} + C_{4}*P_{+3} + C_{5}*P_{+4}

P_{out3} = C_{1}*P_{+1} + C_{2}*P_{+2} + C_{3}*P_{+3} + C_{4}*P_{+4} + C_{5}*P_{+5}

Looking at columns formed by the equations it’s obvious this can be done in parallel. One set of vectors hold the coeffs (each element of the vector holding the same value), and one vector holds a row of pixels which is shifted by one pixel after each multiplication. This way we can process the line in 8 pixel blocks easily.

There’s one more catch though… Working in parallel means we can’t branch and do different stuff on a per pixel basis. The same operation is applied to the whole vector. The float implementation above has no branches, but `getPixel()`

, `setPixel()`

does! They handle out of image pixels specially: reads result in black pixels, writes get ignored. To overcome this problem we split each line into 3 parts: the leftmost 8 pixel block, the center blocks, and the rightmost 8 pixel block. The center blocks form the nominal case, they need 3-3 pixels on the left/right for the convolution and it’s always available. The 2 side blocks are implemented specially to avoid accessing out of the line. This is branch elimination.

Now that everything is in place, here’s the first version of the horizontal pass.

# Horizontal pass in SSE

template<int width, int height> void hFilter( Img<unsigned char, width, height>& img, const std::vector<int>& coeff) { const __m128i c17 = _mm_set1_epi16(coeff[0]); const __m128i c26 = _mm_set1_epi16(coeff[1]); const __m128i c35 = _mm_set1_epi16(coeff[2]); const __m128i c44 = _mm_set1_epi16(coeff[3]); // read/write pointers unsigned char* rptr = img.data; unsigned char* wptr = img.data; for (int y=0; y<height; y++) { __m128i left; // contains 8x 16bit pixels P-3, P-2, P-1, P, P+1, P+2, P+3, P+4 __m128i right; // contains 8x 16bit pixels P+5 ... __m128i out; // output pixels (block aligned) // setup left at the begining of a row left = _mm_cvtepu8_epi16( _mm_slli_si128( _mm_loadl_epi64((const __m128i*)rptr), 3 )); // leftmost and center blocks rptr += 5; //5 pixels already in 'left' for (int x=0; x<width/8-1; x++) { right = _mm_cvtepu8_epi16( _mm_loadl_epi64((const __m128i*)rptr) ); rptr += 8; out = _mm_mullo_epi16( _mm_alignr_epi8(right, left, 0), c17); out = _mm_add_epi16( out, _mm_mullo_epi16( _mm_alignr_epi8(right, left, 2), c26)); out = _mm_add_epi16( out, _mm_mullo_epi16( _mm_alignr_epi8(right, left, 4), c35)); out = _mm_add_epi16( out, _mm_mullo_epi16( _mm_alignr_epi8(right, left, 6), c44)); out = _mm_add_epi16( out, _mm_mullo_epi16( _mm_alignr_epi8(right, left, 8), c35)); out = _mm_add_epi16( out, _mm_mullo_epi16( _mm_alignr_epi8(right, left, 10), c26)); out = _mm_add_epi16( out, _mm_mullo_epi16( _mm_alignr_epi8(right, left, 12), c17)); left = right; _mm_storel_epi64((__m128i*)wptr, _mm_packus_epi16( _mm_srli_epi16(out, 8), _mm_setzero_si128())); wptr += 8; } rptr -= 5; // restore pointer to block boundary // rightmost block of row right = _mm_cvtepu8_epi16( _mm_loadl_epi64((const __m128i*)rptr) ); right = _mm_srli_si128(right, 10); //fixup: 5 black pixels at the end rptr += 8; //same as above out = _mm_mullo_epi16( _mm_alignr_epi8(right, left, 0), c17); out = _mm_add_epi16( out, _mm_mullo_epi16( _mm_alignr_epi8(right, left, 2), c26)); out = _mm_add_epi16( out, _mm_mullo_epi16( _mm_alignr_epi8(right, left, 4), c35)); out = _mm_add_epi16( out, _mm_mullo_epi16( _mm_alignr_epi8(right, left, 6), c44)); out = _mm_add_epi16( out, _mm_mullo_epi16( _mm_alignr_epi8(right, left, 8), c35)); out = _mm_add_epi16( out, _mm_mullo_epi16( _mm_alignr_epi8(right, left, 10), c26)); out = _mm_add_epi16( out, _mm_mullo_epi16( _mm_alignr_epi8(right, left, 12), c17)); left = right; _mm_storel_epi64((__m128i*)wptr, _mm_packus_epi16( _mm_srli_epi16(out, 8), _mm_setzero_si128())); wptr += 8; } }

Some notes:

- The filter is symmetric so only registers are used for 4 coefficients.
- It’s in-place. 🙂
`left`

and`right`

hold 16 consecutive 16 bit pixels starting from P_{-3}, as that’s the farthest neighbor on the left to calculate P_{out0}. The sliding window on P’s is implemented with`_mm_alignr_epi8()`

.- The convolution loop is unrolled because the 3
^{rd}argument of`_mm_alignr_epi8()`

is an immediate value (needs to be a compile time constant). The compiler would have probably unrolled it right if I wrote a`for`

loop, but for me it was more convenient this way. - For the leftmost 8 pixel block only
`left`

is calculated specially, so it’s small. For the rightmost`right`

is calculated specially, but since this can’t be included in the tight loop without branches, the whole convolution block is repeated. I usually fix that with a macro (yuck! 🙂 ). - See the Intel Intrinsics Guide for detailed description of
`_mm_*`

functions.

That’s all for this post, the next one will be about the vertical pass and the measurement of performance and accuracy.