Fast Gaussian filter for greyscale images (with SSE) – part 3

In the previous two posts I covered basic code vectorization and the integer algorithm is fast, but not accurate enough. To describe ‘not enough’ more quantitatively we need some tools…

Image comparison metrics

Visual inspection doesn’t reveal much difference between images, both look as it should: blurred. To numerically compare images let’s define the error at a given pixel as the difference of the pixel’s value on the reference and optimized image: err_{i,j} = opt_{i,j} - ref_{i,j}
Based on this I’ll use the following metrics:

  • Different pixel count – The most basic: count (and percentage) of different pixel pairs
  • Maximum absolute error – MaxAbsError = \max(|err_{i,j}|). What is the highest grey value difference between corresponding pixels of the 2 images. This value shows the worst case error introduced by the integer arithmetic on a test image. A high error value doesn’t tell much about image similarity (even a single pixel can boost this value), but a low value is strong guarantee that images closely match
  • Average absolute error – AvgAbsError = \sum \frac{|err_{i,j}|}{PixelCount}. It’s the average per pixel grey value difference.
  • Average error – AvgError = \sum \frac{err_{i,j}}{PixelCount}. The same as above but without absolute value. In this metric if one pixel is brighter and one pixel is darker than the corresponding pair then these differences cancel each other out. It describes well the overall perceived brightness difference between the two images.
  • Absolute difference image – Img_{i,j} = \frac{|err_{i,j}|}{MaxAbsErr}*255. Contrary to the above scalar metrics this one is an image. Each pixel is |erri,j|, and the whole image is scaled to full intensity. This image describes the spatial distribution of errors.

Visually the most interesting is the difference image:


Reference vs. Optimized (ver1). Click for full resolution.

How to interpret it? The black areas are where erri,j is zero, and the white where |erri,j| is MaxAbsErr. Homogeneous areas of the image match exactly. Other parts with intensity variations differ. Especially the middle ‘Lena’ part which is a natural image and contains some noise, it barely contains identical pixels. The error correlates highly with the input image on the synthetic parts (looks like edge detection), and it’s uncorrelated uniform noise on the natural parts (can’t recognize anything resembling to Lena). I don’t include the other test image as it looks identical to the center part. Now the scalar metrics:

TestPat: diffPixCount: 192145 (18%), avgErr:   -0.182804
         maxAbsErr: 3,               avgAbsErr: 0.201715

Man:     diffPixCount: 955199 (91%), avgErr:   -0.974318
         maxAbsErr: 2,               avgAbsErr: 0.974318   

The diffPixCount is low for the TestPat because of the homogeneous ares, and high for the Man because the whole image is noisy. There’s one more interesting point though: for the second image the avgErr and avgAbsErr are equal in absolute value, and very close to 1. This means all error introduced by the optimization is has negative value, and the optimized image is roughly 1 grey level darker than the reference. This is not coincidence.

As you remember the fixed point convolution results a 8.8 format number, which is simply shifted right by 8 bits to get the integer part. Numerically the value is truncated, whereas the reference code does proper rounding. With each truncation we lose 0.5 LSB value, and 2 passes end up being 1 LSB lower. What happens if we add proper rounding to the fixed point code too? Rounding in this case means adding 0.5 LSB (128 for .8 fixed point format) before the truncation.


Reference vs. Optimized (ver2). Click for full resolution.

TestPat: diffPixCount: 49781 (4%),   avgErr:    0.00657845
         maxAbsErr: 2,               avgAbsErr: 0.0475349
Man:     diffPixCount: 151291 (14%), avgErr:    0.0150023
         maxAbsErr: 1                avgAbsErr: 0.144282

This is much better! The avgAbsErr is around 0.01 grey value, and the optimized code generates at most 2 grey value difference. We could stop here, 1-2 values of difference on ~10% of the pixels is acceptable by many standards, but it can be improved.

What contributes to the remaining error?

  • The integer filter coefficients are much lower in resolution. For sigma=0.8 the center coefficient can be coded in 7 bits, compared to the 23 of a float. With increased sigma this gets even worse.
  • Both of the optimized versions keep the temporary data between passes as 8 bit unsigned integers. This introduces quantization noise compared to the reference implementation (float tmp).

Solutions addressing these issues go hand in hand, as you will see there’s not much sense implementing one without the other, as both form an equally narrow bottleneck that prevents better results. So I will do both, improved arithmetic in this post, and higher resolution temporaries in the next.

A better fixed point arithmetic

The current code works with 0.8 format coefficients (all values below one) and 8.0 format pixel data. Result of the multiplication will be 8.8 and as long as the sum of the coefficients is 256, the additions won’t overflow. Doing 8 pixels per instruction is a major contributor to performance, so we don’t want to give that up. Looking around in the Intel Intrinsics Guide there are other multiplication instructions:

  • _mm_mullo_epi16(): this is the one in the code right now. Each 16 bit by 16 bit multiplication creates a 32 bit intermediate result of which the lowest 16 bits are kept.
  • _mm_mulhi_epu16(): The same as above, but the highest 16 bits are kept. If the coefficients are in 0.16 format and pixels are 8.8 (fractional bits are zeros), the result will be 8.24. With the lower 16 bits stripped we get 8.8 and additions can be performed as before. This is good, the coeff resolution is doubled at the expense of one more shift before the multiplication. The problem here is the same as the beginning of the post: discarding the low 16 bits is effectively truncating the result, and in this case it can’t be fixed by adding a constant because the CPU performs the multiplication and truncation in one pass.
  • _mm_mulhrs_epi16(): the right one! The description on the Intirinsics Guide is a bit cryptic, but basically it’s the same as the previous intrinsic with signed variables and rounding. In fixed point terms: 1.15 format coefficients, 9.7 format pixels will form 10.22 intermediates, which is rounded and shifted to make a 9.7 result.

So with the 3rd version we can boost the coefficient resolution with 7 more bits at the expense of one additional shift per load, and still have accurate rounding. The tight loop doesn’t change much (I don’t include all 3 versions, only the simplest one):

for (int y=0; y<height-3; y++)
  p7 = _mm_slli_epi16(_mm_cvtepu8_epi16(
         _mm_loadl_epi64((const __m128i*)rptr2)),7); 

  __m128i out =             _mm_mulhrs_epi16( p1, c17);
  out = _mm_add_epi16( out, _mm_mulhrs_epi16( p2, c26));
  out = _mm_add_epi16( out, _mm_mulhrs_epi16( p3, c35));
  out = _mm_add_epi16( out, _mm_mulhrs_epi16( p4, c44));
  out = _mm_add_epi16( out, _mm_mulhrs_epi16( p5, c35));
  out = _mm_add_epi16( out, _mm_mulhrs_epi16( p6, c26));
  out = _mm_add_epi16( out, _mm_mulhrs_epi16( p7, c17));
  out = _mm_add_epi16( out, _mm_set1_epi16(64));
    _mm_packus_epi16(_mm_srli_epi16(out, 7), _mm_setzero_si128()));
  wptr2 += width;

  //shift regs
  p1=p2; p2=p3; p3=p4;
  p4=p5; p5=p6; p6=p7;

Now for the comparison:


Reference vs. Optimized (ver3). Click for full resolution.

TestPat: diffPixCount: 19493 (1%),   avgErr:   -0.00024128
         maxAbsErr: 1,               avgAbsErr: 0.01859,
Man:     diffPixCount: 149818 (14%), avgErr:    0.00281906
         maxAbsErr: 1                avgAbsErr: 0.142878

Some things definitely improved: the maximum pixel intensity difference is 1 for both images, and the avgErr decreased by an order of magnitude. AvgAbsErr and diffPixCount (which in this case are mostly identical) didn’t change significantly.

How’s performance?

As one might expect there’s not much difference. The more accurate code is only 20% slower but even this is hardly justifiable as the accuracy improvement is minimal. I mentioned in the middle of the post that this is only half of the work, to see the real improvement the temporary image bit depth should be increased. (All 3 measured versions use 2 out-of-place passes.)

The next post will be about increasing the temporary resolution and tricks to fuse the 2 passes together.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s