Fast Gaussian filter for greyscale images (with SSE) – part 2

Parallel convolution – vertical

Implementing the vertical pass seems to be a piece of cake after the horizontal one: 4 registers hold the coefficients, 7 registers hold the pixel data from consecutive rows and then do the math. No need for in register shifts. The tricky part is how to traverse the image. The horizontal pass was trivial: go line by line. But now we have 2 options:

  • Going vertically (down) looks fine: only 1 row (8 pixels) is loaded for each convolution, it’s in-place but cache inefficient
  • Going horizontally means reloading all 7 data registers for each convolution and it can’t be implemented in-place at all. The cache will love it.

The first option definitely looks better, except for the cache. Let’s see both and take the faster one.

Measuring performance

To compare the implementations I’ll measure execution time. Since it depends on the overall system load and power management (CPU frequency scaling, throttling) the tests will be performed 100 times with mean and stdev calculated. The lower the stdev the more accurate the measurement is. Additionally all functions will be measured in one pass and results will be compared to each other only in that pass. Algorithms tend to behave differently with different data sets so I’ll use 2 test pictures. One is a synthetic image with no noise, large homogeneous areas, and various test patterns. The other is a photo which inherently contains noise in the low order bits, but no abrupt intensity variations. Both come from The USC-SIPI image database misc. section:

Test pattern

Man

So how do the current implementations perform?
perf1_963

Test platform: Win7 64 bit, Core i5-2410M
Compiler: TDM-GCC 4.9.2 64 bit, flags: -std=c++11 -march=corei7 -O3
  • RefChar is the reference implementation presented in the previous post, it’s the slowest and there’s significant difference between the 2 test images. I don’t know why. Performance counters and disassembly could reveal what’s going on.
  • RefFloat is the same code but purely operates on floats (no conversions, no rounding). It’s way faster than the previous, I suspect GCC optimized a bit in the background.
  • hFilter blazing fast compared to others, but it’s only half of the work.

Vertical pass with vertical traversal

void vFilter_vert( Img<unsigned char, width, height>& img, 
                   const std::vector<int>& coeff)
{
  // skipping setup code: rptr, wptr, c17, c26, c35, c44

  for (int x=0; x<width/8; x++)
  {
    unsigned char* rptr2 = rptr;
    unsigned char* wptr2 = wptr;

    //setup for the 1st row
    __m128i p1 = _mm_setzero_si128();
    __m128i p2 = _mm_setzero_si128();
    __m128i p3 = _mm_setzero_si128();
    __m128i p4 = _mm_cvtepu8_epi16(_mm_loadl_epi64((const __m128i*)rptr2)); rptr2+=width;
    __m128i p5 = _mm_cvtepu8_epi16(_mm_loadl_epi64((const __m128i*)rptr2)); rptr2+=width;
    __m128i p6 = _mm_cvtepu8_epi16(_mm_loadl_epi64((const __m128i*)rptr2)); rptr2+=width;
    __m128i p7;

    for (int y=0; y<height-3; y++)
    {
      p7 = _mm_cvtepu8_epi16(_mm_loadl_epi64((const __m128i*)rptr2)); 
      rptr2+=width;

      __m128i out =             _mm_mullo_epi16( p1, c17);
      out = _mm_add_epi16( out, _mm_mullo_epi16( p2, c26));
      out = _mm_add_epi16( out, _mm_mullo_epi16( p3, c35));
      out = _mm_add_epi16( out, _mm_mullo_epi16( p4, c44));
      out = _mm_add_epi16( out, _mm_mullo_epi16( p5, c35));
      out = _mm_add_epi16( out, _mm_mullo_epi16( p6, c26));
      out = _mm_add_epi16( out, _mm_mullo_epi16( p7, c17));
      _mm_storel_epi64((__m128i*)wptr2, 
        _mm_packus_epi16(_mm_srli_epi16(out, 8), _mm_setzero_si128()));
      wptr2 += width;

      //shift regs
      p1=p2; p2=p3; p3=p4;
      p4=p5; p5=p6; p6=p7;
    }
    
    // skipping last 3 hand unrolled rows
  }
}

Even though I removed some parts of the code to keep the listing short, we can see that it’s indeed simpler than the horizontal pass. The inner loop takes an 8 pixel wide column and performs convolutions from top to bottom, while the outer loop goes over these columns from left to right. The inner loop is really minimal, apart from the convolution only a vector load/store pair, and some housekeeping is performed. Looks good, it’s in-place, it’s compact. Let’s see the other version with interchanged loops…

Vertical pass with horizontal traversal

template<int width, int height>
void vFilter_horiz( Img<unsigned char, width, height>& in, 
                    Img<unsigned char, width, height>& out, 
                    const std::vector<int>& coeff)
{
  // skipping setup code: rptr, wptr, c17, c26, c35, c44
  // skipping first 3 hand unrolled rows

  for (int y=0; y<height-6; y++)
  {
    for (int x=0; x<width/8; x++)
    {
      unsigned char* rptr2 = rptr;

      __m128i p1 = _mm_cvtepu8_epi16(_mm_loadl_epi64((const __m128i*)rptr2)); rptr2+=width;
      __m128i p2 = _mm_cvtepu8_epi16(_mm_loadl_epi64((const __m128i*)rptr2)); rptr2+=width;
      __m128i p3 = _mm_cvtepu8_epi16(_mm_loadl_epi64((const __m128i*)rptr2)); rptr2+=width;
      __m128i p4 = _mm_cvtepu8_epi16(_mm_loadl_epi64((const __m128i*)rptr2)); rptr2+=width;
      __m128i p5 = _mm_cvtepu8_epi16(_mm_loadl_epi64((const __m128i*)rptr2)); rptr2+=width;
      __m128i p6 = _mm_cvtepu8_epi16(_mm_loadl_epi64((const __m128i*)rptr2)); rptr2+=width;
      __m128i p7 = _mm_cvtepu8_epi16(_mm_loadl_epi64((const __m128i*)rptr2));

      __m128i out =             _mm_mullo_epi16( p1, c17);
      out = _mm_add_epi16( out, _mm_mullo_epi16( p2, c26));
      out = _mm_add_epi16( out, _mm_mullo_epi16( p3, c35));
      out = _mm_add_epi16( out, _mm_mullo_epi16( p4, c44));
      out = _mm_add_epi16( out, _mm_mullo_epi16( p5, c35));
      out = _mm_add_epi16( out, _mm_mullo_epi16( p6, c26));
      out = _mm_add_epi16( out, _mm_mullo_epi16( p7, c17));
      _mm_storel_epi64((__m128i*)wptr, _mm_packus_epi16(_mm_srli_epi16(out, 8), _mm_setzero_si128()));
      wptr+=8;
      rptr+=8;
    }
  }
  
  // skipping last 3 hand unrolled rows
}

I also skipped some code here (specializations of the tight loop for the top and bottom 3 rows) to make the listing short. The first thing to note is that this version is out-of-place, which is necessary since the direction of traversal is different from the direction of the filter and we can’t keep all neighbors in registers. The tight loop is similar to the previous version with a major difference: all convolution input registers (p1-p7) are reloaded in each iteration, leading to an overall 7 times higher read count. This looks worse than the previous version, so what are the numbers?

perf2_961

Well, this is interesting. The results and our expectations didn’t quite match. By looking at the C++ code only the horizontal traversal is less efficient, but it’s compensated by cache efficiency so much, that it’s almost twice as fast as the vertical traversal. I’ll write a post later about this. The lesson for now: never judge by only looking at the code, instead: profile, profile, profile!

Using the 2 new vertical pass functions, and having the horizontal pass from the previous post finally we can put together the optimized Gaussian filter. One which is completely in-place, and the other that is built on 2 out-of-place passes and has the same in-place API like the reference implementation.

perf3_96

As we can see, both are much faster than the reference implementation but the difference between the 2 optimized versions is less pronounced. It’s probably because the out-of-place implementations process twice as much memory. Again performance counters could reveal the truth, but I won’t dig this deep.

In the next post I’ll cover the accuracy of the optimized code and how to improve that.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s