Type-erasure and lambda expressions

Type-erasure is a well known and documented topic in C++. Just google ‘type-erasure C++’ and the first page of results will present articles discussing the topic from simple examples to advanced scenarios. Oh, and be sure to have C++ in the query since type-erasure in Java means a completely different thing.

Many of these articles create some kind of Variant type that is capable of holding objects whose type is defined and known upfront. In this post I’m creating something that can hold and later invoke a capturing lambda expression (whose type is not defined by standard). This is what std::function does (and much more), and I was wondering what magic might happen inside std::function. There are no new concepts in this post compared to those articles, but handling lambdas is something (I think) worth writing about.
Continue reading

Fast Gaussian filter for greyscale images (with SSE) – part 2

Parallel convolution – vertical

Implementing the vertical pass seems to be a piece of cake after the horizontal one: 4 registers hold the coefficients, 7 registers hold the pixel data from consecutive rows and then do the math. No need for in register shifts. The tricky part is how to traverse the image. The horizontal pass was trivial: go line by line. But now we have 2 options:

  • Going vertically (down) looks fine: only 1 row (8 pixels) is loaded for each convolution, it’s in-place but cache inefficient
  • Going horizontally means reloading all 7 data registers for each convolution and it can’t be implemented in-place at all. The cache will love it.

The first option definitely looks better, except for the cache. Let’s see both and take the faster one.
Continue reading

Fast Gaussian filter for greyscale images (with SSE) – part 1

A few months ago I’ve been working on speeding up some image processing code. It was quite interesting, especially the Gaussian filter. I think it’s a good read for anyone interested in C++ code optimization.

So what does this Gaussian filter do?

The original code implements the filter as a separable (2 pass) convolution. It’s quite simple:

template<int width, int height>
void gauss( Img<unsigned char, width, height>& img, 
            const std::vector<float>& coeff)
{
  const int halfWindow = coeff.size()/2;
  Img<float, width, height> tmp;

  // horizontal pass: img -> tmp
  for (int y=0; y<height; y++)
    for (int x=0; x<width; x++)
    {
      float val = 0;
      for (int i=-halfWindow; i<=halfWindow; i++)
        val += img.getPixel(x+i, y) * coeff[i+halfWindow];
      tmp.setPixel(x, y, val);
    }

  // vertical pass: tmp -> img
  for (int y=0; y<height; y++)
    for (int x=0; x<width; x++)
    {
      float val = 0;
      for (int i=-halfWindow; i<=halfWindow; i++)
        val += tmp.getPixel(x, y+i) * coeff[i+halfWindow];
      img.setPixel(x, y, std::round(val));
    }
}

Continue reading