Why do convolutional neural networks work so well for computer vision?

My friend Peter recently asked that my next post on convolutional neural networks address the question of exactly why they are so effective for computer vision as opposed to other architectures. Note that this is distinct from the question of how CNNs work. The latter question is about the computational procedure of a CNN; its answer is a description of the sequence of steps carried out by a computer in training and predicting with a CNN. The former question asks for an explanation of the theoretical justification for applying that sequences of steps to a particular domain. The answer I provide here will take the form of interpreting the how in a way that illuminates the why, so I’ll end up answering both questions to some degree, but the main focus is on unpacking the causal story of how the algorithm acts to address the unique characteristics of a computer vision problem.

The difficulties of computer vision

To understand why CNNs are good at addressing the problems of computer vision, we should first lay out at least some of the major problems of computer vision, as they differ other other AI problems.

Difficulty 1: The raw data are structured as feature matrices or tensors rather than feature vectors.

It’s common parlance for us to think of image data as unstructured, so perhaps it seems odd to talk about how the data are structured. There’s a sense in which image data are unstructured, and we’ll get to that later, but images are arranged as two-dimensional arrays of pixels, that are each assigned some tuple that gives a value to each of the channels in the image data. That’s structure.

Think of the spreadsheet-type data structure found in excel or a SQL database. Each row corresponds to one observation, and each column corresponds to either one of the raw features or a label. We can think of each observation as a (potentially many-to-may) mapping of label vectors onto feature vectors. We can actually represent grayscale images in this way, with each feature column representing a single pixel of the image, but that’s just doesn’t really correspond to what images are to us in the real world.

When we look at a photograph, we don’t look at a one-dimensional, possibly unordered, set of pixels, but a two-dimensional, necessarily ordered, matrix of pixels. There’s a vertical dimension and horizontal dimension, and both the horizontal and vertical orderings are important to how we interpret the photograph.

When we generalize the problem from grayscale to color images, the dimensionality increases yet again. Instead of each pixel taking a single numeric value, it takes one value for each color channel, e.g. a 3-tuple of numbers for rgb color channeling, and each observation is then a three-dimensional tensor of raw features that gets assigned a vector of labels.

Difficulty 2: Predictive features are localized regions of pixels and their spatial relations to other localized regions of pixels.

In truth, this is a much richer, much more tightly constrained structure than spreadsheet data. Now, we have embedded relations between features into the structure of the data itself. Raw features that are in closer proximity to each other are more likely to be highly correlated with one another, which means that the predictive features of the image, e.g. the cat’s whiskers or the dog’s floppy ears, are localized regions within the raw feature tensor. The feature’s value is important as well as its spatial relations to other features. We have a natural way of relating raw features to one another imposed on us by the nature of what an image is.

Notice what’s not mentioned here: the specific location of predictive features within the raw feature space. We might find the cat’s whiskers in the top left corner of the image, or in the center, or anywhere else. Or, if we’re looking at the cat’s butt, we might not see it’s whiskers at all. Which predictive features are present and where they are located within our space of value assignments varies from observation to observation. This is a radical departure from the situation where Column A always represents gender, Column B always represents age, etc. And this is where the unstructured-ness of image data enters the picture. The predictive features can be anywhere in our data; an algorithm that learns them has to be able to pick them out no matter where they are. They can be oriented and scaled to any size; translated and twisted into any contortion, and yet, we still have to be able to identify them as whiskers or ears or whatever they happen to be.

What does matter is how each of the features related to other features. We know that the whiskers, if shown, should be in some positional relation to the ears, eyes, and nose. The tail should be on one end of the cat, and the head at the other end. And then there are second-order relations between the relations, as well as third-order relations between the second-order relations, and so on and so on, up as far as we need to go to account for any arbitrarily high level of complexity we might find in a visual data set.

How the convolutional segment addresses these problems.

As I described here, a CNN is compose of a convolutional segment, a series of convolution and pooling layers, followed by a series of fully connected layers as seen in a vanilla MLP. Each of these layers my be interspersed with regularization layers, but since the regularization layers do not contribute to the interpretation of CNNs as computer vision algorithms (they bring something different to the table), we won’t really go into those here. We’ll also set aside the vanilla segment, since MLPs exist all over the place, and a detailed look at these isn’t going to illuminate what makes CNNs different with respect to computer vision problems.

Convolutional layers

Each convolutional layer slides a mathematical object called a filter across the image, convoluting the image with the filter at each step. A filter is is simply a matrix of weights that is used to adjust channel values at each pixel in a localized region of the image. For example:

\begin{bmatrix}  2 & 10 & 60 \\  20 & 80 & 4 \\  30 & 10 & 3  \end{bmatrix}  *  \begin{bmatrix}  0 & -1 & 1 \\  -1 & 1 & -1 \\  1 & -1 & 0  \end{bmatrix}

The matrix on the left represents a 3×3 localized region of pixels in the image, and the matrix on the right represents a filter. Note that convolution is not normal matrix multiplication; it’s just a position-by-position sumproduct, or:

(2\times 0) + (10\times -1) + (60\times 1) + \cdots + (30\times 1) + (10\times -1) + (3\times 0) = 126

The entire 3×3 array of pixels then gets replaced with 126 at the position of the center pixel. But what is this actually doing? Looking at the filter, we see that the filter output is larger when pixels along the diagonal have larger values, but it get’s penalized with the surrounding diagonal lines (where the -1s are) take larger values. Pixels in the same position as the 0s make no contribution to the filter output. This is effectively picking out the feature of a diagonal line; the feature is strongest when the diagonal is strong, with weak surrounding pixels, in the input image.

Filters effectively pick out patterns of pixels in a localized region of the image, i.e. they pick out our features. But these features can potentially be located anywhere in the image; as noted earlier, that’s the sense in which we’re licensed to call the image data unstructured. To discover them no matter where they’re at, we slide the filter across the image, bit by bit, convoluting and replacing values at each step. What we get out at the end is a modification of the original image that displays the filtered features more strongly than the original image did. What the neural network learns during training is the optimal weights to be used in filtering, which really means that it’s learning the optimal features that the convoluted image should highlight. Here’s an example of how an edge detection filter transforms the image:

Pretty cool, right? By a very simple mathematical operation applied iteratively over the entire image, we’re able to isolate only the basic shape of the object in the image and have split if off from other things such as lighting, color, texture, etc.

By using two-dimensional filters that slide across the two-dimensional image data, we’re able to address Difficulty 1 above. Since filters apply to an entire region of pixels, they allow us to treat entire regions of pixels as predictive features, which is one part of Difficulty 2.

Pooling Layers

Max pooling is the most common, so that’s what we’ll focus on here. Average pooling also exists, but it’s not different enough from max pooling to require really special treatment for our purposes here. Max pooling layers are used to discard the weakest features and retain the strongest ones. It’s a pretty simple idea. Each “pool” is an n\times n array of pixels, and all we do in max pooling is replace the entire pool with the max value across all pixels in the pool.

There are two assumptions here:

(1) Only the strongest features matter (this has been verified empirically; performance does not suffer from max pooling).  The orange 0, 1, and 4 have been dropped entirely. Only the 6 is regarded as important enough to retain.

(2) The exact position of the main features isn’t important; what’s important is their rough position relative to each other. In the above illustration, we’ve moved the orange 6 from position (2,2) to position (1,1), the green 8 from (4,2) to (2,1), and so on for the others. The exact coordinates are not preserved in the transformation, and in fact neither are the exact relative positions of each main feature to the others. What is preserved is their rough or approximate relative positions. The orange 6 is still somewhere above the blue 3 and to the left of the green 8, but we’ve lost some information about the exact distances and relative directions. This, obviously, is the second part of Difficulty 2 outlined above.

The visual effect this has on the image is downsampling like this:

We can even see visually here that, despite having thrown out a lot of pixels, the image on the right is still recognizable as a representation of the same features as the image on the left, albeit a lower-resolution representation. The stuff that got thrown out really is irrelevant, even to humans!

Leave a Reply