I’ve always been somewhat dissatisfied with the gap between good online resources about the math behind deep learning and good online resources on how to actually write your code. Good all-in-one tutorials on how to think about architecture decisions, parameter tuning, and the general art of modeling are rare and often dismissed as something that comes with experience.
But those of us who have experience can share our mental process with others. There are surely other data scientists who will approach things in a different way than I do, and there really is no one-size-fits-all answer, but I’ll do my best to outline a general thought process for how to think about designing a convolutional neural network, with particular attention paid to the meaning of various parameters and some sort of qualitative description of how they can enhance a neural network in different situations. The importance of intuition that follows from experience can’t be understated, but I also strongly believe that newbies will gain more experience more quickly, and have better results along the way, by witnessing how others think about problems.
Here, we’re going to work in Keras with Tensorflow backend, because that’s what I use in practice. But a lot of the discussion can apply more broadly. The code snippets are verbose for transparency and explanatory reasons. See Keras documentation for which are optional. I’ll assume familiarity with a lot of basic concepts and vocabulary, since it’s very easy to find this kind of help online.
Data do not get truly pre-processed in Keras, but instead preprocessing and image augmentation are performed as part of the model training process. Serving augmented images just in time reduces memory overhead, which is important to offset the memory used by Tensorflow’s gradient table, and especially important when training many networks in parallel on the same machine. Augmented images may be stored in a directory if so desired.
from keras.preprocessing.image import ImageDataGenerator img_rows, img_cols = 150, 150 train_dir = 'data/processed/train' test_dir = 'data/processed/test' augment_dir = None datagen = ImageDataGenerator(featurewise_center=False, samplewise_center=False, featurewise_std_normalization=False, samplewise_std_normalization=False, zca_whitening=False, zca_epsilon=1e-6, #augmentations rotation_range=0., width_shift_range=0., height_shift_range=0., shear_range=0., zoom_range=0., channel_shift_range=0., fill_mode='nearest', cval=0., horizontal_flip=False, vertical_flip=False, rescale = 1./255, preprocessing_function=None) # Each class is stored in a subdirectory of *_dir. Class names are taken from directory names unless specified with the # optional classes argument. train_generator = datagen.flow_from_directory(train_data_dir, target_size = (img_rows, img_cols), color_mode = 'rgb', batch_size = 32, class_mode = 'categorical', seed = 7, save_to_dir = augment_dir, save_prefix = None, save_format = 'png') test_generator = datagen.flow_from_directory(test_data_dir, target_size = (img_rows, img_cols), color_mode = 'rgb', batch_size = 32, class_mode = 'categorical', seed = 7, save_to_dir = augment_dir, save_prefix = None, save_format = 'png')
In raw images, we can not be assured that the ranges of values along each feature are equal. Because the learning rate applies multiplicatively, it will cause unequal corrections for each feature, effectively weighting some features more than others during gradient descent. In order to control for this effect, we center and scale all features.
ZCA whitening can be used at discretion. It’s similar to PCA, except the new image is in the same feature space as the old. ZCA whitening can serve to highlight the features and structures, making them easier to learn. Higher ZCA epsilon yields higher visibility of edges.
Augmentations create random rotations, translations, etc, of images. Will usually improve performance when used, but the amount of improvement varies widely. Usually worth checking out.
A normal learning rate will have difficulty converging with channel range 0 to 255, so we rescale to 0:1.
An additional preprocessing function that takes a rank 3 numpy tensor and returns a rank 3 numpy tensor can be used.
Creating the model architecture
CNNs are always composed of a series of convolutional layers that learn complex features of images, followed by a series of fully connected layers that perform the classification. I like keeping these two segments conceptually separate, but it’s important to balance them against each other. You don’t want to waste all of your “overfitting resistance” on the convolutional segment, since then you’ll only overfit even worse when you get to the fully connected segment, but you don’t want to use convolutional layers too sparingly, or you’ll miss out on learning the more complex features of your data.
It’s a lot of guess and check, experimenting with adding and removing layers, and modifying parameters until you find the architecture that works right for your problem. A common approach, which I use, is to concentrate only on training error when developing my intial architecture, with the goal of overfitting. Then, I’ll cycle back and address the overfitting problem by tuning hyperparameters and introducing various regularization techniques. It takes some finesse and patience. Keep good records in your lab notebook.
from keras.models import Sequential from keras.layers import Convolution2D, MaxPooling2D from keras.layers import Activation, Dropout, Flatten, Dense
Convolutional layers should follow the following pattern: convolution > activation > optional pooling.
model.add(Convolution2D(filters = 32, kernel_size = 3, strides = (3,3), input_shape=(3, img_width, img_height), padding = 'same', dilation_rate = (1,1)) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2)))
The filters argument specifies the number of filters to use, which also dictates the size of the output space. Each filter acts as a node. We don’t have to assign the values of each filter; that’s what the neural net learns for us.
Kernel size is the size of each filter, and strides is the speed at which the kernel moves across the image.
When applying kernels of greater than size 1, the image size will reduce after each convolution. To avoid this, set padding to ‘same’.
Dilation allows the learning algorithm to take in a broader space, in order to make connections between disparate parts of the image. A variety of regularization options are available that are not shown here.
ReLu is used as a default activation function in order to aid with convergence. You can also apply the activation function right inside the convolutional layer, but I like to keep them separate for more transparency.
Pooling helps guard against overfitting, but causes some information loss. It is an optional step to close out each convolutional layer. It represents the idea that the absolute position of a feature is less important than its position relative to other features. Max pooling is common, and it can be thought of as choosing the most prominent feature in a given proximity. Increasing the pool size increases regularization, but loses more information.
How to build it
Any number of convolutional layers can be added to the network, with or without pooling. We continue adding convolutional layers until there are minimal gains in training accuracy, but for most problems, you end up with no more than two convolutional layers. The number of convolutional layers can be determined with the elbow method or some sort of significance test or using your data science spidey sense to intuit whether an increase in accuracy is significant. If you’re in a context where computational speed matters a lot, you probably want to consider that, too.
Start with a large number of filters, then prune them later. This is the “overfit first, then address the overfitting later” philosophy.
Favor using a larger number of small kernels over a smaller number of large kernels. This is more computationally efficient while also making it easier for the network to pick up on features.
Fully Connected Sub-NN
model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(units = 64, kernel_initializer = 'glorot_uniform', bias_initializer = 'zeros', kernel_regularizer = None, bias_regularizer = None, kernel_constraint = None, bias_constraint = None)) model.add(Activation('relu')) model.add(Dropout(0.5))
Don’t be too afraid of information loss through pooling. Features in close proximity to one another within the image tend to be highly correlated, so pooling generally doesn’t cause a huge loss in accuracy, but the reduction in dimensionality means less computation.
Try not to drop the number of filters too much from one layer to the next. This is especially important in the convolutional segment of the network, since it’s effectively throwing away features that will never make it to the fully connected portion of the network where classification is done.
After adding all convolutional layers, the architecture of the network switches to the following pattern: fully connected layer > activation > optional dropout
First, we have to reshape the features back into a single dimension. There are no parameters to set here; keras will just do it. You can put in a dropout layer beforehand if you like.
Fully connected layers
Number of units is the number of nodes, or width, of the layer. This is analogous to the number of filters in the convolutional layer. More nodes means you’re going to pick up on more complex features, but will also make you more prone to overfitting. The kernel initializer specifies how you intialize the weights for each node, e.g. set them all to zero or draw them from some distribution. A number of intializers are available within keras, and you can also define your own. I haven’t played with this too much, but I’ve read here and there that it can be good to look at. Bias initializer is basically the same thing, but for biases. Kernel and bias regularizers add penalties to the loss function that smooth it out, which can help both with overfitting and convergence. This is basically just another form of regularization, except that we constrain the parameters in some way, rather than penalizing the loss function.
In a dropout layer, we randomly select some set of nodes to be removed from the network before feeding into the next layer. This is a very powerful hedge against overfitting. The parameter represents the probability that each node will be dropped out of the network.
How to build it
Much the same as we did in the convolutional segment of the network. Keep adding layers until you stop seeing real gains in accuracy. Overfit, then prune later.
As before, try not reduce the number of nodes drastically from one layer to the next, but it’s not as huge of a deal here, since we’re very close to the output layer, and there’s not a whole lot of feature-learning left to be done. Just be judicious about how many nodes you include in each layer.
Dropout regularization is your friend. Adding a dropout layer can have a huge impact on generalization error, as can carefully tuning the dropout rate.
The Output Layer
There’s not a whole lot to say here. The output layer is another fully connected layer, and the number of nodes in the layer is the dimensionality of your output. For most classifiers, this is going to be 1, but in situations where you’re classifying the same data in multiple ways, you might have a higher-dimensional output. I’ve personally had better success with N neural networks that have a single-dimensional output than I have with one network that has an N-dimensional output.
The activation applied to the output layer is especially important. For a binary classifier, a sigmoid activation is good. For a classifier with more levels, you’ll probably end up using softmax. These are by far the most common output activations for image classification; they’re tried and true.
At some point in the future when I’m feeling ambitious, I’ll go into more detail on selecting activation functions and tuning hyperparameters.