A Convolutional Neural Network, also known as CNN or ConvNet, is a class ofNeural networksspecializes in processing data that has a network-like topology, such as a photo. A digital image is a binary representation of visual data. Contains a set of pixels arranged in a grid, containing pixel values to indicate the brightness and color of each pixel.
The human brain processes a large amount of information when we see an image. Each neuron works in its own receptive field and is connected to other neurons in such a way that they cover the entire visual field. Just as each neuron only responds to stimuli in the restricted area of the visual field, which is called the receptive field in the biological visual system, each neuron in a CNN processes data only in its receptive field. The layers are organized to recognize the simplest patterns (lines, curves, etc.) first and then the more complex patterns (faces, objects, etc.). When using a CNN, you canEnable preview on computer.
A CNN typically has three layers: a convolution layer, a pooling layer, and a fully connected layer.
The convolution layer is the core building block of CNN. It carries most of the computational load on the network.
This layer performs a dot product between two matrices, where one matrix is the set of parameters that can be learned, also known as the kernel, and the other matrix is the restricted part of the receptive field. The core is spatially smaller than an image, but it is deeper. This means that if the image consists of three channels (RGB), the height and width of the kernel will be spatially small, but the depth will extend to all three channels.
During advancement, the kernel slides over the image and creates the image representation of that receiving region. This produces a two-dimensional representation of the image, known as an activation map, which reflects the kernel's response at each spatial position in the image. The size of the core slip is called the step size.
If we have an input of size W x W x T and Dout number of grains with a spatial size of F with pitch S and filling amount P, then the size of the output volume can be determined by the following formula:
This creates an output volume of sizepathxwout xduda.
Motivation behind the fold
Convolution uses three main ideas that have motivated computer vision researchers: sparse interaction, parameter sharing, and equivariant representation. We will describe each of them in detail.
Trivial neural network layers use matrix multiplication with a matrix of parameters that describe the interaction between the input unit and the output unit. This means that every output device interacts with every input device. However, convolutional neural networks havelittle interaction.This is achieved by making the kernel smaller than the input. For example, an image may have millions or thousands of pixels, but during processing with the kernel we can see significant information spanning tens or hundreds of pixels. This means that we need to store fewer parameters, which not only reduces the model's memory consumption, but also improves the statistical efficiency of the model.
If it is useful to calculate a feature at one spatial point (x1, y1), then it must also be useful at another spatial point, say (x2, y2). This means that, for a single two-dimensional segment, i. H. to create an activation map that forces neurons to use the same set of weights. In a traditional neural network, each element of the weight matrix is used once and never revisited, whereas the convolutional network did.common parametersThat is, for results, the weights applied to an input are the same as those applied elsewhere.
Due to the exchange of parameters, the layers of the convolutional neural network have the property ofEquivalence to translation. It states that if we change the input in any way, the output will change in the same way.
The clustering layer replaces network output at specific locations by deriving summary statistics from nearby outputs. This helps reduce the spatial size of the representation, which reduces computational effort and required weights. The grouping operation is processed individually for each slice in the presentation.
There are several ranking functions, such as B. the average of the rectangular neighborhood, the L2 norm of the rectangular neighborhood, and a weighted average based on the distance from the center pixel. However, the most popular process is maximum pooling, which reports the maximum performance of the neighborhood.
If we have an activation card the sizeCxCxD, a spatially sized cluster coreF, and happenS, then the output volume size can be determined using the following formula:
This creates an output volume of sizepathxpathxD.
In all cases, grouping provides some translation invariance, meaning that an object would be recognizable no matter where it appears in the frame.
Neurons in this layer have full connectivity to all neurons in the front and back layers, as seen in normal FCNN. Therefore, it can be calculated as usual by matrix multiplication followed by a bias effect.
The FC layer helps to map the representation between input and output.
Since convolution is a linear operation and images are anything but linear, non-linearity layers are usually placed right after the convolution layer to introduce non-linearity into the activation map.
There are different types of nonlinear operations, the most popular ones are:
The sigmoid nonlinearity has the mathematical form σ(κ) = 1/(1+e¯κ). It takes a real number and "squishes" it into a range between 0 and 1.
However, a very undesirable property of sigmoids is that the gradient becomes almost zero when activation occurs at either end. If the local gradient becomes too small, backpropagation effectively "kills" the gradient. If the data entering the neuron is always positive, the sigmoid output is either fully positive or negative, resulting in zigzag dynamics of weight gradient updates.
Tanh reduces a real number to the interval [-1, 1]. As with sigmoid neurons, the firing is saturated, but unlike sigmoid neurons, its output is zero-centered.
The Rectified Linear Unit (ReLU) has become very popular in recent years. Calculate the function ƒ(κ)=max (0,κ). In other words, activation is simply a threshold at zero.
Compared with Sigmoid and Tanh, ReLU is more reliable and speeds up convergence by 6 times.
Unfortunately, a downside is that ReLU can become brittle during training. A large gradient flowing through it can update it so that the neuron never updates again. However, we can get around this by setting a reasonable learning rate.
Now that we understand the different components, we can build a convolutional neural network. We will use Fashion-MNIST, a Zalando item image dataset consisting of a training set of 60,000 samples and a test set of 10,000 samples. Each example is a 28x28 grayscale image associated with a 10 class label. The dataset can be downloadedHere.
Our convolutional neural network has the following architecture:
→[CONV 1] → [BATCH NORM] → [ReLU] → [GROUP 1]
→ [CONV 2] → [BATCH STANDARD] → [ReLU] → [GROUP 2]
→ [COVER FC] → [RESULT]
For both conv layers, we use a kernel space size of 5 x 5 with an increment of 1 and a padding of 2. For both pooling layers, we use the maximum pooling operation with a kernel size of 2, increment of 2 and zero padding.
Trimmed code to set the convnet
# Layer 1 constraints
self.conv1 = nn.Conv2d (in_channels=1, out_channels=16, kernel_size=5, passo=1, padding=2)
self.batch1 = nn.BatchNorm2d(16)
self.relu1 = nn.ReLU()
self.pool1 = nn.MaxPool2d(kernel_size=2)#default step is equivalent to kernel_size
# Layer 2 constraints
self.conv2 = nn.Conv2d (in_channels=16, out_channels=32, kernel_size=5, passo=1, padding=2)
self.lote2 = nn.BatchNorm2d(32)
self.relu2 = nn.ReLU()
self.pool2 = nn.MaxPool2d(kernel_size=2)
# Define o plano linear
self.fc = nn.Lineal(32*7*7, 10)
# set the network flow
# Conv 1
out = self.conv1(x)
output = self. batch1 (output)
out = self.relu1 (aus)
# Maximum group 1
out = self.pool1(out)
# Conv 2
output = self.conv2 (output)
output = self.batch2(output)
out = self.relu2 (out)
# Group maximum 2
out = self.pool2(out)
fora = fora.view(out.size(0), -1)
# Linear level
off = self.fc(off)
We also use batch normalization in our network, which prevents us from incorrectly initializing the weight matrices by explicitly forcing the network to assume a uniform Gaussian distribution. The code for the network defined above is availableHere. We train with cross-entropy as a loss function and Adam Optimizer with a learning rate of 0.001. After training the model, we achieved 90% accuracy for the test dataset.
Below are some applications of Convolutional Neural Networks in use today:
1. Object Recognition: With CNN, we now have sophisticated models likeR-CNN,R-CNN fast, jR-CNN fasterwhich are the dominant conduit for many object recognition models implemented in autonomous vehicles, facial recognition, and more.
2. Semantic Segmentation: In 2015, a group of Hong Kong researchers developed a strategy based on CNNdeep analysis networkto integrate extensive information into an image segmentation model. UC Berkeley researchers also builtfully folded netsthat improved next-generation semantic segmentation.
3. Subtitles: CNNs are used with recurrent neural networks to write subtitles for images and videos. This can be used for many applications such as B. Activity detection or video and image description for the visually impaired. YouTube implemented this primarily to understand the large number of videos that are regularly uploaded to the platform.
1. Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, published by MIT Press, 2016
2. Stanford University Course – CS231n: Convolutional Neural Network for Visual Recognition by Prof. Fei-Fei Li, Justin Johnson, Serena Yeung