What are convolutional neural networks?
Convolutional neural networks are used in computer vision tasks, which employ convolutional layers to extract features from input data.
Convolutional neural networks (CNNs) are a class of deep neural networks commonly used in computer vision tasks such as image and video recognition, object detection and image segmentation.
Neural networks are machine learning models consisting of interconnected nodes that process information to make decisions, while deep neural networks have multiple hidden layers that enable them to learn complex representations for various tasks. They both mimic the structure and function of the human brain. Computer vision is a field of artificial intelligence (AI) that focuses on enabling machines to interpret and understand visual data from the world.
Although image and video recognition involves classifying or recognizing objects, scenes or actions in photos or videos, object detection involves locating certain things inside an image or video. Image segmentation involves dividing an image into meaningful segments or regions for further analysis or manipulation.
CNNs use several convolutional layers to automatically extract features from input data. The input data is subjected to filters by the convolutional layers, with the feature maps produced passed into further processing layers. Convolutional layers are the building blocks of CNNs that perform the operation of filtering and feature extraction on input data.
Filtering is the process of convolving an image with a filter to extract features, while feature extraction is the process of identifying relevant patterns or features from the convolved images.
Pooling layers, which downsample the output of the convolutional layers to lower the computational cost and increase the network’s capacity to generalize to new inputs, are frequently included in CNNs in addition to the convolutional layers. Additional typical layers include normalization layers, which help to lower overfitting and enhance the network’s performance, and fully connected layers, which are utilized for classification or prediction tasks.
Many applications, such as facial recognition, self-driving cars, medical image analysis and natural language processing (NLP), have extensively used CNNs. They have also been used to achieve state-of-the-art results in image classification tasks, such as the ImageNet challenge.
Convolutional neural networks work by extracting features from input data through convolutional layers and learning to classify the input data through fully connected layers.
The steps involved in the working of convolutional neural networks include the following:
- Input layer: The input layer — the first layer in a CNN — takes raw data as input, like an image or a video, and sends it to the next layer for processing.
- Convolutional layer: The feature extraction takes place in the convolutional layer. This layer applies a collection of filters or kernels to extract features like edges, corners and forms from the input data.
- ReLU layer: To provide non-linearity to the output and enhance the performance of the network, a rectified linear unit (ReLU) activation function is frequently implemented after each convolutional layer. ReLU outputs the input directly if it is positive and outputs zero if it is negative.
- Pooling layer: The convolutional layer’s feature maps are formed with a pooling layer, which reduces their dimensionality. Max-pooling is a commonly used technique where the maximum value in each patch of the feature map is taken as the output.
- Fully connected layer: The fully connected layer takes the flattened output of the pooling layer and applies a set of weights to produce the final output, which can be used for classification or prediction tasks.
Here is an illustration of how CNN would categorize pictures of cats and dogs:
- Step 1: The input layer receives 3-channel (RGB) images of a dog or cat and other raw image data. A 3-channel (RGB) is a standard format used to represent color images in neural networks, with each pixel being represented by three values representing the intensity of the red, green and blue color channels.
- Step 2: The convolutional layer applies a series of filters to the input image to extract features like edges, corners and forms.
- Step 3: Convolutional layer output becomes non-linear due to the ReLU layer.
- Step 4: By taking the maximum value in each feature map patch, the pooling layer lowers the dimensionality of the feature maps created by the convolutional layer.
- Step 5: Many convolutional and pooling layers are stacked to extract progressively complicated characteristics from the input image.
- Step 6: Flatten layer converts the previous layer’s output into a one-dimensional or 1D vector (a sequence of numbers arranged in a single row or column, each representing a feature or characteristic). Then, a fully connected layer receives the flattened output of the last pooling layer and applies a set of weights to produce the final output, identifying whether the image is of a cat or a dog.
The CNN is trained using a set of labeled images, with the weights of the filters and fully connected layers adjusted during training to minimize the error between the predicted and actual labels. Once trained, the convolutional neural network can accurately classify new, unseen images of cats and dogs.
Several types of convolutional neural networks exist, including traditional CNNs, recurrent neural networks, fully convolutional networks and spatial transformer networks — among others.
Traditional CNNs
Traditional CNNs, also known as “vanilla” CNNs, consist of a series of convolutional and pooling layers, followed by one or more fully connected layers. As mentioned, each convolutional layer in this network runs a series of convolutions with a collection of teachable filters to extract features from the input image.
The Lenet-5 architecture, one of the first effective CNNs for handwritten digit recognition, illustrates a conventional CNN. It has two sets of convolutional and pooling layers following two fully connected layers. CNNs’ efficiency in image identification was proved by the Lenet-5 architecture, which also made them more widely used in computer vision tasks.
Recurrent neural networks
Recurrent neural networks (RNNs) are a type of neural network that can process sequential data by keeping track of the context of prior inputs. Recurrent neural networks can handle inputs of varying lengths and produce outputs dependent on the previous inputs, unlike typical feedforward neural networks, which only process input data in a fixed order.
For instance, RNNs can be utilized in NLP activities like text generation or language translation. A recurrent neural network can be trained on pairs of sentences in two different languages to learn to translate between the two.
The RNN processes sentences one at a time, producing an output sentence depending on the input sentence and the preceding output at each step. The RNN can produce correct translations even for complex texts since it keeps track of past inputs and outputs.
Fully convolutional networks
Fully convolutional networks (FCNs) are a type of neural network architecture commonly used in computer vision tasks such as image segmentation, object detection and image classification. FCNs can be trained end-to-end using backpropagation to categorize or segment images.
Backpropagation is a training algorithm that computes the gradients of the loss function with respect to the weights of a neural network. A machine learning model’s ability to predict the anticipated output for a given input is measured by a loss function.
FCNs are solely based on convolutional layers, as they do not have any fully connected layers, making them more adaptable and computationally efficient than conventional convolutional neural networks. A network that accepts an input image and outputs the location and classification of objects within the image is an example of an FCN.
Spatial transformer network
A spatial transformer network (STN) is used in computer vision tasks to improve the spatial invariance of the features learned by the network. The ability of a neural network to recognize patterns or objects in an image independent of their geographical location, orientation or scale is known as spatial invariance.
A network that applies a learned spatial transformation to an input image before processing it further is an example of an STN. The transformation could be used to align objects within the image, correct for perspective distortion or perform other spatial changes to enhance the network’s performance on a specific job.
A transformation refers to any operation that modifies an image in some way, such as rotating, scaling or cropping. Alignment refers to the process of ensuring that objects within an image are centered, oriented or positioned in a consistent and meaningful way.
When objects in an image appear skewed or deformed due to the angle or distance from which the image was taken, perspective distortion occurs. Applying several mathematical transformations to the image, such as affine transformations, can be used to correct for perspective distortion. Affine transformations preserve parallel lines and ratios of distances between points to correct for perspective distortion or other spatial changes in an image.
Spatial changes refer to any modifications to the spatial structure of an image, such as flipping, rotating or translating the image. These changes can augment the training data or address specific challenges in the task, such as lighting, contrast or background variations.
CNNs are preferred in computer vision tasks due to their advantages, including translation invariance, parameter sharing, hierarchical representations, resilience to changes and end-to-end training.
Convolutional neural networks have several advantages, making them an attractive choice for various computer vision tasks. One of their main advantages is translation invariance, a feature of CNNs that allows them to recognize objects in an image regardless of their position. Convolutional layers are used to accomplish this by applying filters to the full input image so that the network can learn features that are translation-invariant.
The use of parameter sharing, in which the same set of parameters is shared across all areas of the input image, is another benefit of CNNs. As a result, the network has fewer parameters and can better generalize new data, which is crucial when working with huge data sets.
CNNs can also learn hierarchical representations of the input image, with the upper layers learning more complicated features like object pieces and forms, while the lower layers learn simpler elements like edges and textures. For challenging tasks like object detection and segmentation, this hierarchical model enables the network to learn characteristics at many levels of abstraction.
CNNs are suitable for real-world applications because they are resilient to changes in lighting, color and tiny distortions in the input image. Finally, convolutional neural networks can be trained end-to-end, allowing gradient descent to simultaneously optimize all of the network’s parameters for performance and faster convergence. Gradient descent is an optimization algorithm used to iteratively adjust model parameters by minimizing the loss function in the direction of the negative gradient.
CNNs have drawbacks such as lengthy training time, the need for large labeled data sets and susceptibility to overfitting. Network complexity can also affect performance. However, CNNs remain a widely used and effective tool in computer vision, including object detection and segmentation, despite limitations in tasks requiring contextual knowledge like NLP.
Convolutional neural networks have several drawbacks that can make using them in some machine learning applications difficult. For instance, CNN training can take a while, especially for large data sets, because CNNs are computationally expensive. Furthermore, creating the CNN architecture can be challenging and necessitates a thorough comprehension of the fundamental ideas of artificial neural networks.
Another drawback is that CNNs need a lot of labeled data to train effectively. This can be a serious constraint in the circumstances with little data available. CNNs are also not always successful at tasks that require more contextual knowledge, such as NLP, even if they are quite good at image recognition tasks.
The number and kind of layers employed in CNN design can affect performance. For instance, adding more layers might improve accuracy but increase network complexity and computing costs at the same time. Deep learning CNN architectures are also vulnerable to overfitting, which happens when a network gets overly specialized to the training data and performs poorly on new, untrained data.
Despite these drawbacks, CNNs are still a widely used and very effective tool for deep learning and machine learning algorithms in the field of artificial neural networks, including segmentation, object detection and image recognition. That said, CNNs will remain a key player in computer vision.