Image Classification

How do you identify a cat from a dog?? Huh, what a stupid question! you will ask. You would have seen cats and dogs all throughout your life, so this will seem like a dumb question to you. What if a small kid asks you this question? How are you going to teach them to identify cats and dogs? The easiest and most efficient way is to show them images of both. After that try to present new images of cats and dogs and if they can correctly identify new examples, you successfully taught them to identify cats and dogs.

Now comes the important question: How do you teach a computer to perform the same task? We do something similar to what we have done with kids. In machine learning, we refer to this as Image classification.

The Image Classification problem is the task of assigning an input image a label or class from a fixed set of categories. This means that the task is to analyze an input image and assign a label that categorizes the image from a possible set of labels.

For example: Consider the following set of labels:
Label = {cat, dog, elephant, tiger, mouse}

Now if we try out the following image to our image classification model

We should get the correct label from the label set – that is “dog”.

The image classification task can also predict a distribution over labels to indicate our confidence for a given image. The confidence levels indicate how sure the classifier is that an image belongs to a label.

For instance, an animal image classifier may return something like this:

94.8% dog
4.2% wolf
1% cat

How does image classification work?

How a human perceives an image is different from how a machine sees it. An image is made up of tiny pixels. In a computer, an image is represented as one large 3-dimensional array of numbers. The above cat image is 1023 pixels wide, 1280 pixels tall (the size is determined by the image resolution), and has three color channels Red, Green, and Blue (or RGB for short). Therefore, the image consists of 1023 x 1280 x 3 numbers or a total of 39,28,320 numbers. Each number is an integer that ranges from 0 (black) to 255 (white). Our task is to turn these numbers into a single label, such as “dog”.

Challenges

There are some challenges that are likely to be experienced while performing image classification tasks.

Viewpoint variation- A single instance of an object can be oriented in many ways. The different orientation of an object poses difficulty while training the image classification model. Here a dog toy is shot from different viewpoints. These images look different to a machine even though the object remains the same.

Scale variation- Same object in different sizes (real-world size or to the range in the image) may pose a problem during image classification.
Deformation- Sometimes the object of interest may be deformed.

Occlusion- The objects of interest can be occluded. Maybe the view could be obstructed by something and only parts of the object could be visible.
Illumination conditions- Images have drastic changes in different lighting.

*image is taken from CS231n: Deep Learning for Computer Visio*

Background clutter- The objects of interest may blend into their environment, making them hard to identify. Can you spot the cheetah in the below image?

Intra-class variation- The classes of interest can often be relatively broad, such as dogs. There are many different breeds of dogs, each differs in look, size, and shape.

A good image classification model must overcome all these challenges.

Data-driven approach

Unlike writing an algorithm for, for example, sorting a list or finding the sum of 2 numbers, the algorithm for identifying a dog in images is not obvious. We cannot specify each and every detail of an image directly in code. For example, consider a picture of a cat and a dog. Imagine trying to write a procedural program that tells the difference between them and then try to write a program to tell the difference between any cat and a dog. We have to change the whole program for each picture of cats and dogs. Also, the programs will be full of if-else statements. This is not a good method. So, what can we do?

Just like how we teach a kid, we are going to provide the computer with many examples (training sets) of each class. Then we develop a learning algorithm that goes through each example and learns about the appearance of each class. This approach is called the data-driven approach.

Image classification pipeline

The goal of an image classification model is to take matrixes of pixels that represent an image and assign a label to it.

Our complete pipeline can be formalized as follows:

Gather dataset (training dataset): The first and foremost step for any deep learning task is to gather the dataset. The dataset consists of N images for building an image classification model, each labeled with K different classes.
Learning (train the model): Using the training set, the model learns what every one of the classes looks like. When the model makes a mistake, it learns from it and improves itself.
Evaluation: Finally, we evaluate the quality of the classifier by asking it to predict labels for a new set of images (testing dataset) that it has never seen before. We will then compare predicted labels with the true labels of the images.

Reference

CS231n: Deep Learning for Computer Vision