The ability to see and recognize objects is a natural and familiar opportunity for a person. However, for the computer so far – this is an extremely difficult task. The field of application of computer vision and image understanding is very wide. They are everywhere, starting from supermarket barcode readers to** augmented reality. **

If we are talking about a practical explanation of computer vision and image understanding, then it is reading vehicle numbers, analyzing medical images, manufacturing flaw detection, face recognition, etc. In a more professional language, it is a set of technologies and algorithms sharing one goal. Their goal is to teach the inanimate computer to survey the surrounding reality with a certain degree of intelligence.

## Сomputer Vision and Image Understanding – the Basics

In order for a computer to recognize your dog in a photo, you need to master several skills. Those are machine learning methods, image processing fundamentals, principles of paralleling calculations. Plus a bit of mathematics — linear algebra, geometry,

In practice, you don’t necessarily have to know all of them. There are multiple easier ways to orient yourself in the methods of computer vision and image understanding.

## How Does a Computer ‘See’ an Object?

If we analyze the phrase computer vision and image understanding in parts [Computer+Vision], it will be understandably clear what it means, namely, ‘How does a computer see?’

To understand how a computer sees, we need to figure out what it can see. Let’s take an image as a potential object of recognition. The image is the result of the transformation of three-dimensional space into two-dimensional. Such transformation can occur in two types – perspective projection and orthographic projection.

## Computer Vision and Image Understanding Approaches

There are two fundamentally different approaches of computer vision and image understanding – (1) the discrete (finite) data set and (2) function.

### Discrete Data Set

A binary image (only two colors — black and white) can be represented as a numeric matrix of size **n** by **m**. The ones

For a color image, there will already be three such arrays. For each of the **RGB **channels, there is its own array. Image resolution – the number of rows and columns. Video – a sequence of such images (30 frames per second). Each pixel of the image is a representation of a set of points in space.

### Function Approach

Unfortunately, RGB is not always well suited for analyzing information. Experiments show that the geometric proximity of colors is quite far from how a person perceives the proximity of certain colors to each other. One can think of an image as a function, as it can be much more convenient for calculations.

#### Discrete Fourier Transform

The function can be expanded using a Discrete Fourier Transform – **DFT**. A digital photograph or other raster image is an array of numbers, recorded by the sensors of the brightness levels, in a two-dimensional plane. Knowing that from a mathematical point of view, a thin lens performs the Fourier transform of images placed in focal planes, it is possible to create image processing algorithms that are analogous to image processing by a classical optical system.

It turns out that at the output we will get a representation of our signal (an image) as the sum of the values of derivatives of different orders of this function at this point. As a rule, the signal – our image – is set on a certain interval of possible values. The cosine decomposition is used to represent it using the discrete cosine transform (DCT) method. In practice, the image is decomposed into two basic functions – two-dimensional sinusoids with different angles of inclination and phases. To accelerate decomposition, their decomposition into one-dimensional **DFT **is used (for example, when compressing in JPEG).

Since the partial sum of the Fourier series will differ from the signal at sharp boundaries – an error occurs – ringing Gibb’s Phenomenon.

#### Gaussian Pyramid

Pyramid is a type of multi-scale signal representation, in which a signal or an image is subject to repeated smoothing and subsampling. Historically, pyramid representation is a predecessor to scale space representation and multiresolution analysis. During processing, there are cases when images can be excluded from processing based on the analysis of their small representations. This type of pyramid as the **Gaussian **Pyramid is used to improve pattern matching, multi-dimensional image understanding (multidimensional methods).

#### Laplacian Pyramid

The **Laplacian Pyramid** is a variation of the Gaussian Pyramid. At each level, not a scaled copy is stored, but a signal (image data) of a certain frequency. Efficient compression, progressive data transfer, first the upper levels and then we specify less frequent data.

#### JPEG 2000 Algorithm

The JPEG 2000 algorithm instead of DCT (used in the JPEG algorithm) uses a **wavelet transform**, which is based on representing the signal as a superposition of basic functions with special properties called wavelets. Compressed by this algorithm, and then the reconstructed image is smoother and sharper, and the file size is smaller than JPEG with the same quality of the reconstructed image. JPEG 2000 is completely free from the main disadvantage of JPEG: thanks to the use of wavelets, images restored after strong compression do not contain artifacts in the form of blocks of pixels.

### Where Is My Dog?

In order to find our dog in the photo we need **Machine Learning**. What is **ML**? This is when the computer quickly learns how to select data (positive and negative examples), and based on them, it learns how to work with new data (not included in the training set).

A specific object can be described with the help of signs and characteristics, if several objects of the same class have been analyzed – you can select their common features, and when studying a new object, the idea of the base class will open, in order to understand how it relates to the new object to the class we know . For example, we need to recognize any dog.

Then we need a training program. Dogs must be different – large, small, with different hair color and in different poses. The classification is formed during the development of the training set, and contains a description of the signs that, with reference to the new image, with some probability (percentage sign) will recognize or not recognize dogs.

### Difference of Gaussians

There are many different approaches to image recognition in computer vision and image understanding. For example, in the image, you first need to select interesting points or interesting places. Something different from the background: bright spots, transitions, etc. One of the most common ways is called **Difference of Gaussians** (DoG). Blurring the picture with different radius and comparing the results obtained, you can find the most contrasting fragments. The areas around these fragments are the most interesting.

Further, these areas are described in digital form. The regions are divided into small sections, it is determined in which direction the gradients are directed, vectors are obtained. The received data is recorded in descriptors.

In order for identical descriptors to be recognized as such regardless of turns in the plane, they are turned so that the largest vectors are turned in one direction. This is not always done. But if you want to find two identical objects located in different planes. Descriptors can be written in numeric form and also represented as a point in a multidimensional array.

Next, for each cluster we describe a region in space. When the descriptor falls into this area, it does not matter what area he’s come from, but which area it has just fallen into. And then we can compare the images, determining how many descriptors of one image are in the same clusters as the descriptors of another image. Such clusters can be called **visual words**.

### Similar Objects Understanding

To find not just identical pictures, but images of similar objects, it is required to take a set of images of this object and a set of pictures in which it does not exist. Then select descriptors from them and cluster them.

Next, we need to find out which clusters the descriptors from the images on which the object we need was present got into. Now we know that if the descriptors from the new image fall into the same clusters, it means that the desired object is present on it. Matching descriptors is not a guarantee of the identity of the objects containing them.

One of the ways of additional verification is** geometric validation**. In this case, a comparison of the location of the descriptors relative to each other.

### Dogs, Cats or People?

In an ideal world, **categorization **would be simple – either the object falls into this category or not: based on a set of properties common to all elements from the category. These are so-called rigid ideal categories. In practice, not everything is so simple – if someone likes to feast on dandelions, this is not a reason to bring them into the ‘food’ class.

Therefore it is customary to use so-called **natural categories**. In them, each class is determined by the best prototype examples. Therefore, when comparing an unknown class, we can specify a certain degree (probability) of the category fit. Fuzzy rules are much more flexible and more suitable for real classification tasks in computer vision and image understanding.

For example, the psychological term canonical perspective is such a reference instance of a class by which it is easiest for a person to identify a subject. Once you hear the “cat” class, then some cute and fluffy Garfield comes to your mind somehow. One of the simplified models of human thinking is the **feedforward architecture** – when there is no a priori information about the observed scene, so no feedback is required. A good classifier should provide equivalent capabilities.

We come to the formulation of the problem of **categorization**: it is necessary to determine whether there is an object (scene) in the image of a given category. At present, the classical approach consists in analyzing only local features of an image, without isolating and analyzing individual objects and their aggregates.

## Image Categorizing Methods

The main methods for categorizing images including machine learning based techniques:

### Knn – Nearest Neighbor Method

Formally, the k-nearest-neighbors method is based on the compactness hypothesis: if the metric of the distance between the examples is introduced quite successfully, then similar examples are more often in the same class than in the other. To classify a new object, you need to find similar “neighbors” from the dataset (base) and then determine the class by voting. A notable feature of this approach is its laziness. This means that the calculations begin only at the moment of the classification of the test case, and in advance, only if there are training examples, no model is built.

### Random Forest

The random forest is one of the most universal and effective learning algorithms with a teacher, applicable for both classification tasks and regression restoration tasks. The idea of this method of computer vision and image understanding is to build an ensemble in parallel, trained independent decision trees.

The final classification of the object is carried out by voting all the trees that make up the ensemble. Among the advantages of the algorithm, one can highlight the high quality of prediction, the ability to efficiently process data with a large number of categories and attributes, an internal assessment of the model generalization ability, high parallelism and scalability.

### SVM – Support Vector Machine

The SVM classifier is a powerful way to classify with learning. It is great for segmented input rasters, but it can work with standard images. This is a classification method widely used by computer vision and image understanding researchers. For standard input images, the tool accepts multi-channel images of any bit depth and performs SVM pixel-by-pixel classification based on the input file of the training feature class.

For segmented rasters, the key property of which is set as Segmented, the tool calculates the index image and associated segment attributes from the segmented **RGB **raster. Attributes are calculated to create a classifier definition file that should be used in a separate classification tool. Attributes for each segment can be calculated for any image supported by Esri.

### Neural networks

One of the computer vision and image understanding methods used for recognition, classification, and restoration of images is a method based on neural networks. In order to reduce the number of input neurons in the network, the image classification system is usually located at the preprocessing step. One of the steps of pre-processing digital images is wavelet transform.

Currently, the wavelet transform is a widely known method used to analyze images and obtain such image characteristics as shape and texture. There are two types used for image processing called **Convolutional Neural Networks **(CNN) and **Deep Convolutional Neural Networks** (DCNN).

A typical way to use CNN is to classify images: if there is a cat in the image, the network will issue a “cat’, if there is a dog – a “dog”. Such networks usually use a “scanner” that does not parse all the data at once. For example, if you have an image of 200 × 200, you will not immediately process all 40 thousand pixels. Instead, the network counts a 20 x 20 square (usually from the upper left corner), then moves 1 pixel and counts a new square, and so on.

This input data is then transmitted through convolutional layers, in which not all nodes are interconnected. These layers tend to shrink with depth, with often used degrees of two: 32, 16, 8, 4, 2, 1. In practice, FFNN is attached to the end of CNN for further data processing. Such networks are called deep (DCNN).

## Summing up Computer Vision and Image Understanding

We accumulate a huge amount of information, the process of learning a neural network does not stop for a second. For a person, it is not particularly difficult to restore a perspective from a flat picture and imagine what it would all look like in three dimensions. Computer all this is given much more complicated. And primarily because of the problem of accumulation of experience.

Since at the moment there is no universal metric, we will keep up researching in the field of computer vision and image understanding, and we will be happy to respond to your comments if something is interesting or remains unclear.