fbpx

Computer Vision Datasets to Use for Your Next Project

Perhaps, it’s not a secret for anyone that AI technologies are on the rise nowadays. Therefore, it’s not surprising that there is a good deal of articles describing the advances in computer vision. Hundreds of tech sites focus on the latest computer vision algorithms capable of bringing numerous businesses to the next level. However, much less attention is paid to the thing that makes those algorithms powerful. We mean the data itself. It’s quite evident that successful computer vision projects are just impossible without enough high-quality training data. That’s why we decided to devote this article to the irreplaceable source of the useful training data – computer vision datasets.

In our article, we’ll not only explain what computer vision datasets are and why they are so important. We’ll also acquaint you with a number of datasets for computer vision that may be of great help in your next project. Are you already intrigued? Then, let’s start without further ado.

What Are Computer Vision Datasets?

Computer vision datasets are curated collections of information organized to allow useful learning on a specific topic. Actually, they are the vital training data, which is the backbone of machine learning.

As you know, computer vision deals with automatic extraction, analysis, and understanding of useful information from an image or a sequence of images by a machine. Computer vision datasets let us teach our machines what exactly to extract, analyze, and understand. In other words, datasets for computer vision help teach machines how to respond to and act like humans.

As a rule, existing computer vision datasets are off-the-shelf, already annotated sets available either for free or for purchase. The majority of them are application-specific. This means that they focus on a certain topic, for instance, faces, cars, etc. They serve as a good starting point for any project and are an integral part of an ML engineer’s toolkit.

Why Are Computer Vision Datasets Necessary?

It doesn’t matter whether you’re going to produce self-driving cars, employ drones or automate your manufacturing process using robots. In any case, your mission is impossible without qualitative datasets for computer vision.

Any time you implement computer vision, choosing a proper algorithm is not enough. You have to train your model, test it, and teach it what it doesn’t know yet. In fact, the process resembles that of teaching a toddler. If you’re not a parent yet, we’ll explain our thought in a few words.

If you want to teach a kid what a cat is, you definitely won’t tell him: “My boy, a cat is a small domesticated carnivorous mammal with soft fur”. You’ll just show him a number of pictures with various cats repeating: “This is a cat, and this is a cat, and here is one more cat.” By seeing real-life examples of what the cat is, your kid will learn to recognize this animal.

The same is with the machines. A machine can’t properly interact with the real objects unless it knows what these objects are in detail. Consequently, the more examples of the real-life object you use while training, the more accurate the result will be.

In a word, practice makes perfect as they say. Computer vision datasets are the unique source for such a practice. Whether your interests lie in optical character recognition, image recognition, video recognition or video tracking, datasets for computer vision are a must.

Helpful Computer Vision Datasets

Of course, no one says that it is impossible to create your own datasets for computer vision. However, this task can be rather tiresome and time-consuming even for a real pro. That’s why we decided to save your precious time and effort. We’ve compiled a list of useful computer vision datasets that might arouse your strong interest. The best thing is that you’ll find a link for downloading under every description.

The Cityscapes Dataset

The Cityscapes Dataset

Cityscapes is a large-scale dataset that focuses on the semantic understanding of the urban street scenes. It is one of the computer vision datasets helpful to assess the performance of vision algorithms. Besides, it can be used to support the research that aims to exploit large volumes of annotated data.

The Cityscapes Dataset includes a wide range of stereo video sequences recorded in street scenes from 50 different cities. With this dataset, you get 5,000 images with fine annotations and 20,000 images with coarse ones.

The best thing is that the dataset is absolutely free for anyone with non-commercial purposes. These purposes include academic research, teaching, scientific publications, and personal experimentation.

ImageNet Dataset

ImageNet Dataset

ImageNet is, perhaps, one of the most popular datasets for computer vision. Its main aim is to provide researchers around the world with an easily accessible image database.

This dataset organized according to the WordNet hierarchy contains more than 100,000 synonym sets (synsets). The larger part of the synsets (80,000+) is nouns. Each and every synset is illustrated by on average 1000 images. Furthermore, the images of every concept are quality-controlled and human-annotated.

Of course, it’s a real challenge to download such a huge amount of data. That’s why there are a few options for downloading to choose from. You can download image URLs, original images, features, object bounding boxes, or object attributes. Also, there’s a pleasant bonus – API integration.

COCO Dataset

COCO Dataset

Common Objects in Context (or simply COCO) is another outstanding representative of computer vision datasets. COCO is a large-scale object detection, segmentation, and captioning dataset. Its creators’ goal is “advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding”.

This dataset includes 80 object categories and 91 stuff categories. The images of complex everyday scenes show common objects in their natural context. The objects are labeled using per-instance segmentations to aid in precise object localization.

All in all, the dataset has more than 300K images with 5 captions per image. The greatest thing is that each photo COCO contains is easily recognizable by a 4-year-old.

Mapillary Vistas Dataset

Mapillary Vistas Dataset

Mapillary Vistas Dataset is one of the largest computer vision datasets for teaching machines to see. This diverse street-level imagery dataset is pixel-accurate. It contains instance‑specific human annotations for understanding street scenes around the world.

The dataset possesses 25,000 high-resolution images: 18,000 for training, 2,000 for validation, and 5,000 for testing. The average resolution of every image is about 9 megapixels. The images cover 6 continents: North and South America, Europe, Africa, Asia, and Oceania. Moreover, there’s high variability in weather conditions (sun, rain, snow, fog, haze). The capturing times also differ (dawn, daylight, dusk, and even night). What’s more, the images are captured from various viewpoints including road, sidewalks, and off-road.

Mapillary Vistas Dataset comes in two editions: Research and Commercial. In case your research is non-commercial, you can request the free version. If you’re doing commercial research, you’ll have to license the full dataset for training models. Also, you can download the sample to evaluate the dataset.

MPII Human Pose Dataset

MPII Human Pose Dataset

MPII is one of the state-of-the-art datasets for computer vision that enables evaluation of articulated human pose estimation. This dataset contains about 25K images including more than 40K people with annotated body joints.

The images of MPII Human Pose Dataset were systematically collected using an established taxonomy of everyday human activities. These activities include sports, transportation, self-care, occupation, music playing, dancing, and fishing and hunting to name just a few. In total, the dataset covers 410 human activities. The images were extracted from the YouTube video. Every image has an appropriate activity label. Besides, the test set possesses rich annotations that contain body part occlusions and 3D torso and head orientations.

The Kinetics Human Action Video Dataset

The Kinetics Human Action Video Dataset

Kinetics-600 is a large-scale, high-quality dataset aimed at helping the machine learning community to advance models for video understanding. It is an improved expanded version of Kinetics-400 that was released in 2017.

Kinetics-600 is a useful computer vision dataset of YouTube video URLs which include a diverse range of human-focused actions. It consists of around 500,000 video clips. The clips embrace 600 human action classes with at least 600 video clips for each action class. The duration of every video clip is approximately 10 seconds. Each clip is labeled with a single class. The actions cover various fields including both human-object interactions, for instance, playing instruments and human-human interactions, for example, shaking hands and hugging.

The best thing is that the files in the dataset are periodically updated. As a result, they don’t have links to the deleted or non-public videos. The last update was on May 1, 2018.

The 20BN-Something-Something Dataset

The 20BN-Something-Something Dataset

The 20BN-Something-Something is one of the latest computer vision datasets released by TwentyBN. It’s a huge collection of densely-labeled video clips showing humans performing pre-defined basic actions with everyday objects. The dataset enables machine learning models to develop a fine-grained understanding of basic actions that occur in the physical world.

The total number of videos in the 20BN-Something-Something is 220,847. To be more precise, the training set contains 168,913 videos, the validation set includes 24,777 videos, and the test set has 27,157 videos. Every video in the training and validation sets possesses object annotations in addition to the video label. All in all, there are 318,572 annotations involving 30,408 unique objects. The resolution is 240px.

All video data is provided as TGZ archive, split into parts of 1 GB max. The entire download size is 19.4 GB. You can use the dataset for free for any academic research. If your research is a commercial one, a license is a must.

The 20BN-Jester Dataset

The 20BN-Jester Dataset

The 20BN-Jester is another helpful dataset for computer vision from TwentyBN. This dataset represents a large collection of densely-labeled video clips. The video clips show people who perform predefined hand gestures in front of a laptop camera or webcam. The dataset makes it possible to train robust machine learning models to recognize human hand gestures.

The 20BN-Jester contains 148,092 videos. There are 118,562 videos in the training set, 14,787 in the validation set, and 14,743 in the test set. Hand gestures cover 27 categories from drumming fingers and pushing hand away to no gesture.

The video data comes as a TGZ archive, split into parts of 1 GB max. Its download size is 22.8 GB. If you are carrying out academic research, the dataset is available free of charge. In case of the commercial use, you have to request for a commercial license.

Visual Genome Dataset

Visual Genome Dataset

Visual Genome is the last item of our computer vision datasets list. However, it doesn’t mean that it deserves less attention than the others.

Visual Genome is a dataset released with the aim to connect language and vision using dense image annotations. The dataset includes 108,077 images where each image has an average of 35 objects, 26 attributes, and 21 pairwise relationships between objects. In total, the dataset contains 5.4M region descriptions, 1.7M visual question answers, 3.8M object instances, 2.8M attributes, and 2.3M relationships.

Each object of Visual Genome Dataset is delineated by a tight bounding box. Moreover, it is canonicalized to a synset ID in WordNet. Descriptions and question answers are raw texts without any restrictions on length or vocabulary. The pairwise relationships can be actions, spatial, descriptive verbs, prepositions, comparative, or prepositional phrases.

There are two types of QA pairs associated with each image. The first type is freeform QAs, based on the entire image. The second type is region-based QAs, based on the selected regions of the image. All in all, each image has 6 different types of QAs: what, where, how, when, who, and why.

Hope you’ve found the computer vision datasets that are worth your attention among the mentioned above. Don’t hesitate to download them and use in your new project. Undoubtedly, they will do a great job and help you succeed.

Do you use any other computer vision datasets and want to recommend them to our readers? In such a case, you are welcome to share your preferences in the comments section below.


Tags

Related Posts

Computer Vision Healthcare and Medical Applications
Computer Vision Healthcare and Medical Applications

Let's dig into the world trends in computer vision healthcare application and see how ML helps to...

Comments

Car Recognition - How It Works & Where to Find the Software
Car Recognition - How It Works & Where to Find the Software

Car recognition software is an essential tool for tracking vehicles that has gained momentum in t...

Comments

Top 15 Best Books on Artificial Intelligence - Programming, Robotics, ML
Top 15 Best Books on Artificial Intelligence - Programming, Robotics, ML

Machines that can see and analyze, robots that can act instead of humans. A century ago it would ...

Comments