Machine Learning and Deep Learning for Video: A Developer’s Guide

Nowadays, video footage is being widely used on applications like inspection, surveillance, process management and quality control, where a lot of manpower is required to assess the footage. Artificial Intelligence is used to reduce the workload in the video analysis while increasing the accuracy in tasks such as detection and classification.

RidgeRun has experience in the design, development and optimization of computer vision on video applications, by implementing machine learning and deep learning algorithms according to the capabilities of the hardware platform where the applications must be executed.

What is Machine Learning for Video?

Machine Learning (ML) is an artificial intelligence field where algorithms use statistics to find patterns in data from small to massive amounts. Machine Learning has been developed based on the ability to use computers to prove the data for structure, even if we do not have a theory of what that structure looks like. Data as a concept involves a lot of things, including numbers, words, images, clicks, and many more. anything that you can store digitally can also be fed into a machine-learning algorithm.

The patterns that ML provides are used to perform very accurate analysis of our data without human intervention. ML allows the user to feed a computer algorithm an immense amount of data and start an analysis process in order to get data-driven recommendations and decisions based on only the input data.

How is Machine Learning Used for Video Recognition?

A video is an example of digital data, where each video is considered as a stack of images that can be used to feed a machine learning algorithm. Some ML applications on video that are being developed nowadays include:

Cognitive systems involving assistive robots, self-driven cars, video segmentation, and object detection.
Metadata recognition implicating content compliance, and video metadata recognition.
Intelligent surveillance which involves safe manufacturing, automatic alarming, and child care.

At RidgeRun, machine learning techniques have been used in video-based applications for people, animals, and multiple objects recognition in order to perform detection, classification, security, entertainment, and tracking tasks.

The Difference Between Video-Based AI and Machine Learning

Machine learning represents the most common AI method used today, which is part of the fundamental disciplines of modern artificial intelligence.

Video-Based AI is related to another key discipline called Computer Vision, where the goal is to represent the visual elements of an image or video by using Artificial Intelligence. Computer Vision is a Machine Learning concept where the techniques are used to accomplish this goal, and they establish the basis of emerging technology applications such as self-driven vehicles and face recognition.

Traditional Computer Vision algorithms for video-based AI can be described by the following broad steps:

To extract the local high-dimensional visual features, which describe a region of the video by getting either densely or at a sparse set of interest points.
To combine the extracted features into a fixed-sized video level description. A popular variant for this step is to pact up visual words for encoding features at video-level. Such visual words are derived by using hierarchical or k-means clustering.
To train a classifier such as Support Vector Machine (SVM) or Random Forest (RF) on a pack of visual words to perform the final prediction.

The major differences between the approaches based on the aforementioned steps were given by the design choice around combining spatio-temporal information.

The Benefits of Video-Based AI and Machine Learning

By using Machine Learning techniques on video-based AI recognition, developers are able to obtain these benefits:

The final products will be flawless by using computing imaging.
The reliability will be greater since the human factor is full or partially removed. (Cameras and computers don't get tired as human eyes do).
Costs will be reduced. Developers can save time while reducing the probability of getting defective products. Computer Vision approaches also represent a tool to optimize the time-to-market for video-based AI applications.
Machine Learning methods are very helpful to perform important improvements on the customer experience to be provided with products and services

The Challenges of Video-Based AI and Machine Learning

The main challenges of video-based AI applications are mentioned below:

Computational Cost
Recognition tasks in video involve capturing spatio-temporal context across frames. The captured spatial information requires to be compensated for camera movement. Even having strong spatial object detection, it does not suffice as the motion information also carries finer details.
High level of expertise in Computer Vision field and machine learning techniques to design an application properly while exposing the desired benefits.
The design of architectures that can capture spatio-temporal information involve multiple options which are non-trivial and expensive to evaluate.

RidgeRun has vast experience in the development of Computer Vision for image and video-based AI applications that must run on embedded systems while considering energy, communication and memory constraints. If you want to learn more please contact us and visit our website.

What is Deep Learning for Video?

Deep learning is a concept where the main idea is to imitate the working of the human brain in processing information in order to use it in decision, prediction, and recognition in processes making.

In both machine learning and deep learning, developers enable computers to identify trends and characteristics in data by learning from a specific data set. In machine learning, training data is used to build a model that a computer can use to classify data. Deep learning is a subset of machine learning, where data is fed into the deep neural network in order to learn what features are appropriate to determine the desired output. For video data both spatial and temporal information must be considered according to the application features.

How is Deep Learning Used for Video Recognition?

Deep Neural Networks for video tasks are part of the most complex models, with at least twice the parameters of models for image tasks. Despite the success of deep learning architectures in image classification, progress in architectures for video based AI has been slower.

There are two basic approaches for video-based AI tasks:

1. Single Stream Network:

This approach was initially proposed in this paper, where the authors explored multiple ways to merge temporal information from consecutive frames using 2D convolutions.

In this approach the consecutive frames are presented as the input, and there are four different configurations that can be used:

a. Single frame: a single architecture is used to fuse information from all the frames at the last stage.

b. Late fusion: two nets with shared params are used. The nets are spaced 15 frams apart, and combine predictions at the end of the configuration.

c. Early fusion: the combination is performed in the first layer by convolving over 10 frames.

d. Slow fusion: fusion is performed at multiple stages, as a balance between early and late fusion. Multiple clips are sampled from the entire video and prediction scores are averaged from the sampled clips in order to perform the final predictions.

Important issues that provoked this approach to fail:

The learnt spatiotemporal features don't capture motion features.
The used dataset was not diverse enough, causing difficulties to learn such detailed features.

2. Two Stream Networks:

It’s an approach proposed in this paper which tries to overcome the failures of the Single Stream Network approach. Instead of a single network for spatial context, the proposed architecture is composed of two separate networks: a pre-trained network for spatial context, and another network for motion context.

The input of the spatial net is given by a single video frame. Motion features are modeled in the form of stacked optical flow vectors. The two streams are trained separately and combined by using Support Vector Machines (SVM). The final prediction is the same as the obtained in the Single Stream Network approach, i.e. averaging across sampled frames.

This method improved the performance of the Single Stream method by explicitly capturing local temporal movement. However, there are some drawbacks to be considered:

The video level predictions are obtained from averaging predictions over sampled clips. Then, the long range temporal information is still missing in learnt features
This method requires optical flow vectors pre-computing and storing them separately. Training both the streams in separate end-to-end processes still represents a long path.
Training clips are sampled uniformly from videos, causing a false label assignment problem. The ground truth of each clip is assumed to be the same as the ground truth of the video, which may not be the case if an event of interest happens for a small duration within the entire video.

Several solutions have been proposed since 2014, based on both the Single Stream Network and the Two Stream Networks architectures. Those solutions attack the drawbacks that the original approaches have, while trying to increase the performance in video-based AI applications. Nowadays deep learning is the core concept in such solutions.

Pattern Recognition:

Pattern Recognition is a concept which aims to identify the things or objects in an image or in a sequence of images such as a video.

There are two classifiers for pattern recognition:

Pixel based classification: this classifier for pixel based algorithms is given by not taking contextual information into consideration, where context can be considered as a relationship about how these pixels/objects are associated with their environment.
Object based classification: To overcome the limitation of pixel based classification, we could use Object based image analysis(OBIA) to divide images into meaningful image-objects and evaluate their characteristics in spatial, spectral and temporal aspects.

Object Detection:

Object detection is a problem composed by object localization and/or object classification. The objective of object detection is to determine where objects are located in a given image, while object classification consists in determining which category each object belongs to.

The pipeline of traditional object detection models can be mainly divided into three stages: informative region selection, feature extraction and classification.

How is Deep Learning Used For Video Stabilization?

In the past decade, several digital video stabilization techniques have been proposed to improve the visual quality of videos recorded by cameras that capture shaky content, due to movements caused by external factors such as camera position in vehicles, hand-holding, windy conditions, and more. The main improvements are given by removing high frequency camera movements.

Most of the proposed techniques deal with the stabilization problem using a global view, estimating and smoothing the camera path through offline computation methods. However, there are some techniques to perform online video stabilization with a capture-compute-display operation for each incoming frame in real time and with low latency. The real time requirement demands camera motion estimation via affine transformations, homography, or with the usage of meshflow.

Digital video stabilization algorithms are usually based on feature extraction and frame-to-frame tracking, and optical flow calculations, followed by an algorithm like the Random Sample Consensus (RANSAC) which estimates the global motion in a video. The motion information can be represented with an homography matrix, which contains 2D transformations between the pixels of consecutive frames. The motion estimation is considered as the most computationally demanding step in the video stabilization process, because the feature extraction methods require to get features in a very large number of pixels depending on the frame dimensions.

With the addition of a high frame rate as a requirement, performing video stabilization processes for real time applications becomes very challenging.

Deep learning represents an approach to overcome the main challenges of video stabilization for real time applications, where the motion estimation step is accomplished by using deep Convolutional Neural Networks (CNN), which estimate the required affine transformation matrix directly from a given pair of video frames. Deep learning approaches are able to increase computational speed, as well as to improve the overall performance of the system when used on videos recorded with high-quality cameras.

What Developers Need to Know about Deep Learning and Computer Vision

Computer vision is defined as a field of study where techniques are proposed to help computers to ‘see’ and understand the content of digital images such as photographs and videos. Deep learning is a subset of techniques that can be used to speed up computer vision applications. The relationship between computer vision and deep learning had its peak in the usage of convolutional neural networks (CNN) and modified architectures of them for computer vision tasks. Convolutional neural networks represent a class of deep neural networks (DNN) which is most commonly applied to analyzing visual imagery due to its greater capabilities for image pattern recognition.

In both machine learning and deep learning, developers enable computers to identify trends and characteristics in data by learning from a specific data set. In machine learning, training data is used to build a model that a computer can use to classify data. Deep learning is a subset of machine learning, where data is fed into the deep neural network in order to learn what features are appropriate to determine the desired output. For video data both spatial and temporal information must be considered according to the application features.

Get Started with Video-Based AI: Contact RidgeRun

RidgeRun has vast experience in developing software for embedded systems, focusing on embedded Linux and GStreamer, also including Deep Learning and Computer Vision related projects for video-based AI applications which can be leveraged by the products of the customers, allowing them to reduce the time-to-market.

In both machine learning and deep learning, developers enable computers to identify trends and characteristics in data by learning from a specific data set. In machine learning, training data is used to build a model that a computer can use to classify data. Deep learning is a subset of machine learning, where data is fed into the deep neural network in order to learn what features are appropriate to determine the desired output. For video data both spatial and temporal information must be considered according to the application features.

Here are some projects developed by RidgeRun:

GstInference
GstVideoStabilizer
GstDispTEC Tracker
Bird's Eye View

A Developers Guide to Video Machine Learning & Video Deep Learning