Computer Vision will be Sharpened by Deep Learning


Wouldn’t it be nice if you were able to talk into your phone and say, “show me all pictures from my vacation in Hawaii’” or be able to type your 2-year-old nephew’s name in the search box and get of all his pictures on the screen? Personally, I would love to have my photo album categorized, tagged, and stored without me having to generate all that information and then use a simple text or voice search to retrieve photographs of interest. Same thing goes for video – I would love to be able to simply say, “show me my graduation video” and have the video pop up on the screen.  There are some systems from Facebook and Google that allow you to do something similar, but that requires tagging the picture first. In essence, as you store the picture, you tag it with person’s name or location and then it can be retrieved using those identifiers.

Enter deep learning, a key enabling technology of the future for computer vision. Deep learning is a set of algorithms that attempt to learn on multiple levels, corresponding to different levels of abstraction. The levels in these learned statistical models correspond to distinct levels of concepts, where higher-level concepts are defined from lower-level concepts.

In the context of computer vision, a deep learning algorithm takes multiple decisions as it scans and analyzes the picture. Suppose you want the computer to recognize a picture of several students in a graduation costume.  A deep learning-based system will analyze the picture at the 8×8 pixel level, for instance, followed by 16×16 pixels followed by 32×32 pixels and the process keeps repeating until a meaningful result is identified.  At each level, some useful information is extracted and passed on to the next level. For example, at the lowest level, the computer may learn whether or not the picture was taken in sunlight, the next level may learn that there is a human face in the image, and the following level might tell that the human is standing or sitting, and so on. Eventually, the knowledge will add up for the system to output a message that says that it is a picture of students in graduation robes.

Deep learning algorithms will enable computers to translate pictures into text or speech and vice versa. So if you present a picture to the computer, it can tell you that a picture is of “grandpa drinking beer” or if you tell the computer to “find all of Joe’s pictures,” then it will locate them for you in your collection of thousands of photographs.

There are two types of deep learning algorithms – supervised and non-supervised. In the supervised mode, the output at each step is given to the computer. In the unsupervised mode, it is up to the algorithm to decide if the results are meaningful or not.

Google’s cat experiment, in which deep learning algorithms were used to recognize picture of cats in YouTube videos, generated quite of bit excitement in the computer vision world. Using 16,000 processors and more than 10 million videos, Google trained computers to find pictures of cats in YouTube videos. This algorithm used unsupervised training that ran over a period of 3 days.   The system was able to achieve a success rate of 81.7% in identifying human faces and a 74.8% success rate in identifying pictures of cats.

Chip maker Nvidia is also investing heavily into deep learning algorithms. The company has already released a development platform called Titan X for deep learning applications. The recent 2015 Nvidia GPU Tech conference focused heavily on deep learning and included keynotes on the topic from Nvidia, Google, and Baidu. The nature of computation in deep learning algorithms makes them perfect candidates to run on a graphics processing unit (GPU). When run on GPUs, training and execution time could be cut from days on an X86 server to hours. GPU chip makers are therefore seeing this as a huge potential growth opportunity.

Deep leaning algorithms have a wide range of applications. Consider security, for example. A large mall or casino could generate video from hundreds of security cameras. It would not be efficient to have a security guard monitoring each of these feeds, however a deep learning system could analyze such feeds and output text messages that could describe the problem and then the security guard could follow up.

Although promising, it will be a while before the computer can recognize your grandpa’s beer drinking picture or output pertinent text, for that matter.  In addition to Google, Microsoft, Nvidia, and other big names, numerous startups are also emerging in this space. However, the accuracy and reliability are still too low for most commercial applications.  As algorithms get better and the capacity of hardware increases, we will start seeing better results, leading to new applications and greater adoption of deep learning algorithms.

Comments are closed.