Deep learning models have succeeded at a variety of human intelligence tasks and are already being used at the commercial scale. These models largely rely on the standard gradient descent optimization of parameters w, to map input X to an output y = f(X;w). The optimization procedure minimizes the loss (difference) between the model output and actual output. As an example, in the cancer detection setting, X is an MRI image, while y is the presence or absence of cancer. Three key ingredients hint at the reason behind deep learning’s power. (1) Deep architectures can better adapt to breaking down complex functions into a composition of simpler abstract parts. (2) Standard gradient descent methods attain local minima on a non-convex function f(X;w) that is not too far from the global minima. (3) The architecture is suited for execution on parallel computing hardware (e.g., GPUs), thus making the optimization viable over hundreds of millions of observations (X,y).
Computer vision tasks, where input X is a high-dimensional image or video, are particularly suited to deep learning application. Recent advances in deep architectures, i.e., inception modules, attention networks, adversarial networks and DeepRL, have opened up completely new applications that were previously untried. However, the breakneck progress to replace human tasks with deep learning comes with caveats. These models tend to limit interpretation, lack causal relationships between input X and output y and may inadvertently mimic not just human actions but human biases and stereotypes. In this tutorial, we provide an intuitive explanation of deep learning methods in computer vision as well as their limitations.