Lectures and Readings : Visual Computing Systems

Lecture 1: Throughput Architecture Review

Required Reading:

The Compute Architecture of Intel Processor Graphics Gen9, Intel Technical Whitepaper

The reading for this class is not an academic paper, but a whitepaper from Intel describing the architectural geometry of a recent GPU. -- the marketing name is HD Graphics 530 (or larger numbers).

I'd like you to read the whitepaper, focusing on the description of the processor in Sections 5.3-5.5. Then, given your knowledge of the concepts discussed in lecture (such as superscalar execution, multi-core, multi-threading, etc.), I'd like you to describe the organization of the processor (using terms from the lecture, not Intel terms). For example, what is the basic processor building block. How many hardware threads does it support? What type of SIMD instructions are executed by those threads? Does it have superscalar execution capabilities? How many times is this block replicated for additional parallelism?

Consider your favorite data-parallel language, such as GLSL/HLSL shading languages, CUDA, OpenCL, ISPC, or just an OpenMP #pragma parallel for. Can you think through how an embarrassingly parallel for loop can be mapped to this architecture. (You don't need to write this down, but you could.)

I also encourage you to read NVIDIA's V100 (Volta) Architecture whitepaper, linked in the "further reading" below. Can you put the organization of this GPU in correspondence with the organization of the Intel GPU? You could make a table contrasting the features of a modern AVX-capable Intel CPU, Intel Integrated Graphics (Gen9), NVIDIA GPUs, etc.

Further Reading:

NVIDIA Tesla V100 Whitepaper, 2017
NVIDIA Tegra X1 Whitepaper
The Rise of Mobile Visual Computing Systems, Fatahalian, IEEE Mobile Computing 2016
Scalability! But at What COST?, McSherry, Isard, and Murray. HotOS 2015 (The arguments in this paper are very consistent with the way we think about performance in the visual computing domain.)

Lecture 2: The Digital Camera Processing Pipeline

Required Reading:

Burst Photography for High Dynamic Range and Low-light Imaging on Mobile Cameras, Hasinoff et al. SIGGRAPH Asia 2016

Further Reading:

The Stanford CS448A course notes are a very good reference for camera image processing pipeline algorithms and issues.
The interactive demos on the Stanford CS178 course site are very well done
Clarkvision.com has some very interesting material on cameras.
Demosaicking: Color Filter Array Interpolation, Gunturk et al. IEEE Signal Processing Magazine, 2005
A Non-Local Algorithm for Image Denoising, Buades et al. CVPR 2005
A Gentle Introduction to Bilateral Filtering and its Applications, Paris et al. SIGGRAPH 2008 Course Notes
A Fast Approximation of the Bilateral Filter using a Signal Processing Approach, Paris and Durand. MIT technical report 2006 (extends their ECCV 2006 paper)

Lecture 3: Modern Smartphone Camera Processing (such as in the Pixel 2 Phone)

Required Reading:

Synthetic Depth-of-Field with a Single-Camera Mobile Phone, Wadha et al. SIGGRAPH 2018

Further Reading:

The Laplacian Pyramid as a Compact Image Code. Burt and Adelson, IEEE Transactions on Communications 1983.
Pyramid Methods in Image Processing. Andersen et al. 1984
Local Laplacian Filters: Edge-aware Image Processing with a Laplacian Pyramid. Paris et al. SIGGRAPH 2013
Exposure Fusion. Mertens et al. Computer Graphics and Applications, 2007
Fast Local Laplacian Filters: Theory and Applications. Aubry et al. TOG 14
A Non-Local Algorithm for Image Denoising, Buades et al. CVPR 2005

Lecture 4: Efficiently Scheduling Image Processing Algorithms on Parallel Hardware

Required Reading:

Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. Ragan-Kelley, Andrew Adams, et al. PLDI 2013 (or read the selected chapters in the Ragan-Kelley thesis below)

Further Reading:

Decoupling Algorithms from the Organization of Computation for High Performance Image Processing (please read Chapters 1, 4, 5, and 6.1), Ragan-Kelley (MIT Ph.D. thesis, 2014)
Automatically Scheduling Halide Image Processing Pipelines, Mullapudi et al. SIGGRAPH 2016
Differentiable Programming for Image Processing and Deep Learning in Halide. T. Li et al. SIGGRAPH 2018
Parallel Associative Reductions in Halide. P. Suriana et al. CGO 2017
Halide Language Website (contains documentation and many tutorials)
Check out this Youtube Video on scheduling

Lecture 5: Specialized Hardware for Image Processing

Required Reading:

The Frankencamera: An Experimental Platform for Computational Photography. A. Adams et al. SIGGRAPH 2010

Further Reading:

Darkroom: Compiling High-Level Image Processing Code into Hardware Pipelines Hegarty et al. SIGGRAPH 2014
Rigel: Flexible Multi-Rate Image Processing Hardware, Hegarty et al. SIGGRAPH 2016.
Programming Heterogeneous Systems from an Image Processing DSL. Pu et al. TACO 2017

Lecture 6: Lossy Image (JPG) and Video (H.264) Compression

Required Reading:

There is no required reading for this lecture.

Further Reading:

Overview of the H.264/AVC Video Coding Standard
Learning Binary Residual Representations for Domain-specific Video Streaming. Tsai et al. AAAI 18
Real-Time Adaptive Image Compression. Rippel and Bourdev. 2017

Lecture 7: The Light Field and Capture for VR

Further Reading:

Stanford cs231: Convolutional Neural Networks for Visual Recognition. If you haven't taken CS231N, I recommend that you read through the lecture notes of modules 1 and 2 for very nice explanation of key topics.
An Introduction to different Types of Convolutions in Deep Learning. by Paul-Louis Pröve (a nice little tutorial)
Neural Networks and Deep Learning, Nielson, 2016 (a free online book)
Deep Residual Learning for Image Recognition. K. He et al. CVPR 2016 (the ResNet paper)
Visualizing and Understanding Convolutional Neural Networks, Zeiler and Fergus, ECCV14
Facebook Tensor Comprehensions (Arxiv paper is here)
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. Han et al. ICLR 2016
Progressive Neural Architecture Search. Liu et al. ECCV 2018

Lecture 9: Parallel DNN Training

Required Reading:

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. Goyal et al. 2017

Further Reading:

ImageNet Training in Minutes. You et al. 2018
Scaling Distributed Machine Learning with the Parameter Server, Li et al. OSDI 2014
Deep Gradient Compression. Lin et al. ICLR 2018

Lecture 10: Hardware Accelerators for Deep Neural Network Evaluation

Required Reading:

In-Datacenter Performance Analysis of a Tensor Processing Unit. Jouppi et al. ISCA 2017

Further Reading:

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. Parashar et al. ISCA 2017
EIE: Efficient Inference Engine on Compressed Deep Neural Network, Han et al. ISCA 2016
vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design, Rhu et al. MICRO 2016
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Network, Chen et al. ISCA 2016

Lecture 11: Imposing Structure on DNN Topologies

Required Reading:

Inferring and Executing Programs for Visual Reasoning. J. Johnson et al. ICCV 2017

Further Reading:

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. J. Johnson et al. CVPR 2017
Neural Module Networks, Andreas et al. CVPR 2016
HydraNets: Specialized Dynamic Architectures for Efficient Inference. Mullapudi et al. CVPR 2018
Deep Bilateral Learning for Real-Time Image Enhancement. Gharni et al. SIGGRAPH 2017

Lecture 12: Efficient Inference on Video Streams

Required Reading:

NoScope: Optimizing Neural Network Queries over Video at Scale. D. Kang et al. 2017

Further Reading:

Mainstream: Dynamic Stem-Sharing for Multi-Tenant Video Processing. Jiang et al. USENIX 2018
Deep Feature Flow for Video Recognition. Zhu et al. CVPR 2017
Clockwork Convnets for Video Semantic Segmentation. E. Shelhamer et al. 2016

Lecture 13: Video Processing at Cloud Scale

Required Reading:

Scanner: Efficient Video Analysis at Scale. Poms et al. SIGGRAPH 2018 (Github code here)

Further Reading:

SVE: Distributed Video Processing at Facebook Scale. Huang et al. SOSP 2017
Chameleon: Scalable Adaptation of Video Analytics. Jiang et al. SIGCOMM 2018
Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads. Fouladi et al. NSDI 2017

Lecture 14: Real-Time 3D Graphics Pipeline Architecture

Required Reading:

Please read the Fouladi et al. paper listed under the recommended readings from last class.

Further Reading:

A Trip Down the LOL Graphics Pipeline. A nice introductory blog post for Riot Games that illustrates all the different rendering passes used to construct a League of Legends scene. Note how each of these passes draws geometry under different graphics pipeline state configurations.
A Trip Down the Graphics Pipeline. A much more detailed blog post by Fabian Giesen describing the Direct3D 10-class pipeline
The Design of the OpenGL Graphics Interface. by M. Segal and K. Akeley. [unpublished 1994]
The Direct3D 10 System by D. Blythe. SIGGRAPH 2006

Lecture 15: Data Access in the Graphics Pipeline (Efficient Texture Mapping and Depth Buffering)

There is no required reading for this lecture.

Further Reading:

Pyramidal Parametrics. L. Williams, Computer Graphics 1983
Texture on Demand. D. Peachy. Pixar Technical Memory #217. 1990
The Design and Analysis of a Cache Architecture for Texture Mapping. Z. S. Hakura and Anoop Gupta, ISCA 1997
Prefetching in a Texture Cache Architecture. H. Igehy et al. Graphics Hardware 1998
Cardinality-Constrained Texture Filtering. J. Manson and S. Schaefer. SIGGRAPH 2013.
Parameterization-Aware MIP-Mapping. J. Manson and S. Schaefer. Computer Graphics Forum. 2012.

Further Reading:

Pomegranate: A Fully Scalable Graphics Architecture. M. Eldridge et al. SIGGRAPH 2000
Life of a Triangle - NVIDIA's Logical Pipeline. C. Kubisch (NVIDIA GameWorks Blog, 2015)
Fast Tessellated Rendering on Fermi GF100. T. Purcell (High Performance Graphics Hot3D talk)
A Sorting Classification of Parallel Rendering. S. Molnar et al. IEEE Computer Graphics and Applications, 1994.

Lecture 17: Domain-Specific Languages for Shading

Required Reading:

A Language for Shading and Lighting Calculations. P. Hanrahan and J. Lawson. SIGGRAPH 1990
Cg: A System for Programming Graphics Hardware in a C-like Language. W. R. Mark et al. SIGGRAPH 2003

Further Reading:

Slang: Language Mechanisms for Extensible Real-time Shading Systems. Y. He, K. Fatahalian, T. Foley. SIGGRAPH 2018
Shader Components: Modular and High Performance Shader Development. Y. He et al. SIGGRAPH 2017
A Real-Time Procedural Shading System for Programmable Graphics Hardware. K. Proudfoot et al. SIGGRAPH 2001
Shade Trees. R. Cook. SIGGRAPH 1984
An Image Synthesizer. K. Perlin. SIGGRAPH 1985