# Paper Note: DeepFace

Taigman, Yaniv, et al. "Deepface: Closing the gap to human-level performance in face verification."Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.

Facebook likes to suggest me tagging somebody on a photo. Amazingly it is almost always right! DeepFace was by Facebook in 2014. DeepFace in read world, wow.

• Face Recognition: Detection -> Alignment -> Representation -> Recognition

This work contributed to Alignment, Representation and Recognition.

### Alignment

They proposed a 3D-model based alignment method. Although 3D-model based methods had fallen out of favor, they think it is the right way because faces are 3D objects. The alignment method is as the image below.

### Representation

They proposed a novel CNN models for aligned faces raw images.

C1 + M1: conv and max-pooling

C2: conv (no max-pooling)

L4, L5, L6: local connected layers.

F7, F8: FC layers. Features are extracted at F7.

Local connected layers are are like normal conv layers but every location in the feature map learns a different set of filters. Based on the fact the input faces are aligned, different regions of the image have different local statistics, learned by local filters.

### Recognition

Several metrics are tested:

1. unsupervised metric: distance = feature1 dot feature2
2. weighted X^2 distance
3. Siamese network to finetune the last 2 layers

## Results

On the LFW (Labeled Faces in the Wild) dataset, it came to 97.35%. Human: 97.53%.

On the YTF (Youtube Faces) dataset, 91.4%.

# Paper Note: Faster RCNN

Ren, Shaoqing, et al. “Faster R-CNN: Towards real-time object detection with region proposal networks."Advances in Neural Information Processing Systems. 2015.

Fast RCNN did well on object detection, and Faster RCNN is better. The bottleneck of Fast RCNN is region proposal. Selective Search, a region proposal algorithm, generates hundreds of proposals and feed into Fast RCNN. Faster RCNN combine the two steps into one with a novel RPN, improving speed greatly.

## RPN (Region Proposal Networks)

RPN is a multitask network attached after CNN. The input is the CNN conv feature map, and output rectangular coordinates of region proposals and their object scores. RPN can be trained along with original Fast RCNN simultaneously, thus reducing training time.

Here are some sample results.

## Quantitive Results

Using VGG-16, Faster RCNN is 10x faster than Selective Search + Fast RCNN.

In PASCAL VOC 2012 Test Set,  mAP also beats Fast RCNN:

# Paper Note: Deep Compression

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Cohen-Or, Daniel, Yair Mann, and Shachar Fleishman. “Deep compression for streaming texture intensive animations." Proceedings of the 26th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 1999.

After we finish training a NN, we get a model file containing millions of parameters: weights, biases, etc. The model file is 240 MB for AlexNet, and 552 MB for VGG-16. This makes difficulties to store models on different platforms, especially embedded systems whose storage is very limited.

Deep Compression introduces a 3-stage pipeline to reduce the model size:

1. pruning
2. trained quantization
3. huffman coding

Finally the size is reduced by 35x for AlexNet (6.9MB), and 49x for VGG-16(11.3MB).

## Pruning

First we train a model, then remove the neurons with too small weights, and re-train the model with sparse connections.

Pruning leads to 9x to 16x size reduction.

## Trained Quantization

In a trained model, we quantize the weights into several bins. Instead of storing original weights, we only keep the quantized step values and which bin every neuron belongs to (this is an integer index).

## Huffman Coding

Quantized weights and sparse matrix indices and encoded using huffman coding.

# Paper Note: Two-Stream Convolutional Networks for Action Recognition in Videos

Simonyan, Karen, and Andrew Zisserman. “Two-stream convolutional networks for action recognition in videos." Advances in Neural Information Processing Systems. 2014.

This is to do human action recognition with 2-stream CNN. Previously the job is done best with hand-crafted features. Some CNN attempts, treating every frame as an image, were 20% less accurate than hand-crafted state-of-the-art trajectory-based method (on UCV-101 dataset).

To use CNNs to get better performance, this work take temporal information into consider, not only the spacial part. They introduced the 2-stream CNN, where the 1st is for spatial stream, the other is for temporal stream, and at last they are late-fusioned to do classification.

## Spatial Stream ConvNet

Many actions are strongly associated with particular objects, so the static appearance of an action is a useful clue. This CNN is trained as an standard image classifier. A single frame in a video is input, and the action is output. The great part is pre-trained CNN models can be utilized.

## Temporal: Optical Flow ConvNet

To represent the motion of a video as input to CNN, they use optical flows.

The optical flows are split to two: 1 for x-direction, the other for y-direction. An example of a frame of optical flows is on the figure above (d)(e).

While the optical flow representation samples the displacement vector at the same location in multiple frames, a trajectory-based representation is introduced, which works even better.

The input channel stack all L frames, so it has dimension of (w, h, L * 2). The 2 is for x and y.

## Fusion

1. Averaging
2. multi-class linear SVM on softmax scores as features

## Evaluation

2 datasets are used, UCV-101 and HMDB-51.

88% is as good as the state-of-the-art 87.9%.

Good.

# Paper Note: A Bayesian Hierarchical Model for Learning Natural Scene Categories

$AMMAI paper study note Fei-Fei, Li, and Pietro Perona. “A bayesian hierarchical model for learning natural scene categories." Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 2. IEEE, 2005. This is almost the most frustrating paper in AMMAI until now. I should have studies Probability & Statistics and random process more carefully. If you understand Latent Dirichlet Allocation (LDA), you can easily read this. If you don’t like me, this will be really hard. I am trying explaining it now. Problem definition: scene classification in images (but this method can be adopted in many other problems) Previously we classify images directly based on such as BoW features generated from patches. That is: visual word -> scene category Now we find that an extra intermediate representation (in other word, themes) in the middle improves performance. So we first classify the visual word into some theme, and then classify an image based on its themes. That is: visual word -> theme -> scene category Everything technical here is some probabilistic distribution, that is too complicated to understand in a blog. We just go through some key ideas. We want the probability of category c given an image (made up of patches, that is visual words), and parameters $\theta, \beta, \eta$. $p(c|x, \theta, \beta, \eta) \propto p(x|c, \theta, \beta)p(c|\eta)\propto p(x|c,\theta,\beta)$ Using Bayes’ theorem, we just need $p(x|c,\theta,\beta)$ . The parameters are learnt in another complicated method: Variational Message Passing (VMP). It is based on EM-algorithm. We don’t explain details but just note that after learning optimal $\theta,\beta$ (but they didn’t mention $\eta$), we can derive$latex p(x|c,\theta,\beta)\$.

The probability mentioned of is derived from a set of bayesian hierarchical models, for example, probability of theme based on class, probability of visual word based on theme, probability of class based on parameter $\eta$, ..etc. We skip the details here.

Let’s see a result:

In the image of forest, we get the histogram of themes as in the upper left. You can see the visual words from the top themes in the upper right. We can see that there are observable patterns of visual words in the themes.

# Paper Note: Nonlinear Dimensionality Reduction by Locally Linear Embedding

Roweis, Sam T., and Lawrence K. Saul. “Nonlinear dimensionality reduction by locally linear embedding." Science 290.5500 (2000): 2323-2326.

This paper introduces a novel method to reduce dimensionality nonlinearly, called LLE (locally linear embedding).

The counterparts are PCA, MDS.

3 Steps:

1. In data space (dim D), find K neighbors for every feature $X_i$, by KNN.
2. Find weights that represent $X_i$ by its K neighbors. Technically, we find best weights $W_i$ such that error function $\Sigma_i|X_i-\Sigma_j{W_{ij}X_j}|^2$ is minimized.
3. In new space (dim d), find new feature vectors $Y$ for every $X$. They are found by minimize the embedded error function $\Sigma_i|Y_i-\Sigma_j{W_{ij}Y_j}|^2$ , using linear algebra.

Notice that the three steps don’t contain iterative algorithms, generating the global minimum instead of local minimum as in other methods like auto-encoder.

This method can be widely used in applications such as image recognition, text classification, or other multimedia applications.

# Paper Note: Online Dictionary Learning for Sparse Coding

This paper introduces a new online algorithm of dictionary learning for sparse coding.

There are several previous algorithms:

1. second-order batch procedure. Drawbacks:
1. access whole dataset every iteration –> can’t be large
2. can’t handle dynamic dataset
2. first-order stochastic gradient descent method. Drawbacks:
1. require learning-rate tuning

The proposed algorithm address these issues. It is based on stochastic approximations, processing 1 sample at a time, but exploits the specific structure of the problem to efficiently solve it.

The algorithm iteratively solves 2 problems:

1. solve $\alpha$ (the “sparse coding" part): by LARS algorithm
2. update dictionary $D$ using block-coordinate descent

Successively, the algorithm can be further optimized in the following aspects:

1. speed up when the dataset is fixed-size
2. mini-batch imprives convergence speed
3. purging the dictionary form unused atoms

In conclusion, the proposed algorithm has the following advantages:

1. faster than the batch alternatives
2. does not require learning-rate tuning like SGD methods