Paper Note: DeepFace

Taigman, Yaniv, et al. "Deepface: Closing the gap to human-level performance in face verification."Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.

Facebook likes to suggest me tagging somebody on a photo. Amazingly it is almost always right! DeepFace was by Facebook in 2014. DeepFace in read world, wow.

  • Face Recognition: Detection -> Alignment -> Representation -> Recognition

This work contributed to Alignment, Representation and Recognition.


They proposed a 3D-model based alignment method. Although 3D-model based methods had fallen out of favor, they think it is the right way because faces are 3D objects. The alignment method is as the image below.

螢幕快照 2016-05-11 下午11.17.33.png



They proposed a novel CNN models for aligned faces raw images.

螢幕快照 2016-05-12 上午12.00.34.png

C1 + M1: conv and max-pooling

C2: conv (no max-pooling)

L4, L5, L6: local connected layers. 

F7, F8: FC layers. Features are extracted at F7.

Local connected layers are are like normal conv layers but every location in the feature map learns a different set of filters. Based on the fact the input faces are aligned, different regions of the image have different local statistics, learned by local filters.


Several metrics are tested:

  1. unsupervised metric: distance = feature1 dot feature2
  2. weighted X^2 distance
  3. Siamese network to finetune the last 2 layers



On the LFW (Labeled Faces in the Wild) dataset, it came to 97.35%. Human: 97.53%.

螢幕快照 2016-05-12 上午12.11.59.png


On the YTF (Youtube Faces) dataset, 91.4%.

螢幕快照 2016-05-12 上午12.14.30.png






Paper Note: DeepFace

Paper Note: Faster RCNN

Ren, Shaoqing, et al. “Faster R-CNN: Towards real-time object detection with region proposal networks."Advances in Neural Information Processing Systems. 2015.


Fast RCNN did well on object detection, and Faster RCNN is better. The bottleneck of Fast RCNN is region proposal. Selective Search, a region proposal algorithm, generates hundreds of proposals and feed into Fast RCNN. Faster RCNN combine the two steps into one with a novel RPN, improving speed greatly.


螢幕快照 2016-05-04 下午11.55.42.png


RPN (Region Proposal Networks)

RPN is a multitask network attached after CNN. The input is the CNN conv feature map, and output rectangular coordinates of region proposals and their object scores. RPN can be trained along with original Fast RCNN simultaneously, thus reducing training time.

螢幕快照 2016-05-04 下午11.58.04.png


Here are some sample results.

螢幕快照 2016-05-04 下午11.59.18.png


Quantitive Results

Using VGG-16, Faster RCNN is 10x faster than Selective Search + Fast RCNN.

螢幕快照 2016-05-05 上午12.00.19.png

In PASCAL VOC 2012 Test Set,  mAP also beats Fast RCNN:

螢幕快照 2016-05-05 上午12.01.44.png




Paper Note: Faster RCNN

Paper Note: Deep Compression

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Cohen-Or, Daniel, Yair Mann, and Shachar Fleishman. “Deep compression for streaming texture intensive animations." Proceedings of the 26th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 1999.


After we finish training a NN, we get a model file containing millions of parameters: weights, biases, etc. The model file is 240 MB for AlexNet, and 552 MB for VGG-16. This makes difficulties to store models on different platforms, especially embedded systems whose storage is very limited.

Deep Compression introduces a 3-stage pipeline to reduce the model size:

  1. pruning
  2. trained quantization
  3. huffman coding

Finally the size is reduced by 35x for AlexNet (6.9MB), and 49x for VGG-16(11.3MB).



First we train a model, then remove the neurons with too small weights, and re-train the model with sparse connections.

Pruning leads to 9x to 16x size reduction.

Trained Quantization

In a trained model, we quantize the weights into several bins. Instead of storing original weights, we only keep the quantized step values and which bin every neuron belongs to (this is an integer index).

Huffman Coding

Quantized weights and sparse matrix indices and encoded using huffman coding.






Paper Note: Deep Compression

Paper Note: Two-Stream Convolutional Networks for Action Recognition in Videos

Simonyan, Karen, and Andrew Zisserman. “Two-stream convolutional networks for action recognition in videos." Advances in Neural Information Processing Systems. 2014.


This is to do human action recognition with 2-stream CNN. Previously the job is done best with hand-crafted features. Some CNN attempts, treating every frame as an image, were 20% less accurate than hand-crafted state-of-the-art trajectory-based method (on UCV-101 dataset).

To use CNNs to get better performance, this work take temporal information into consider, not only the spacial part. They introduced the 2-stream CNN, where the 1st is for spatial stream, the other is for temporal stream, and at last they are late-fusioned to do classification.

螢幕快照 2016-04-21 上午10.35.48.png


Spatial Stream ConvNet

Many actions are strongly associated with particular objects, so the static appearance of an action is a useful clue. This CNN is trained as an standard image classifier. A single frame in a video is input, and the action is output. The great part is pre-trained CNN models can be utilized.

Temporal: Optical Flow ConvNet

To represent the motion of a video as input to CNN, they use optical flows.

螢幕快照 2016-04-21 上午10.58.44.png

The optical flows are split to two: 1 for x-direction, the other for y-direction. An example of a frame of optical flows is on the figure above (d)(e).

While the optical flow representation samples the displacement vector at the same location in multiple frames, a trajectory-based representation is introduced, which works even better.

The input channel stack all L frames, so it has dimension of (w, h, L * 2). The 2 is for x and y.

螢幕快照 2016-04-21 上午11.03.11.png



  1. Averaging
  2. multi-class linear SVM on softmax scores as features



螢幕快照 2016-04-21 上午11.14.08.png

2 datasets are used, UCV-101 and HMDB-51.

88% is as good as the state-of-the-art 87.9%.







Paper Note: Two-Stream Convolutional Networks for Action Recognition in Videos

Paper Note: A Bayesian Hierarchical Model for Learning Natural Scene Categories

$AMMAI paper study note

Fei-Fei, Li, and Pietro Perona. “A bayesian hierarchical model for learning natural scene categories." Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 2. IEEE, 2005.


This is almost the most frustrating paper in AMMAI until now. I should have studies Probability & Statistics and random process more carefully.


If you understand Latent Dirichlet Allocation (LDA), you can easily read this. If you don’t like me, this will be really hard. I am trying explaining it now.


Problem definition: scene classification in images (but this method can be adopted in many other problems)
Previously we classify images directly based on such as BoW features generated from patches. That is:

visual word -> scene category

Now we find that an extra intermediate representation (in other word, themes) in the middle improves performance. So we first classify the visual word into some theme, and then classify an image based on its themes. That is:

visual word -> theme -> scene category


Everything technical here is some probabilistic distribution, that is too complicated to understand in a blog. We just go through some key ideas.

We want the probability of category c given an image (made up of patches, that is visual words), and parameters \theta, \beta, \eta.

p(c|x, \theta, \beta, \eta) \propto p(x|c, \theta, \beta)p(c|\eta)\propto p(x|c,\theta,\beta)

Using Bayes’ theorem, we just need p(x|c,\theta,\beta) . The parameters are learnt in another complicated method: Variational Message Passing (VMP). It is based on EM-algorithm. We don’t explain details but just note that after learning optimal \theta,\beta (but they didn’t mention \eta), we can derive $latex p(x|c,\theta,\beta)$.

The probability mentioned of is derived from a set of bayesian hierarchical models, for example, probability of theme based on class, probability of visual word based on theme, probability of class based on parameter \eta, ..etc. We skip the details here.


Let’s see a result:

螢幕快照 2016-04-06 下午11.59.01.png

In the image of forest, we get the histogram of themes as in the upper left. You can see the visual words from the top themes in the upper right. We can see that there are observable patterns of visual words in the themes.


Paper Note: A Bayesian Hierarchical Model for Learning Natural Scene Categories

Paper Note: Nonlinear Dimensionality Reduction by Locally Linear Embedding

Roweis, Sam T., and Lawrence K. Saul. “Nonlinear dimensionality reduction by locally linear embedding." Science 290.5500 (2000): 2323-2326.


This paper introduces a novel method to reduce dimensionality nonlinearly, called LLE (locally linear embedding).

The counterparts are PCA, MDS.

螢幕快照 2016-03-28 下午8.28.06.png

3 Steps:

  1. In data space (dim D), find K neighbors for every feature X_i, by KNN.
  2. Find weights that represent X_i by its K neighbors. Technically, we find best weights W_i such that error function \Sigma_i|X_i-\Sigma_j{W_{ij}X_j}|^2 is minimized.
  3. In new space (dim d), find new feature vectors Y for every X. They are found by minimize the embedded error function \Sigma_i|Y_i-\Sigma_j{W_{ij}Y_j}|^2 , using linear algebra.


Notice that the three steps don’t contain iterative algorithms, generating the global minimum instead of local minimum as in other methods like auto-encoder.

This method can be widely used in applications such as image recognition, text classification, or other multimedia applications.


Paper Note: Nonlinear Dimensionality Reduction by Locally Linear Embedding

Paper Note: Online Dictionary Learning for Sparse Coding

This paper introduces a new online algorithm of dictionary learning for sparse coding.

There are several previous algorithms:

  1. second-order batch procedure. Drawbacks:
    1. access whole dataset every iteration –> can’t be large
    2. can’t handle dynamic dataset
  2. first-order stochastic gradient descent method. Drawbacks:
    1. require learning-rate tuning

The proposed algorithm address these issues. It is based on stochastic approximations, processing 1 sample at a time, but exploits the specific structure of the problem to efficiently solve it.


The algorithm iteratively solves 2 problems:

  1. solve \alpha (the “sparse coding" part): by LARS algorithm
  2. update dictionary D using block-coordinate descent


Successively, the algorithm can be further optimized in the following aspects:

  1. speed up when the dataset is fixed-size
  2. mini-batch imprives convergence speed
  3. purging the dictionary form unused atoms


In conclusion, the proposed algorithm has the following advantages:

  1. faster than the batch alternatives
  2. does not require learning-rate tuning like SGD methods










Paper Note: Online Dictionary Learning for Sparse Coding