Paper Note: Two-Stream Convolutional Networks for Action Recognition in Videos

Simonyan, Karen, and Andrew Zisserman. “Two-stream convolutional networks for action recognition in videos." Advances in Neural Information Processing Systems. 2014.

 

This is to do human action recognition with 2-stream CNN. Previously the job is done best with hand-crafted features. Some CNN attempts, treating every frame as an image, were 20% less accurate than hand-crafted state-of-the-art trajectory-based method (on UCV-101 dataset).

To use CNNs to get better performance, this work take temporal information into consider, not only the spacial part. They introduced the 2-stream CNN, where the 1st is for spatial stream, the other is for temporal stream, and at last they are late-fusioned to do classification.

螢幕快照 2016-04-21 上午10.35.48.png

 

Spatial Stream ConvNet

Many actions are strongly associated with particular objects, so the static appearance of an action is a useful clue. This CNN is trained as an standard image classifier. A single frame in a video is input, and the action is output. The great part is pre-trained CNN models can be utilized.

Temporal: Optical Flow ConvNet

To represent the motion of a video as input to CNN, they use optical flows.

螢幕快照 2016-04-21 上午10.58.44.png

The optical flows are split to two: 1 for x-direction, the other for y-direction. An example of a frame of optical flows is on the figure above (d)(e).

While the optical flow representation samples the displacement vector at the same location in multiple frames, a trajectory-based representation is introduced, which works even better.

The input channel stack all L frames, so it has dimension of (w, h, L * 2). The 2 is for x and y.

螢幕快照 2016-04-21 上午11.03.11.png

 

Fusion

  1. Averaging
  2. multi-class linear SVM on softmax scores as features

 

Evaluation

螢幕快照 2016-04-21 上午11.14.08.png

2 datasets are used, UCV-101 and HMDB-51.

88% is as good as the state-of-the-art 87.9%.

Good.

 

 

 

 

 

Paper Note: Two-Stream Convolutional Networks for Action Recognition in Videos

發表留言