Paper Note: Sequence to Sequence – Video to Text

Venugopalan, Subhashini, et al. “Sequence to sequence-video to text." Proceedings of the IEEE International Conference on Computer Vision. 2015.

GOAL: To generate a caption sentence describing a video.

CNN and LSTM (long-short-term memory) networks are combined. The CNN is used to extract features for every frame. 2 LSTM are used. The 1st models the video frame sequence, and the 2nd models the output word sequence.

In the CNN part, inputs can be comprised of RGB and/or optical flow. CaffeNet and VGG-16 are used in the experiments.


螢幕快照 2016-06-02 下午3.18.58.png


They use the datasets:

  1. Microsoft Video Description corpus (MSVD)
  2. MPII Movie Description Corpus (MPII-MD)
  3. Montreal Video Annotation Dataset (M-VAD)

螢幕快照 2016-06-02 下午3.25.41.png



螢幕快照 2016-06-02 下午3.28.02.png


Paper Note: Sequence to Sequence – Video to Text


在下方填入你的資料或按右方圖示以社群網站登入: 標誌

您的留言將使用 帳號。 登出 /  變更 )

Google photo

您的留言將使用 Google 帳號。 登出 /  變更 )

Twitter picture

您的留言將使用 Twitter 帳號。 登出 /  變更 )


您的留言將使用 Facebook 帳號。 登出 /  變更 )

連結到 %s