Venugopalan, Subhashini, et al. “Sequence to sequence-video to text." Proceedings of the IEEE International Conference on Computer Vision. 2015.
GOAL: To generate a caption sentence describing a video.
CNN and LSTM (long-short-term memory) networks are combined. The CNN is used to extract features for every frame. 2 LSTM are used. The 1st models the video frame sequence, and the 2nd models the output word sequence.
In the CNN part, inputs can be comprised of RGB and/or optical flow. CaffeNet and VGG-16 are used in the experiments.
They use the datasets:
- Microsoft Video Description corpus (MSVD)
- MPII Movie Description Corpus (MPII-MD)
- Montreal Video Annotation Dataset (M-VAD)