video understanding note
DatSet
- UCF101: 101 classes, 13320 clips, site
- Kinetics-600: 600 classes, 392623 train, 30001 val, 72925 test, download file
C3D
2015 年 Learning Spatiotemporal Features with 3D Convolutional Networks
on UCF101
Conv3D: d x 3 x 3, 当 d = 3 时, 即 3 x 3 x 3, 准确率最高
类似于 VGG16, Conv 层可作为 video feature embedding
Pseudo-3D Convolution
2017 年 Learning Spatio-Temporal Representation With Pseudo-3D Residual Networks
pytorch code: https://github.com/qijiezhao/pseudo-3d-pytorch
I3D
Two-Stream Inflated 3D ConvNet
2017 年 Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Non-local Neural Networks
a non-local operation computes the response at a position as as weighted sum of the features at all positions in the input feature maps. The set of positions can be in space, time, spacetime, applicable for image, sequence and video problems.
发现 kaiming 的文章对于 multi-label classification 问题很喜欢用 sigmoid per category 而不是 softmax,在 Mask R-CNN 中对每一个 class 设置一个 sigmoid,精度会高一些。
参考 code