video understanding note

DatSet

UCF101: 101 classes, 13320 clips， site
Kinetics-600: 600 classes, 392623 train, 30001 val, 72925 test, download file

C3D

2015 年 Learning Spatiotemporal Features with 3D Convolutional Networks

on UCF101

Conv3D: d x 3 x 3, 当 d = 3 时, 即 3 x 3 x 3, 准确率最高

类似于 VGG16, Conv 层可作为 video feature embedding

Pseudo-3D Convolution

2017 年 Learning Spatio-Temporal Representation With Pseudo-3D Residual Networks

pytorch code: https://github.com/qijiezhao/pseudo-3d-pytorch

I3D

Two-Stream Inflated 3D ConvNet

2017 年 Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Non-local Neural Networks

a non-local operation computes the response at a position as as weighted sum of the features at all positions in the input feature maps. The set of positions can be in space, time, spacetime, applicable for image, sequence and video problems.

发现 kaiming 的文章对于 multi-label classification 问题很喜欢用 sigmoid per category 而不是 softmax，在 Mask R-CNN 中对每一个 class 设置一个 sigmoid，精度会高一些。

参考 code