[News] MMAction v0.2.0 Released

[News] MMAction v0.2.0 Released

2020, Mar 22    

Recently we release MMAction v0.2.0, in which we implement most famous video recognition algorithms and evaluate them on the Kinetics-400 dataset. Those algorithms include TSN[1], I3D[2], SlowOnly[3], SlowFast[3], R(2+1)D[4] and CSN[5]. In this post, we give a brief introduction to the algorithms we implemented and some interesting findings.

The version of Kinetics-400 we used contains 240436 training videos and 19796 testing videos. If you can not reproduce our testing results due to dataset unalignment, please submit a request at get validation data.

TSN[1]

For Temporal Segments Network, we release a 3seg-TSN with ResNet50 Backbone. During TSN Training, ImageNet pretraining is important, which leads to around 2% gain in Top-1 Accuracy. The number of segments is also an important factor. We also trained a 8seg-TSN with ResNet50 Backbone, the Top-1 Accuracy is around 71.6%.

Modality Pretrained Backbone Input Top-1 Top-5 Download
RGB ImageNet ResNet50 3seg 70.6 89.4 model

I3D[2]

We release I3D models with two backbones. Models with the Inception-V1 backbone are converted from the repo kinetics_i3d. Models with the ResNet50 backbone are trained by ourselves. We also finetune Kinetics-400 trained I3D on UCF101 and HMDB51 to collect results on transfer learning.

Kinetics

Modality Pretrained Backbone Input Top-1 Top-5 Download
RGB ImageNet Inception-V1 64x1 71.1 89.3 model*
RGB ImageNet ResNet50 32x2 72.9 90.8 model
Flow ImageNet Inception-V1 64x1 63.4 84.9 model*
Two-Stream ImageNet Inception-V1 64x1 74.2 91.3 /

Transfer Learning

Modality Pretrained Backbone Input UCF101 HMDB51 Download (split1)
RGB Kinetics I3D 64x1 94.8 72.6 UCF101 / HMDB51
Flow Kinetics I3D 64x1 96.6 79.2 UCF101 / HMDB51
TwoStream Kinetics I3D 64x1 97.8 80.8 /

SlowOnly & SlowFast[3]

We reimplement SlowOnly and SlowFast algorithms from [3]. The models we released have better performance compared with the original repo, about 1% higher in Top-1 accuracy. During training, we found that ImageNet pretraining is also important for 3D ConvNet (both SlowOnly & SlowFast), which is different from the conclusion in [3].

Model Modality Pretrained Backbone Input Top-1 Top-5 Download
SlowOnly RGB None ResNet50 4x16 72.9 90.9 model
SlowOnly RGB ImageNet ResNet50 4x16 73.8 90.9 model
SlowOnly RGB None ResNet50 8x8 74.8 91.9 model
SlowOnly RGB ImageNet ResNet50 8x8 75.7 92.2 model
SlowOnly RGB None ResNet101 8x8 76.5 92.7 model
SlowOnly RGB ImageNet ResNet101 8x8 76.8 92.8 model
SlowFast RGB None ResNet50 4x16 75.4 92.1 model
SlowFast RGB ImageNet ResNet50 4x16 75.9 92.3 model

R(2+1)D & CSN[4,5]

For R(2+1)D, we release models with ResNet-34 backbone with various input and pretraining. Our performance is slightly better than VMZ, since we use large images as input (224x224 instead of 112x112). For CSN, we convert models from VMZ, and evaluate them on our testing data.

Model Modality Pretrained Backbone Input Top-1 Top-5 Download
R(2+1)D RGB None ResNet34 8x8 63.7 85.9 model
R(2+1)D RGB IG-65M ResNet34 8x8 74.4 91.7 model
R(2+1)D RGB None ResNet34 32x2 71.8 90.4 model
R(2+1)D RGB IG-65M ResNet34 32x2 80.3 94.7 model
irCSN RGB IG-65M irCSN-152 32x2 82.6 95.7 model*
ipCSN RGB IG-65M ipCSN-152 32x2 82.7 95.6 model*

References

[1] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016, October). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20-36). Springer, Cham.

[2] Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299-6308).

[3] Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision (pp. 6202-6211).

[4] Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 6450-6459).

[5] Tran, D., Wang, H., Torresani, L., & Feiszli, M. (2019). Video classification with channel-separated convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 5552-5561).

[6] Ghadiyaram, D., Tran, D., & Mahajan, D. (2019). Large-scale weakly-supervised pre-training for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 12046-12055).