Let's review the another method for video classification entitled with SlowFast Network for Video Recognition published in ICCV this year. You can find the implementation in https://github.com/facebookresearch/SlowFast.
The main concept is to discard optical flow information by developing slow-path and fast-path. The Slow Path will be used to capture spatial semantics, while the Fast Path will be used to capture motion at temporal resolution. The Fast Path can be made very ligthweight by reducing its channel.
In my opinion, the main concept of this work is to reduce the complexity by training the optical flow modality. As we know, producing the optical flow is not a straightforward work and need a long time to finish. Then, how if we can use the RGB modality in order to capture this motion? Can we?
For instance, the action like waving hands do not change its identity as "hands" over the span of the waving action, and the person is always in "person" category even they also walking or running. The chategorical semantics such the texture, color, or any information can be captured by slow path.
As for the implementation, the slowpath network first perform simple sampling frame (such take one frame for 16 frames) while the fast path way use denser framerate. Then, the lateral connection is used to fuse the networks. The evaluation is made on such a huge datasets, such as Kinetics-400, Kinetics-600, Charades, and AVA datasets. The illustration of the network can be seen in this figure below.