Carlos Antônio Caetano Júnior
Carlos Caetano received a PhD in Computer Science from the Universidade Federal de Minas Gerais (UFMG). He developed part of the doctoral studies at the Centre de Recherche INRIA Sophia Antipolis, Méditerranée, France, as a researcher at the STARS team (under the guidance of Dr. François Brémond). He received his B.Sc. and M.Sc. degrees in Information Systems and Computer Science, from Pontifícia Universidade Católica de Minas Gerais (PUC Minas) and Universidade Federal de Minas Gerais (UFMG), respectively. His research interests include computer vision, smart surveillance and machine learning applications, with focus on visual pattern recognition.
In this dissertation we propose four different representations based on motion information for activity recognition. The first is a spatiotemporal local feature descriptor that extracts a robust set of statistical measures to describe motion patterns. This descriptor measures meaningful properties of co-occurrence matrices and captures local space-time characteristics of the motion through the neighboring optical flow magnitude and orientation. The second, is the proposal of a compact novel mid-level representation based on co-occurrence matrices of codewords. This representation expresses the distribution of the features at a given offset over feature codewords from a pre-computed codebook and encodes global structures in various local region-based features. The third representation, is the proposal of a novel temporal stream for two-stream convolutional networks that employs images computed from the optical flow magnitude and orientation to learn the motion in a better and richer manner. The method applies simple non-linear transformations on the vertical and horizontal components of the optical flow to generate input images for the temporal stream. Finally, the forth is a novel skeleton image representation to be used as input of convolutional neural networks (CNNs). The proposed approach encodes the temporal dynamics by explicitly computing the magnitude and orientation values of the skeleton joints. Moreover, the representation has the advantage of combining the use of reference joints and a tree structure skeleton, incorporating different spatial relationships between the joints and preserving important spatial relations. The experimental evaluations carried out on challenging well-known activity recognition datasets (KTH, UCF Sports, HMDB51, UCF101, NTU RGB+D 60 and NTU RGB+D 120) demonstrated that the proposed representations achieved better or similar accuracy results in comparison to the state of the art, indicating the suitability of our approaches as video representations.