The proposed invention aims to encode contextual information for video analysis and understanding, encoding spatial and temporal relations of objects and the main agent in a scene.