这里我们将这个密度使用基于human appearance 和 正在进行的动作的高斯函数来建模 We model the density over the target object’s location as a Gaussian function whose mean is predicted based on the human appearance and action being performed.
Interaction Recognition 为了提高模型的表达能力,我们这里进一步利用了 the appearance of the target object,得到一个新的分支 interaction branch
3.2. Multi-task Training 我们将人-物关系的学习看做一个多任务学习问题,所有三个分支是共同训练的。 损失函数定义: Our overall loss is the sum of all losses in our model including: (1) the classification and regression loss for the object detection branch, (2) the action classification and target localization loss for the human-centric branch, and (3) the action classification loss of the interaction branch.
3.3. Cascaded Inference 在Inference 阶段,我们使用了 Cascaded 来降低时间复杂度,关键是只对人的矩形框进行相关处理! 实现 ∼ 135ms on a typical image running on a single Nvidia M40 GPU