Visual understanding is the abstracting of high-dimensional visual signals like images and videos. Many problems are involved in this process, ranging from depth prediction and vision-language correspondence to classification and object grounding, which include tasks defined along spatial and temporal axes and tasks defined along coarse to fine granularity, like object grounding. In light of…
