기술

A Survey on Natural Language Video Localizatioin 리뷰

기술 공간 2022. 8. 23. 00:51

NLVL(Natural Language Video Localization) 이라는 task 에 관한 survey paper 이다.

 

I. INTRODUCTION

 

- NLVL 은 video 와 query sentence 가 주어졌을때, video 로부터 query sentence 내용에 관련있는 segment 만을 찾아내는 task 이다.

- NLVL == moment retrieval == moment localization == video grounding

- NLVL !=  Temporal Action Localization, Video Captioning, Temporal Action Proposals

- application : 

        - automatically cut a compilation from many videos

        - locate the movie episode you want to watch most

        - extract the most heplful segment from hours of surveilance

 

- challenges : 

        - intractable to align semantic info from both visual and textual modalities

        - kinda hard to obtain a final timestamp from the alignment relationship

 

- evaluation aspects : 

        - Precision : THERE ARE STILL BIG GAP BWTN SOTA AND INDUSTRIAL NEEDS

        - Efficiency : KEEPING ACCURACY WITH LOW COST OF COMPUTATION IS NEEDED

        - Robustness : CAN THE MODEL DESIGNED FOR SHORT VIDEO LENGTH BE EASILY EXTENDED TO LONGER VIDEOS?

 


II. PIPELINE

 

source : https://arxiv.org/abs/2104.00234

- 1. FEATURE EXTRACTION

        - Video Feature Extraction

                - type 1 : based on the clip level :  input multiple frames, output a feature vector for capturing dynamic info   ex.C3DI3D

                - type 2 : based on the frame level : feature vector outputs from one frame  ex. VGG, Faster-RCNN

( clip == smallest unit of video features, 최소 단위 이지만, 약간 최소 단위의 덩어리 느낌 )

( frame == spacial case of the clip , 말 그대로 이미지 한장, 아주 극단적으로 얇은 clip )

 

        - Text Feature Extraction

                - type 1 : extract whole sentence feature

                - type 2 : extract the features of words separately and do more processing

 

- 2. CROSS-MODAL INTERACTION

        - role : to allow video features text features to perceive each other

        - unique design by researchers

        - main method : attention mechanism, GNN

 

- 3. LOCALIZATION POLICIES; 내용 이해하려면 논문 읽기 요망

        - to generate prediction bounds

 

        - (1) Proposal-Based Methods : 

        - (2) Dense Locators : 

        - (3) Single-Shot : 

        - (4) Reinforcement Learning : 

        - (5) Boundary Aware : 

 

 III.  CATEGORIES OF NLVL; 내용 이해하려면 논문 읽기 요망

- supervised methods

        - proposal-based methods : 

        - dense locator : 

        - single-shot : 

        - reinforcement learning : 

        - boundary aware : 

 

- weakly-supervised methods

 

 

IV. DATASETS AND EVALUATION

- datasets

        - DiDeMo : 

        - TACoS : 

        - Charades-STA : 

        - ActivityNet Captions : 

 

- evaluation

        - R@n, IoU=m == R(m,n) 

source : https://arxiv.org/abs/2104.00234

        - r(n,m,q_i) : 1(exist return in which IOU > m)           (n : # of timespens prediction(pred 총 길이 합인지, 아니면 pred box 개수 인지? - 뇌피셜은 pred box 개수,왜냐하면 real world 에서는 video length variance 크므로, 그 안에서 pred 총 길이가 일정하도록 모델을 만드는 것은 의미 x, 만약 어떤 영상은 매우 길고, gt 도 상당히 긴 그런 data point 가 있다면? pred 총 길이 고정은 이러한 data 의 variety 를 handling 하기 어려울 듯 ), m : margin, q_i : i th quary)

 

V. PERFORMANCE COMPARISON

 

 

VI. CONCLUSION AND PERSPECTIVES