[논문 리뷰 스터디] MMDetection: Open MMLab Detection Toolbox and Benchmark

Imjjun 2023. 3. 23. 19:08

Written by 16기 임정준

MMDetection은 object detection과 instance segmentation 및 관련 모듈들을 포함한 object detection toolbox이다. COCO Challenge에서 우승한 MMDet team의 시작으로 다른 유명 신경망까지 함께 합한 본 toolbox는 모델들에 대한 Benchmark와 유연한 toolkit으로 새로운 연구의 용이성을 제공해준다.

Introduction

object detection / instance segmentation은 기본적인 computer vision tasks로, 기본 프레임워크 자체가 classification보다 복잡한 구조를 가지고 있다. 따라서 통합된 프레임워크인 MMDetection을 개발하여, 양질의 코드 기반의 unified benchmark를 달성하고자 하였다.

1) Modular design

다른 모듈들을 조합하고 쉽게 분해하면서 customized object detection framework를 만들 수 있다.

2) Support of multiple frameworks out of box

다양한 detection framework를 제공한다.

3) High efficiency

Detectron, maskrcnn-benchmark, SimpleDet과 비교하였을 때, 모든 기본 bbox와 mask를 GPU에서 연산하기에 더 빠르다.

4) State-of-the-art

해당 논문이 나온 당시, COCO Challenge 2018의 우승팀에 의해 코드기반이 개발되었다.

Supported Frameworks

본 논문에서 소개된 MMDetection 구성 모듈은 다음과 같다. 또한, 현재는 이보다 더 최신 모델들이 MMDetection에 들어가 있어, 비교적 쉽게 customized framework를 생성하고 학습시킬 수 있다.

아래의 링크에서 YOLOX, TOOD, MASK2FORMER 같은 최근 업데이트된 모델까지 한눈에 확인할 수 있다.

https://github.com/open-mmlab/mmdetection#benchmark-and-model-zoo

2.1 Single-state Method
- SSD: Classic one-stage method, 2015
- RetinaNet: High performance single-stage detector, 2017
- GHM: Gradient Harmonizing mechanism, 2019
- FCOS: Fully convolutional anchor-free detector, 2019
- FSAF: Feature selective anchor-free module, 2019
2.2 Two-state Methods
- Fast R-CNN
- Faster R-CNN
- R-FCN: Fully convolutional object detector, 2016
- Mask R-CNN
- Grid R-CNN: Grid guided localization mechanism, 2018
- Mask Scoring R-CNN: Predicting the mask IoU, 2019
- Double-Head R-CNN: different heads for classification & localization, 2019
2.3 Multi-state Methods
- Cascade R-CNN: multi-stage detection method, 2017
- Hybrid Task Cascade : multi-stage multi-branch object detection & instance segmentation method(1st on COCO Challenge 2018), 2019
2.4 General Modules and Methods
- Mixed Precision Training: training using half precision point numbers, 2018
- Soft NMS: alternative to NMS, 2017
- OHEM: online sampling methods, 2016
- DCN: deformable convolution & RoI Pooling,2017
- DCNv2: modulated DCN, 2018
- Train from Scratch: training from random initialization, 2018
- ScratchDet: another exploration on TFS, 2018
- M2Det : New Feature Pyramid Network, 2018
- GCNet: global context block, 2019
- Generalized Attention: generalized attention formulation, 2019
- SyncBN: synchronized batch normalization across GPUs
- Group Normalization
- Weight Standardization: standardizing the weights for micro-batch training, 2019
- HRNet : new backbone for high-resolution representations, 2019
- Guided Anchoring: anchoring scheme that predicts sparse & arbitary-shaped anchors, 2019
- Libra R-CNN: new framework towards balanced learning, 2019

Architecture

3.1. Model Representation

각 모듈마다 각기 다른 Architecture로 구성되어 있지만, 대개 상기 이미지와 같은 방식으로 구성된다.

Backbone

image를 feature map으로 바꾸어주는 부분으로, 마지막 FCN layer를 제거한 ResNet-50 등이 해당한다.

Neck

Backbone과 Head를 이어주는 부분으로, raw feature map에 refinement & reconfiguration을 진행한다. 일례로 Feature Pyramid Network(FPN)이 있다.

DenseHead (AnchorHead/AnchorFreeHead)

Feature maps의 dense location을 수행하는 부분으로, RPN-Head, RetinaHead, FCOSHead등이 이에 해당한다.

RoIExtractor

Featue map에서 RoI feature를 뽑아내는 부분으로, SingleRoIExtractor가 이에 해당한다.

RoIHead (BBoxHead/MaskHead)

RoI Feature를 취하여, RoI-wise task specific prediction을 수행하며, bbox classification/regression, mask prediction이 본 과정에 해당한다.

Training Pipeline

MMDetection에서는 train epoch과 iter, 세부적으로는 before_train_epoch, before_train_iter, after_train_iter, after_train_epoch로 나누어 진행된다. 이러한 레지스터된 hook들은 특정한 timepoint에서의 executable method를 가용할 수 있다.

Experiment

Dataset

MS COCO 2017로 모든 Experiment의 benchmark 측정 및 train/val dataset split

Implementation details

i) 1333*800을 최대로 resize scaling

ii) 8 V100 GPUs & total batch size of 16(2 images per GPU) + single V100 GPU for inference

iii) Training schedule: same as Detectron

Evaluation metrics

Detection Results from Multiple IoU thresholds[0.5 ~ 0.95]

*RPN: AR

아래 그래프는 MMDetection 내 모델들에 대한 Benchmark 결과들이다.

또한 아래 그래프는 GPU 종류에 따른 Inference Speed에 대한 결과이다.

아래 Table은 Codebase에 따른 동일 모델의 performance를 비교한 것인데, MMDetection이 다른 Codebase들보다 memory 사용면에 있어서 비교적 준수한 성능을 보여주고 있다.

또한, Detectron에서 지원하지 않는 Mixed precision training을 통해 GPU 메모리를 아끼면서 효율적인 연산을 수행할 수 있었으며, 특히 Faster R-CNN에서의 실험으로부터 더 큰 batch size에서 더 큰 메모리 절약이 발생하는 것을 아래 table에서 확인할 수 있다.

아래 Table은 Codebase의 차이를 기반으로 Mixed precision training 결과를 나타낸 것이며, Detectron은 Mixed precision training을 아직 지원하지 않는다.

RetinaNet과 같은 더 단순한 framework에서는 더욱 효율적인 메모리 연산 결과를 보여주었다.

아래의 Multi-node scalability Experiment 에서도 각각 8, 16, 32, 64 GPU에서 실험이 진행되었는데, 다른 Batch size를 적용하였을 때, 기본 learning rate가 선형적으로 적용되는 것이 확인되었다.

Extensive Studies

Regression Losses

상기 Table은 Faster-RCNN with ResNet-50-FPN에서의 Loss에 대한 Experiment로, L1 Loss 계열의 경우 loss weight를 다르게 적용하여도 그 한계가 있었지만, IoU Loss의 경우 loss weight를 조절하였을 때 30.0까지 loss가 감소하는 것을 확인할 수 있었다.

Normalization layers

FrozenBN _ using the Backbone Statistics

Sync BN (Synchronized) _ BN for multi-GPUs

GN (Group) _ using when memory lacks

(1) 각 Normalization layer끼리 비교하였을 때 얼마나 차이나는가 ? (2) Detector의 어느 부분에 layer를 추가해야 하는가? 에 대한 질문을 기반으로 Mask R-CNN을 활용하여 상기와 같은 Table의 결과를 보여주었다.

SBN이 일반 FrozenBN보다 더 좋은 성능을 보여주었고, GN의 경우 더 깊은 Convolution이 진행된 layer가 더 높은 성능을 보여주었다.

Training scales

Mask R-CNN with ResNet-50-FPN+2x lr schedule 신경망을 활용하여, Image Input scale에 대한 Experiment도 진행되었는데, 1333*[640:960]이 1333*[640:800]보다 0.4%~0.5% 정도 더 우수한 성능을 보여주었다.

Other Hyper-parameters

상기의 Experiment 외에도 SmoothL1_beta value, allowed_border(inf로 설정할 경우, 제외되는 Anchor가 존재하지 않음), neg_pos_ub(sampling positive & negative anchors) 값을 다르게 설정하면서 추가적인 Experiment가 진행되었고, 각각 1/9, inf, 3일 때 일반적인 setting에 비해 약 2% 정도 상승한 성능을 보여주었다.