Mapping Degeneration Meets Label Evolution: Learning Infrared Small Target Detection with Single Point Supervision


Xinyi Ying, Li Liu, Yingqian Wang, Ruojing Li, Nuo Chen, Zaiping Lin, Weidong Sheng, Shilin Zhou


Contact: yingxinyi18@nudt.edu.cn, dreamliu2010@gmail.com, wangyingqian16@nudt.edu.cn, liruojing@nudt.edu.cn, chennuo97@nudt.edu.cn, linzaiping@nudt.edu.cn, shengweidong1111@sohu.com, slzhou@nudt.edu.cn


Abstract


Training a convolutional neural network (CNN) to detect infrared small targets in a fully supervised manner has gained remarkable research interests in recent years, but is highly labor expensive since a large number of per-pixel annotations are required. To handle this problem, in this paper, we make the first attempt to achieve infrared small target detection with point-level supervision. Interestingly, during the training phase supervised by point labels, we discover that CNNs first learn to segment a cluster of pixels near the targets, and then gradually converge to predict groundtruth point labels. Motivated by this "mapping degeneration" phenomenon, we propose a label evolution framework named label evolution with single point supervision (LESPS) to progressively expand the point label by leveraging the intermediate predictions of CNNs. In this way, the network predictions can finally approximate the updated pseudo labels, and a pixel-level target mask can be obtained to train CNNs in an end-to-end manner. We conduct extensive experiments with insightful visualizations to validate the effectiveness of our method. Experimental results show that CNNs equipped with LESPS can well recover the target masks from corresponding point labels, {and can achieve over 70% and 95% of their fully supervised performance in terms of pixel-level intersection over union (IoU) and object-level probability of detection (Pd), respectively. Code is available at https://github.com/XinyiYing/LESPS.



Presentation






The Mapping Degeneration Phenomenon


Fig. 1: Illustrations of mapping degeneration under point supervision. CNNs always tend to segment a cluster of pixels near the targets with low confidence at the early stage, and then gradually learn to predict groundtruth point labels with high confidence.



Reasons of Mapping Degeneration:

1. Special imaging mechanism of infrared systems. Targets only have intensity information without structure and texture details, resulting in highly similar pixels within the target region.
2. High local contrast of infrared small targets. Pixels within the target region are much brighter or darker with high contrast against the local background clutter.
3. Easy-to-hard learning property of CNNs. CNNs always tend to learn simple mappings first, and then converge to difficult ones. Compared with region-to-point mapping, region-to-region mapping is easier, and thus tends to be the intermediate result of region-to-point mapping.

In conclusion, the unique characteristics of infrared small targets result in extended mapping regions beyond point labels, and CNNs contribute to the mapping degeneration process.


Fig. 2: Quantitative and qualitative illustrations of mapping degeneration in CNNs.



Fig. 3: Analyses of Mapping Degeneration with respect to different characteristics of targets (i.e.,(a) intensity, (b) size, (c) shape, and (d) local background clutter) and point labels (i.e.,(e) numbers and (f) locations). We visualize the zoom-in target regions of input images with GT point labels (i.e., red dots in images) and corresponding CNN predictions (in the epoch reaching maximum IoU). Illustrations of mapping degeneration under point supervision.




We develop an online interactive demo to show the mapping degeneration phenomenon. You can adjust the epoch number by dragging the sliding blocks.



Image
Feature

Epoch Number 1-300






Image
Feature

Epoch Number 1-300






Image
Feature

Epoch Number 1-300








The Label Evolution Framework



Fig. 4: Illustrations of Label Evolution with Single Point Supervision (LESPS). During training, intermediate predictions of CNNs are used to progressively expand point labels to mask labels. Black arrows represent each round of label updates.



Table 1. Average results achieved by DNAnet with (w/) and without (w/o) LESPS under centroid, coarse point supervision together with full supervision.


Fig. 5: Quantitative and qualitative results of evolved target masks.



Fig. 6: PA (P) and IoU (I) results of LESPS with respect to (a) initial evolution epoch, (b) Tb and (c) k of evolution threshold, and (d) evolution frequency.




Quantitative Results



Table 2: IoU (×10e2), Pd (×10e2) and Fa(×10e6) values of different methods achieved on NUAA-SIRST, NUDT-SIRST and IRSTD-1K. “CNN Full”, “CNN Centroid”, and “CNN Coarse” represent CNN-based methods under full supervision, centroid and coarse point supervision. “+” represents CNN-based methods equipped with LESPS.



Table 3: Average IoU (×10e2), Pd (×10e2) and Fa(×10e6) values on NUAA-SIRST, NUDT-SIRST and IRSTD-1K of DNAnet trained with pseudo labels generated by different LCM-based methods and LESPS under centroid and coarse point supervision.




Qualitative Results



Fig. 7: Visualizations of regressed labels during training and network predictions during inference with centroid and coarse point supervision.



Fig. 8: Visual detection results of different methods achieved on NUAA-SIRST, NUDT-SIRST and IRSTD-1K. Correctly detected targets and false alarms are highlighted by red and orange circles, respectively.




Materials