Synthesizing Training Data for Object Detection in Indoor Scenes



Abstract

Detection of objects in cluttered indoor environments is one of the key enabling functionalities for service robots. The best performing object detection approaches in computer vision exploit deep Convolutional Neural Networks (CNN) to simultaneously detect and categorize the objects of interest in cluttered scenes. Training of such models typically requires large amounts of annotated training data which is time consuming and costly to obtain. In this work we explore the ability of using synthetically generated composite images for training state-of-the-art object detectors, especially for object instance detection. We superimpose 2D images of textured object models into images of real environments at variety of locations and scales. Our experiments evaluate different superimposition strategies ranging from purely image-based blending all the way to depth and semantics informed positioning of the object models into real scenes. We demonstrate the effectiveness of these object detector training strategies on two publicly available datasets, the GMU-Kitchens and the Washington RGB-D Scenes v2. As one observation, augmenting some hand-labeled training data with synthetic examples carefully composed onto scenes yields object detectors with comparable performance to using much more hand-labeled data. Broadly, this work charts new opportunities for training detectors for new objects by exploiting existing object model repositories in either a purely automatic fashion or with only a very small number of human-annotated examples.

Paper

Georgios Georgakis, Arsalan Mousavian, Alexander C. Berg, Jana Kosecka
Synthesizing Training Data for Object Detection in Indoor Scenes [pdf]
Robotics: Science and Systems (RSS 2017)

Video Slides



Code

synthesizing_project.zip
For details please see the README file that is included in the code zip.
For any questions please email: ggeorgak@gmu.edu

Data

Preprocessed NYU kitchen scenes with semantic segmentation that were used as background scenes in the paper:
part1 part2 part3 part4 part5 part6 part7 part8 part9 part10 part11
part12 part13 part14 part15 part16 part17 part18 part19 part20 part21 part22

Output synthesized scenes used for training the detectors (SP-BL-SS, see paper for details):
part1 part2 part3 part4 part5 part6 part7 part8 part9 part10 part11

Acknowledgments

We acknowledge support from NSF NRI grant 1527208. Some of the experiments were run on ARGO, a research computing cluster provided by the Office of Research Computing at George Mason University, VA. (URL: http://orc.gmu.edu).