Project overview
This system automates the generation of training data for object detection models by using natural language descriptions to detect and segment objects in images. It combines Grounding DINO's zero-shot object detection with SAM's precise segmentation capabilities, producing rotated bounding boxes and masks suitable for computer vision datasets.
Core Features
- Text-prompt based object detection using Grounding DINO
- High-precision segmentation masks with Segment Anything Model (SAM)
- Automatic generation of rotated bounding boxes (handles partial objects)
- Dual output formats: ImageNet XML and custom Cartel JSON
- Synthetic data generation through background overlay
- Dockerized deployment with CUDA GPU support
- Interactive Gradio web interface for real-time labeling
Technical Implementation
The pipeline processes images through Grounding DINO for detection, passes bounding boxes to SAM for segmentation, then generates rotated bounding boxes using a custom min_in_image_area_rect algorithm that handles objects extending beyond image boundaries—critical for logistics and conveyor belt applications.