YOLO v7 training

Object 6D pose estimation, using MegaPose, requires 3 inputs:

  1. Object CAD model (with texture)
  2. RGB image
  3. Region of Interest (bounding box) where the object is located on the rgb image 

To get this object bounding box, we used a YOLO v7 (the fastest object detection algorithm in 2024) to detect our wood block. 

The generated dataset to train the YOLO v7 model is composed by images gathered from 3 different sources:

  1. Generated synthetic images, where the block texture is a common wood texture (available online)
  2. Generated synthetic images, where the object has its own real texture (collected from real images)
  3. Captured real images, where the labeling (ground truth for the bounding boxes) were manually developed

The dataset distribution is represented in the following table:

 

Total

Synthetic images

(synthetic texture)

Synthetic Images 

(Real Texture)

Real Images

Train

55105

51840

3072

193

Validation

13774 

12960

768

46

Total

68879

64800

3840

239

Here is an example of each one of the 3 types of images:

  • Synthetic image; object with a common online (synthetic) wood texture:


 

  • Synthetic image; object with its own real texture:

 

  • Real image, with real object:

 

The training procedure used, as initial weights, the pre-trained, online available, yolov7-tiny model. Thus, this training step stands as a YOLOv7 finetune, where the following parameters where selected:

  • epochs: 600
  • batch size: 32 

After more than 80 hours of training, the metrics evolution throughout the training are shown bellow:


 The evaluation of this previous graphs raises several questions about why the trained model has this difficulty to converge:

  1.  Does the mixed dataset influence the training procedure? A homogeneous dataset would stabilize the training curves?
  2. Too much epochs? It seems that, for this number of images and batch size, 400 epochs would be enough.
  3. Too small batch size? Should I test it with 128 images per batch instead of 32?
  4. More or less images on the dataset?

Comments

Popular posts from this blog

Real-time UR10e following a tracked object

RGB-D tracking + UR10e following & picking/placing

UR10e control architecture