Object Detection from Scratch: Part 2 - Dataset, Labels, and the Reality of Training Data

April 15, 202620 min readNew

Part 2 examines the dataset behind the MTG detector: splits, YOLO labels, class design, annotation noise, and why data quality sets the real performance ceiling.

Object Detection from Scratch: Part 2 - Dataset, Labels, and the Reality of Training Data
React to this article

In machine learning, the dataset is not "input." It is the first version of the system's truth model.

That is why I like the way this repo documents its data. The files are not hidden behind a vague training command. The project explains where the data comes from, how the splits work, and what the labels actually mean in practice.

The Dataset Source

The project uses a Roboflow Universe dataset for Magic: The Gathering card regions. According to the repo documentation, the working version contains thousands of annotated images split into train, validation, and test subsets.

The important detail is not just the count. It is that the annotations are region-level, not card-level. The model is being taught to understand a structured object.

The Split Strategy

The repo explains the classic three-way split clearly:

  • train: where the model learns
  • valid: where training progress is monitored
  • test: where final evaluation stays honest
Loading diagram...

This matters because a detector that performs well on images it has effectively memorized is not useful. The split is one of the first safeguards against self-deception.

The Seven Classes

The class set is practical and downstream-aware:

IDClassWhy it exists
0artUsed for visual matching
1cardCaptures the full card boundary
2descriptionPreserves rules text region
3mana-costSmall, semantically useful detail
4powerAnother small but important region
5tagsType line / category signal
6titlePrimary OCR target

This class design is stronger than a generic "card" detector because it aligns with downstream product needs. The title region exists because OCR needs a clean crop. The art region exists because print disambiguation needs one. The classes reflect system requirements, not academic neatness.

What a YOLO Label Actually Stores

The file format is minimal:

class_id center_x center_y width height

All coordinates are normalized to [0, 1], which keeps the labels resolution-independent.

6 0.5524 0.9418 0.7480 0.1164
1 0.4949 0.5015 0.9754 0.9970

That simplicity is one of YOLO's practical strengths. The same simplicity is also unforgiving. If the label is sloppy, the model will faithfully learn sloppy targets.

Annotation Quality Is the Hidden Bottleneck

One of the most revealing details in the repo is the performance gap between relaxed and strict metrics:

  • strong mAP50
  • meaningfully lower mAP50-95

That pattern is common when the detector is finding roughly the right region but cannot consistently achieve near-pixel-perfect overlap. For large boxes, that is often acceptable. For small boxes like mana-cost and power, a few bad pixels matter a lot.

Loading diagram...

This is why the repo repeatedly returns to annotation quality in the metrics and training documents. Bigger models do not magically fix inconsistent boxes.

Data Quality Checks Matter More Than People Admit

The exploration scripts in this repo are not filler. They generate sample grids, class distributions, and quality checks. That is the right instinct.

Before tuning augmentations or scaling to cloud GPUs, an engineer should be able to answer:

  • Are labels missing?
  • Are some classes underrepresented?
  • Are small boxes consistently too loose or too tight?
  • Do train and validation samples look like the same world?

That is basic engineering hygiene. Without it, parameter tuning becomes expensive superstition.

Annotation Noise and Small Objects

The hardest classes in this project are not surprising. Small semantic regions are always more fragile:

  • mana-cost
  • power
  • title
  • tags

These classes compress a lot of meaning into a small area. Any mismatch in labeling, cropping, camera angle, or resolution hurts disproportionately.

This is also why the training strategy later leans on augmentations like mosaic, multi-scale training, and geometric transforms. Those choices are trying to help the model survive real-world variance without pretending the labels are perfect.

The Correction Workflow Is Part of the Architecture

One of my favorite details in the project is the explicit correction loop:

  • download test images
  • export predictions
  • review in Label Studio
  • merge corrected labels
  • retrain

That is the right way to think about quality improvement. The model is not just trained once. The dataset and the detector can improve together.

Conclusion

A detector learns whatever definition of reality the dataset encodes. If the labels are thoughtful, consistent, and close to the downstream task, the model has a chance. If the labels are noisy, even a strong architecture will plateau early.

This repo gets the important part right: it treats the dataset as a first-class engineering artifact.

In the next article, we move from the data itself to the choices that shape training: transfer learning, augmentation, optimization, and the experiments that pushed this detector to its best result.

Further Reading

Object Detection from ScratchPart 2 of 2
Series Progress2 / 2
Arthur CostaA

Arthur Costa

Senior Full-Stack Engineer & Tech Lead

Senior Full-Stack Engineer with 8+ years in React, TypeScript, and Node.js. Expert in performance optimization and leading engineering teams.

View all articles →