← All Projects
Researcher · 2024

Deep Dive into Vision Transformers

Interpretability research on ViT sensitivity to data quality

PyTorchYOLOv5Grad-CAMComputer Vision

The Problem

Vision Transformers show strong performance on benchmark datasets, but their sensitivity to data quality in real-world scenarios is poorly understood. How much does data quality actually matter, and where do ViTs focus their attention under different conditions?

Approach

I used YOLOv5 to generate facial and body crop annotations across 15,000 Kaggle images, creating datasets of varying quality. I then performed systematic ablation studies on augmentation techniques and dataset scaling. To understand what the models learn, I applied NMF-based deep feature factorization and Grad-CAM to visualize attention patterns under different data conditions.

Results

  • 18% precision boost from high-quality annotations vs. raw data
  • 5% accuracy improvement per additional 1,000 high-fidelity samples
  • Visualized attention shifts revealing how data quality changes ViT focus regions