Deep Dive into Vision Transformers

The Problem

Vision Transformers show strong performance on benchmark datasets, but their sensitivity to data quality in real-world scenarios is poorly understood. How much does data quality actually matter, and where do ViTs focus their attention under different conditions?

Approach

I used YOLOv5 to generate facial and body crop annotations across 15,000 Kaggle images, creating datasets of varying quality. I then performed systematic ablation studies on augmentation techniques and dataset scaling. To understand what the models learn, I applied NMF-based deep feature factorization and Grad-CAM to visualize attention patterns under different data conditions.

Results

18% precision boost from high-quality annotations vs. raw data
5% accuracy improvement per additional 1,000 high-fidelity samples
Visualized attention shifts revealing how data quality changes ViT focus regions