Deep Dive into Vision Transformers
Interpretability research on ViT sensitivity to data quality
The Problem
Vision Transformers show strong performance on benchmark datasets, but their sensitivity to data quality in real-world scenarios is poorly understood. How much does data quality actually matter, and where do ViTs focus their attention under different conditions?
Approach
I used YOLOv5 to generate facial and body crop annotations across 15,000 Kaggle images, creating datasets of varying quality. I then performed systematic ablation studies on augmentation techniques and dataset scaling. To understand what the models learn, I applied NMF-based deep feature factorization and Grad-CAM to visualize attention patterns under different data conditions.
Results
- 18% precision boost from high-quality annotations vs. raw data
- 5% accuracy improvement per additional 1,000 high-fidelity samples
- Visualized attention shifts revealing how data quality changes ViT focus regions