Studying distribution shifts in fully self-supervised ViTs with test-time fine-tuning

This project was undertaken in collaboration with a classmate, Aditya Mehrotra, as part of the course CSC2529: Computational Imaging at the University of Toronto during my undergraduate degree. The project was supervised by Prof. Aviad Levis.

In this effort, we aimed to study distribution shifts in pretrained vision transformers (ViTs) when fine-tuned on a downstream task. We focused on fully self-supervised ViTs, which are pretrained on large-scale datasets using self-supervised learning. The downstream classification head in our study was a parametric KNN model from the training set of the model. This enabled us to focus our efforts on mitigating the distribution shift to focus on the pretrained backbone of the model. Additionally, it meant that on evaluation, we could understand if the learned representations correctly identified the causally relevant features of the input data.

We conducted experiments using the DiNO self-supervised learning algorithm and on the CIFAR10/CIFAR10.1 and Camelyon17 datasets. In these, we observed that the learned representations of the ViT were not robust to the distribution shifts exhibited in these datasets. We attempted to fine-tune the ViT backbone using SSL to update the representations with the test dataset without access to the test labels. In addition to naive fine-tuning we used LoRA and Elastic Weight Consolidation techniques to enhance the adaptation of the model representations to the test distribution. Even with these enhancements we observed that the model was not able to adapt to the test distribution effectively. We hypothesize that the model’s inability to adapt to the test distribution is due to the lack of large datasets representative of the test distribution, which is the likely case in practice as well.

Link to the project report.