Physics in Video Language Models
In this post, we explore the physics behind text to video language models and how they can be used to generate realistic videos from text prompts. Read more
Note that, this perspective provides an alternative to the conventional thinking of local neighborhoods as hyperspheres or balls of certain radius in space. This is made possible in NNK, by taking into account the relative positions of nodes j
and k
using the metric on (j,k)
, in addition to the metric on (i,j)
and (i,k)
that were previously used for the selection and weighing of neighbors j
,k
to node i
.
Application of NNK for classification is further interesting from a recent theoretical point of view: interpolative classifiers, namely classifiers with training error 0
, are capable of generalizing well to test data contrary to the conventional wisdom (interpolation ⇒ poor generalization). Note that overparameterized neural network models today are increasingly trained to zero training error and can be considered to fall under this category of interpolative classifiers. In the experiments that follow, we will use a normalized cosine kernel (cosine kernel shifted and scaled to have values between 0
and 1
) for NNK and KNN.
Table below lists Top-1 Linear, weighted KNN and NNK classification accuracy on ImageNet using a fixed bacbone architecture ResNet-50 that was trained using different self-SL training strategies. The evaluation protocol follows a standard setup where one evaluated performance on validation dataset based on labeled training dataset. The KNN and NNK evaluations were done using a self-SL framework VISSL with officially released model weights and setting the parameter k
, number of neighbors, to 50
. The results listed for linear evaluations are as reported on the corresponding self-SL papers and was not validated by us.
Method | Linear | KNN | NNK |
---|---|---|---|
Supervised | 76.1 | 74.5 | 75.4 |
MoCo-v2 | 71.1 | 60.3 | 64.9 |
DeepClusterV2 | 75.2 | 65.8 | 70.7 |
SwAV | 75.3 | 63.2 | 68.7 |
DINO | 75.3 | 65.6 | 71.1 |
To evaluate the robustness of NNK relative to KNN we will use a recently introduced self-SL model, DINO (distillation with no labels), as our baseline model and compare Top-1 accuracy on ImageNet for different values of k
. As can be seen in Figure below, NNK not only consitently outperforms a weighted KNN classifer but does so in a robust manner. Further, unlike the KNN whose performance decreases as k
increases, NNK improves with k
. This can be explained by the fact that NNK accommodates new neighbors only if they belong to a new direction in space that improves its interpolation whereas a KNN simply interpolates with all k
neighbors. NNK classifier in this setup achieves performance on par if not better than the linear classifier model with the small ViT model achieving ImageNet top-1 accuracy of 79.8%
, the best performance by a non parametric classifier in conjunction with self-SL models.
Several machine learning methods leverage the idea of local neighborhood by using methods such as KNN, ∈-neighborhood to better design and evaluate pattern recognition models. However, the choice of parameters such as k
is often made experimentally, e.g., via cross-validation, leading to local neighborhoods without a clear interpretation. NNK formulation allows us to overcome this shortcomings resulting in a superior framework that is adaptive to the local distribution of samples. As show in this post and our related works, NNK does exhibit robust and superior performance relative to standard local methods. Ultimately, our goal is for the NNK framework to encourage and revisit the use of interpretable and robust neighborhood methods in machine learning - for e.g., consider the problem of model distillation or clustering where NNK could be used to enforce better local consistency regularization.
Source code for experiments in this post are available at github.com/shekkizh/VISSL_NNK_Benchmark