Why do transformers have outliers?
Modern Machine Learning models are trained with a large number of parameters, often too large, and this overparameterization is very useful during training as it creates a vast search space for the model to encode rich representations from data...