Transformer attention is typically multi-head to:
Answer options
A
Reduce model parameters
B
Capture different relations using different projection subspaces
C
Remove positional info
D
Enforce Gaussian priors
Correct answer: Capture different relations using different projection subspaces
Explanation
The correct answer is: Capture different relations using different projection subspaces.