Question:
Transformer - Multi-Head Attention
Author: Christian NAnswer:
1) Concentenate all the attention heads 2) Multiply with a weight matrix W^o that was trained jointly with the model 3) The result should be the Z matrix that captures information from all the attention heads. We can send this forward to the FFNN.
0 / 5 Â (0 ratings)
1 answer(s) in total