Paper page - Let ViT Speak: Generative Language-Image Pre-training
… View arXiv page View PDF Project page GitHub 116 Add to collection Community that gated attention trick to curb attention sink in a single, concatenated vision+text transformer is the most interesting nugget here. by modulating attention outputs per token, it lets image tokens attend bidirectionall… …