Oftentimes, machine studying (ML) mannequin builders start their design utilizing a generic spine mannequin that’s educated at scale and with capabilities transferable to a variety of downstream duties. In pure language processing, quite a lot of in style spine fashions, together with BERT, T5, GPT-3 (generally additionally known as “basis fashions”), are pre-trained on web-scale knowledge and have demonstrated generic multi-tasking capabilities via zero-shot, few-shot or switch studying. In contrast with coaching over-specialized particular person fashions, pre-training spine fashions for a lot of downstream duties can amortize the coaching prices, permitting one to beat useful resource limitations when constructing giant scale fashions.
In laptop imaginative and prescient, pioneering work has proven the effectiveness of single-encoder fashions pre-trained for picture classification to seize generic visible representations which might be efficient for different downstream duties. Extra not too long ago, contrastive dual-encoder (CLIP, ALIGN, Florence) and generative encoder-decoder (SimVLM) approaches educated utilizing web-scale noisy image-text pairs have been explored. Twin-encoder fashions exhibit exceptional zero-shot picture classification capabilities however are much less efficient for joint vision-language understanding. However, encoder-decoder strategies are good at picture captioning and visible query answering however can not carry out retrieval-style duties.
In “CoCa: Contrastive Captioners are Picture-Textual content Basis Fashions”, we current a unified imaginative and prescient spine mannequin known as Contrastive Captioner (CoCa). Our mannequin is a novel encoder-decoder method that concurrently produces aligned unimodal picture and textual content embeddings and joint multimodal representations, making it versatile sufficient to be instantly relevant for all sorts of downstream duties. Particularly, CoCa achieves state-of-the-art outcomes on a sequence of imaginative and prescient and vision-language duties spanning imaginative and prescient recognition, cross-modal alignment, and multimodal understanding. Moreover, it learns extremely generic representations in order that it may well carry out as properly or higher than totally fine-tuned fashions with zero-shot studying or frozen encoders.
![]() |
Overview of Contrastive Captioners (CoCa) in comparison with single-encoder, dual-encoder and encoder-decoder fashions. |
Methodology
We suggest CoCa, a unified coaching framework that mixes contrastive loss and captioning loss on a single coaching knowledge stream consisting of picture annotations and noisy image-text pairs, successfully merging single-encoder, dual-encoder and encoder-decoder paradigms.
To this finish, we current a novel encoder-decoder structure the place the encoder is a imaginative and prescient transformer (ViT), and the textual content decoder transformer is decoupled into two components, a unimodal textual content decoder and a multimodal textual content decoder. We skip cross-attention in unimodal decoder layers to encode text-only representations for contrastive loss, and cascade multimodal decoder layers with cross-attention to picture encoder outputs to study multimodal image-text representations for captioning loss. This design maximizes the mannequin’s flexibility and universality in accommodating a large spectrum of duties, and on the similar time, it may be effectively educated with a single ahead and backward propagation for each coaching goals, leading to minimal computational overhead. Thus, the mannequin will be educated end-to-end from scratch with coaching prices akin to a naïve encoder-decoder mannequin.
![]() |
Illustration of ahead propagation utilized by CoCa for each contrastive and captioning losses. |
Benchmark Outcomes
The CoCa mannequin will be instantly fine-tuned on many duties with minimal adaptation. By doing so, our mannequin achieves a sequence of state-of-the-art outcomes on in style imaginative and prescient and multimodal benchmarks, together with (1) visible recognition: ImageNet, Kinetics-400/600/700, and MiT; (2) cross-modal alignment: MS-COCO, Flickr30K, and MSR-VTT; and (3) multimodal understanding: VQA, SNLI-VE, NLVR2, and NoCaps.
![]() |
Comparability of CoCa with different image-text spine fashions (with out task-specific customization) and a number of state-of-the-art task-specialized fashions. |
It’s noteworthy that CoCa attains these outcomes as a single mannequin tailored for all duties whereas typically lighter than prior top-performing specialised fashions. For instance, CoCa obtains 91.0% ImageNet top-1 accuracy whereas utilizing lower than half the parameters of prior state-of-the-art fashions. As well as, CoCa additionally obtains robust generative functionality of high-quality picture captions.
![]() |
Picture classification scaling efficiency evaluating fine-tuned ImageNet top-1 accuracy versus mannequin measurement. |
![]() |
Textual content captions generated by CoCa with NoCaps photos as enter. |
Zero-Shot Efficiency
In addition to attaining wonderful efficiency with fine-tuning, CoCa additionally outperforms earlier state-of-the-art fashions on zero-shot studying duties, together with picture classification,and cross-modal retrieval. CoCa obtains 86.3% zero-shot accuracy on ImageNet whereas additionally robustly outperforming prior fashions on difficult variant benchmarks, comparable to ImageNet-A, ImageNet-R, ImageNet-V2, and ImageNet-Sketch. As proven within the determine beneath, CoCa obtains higher zero-shot accuracy with smaller mannequin sizes in comparison with prior strategies.
![]() |
Picture classification scaling efficiency evaluating zero-shot ImageNet top-1 accuracy versus mannequin measurement. |
Frozen Encoder Illustration
One notably thrilling commentary is that CoCa achieves outcomes akin to one of the best fine-tuned fashions utilizing solely a frozen visible encoder, by which options extracted after mannequin coaching are used to coach a classifier, somewhat than the extra computationally intensive effort of fine-tuning a mannequin. On ImageNet, a frozen CoCa encoder with a discovered classification head obtains 90.6% top-1 accuracy, which is best than the totally fine-tuned efficiency of present spine fashions (90.1%). We additionally discover this setup to work extraordinarily properly for video recognition. We feed sampled video frames into the CoCa frozen picture encoder individually, and fuse output options by attentional pooling earlier than making use of a discovered classifier. This easy method utilizing a CoCa frozen picture encoder achieves video motion recognition top-1 accuracy of 88.0% on Kinetics-400 dataset and demonstrates that CoCa learns a extremely generic visible illustration with the mixed coaching goals.
![]() |
Comparability of Frozen CoCa visible encoder with (a number of) best-performing fine-tuned fashions. |
Conclusion
We current Contrastive Captioner (CoCa), a novel pre-training paradigm for image-text spine fashions. This easy technique is broadly relevant to many varieties of imaginative and prescient and vision-language downstream duties, and obtains state-of-the-art efficiency with minimal and even no task-specific variations.
Acknowledgements
We want to thank our co-authors Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu who’ve been concerned in all elements of the venture. We additionally want to thank Yi-Ting Chen, Kaifeng Chen, Ye Xia, Zhen Li, Chao Jia, Yinfei Yang, Zhengdong Zhang, Wei Han, Yuan Cao, Tao Zhu, Futang Peng, Soham Ghosh, Zihang Dai, Xin Li, Anelia Angelova, Jason Baldridge, Izhak Shafran, Shengyang Dai, Abhijit Ogale, Zhifeng Chen, Claire Cui, Paul Natsev, Tom Duerig for useful discussions, Andrew Dai for assist with contrastive fashions, Christopher Fifty and Bowen Zhang for assist with video fashions, Yuanzhong Xu for assist with mannequin scaling, Lucas Beyer for assist with knowledge preparation, Andy Zeng for assist with MSR-VTT analysis, Hieu Pham and Simon Kornblith for assist with zero-shot evaluations, Erica Moreira and Victor Gomes for assist with useful resource coordination, Liangliang Cao for proofreading, Tom Small for creating the animations used on this blogpost, and others within the Google Mind group for assist all through this venture.