Self supervision and pure language supervision have emerged as two thrilling methods to coach common objective picture encoders which excel at quite a lot of downstream duties. Current works equivalent to M3AE [31] and SLIP [64] have prompt that these approaches might be successfully mixed, however most notably their outcomes use small (<20M examples) pre-training datasets and don’t successfully replicate the large-scale regime (>100M samples) that’s generally used for these approaches. Right here we examine whether or not an analogous method might be efficient when educated with a a lot bigger quantity of knowledge. We discover {that a} mixture of two state-of-the-art approaches: masked auto-encoders, MAE [38] and contrastive language picture pre-training, CLIP [68] offers a profit over CLIP when educated on a corpus of 11.3M image-text pairs, however little to no profit (as evaluated on a collection of widespread imaginative and prescient duties) over CLIP when educated on a big corpus of 1.4B photos. Our work offers some a lot wanted readability into the effectiveness (or lack thereof) of self supervision for large-scale image-text coaching.