In recent times, pure language processing fashions have dramatically improved their means to study general-purpose representations, which has resulted in important efficiency features for a variety of pure language era and pure language understanding duties. Largely, this has been completed by pre-training language fashions on in depth unlabeled textual content corpora.
This pre-training formulation doesn’t make assumptions about enter sign modality, which will be language, imaginative and prescient, or audio, amongst others. A number of latest papers have exploited this formulation to dramatically enhance picture era outcomes by pre-quantizing pictures into discrete integer codes (represented as pure numbers), and modeling them autoregressively (i.e., predicting sequences one token at a time). In these approaches, a convolutional neural community (CNN) is skilled to encode a picture into discrete tokens, every similar to a small patch of the picture. A second stage CNN or Transformer is then skilled to mannequin the distribution of encoded latent variables. The second stage can be utilized to autoregressively generate a picture after the coaching. However whereas such fashions have achieved sturdy efficiency for picture era, few research have evaluated the discovered illustration for downstream discriminative duties (similar to picture classification).
In “Vector-Quantized Picture Modeling with Improved VQGAN”, we suggest a two-stage mannequin that reconceives conventional picture quantization strategies to yield improved efficiency on picture era and picture understanding duties. Within the first stage, a picture quantization mannequin, known as VQGAN, encodes a picture into lower-dimensional discrete latent codes. Then a Transformer mannequin is skilled to mannequin the quantized latent codes of a picture. This strategy, which we name Vector-quantized Picture Modeling (VIM), can be utilized for each picture era and unsupervised picture illustration studying. We describe a number of enhancements to the picture quantizer and present that coaching a stronger picture quantizer is a key part for enhancing each picture era and picture understanding.
Vector-Quantized Picture Modeling with ViT-VQGAN
One latest, generally used mannequin that quantizes pictures into integer tokens is the Vector-quantized Variational AutoEncoder (VQVAE), a CNN-based auto-encoder whose latent area is a matrix of discrete learnable variables, skilled end-to-end. VQGAN is an improved model of this that introduces an adversarial loss to advertise top quality reconstruction. VQGAN makes use of transformer-like parts within the type of non-local consideration blocks, which permits it to seize distant interactions utilizing fewer layers.
In our work, we suggest taking this strategy one step additional by changing each the CNN encoder and decoder with ViT. As well as, we introduce a linear projection from the output of the encoder to a low-dimensional latent variable area for lookup of the integer tokens. Particularly, we decreased the encoder output from a 768-dimension vector to a 32- or 8-dimension vector per code, which we discovered encourages the decoder to raised make the most of the token outputs, enhancing mannequin capability and effectivity.
With our skilled ViT-VQGAN, pictures are encoded into discrete tokens represented by integers, every of which encompasses an 8×8 patch of the enter picture. Utilizing these tokens, we practice a decoder-only Transformer to foretell a sequence of picture tokens autoregressively. This two-stage mannequin, VIM, is ready to carry out unconditioned picture era by merely sampling token-by-token from the output softmax distribution of the Transformer mannequin.
VIM can also be able to performing class-conditioned era, similar to synthesizing a particular picture of a given class (e.g., a canine or a cat). We lengthen the unconditional era to class-conditioned era by prepending a class-ID token earlier than the picture tokens throughout each coaching and sampling.
|Uncurated set of canine samples from class-conditioned picture era skilled on ImageNet. Conditioned lessons: Irish terrier, Norfolk terrier, Norwich terrier, Yorkshire terrier, wire-haired fox terrier, Lakeland terrier.|
To check the picture understanding capabilities of VIM, we additionally fine-tune a linear projection layer to carry out ImageNet classification, a normal benchmark for measuring picture understanding skills. Just like ImageGPT, we take a layer output at a particular block, common over the sequence of token options (frozen) and insert a softmax layer (learnable) projecting averaged options to class logits. This permits us to seize intermediate options that present extra info helpful for illustration studying.
We practice all ViT-VQGAN fashions with a coaching batch dimension of 256 distributed throughout 128 CloudTPUv4 cores. All fashions are skilled with an enter picture decision of 256×256. On prime of the pre-learned ViT-VQGAN picture quantizer, we practice Transformer fashions for unconditional and class-conditioned picture synthesis and examine with earlier work.
We measure the efficiency of our proposed strategies for class-conditioned picture synthesis and unsupervised illustration studying on the extensively used ImageNet benchmark. Within the desk beneath we exhibit the class-conditioned picture synthesis efficiency measured by the Fréchet Inception Distance (FID). In comparison with prior work, VIM improves the FID to three.07 (decrease is healthier), a relative enchancment of 58.6% over the VQGAN mannequin (FID 7.35). VIM additionally improves the capability for picture understanding, as indicated by the Inception Rating (IS), which works from 188.6 to 227.4, a 20.6% enchancment relative to VQGAN.
|Fréchet Inception Distance (FID) comparability between completely different fashions for class-conditional picture synthesis and Inception Rating (IS) for picture understanding, each on ImageNet with decision 256×256. The acceptance price reveals outcomes filtered by a ResNet-101 classification mannequin, just like the method in VQGAN.|
After coaching a generative mannequin, we check the discovered picture representations by fine-tuning a linear layer to carry out ImageNet classification, a normal benchmark for measuring picture understanding skills. Our mannequin outperforms earlier generative fashions on the picture understanding process, enhancing classification accuracy by linear probing (i.e., coaching a single linear classification layer, whereas preserving the remainder of the mannequin frozen) from 60.3% (iGPT-L) to 73.2%. These outcomes showcase VIM’s sturdy era outcomes in addition to picture illustration studying skills.
We suggest Vector-quantized Picture Modeling (VIM), which pretrains a Transformer to foretell picture tokens autoregressively, the place discrete picture tokens are produced from improved ViT-VQGAN picture quantizers. With our proposed enhancements on picture quantization, we exhibit superior outcomes on each picture era and understanding. We hope our outcomes can encourage future work in direction of extra unified approaches for picture era and understanding.
We wish to thank Xin Li, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu for the preparation of the VIM paper. We thank Wei Han, Yuan Cao, Jiquan Ngiam, Vijay Vasudevan, Zhifeng Chen and Claire Cui for useful discussions and suggestions, and others on the Google Analysis and Mind Staff for assist all through this undertaking.