Next Visual Granularity Generation

Yikai Wang1, Zhouxia Wang1, Zhonghua Wu2, Qingyi Tao2, Kang Liao1, Chen Change Loy1
1: S-Lab, Nanyang Technological University; 2: SenseTime Research
GIF 0
GIF 1
GIF 2
GIF 3
Construction of the visual granularity sequence on 256x256 image and the next visual granularity generation in the 16x16 latent space. Top-to-bottom: Number of unique tokens, structure map, generated image.
Our model can generate diverse and high-fidelity images.
The generated images align well with the generated binary structure map.
We can reuse structures from reference images (wallaby, flamingo) to generate new ones (rabbits, heron).

Abstract

  • We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity.
  • Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels.
  • We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 → 3.03, 2.57 → 2.44, 2.09 → 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models will be released.

Results

Generated 256×256 samples of our NVG-d24 trained on ImageNet.
Top: We show several representative examples to illustrate the iterative generation process.
Middle: The generated binary structure maps align well with the final images.
Bottom: Our NVG-d24 model can generate diverse and high-quality images.

Method

Structure and Content Decomposition

We quantize the latent in a visual granularity sequence. We use the content (unique tokens) and structure pairs (arrangement of each token in the latent space) to represent every stage.

Structure Embedding

We propose a compact, hierarchical representation to encode the overall structure.

Generation Pipeline

Illustration of our generation pipeline. At each stage, we first generate the structure, then generate the content based on that structure. Both the structure and content generation are guided by the input text, the current canvas, and the current hierarchical structure. "-" is the minus operator.