Structure and Content Decomposition
We quantize the latent in a visual granularity sequence. We use the content (unique tokens) and structure pairs (arrangement of each token in the latent space) to represent every stage.
Structure Embedding
We propose a compact, hierarchical representation to encode the overall structure.
Generation Pipeline
Illustration of our generation pipeline. At each stage, we first generate the structure, then generate the content based on that structure. Both the structure and content generation are guided by the input text, the current canvas, and the current hierarchical structure. "-" is the minus operator.