Unified Multi-Image Composition with Sequence Modeling
Skywork
A comprehensive data curation pipeline specifically tailored for multi-image composition, constructing 215K high-quality examples with multi-stage filtering to ensure semantic coherence, visual compatibility, and composition quality.
A novel unified sequence structure that concatenates noisy latent variables of target output with all reference images, enabling simultaneous training on single-image editing and multi-image composition while maintaining architectural simplicity.
Pioneered integration of trajectory mapping and distribution matching into distillation, producing high-fidelity results in just 8 inference steps with a remarkable 12.5× speedup over standard synthesis samplers.
We propose a comprehensive data curation pipeline specifically tailored for multi-image composition. Recognizing that data quality outweighs quantity for this delicate task, we construct a high-quality dataset of 215K multi-image composition examples with a focus on challenging HOI scenarios. Our pipeline employs multi-stage filtering to ensure semantic coherence, visual compatibility, and composition quality, demonstrating that a carefully curated, moderately-sized dataset is sufficient to train a state-of-the-art model.
We introduce a novel sequence modeling paradigm for multi-image composition. Specifically, we concatenate the noisy latent variables of the target output image with the latents of all reference images along the sequence dimension to form a unified long sequence. This formulation enables our model to simultaneously train on single-image editing and multi-image composition tasks while maintaining architectural simplicity. The unified sequence structure naturally accommodates variable numbers of input images and arbitrary output resolutions within a flexible pixel budget.
We compare the performance of Skywork UniPic 3.0 with the state-of-the-art models on the image editing tasks.
| Model | ImgEdit-Bench ↑ | GEdit-Bench ↑ | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Extract | Style | BG | Add | Remove | Replace | Adjust | Compose | Action | Overall | G_SC | G_PQ | G_O | |
| Qwen-Image-Edit | 3.47 | 4.80 | 4.32 | 4.26 | 3.87 | 4.58 | 4.45 | 3.91 | 4.59 | 4.25 | 8.18 | 7.87 | 7.68 |
| Qwen-Image-Edit-2509 | 3.51 | 4.84 | 4.36 | 4.43 | 4.29 | 4.66 | 4.42 | 3.72 | 4.58 | 4.31 | 8.12 | 8.01 | 7.61 |
| Nano Banana | 3.89 | 4.20 | 4.32 | 4.33 | 4.39 | 4.55 | 4.36 | 3.42 | 4.48 | 4.22 | 7.43 | 8.14 | 7.20 |
| Seedream 4.0 | 2.96 | 4.76 | 4.22 | 4.47 | 4.25 | 4.42 | 4.31 | 3.11 | 4.45 | 4.11 | 8.24 | 7.86 | 7.66 |
| UniPic 2.0 | 1.86 | 4.53 | 4.73 | 4.48 | 4.00 | 4.73 | 4.18 | 3.82 | 4.22 | 4.06 | 7.63 | 7.17 | 7.10 |
| UniPic 3.0 | 3.31 | 4.97 | 4.35 | 4.45 | 4.46 | 4.71 | 4.44 | 3.77 | 4.69 | 4.35 | 8.12 | 7.79 | 7.55 |
*Note: Images shown here are in JPEG format with compression for display purposes. For detailed results, please refer to the paper and model outputs.
We are grateful to the community for their open exploration and contributions to the field of unified multimodal model.
}