Skywork UniPic 3.0

Unified Multi-Image Composition with Sequence Modeling

Skywork

Technical Report GitHub 🤗HuggingFace

Skywork UniPic 3.0 Teaser

Key Capabilities

📊

High-Quality Data Curation

A comprehensive data curation pipeline specifically tailored for multi-image composition, constructing 215K high-quality examples with multi-stage filtering to ensure semantic coherence, visual compatibility, and composition quality.

🔗

Sequence Modeling Paradigm

A novel unified sequence structure that concatenates noisy latent variables of target output with all reference images, enabling simultaneous training on single-image editing and multi-image composition while maintaining architectural simplicity.

🚀

Fast Inference

Pioneered integration of trajectory mapping and distribution matching into distillation, producing high-fidelity results in just 8 inference steps with a remarkable 12.5× speedup over standard synthesis samplers.

Abstract

The recent surge in popularity of Nano-Banana and Seedream 4.0 underscores the community's strong interest in multi-image composition tasks. Compared to single-image editing, multi-image composition presents significantly greater challenges in terms of consistency and quality, yet existing models have not disclosed specific methodological details for achieving high-quality fusion. Through statistical analysis, we identify Human-Object Interaction (HOI) as the most sought-after category by the community. We therefore systematically analyze and implement a state-of-the-art solution for multi-image composition with a primary focus on HOI-centric tasks. We present Skywork UniPic 3.0, a unified multimodal framework that integrates single-image editing and multi-image composition. Our model supports an arbitrary (1~6) number and resolution of input images, as well as arbitrary output resolutions (within a total pixel budget of 1024x1024). To address the challenges of multi-image composition, we design a comprehensive data collection, filtering, and synthesis pipeline, achieving strong performance with only 700K high-quality training samples. Furthermore, we introduce a novel training paradigm that formulates multi-image composition as a sequence-modeling problem, transforming conditional generation into unified sequence synthesis. To accelerate inference, we integrate trajectory mapping and distribution matching into the post-training stage, enabling the model to produce high-fidelity samples in just 8 steps and achieve a 12.5x speedup over standard synthesis sampling. Skywork UniPic 3.0 achieves state-of-the-art performance on single-image editing benchmark and surpasses both Nano-Banana and Seedream 4.0 on multi-image composition benchmark, thereby validating the effectiveness of our data pipeline and training paradigm.

Data Pipeline

We propose a comprehensive data curation pipeline specifically tailored for multi-image composition. Recognizing that data quality outweighs quantity for this delicate task, we construct a high-quality dataset of 215K multi-image composition examples with a focus on challenging HOI scenarios. Our pipeline employs multi-stage filtering to ensure semantic coherence, visual compatibility, and composition quality, demonstrating that a carefully curated, moderately-sized dataset is sufficient to train a state-of-the-art model.

Data Pipeline

Model Overview

We introduce a novel sequence modeling paradigm for multi-image composition. Specifically, we concatenate the noisy latent variables of the target output image with the latents of all reference images along the sequence dimension to form a unified long sequence. This formulation enables our model to simultaneously train on single-image editing and multi-image composition tasks while maintaining architectural simplicity. The unified sequence structure naturally accommodates variable numbers of input images and arbitrary output resolutions within a flexible pixel budget.

Model Pipeline

Performance Comparison

We compare the performance of Skywork UniPic 3.0 with the state-of-the-art models on the image editing tasks.

Model ImgEdit-Bench ↑ GEdit-Bench ↑
Extract Style BG Add Remove Replace Adjust Compose Action Overall G_SC G_PQ G_O
Qwen-Image-Edit 3.474.804.324.263.874.584.453.914.594.258.187.877.68
Qwen-Image-Edit-2509 3.514.844.364.434.294.664.423.724.584.318.128.017.61
Nano Banana 3.894.204.324.334.394.554.363.424.484.227.438.147.20
Seedream 4.0 2.964.764.224.474.254.424.313.114.454.118.247.867.66
UniPic 2.0 1.864.534.734.484.004.734.183.824.224.067.637.177.10
UniPic 3.0 3.314.974.354.454.464.714.443.774.694.358.127.797.55

Multi-image Composition

*Note: Images shown here are in JPEG format with compression for display purposes. For detailed results, please refer to the paper and model outputs.

Visual Comparison

UniPic 3.0 Comparison UniPic 3.0 Comparison UniPic 3.0 Comparison

Qualitative results in 8 steps

UniPic 3.0 few step

Acknowledgement

We are grateful to the community for their open exploration and contributions to the field of unified multimodal model.

Citation


      
}