Exploring StableViton for Virtual Try-On: A Hands-On Review

April 3, 2024
Tayyab Bilal


Virtual try-on technology represents a significant advancement in the digital fashion industry, offering a bridge between online shopping experiences and the physical act of trying on clothes. This technology allows consumers to visualize how garments look on their bodies without the need for a physical fitting. It leverages digital imaging and artificial intelligence to superimpose clothing items onto the user's digital avatar or image in a realistic manner. The potential uses of virtual try-on in the fashion industry are extensive, including online shopping, personalized recommendations, and virtual fashion shows and benefits for consumers may include convenience, enhanced shopping experience, and reduced need for physical trials, leading to fewer returns and greater satisfaction.

In today's blog, we will test and evaluate the results from StableVITON, an advanced framework for image-based virtual try-on. Building upon the capabilities of pre-trained diffusion models, it promises to offer high-quality, realistic clothing simulations on arbitrary person images. We're particularly interested in evaluating if this technology is ready to become a final product that customers would find beneficial.


StableVITON distinguishes itself from existing virtual try-on solutions with several innovative features:

Semantic Correspondence Learning

It analyzes and learns the relationships between clothing items and body shape within the hidden (latent) representation space of the diffusion model. This leads to more accurate virtual clothing transfer onto different body images.

Zero Cross-Attention Blocks

These are core elements of the architecture allowing StableVITON to maintain clothing details while still using the power of the pre-trained diffusion model for image generation. These blocks are designed to preserve clothing details by learning semantic correspondences, while also leveraging the inherent knowledge of the pre-trained model in the image warping process, resulting in high-fidelity images.

Attention Total Variation Loss

A novel loss function proposed to achieve sharper attention maps leading to crisper preservation of garment details like patterns and textures

Efficient Utilization of Conditions

StableVITON employs three conditions—agnostic map, agnostic mask, and dense pose—alongside the clothing feature map as inputs to the model's attention mechanism, ensuring detailed and accurate alignment of clothing on the person's image.

In contrast to other virtual try-on solutions that may struggle with preserving fine clothing details or require extensive manual tuning, StableVITON claims to offer an end-to-end solution that preserves clothing details and generates high-quality images even in the presence of complex backgrounds.

The main goal of StableVITON is to solve the issue of realistic virtual try-on of clothing. It aims to generate highly accurate images where:

  • Clothing details are preserved from the original garment.
  • Clothing fits the new person's body shape convincingly.

Running StableViton

Installation and Setup

System Requirements

While specific system requirements are not formally documented for StableVITON, our testing environment was equipped with robust hardware to ensure optimal performance. This included an NVIDIA RTX 3090 GPU, a 12th Gen Intel® Core™ i5-12400 CPU, and 64GB of RAM. It's recommended to use a similarly capable setup or consult the documentation for minimum requirements.

Installation Process

The process to install StableVITON is straightforward. Here's a step-by-step guide to get you started:

  • Prerequisites: Ensure that CUDA and cuDNN are installed on your system. These are essential for leveraging GPU acceleration, crucial for running StableVITON efficiently.
  • Clone the Repository: Begin by cloning the StableVITON repository to your local machine. This can be done via your preferred Git client or directly through the command line.
  • Installing the required package: Inside the repository, you'll find detailed installation instructions. Instructing you to install the required packages in a separate conda environment. These steps are crucial for a successful setup.
  • Downloading the weights: the weights are also available to download as a dropbox list.

During our installation, we encountered no significant issues.

Data Preprocessing

The lack of comprehensive documentation in the repository posed a significant challenge. While the authors outlined the expected input data structure for inference, they did not provide clear instructions or identify the specific dependencies required to replicate their results. References to VITON-HD and DensePose were made for mask generation and pose estimation, respectively, but the exact models needed for accurate replication were omitted.

Despite these challenges, we were able to proceed with inference. Using Facebook's Detectron, we generated the necessary DensePose estimations. Additionally, the 'OOTDiffusion' repository (https://github.com/levihsu/OOTDiffusion ) provided the inference script needed to generate agnostic masks.

The authors had referred to the zalando-hd-resized dataset to structure the custom data similarly.The dataset(zalando-hd-resized) was linked in a separate repository linked here VITON-HD, we decided to use the cloth images already provided in the dataset to test on our own images. 

The following images and masks were essential for inference:

  • image: The original image of the person.
  • image-densepose: DensePose estimation (generated by Facebook's Detectron).
  • agnostic: Agnostic image (generated by OOTDiffusion).
  • agnostic-mask: Agnostic mask (generated by OOTDiffusion).
  • cloth: The original image of the clothing item.
  • cloth-mask: Mask for the clothing item.

We manually reframed the original images to center the subject and resized them to 768x1024 with a 3:4 aspect ratio. This standardization was crucial as the model requires uniform image sizes, including the clothing images.

Generating Agnostic Images and Masks

To generate the necessary agnostic images and masks, we employed the "run_ootd" script from OOTDiffusion. Modifications were made to this script to output images in the specified dimensions of 768x1024.

DensePose Pose Estimation

For pose estimation, the 'densepose_rcnn_R_50_FPN_s1x' model was run in 'dp_segm' mode within the DensePose project. Using the 'apply_net' script (link here: https://github.com/facebookresearch/detectron2/tree/main/projects/DensePose), we obtained the DensePose results. A minor script modification was necessary to ensure the DensePose was generated against a black background.

Data structure

The structure was provided as follows in the project repo but an additional file named “test_pairs.txt” was required to run the inference. The final structure of the directory should look like this.

|-- image

|-- image-densepose

|-- agnostic

|-- agnostic-mask

|-- cloth

|-- cloth_mask

|-- test_pairs.txt

In the "test" directory, we created a text file named "test_pairs.txt". This file specifies the image pairs for virtual try-on. Each line should list the filenames of the human image and the corresponding cloth image, separated by a space.


human_image_1.jpg cloth_image_1.jpg

human_image_2.jpg cloth_image_1.jpg

human_image_1.jpg cloth_image_2.jpg

human_image_2.jpg cloth_image_2.jpg



Following the repository's instructions, we successfully generated results using the 'unpaired' mode. Each generation took approximately 9 seconds. Our test included 12 total generations, combining 6 different clothing items with 3 input images.

Here is our observations:

Cloth Fitting

Impressive Fit and Detail

StableVITON delivers impressive garment fit on the virtual model. Notably, it retains details like hair and skin at the garment's edges, showcasing its ability to handle complex image regions. Additionally, the clothing exhibits realistic wrinkles and folds, adding a layer of realism.

Versatility in Fit Representation

Interestingly, when comparing the generated images for the crop top and t-shirt, StableVITON appears to understand the inherent style of each garment. The virtual try-on reflects a tighter fit for the crop top and the dress while generating a looser drape for the t-shirt, demonstrating the model's ability to adapt to different clothing types.

Color inconsistency and loss of details in the garment itself

Color Consistency Issues

We observed inconsistencies in the color of a single garment across different generations. For example, the gray t-shirt exhibited variations in shade, at times appearing lighter gray or even black.

Texture Detail Limitations

While StableVITON retains the basic texture of the clothing, it struggles to fully reproduce finer details, particularly in the case of text or graphics printed on the garment. 

Faulty generation of faces, beards and glasses in most cases

We consistently encountered problems with the generation of faces, beards, and occasionally glasses by the pre-trained diffusion model. These facial features often appeared distorted or malformed, rendering many results unusable.  This significant limitation indicates that the framework, in its current form, is not suitable for deployment in a customer-facing product.

Faulty generation of skin tone other details on the arms

Since the final image is produced by the pre-trained diffusion model, we encountered occasional issues with inconsistent skin tone when generating arms. Additionally, finer details like watches or tattoos on the original person's image were always lost during the generation process.


StableVITON demonstrates considerable promise in the realm of virtual try-on. Its ability to accurately transfer garment fit, generate compelling textures, and adapt to different clothing styles showcases significant technical advancement. However, the noted inconsistencies and shortcomings highlight critical areas for improvement before this technology can deliver reliable results in a customer-facing product.

Specifically the frequent distortion of facial features, loss of finer garment details and color instability detract from the overall realism and trustworthiness of the virtual try-on experience. For customers, these flaws undermine confidence in making purchasing decisions based on the generated images.

While StableVITON marks a noteworthy step forward, addressing these limitations is crucial before it can be considered a market-ready solution.