UIClip: A Data-driven Model for Assessing User Interface Design

1Carnegie Mellon University
2Apple
ACM UIST 2024

palette Abstract

User interface (UI) design is a difficult yet important task for ensuring the usability, accessibility, and aesthetic qualities of applications. In our paper, we develop a machine-learned model, UIClip, for assessing the design quality and visual relevance of a UI given its screenshot and natural language description. To train UIClip, we used a combination of automated crawling, synthetic augmentation, and human ratings to construct a large-scale dataset of UIs, collated by description and ranked by design quality. Through training on the dataset, UIClip implicitly learns properties of good and bad designs by (i) assigning a numerical score that represents a UI design’s relevance and quality and (ii) providing design suggestions. In an evaluation that compared the outputs of UIClip and other baselines to UIs rated by 12 human designers, we found that UIClip achieved the highest agreement with ground-truth rankings. Finally, we present three example applications that demonstrate how UIClip can facilitate downstream applications that rely on instantaneous assessment of UI design quality: (i) UI code generation, (ii) UI design tips generation, and (iii) quality-aware UI example search.

dataset Dataset Generation

We collected over 2.3 million UI screenshots, each paired with natural language text that includes a caption, design quality, and design defects. Since it is prohibitively costly and time-consuming to collect enough human-annotated data to train deep learning models, the majority of our data (over 99.9%) is synthetically generated, and a smaller portion is human-rated by professional designers. We refer to our synthetically-generated dataset as JitterWeb and our human-rated dataset as BetterApp.

smart_toy UIClip: Training and Inference

We used the JitterWeb and BetterApp datasets to train a computational model, UIClip, that assesses UI designs from screenshots. While our datasets could be applied to train various models, such as large vision-language models (LVLMs), we adopted the CLIP architecture due to its efficiency and ability to produce numerical scores, aligning with our goal of design assessment. UIClip takes two inputs: an image (e.g., a screenshot of a UI) and a corresponding textual description. The model outputs a single numerical value representing a combined assessment of the UI's design relevance and quality.

During training, UIClip was initialized from the pre-trained CLIP B/32 model and finetuned in four stages to learn domain-specific UI features. In the first stage, we used the JitterWeb dataset, training UIClip using the standard CLIP objective, which aligns paired images and text in a shared embedding space. This allowed the model to learn how UI screenshots relate to their descriptions. In the second stage, we introduced a pairwise contrastive learning objective to distinguish good designs from bad ones by comparing jittered and non-jittered UIs. This process was repeated for the BetterApp dataset, incorporating human-rated comparisons and further refining UIClip’s design assessment capabilities.

For inference, UIClip applies a sliding window strategy to handle varying UI dimensions, where the input image is resized and split into 224x224 pixel windows. The model encodes each window separately and averages their embeddings, ensuring that the full UI is considered. The final score is computed as the dot product between the image embedding and the text embedding, allowing the model to assess the quality of the UI based on its visual appearance and the provided description. Additionally, UIClip can suggest design improvements by identifying specific design flaws (e.g., poor contrast or bad alignment) from the JitterWeb dataset.

smart_toy Experiment Results on UI Design Assessment

We conducted a comprehensive evaluation of UIClip across three key aspects of design assessment: UI design quality assessment, design suggestion generation, and design relevance. UIClip’s performance was compared against state-of-the-art models, including large vision-language models (LVLMs) and various CLIP-based baselines. Our results highlight the model’s effectiveness in these tasks, outperforming much larger models in several cases.

(i) Design Quality: For the task of design quality assessment, we evaluated how accurately the models identified the "preferred" UI from a pair of examples. UIClip, particularly the variant trained with jittered web pairs from the JitterWeb dataset, achieved the highest accuracy (75.12%) across both synthetic and human-rated datasets. This performance was notably better than the base CLIP model, which struggled with jittered websites and exhibited erroneous associations between certain design defects and better design quality. Interestingly, human-rated examples from BetterApp proved more challenging, as UIClip's accuracy slightly dropped to 73.88%, likely due to the complexity and subjectivity of human annotations. Larger LVLMs like GPT-4V and Gemini-1.0-Pro underperformed on design quality assessment, with GPT-4V showing accuracy as low as 51.58%, partially due to refusals to respond to certain examples. These findings reinforce the effectiveness of UIClip’s training approach, which combines paired examples and contrastive learning to assess relative UI design quality.

(ii) Design Suggestions: To evaluate design suggestions, we compared model-generated outputs to CRAP (Contrast, Repetition, Alignment, Proximity) principles selected by human designers in the BetterApp dataset. Despite being a challenging task, UIClip consistently outperformed baseline models, achieving the best F1 scores in both the raw and choice-adjusted metrics. UIClip’s training on jittered websites enabled it to effectively detect design defects and generate suggestions, while LVLMs often over-generated suggestions, leading to inflated recall scores. When adjusting for wrong design choices (choice-adjusted F1), UIClip variants maintained high performance, with the full UIClip variant achieving the best results after adjusting for incorrect reasoning. CLIP models trained on alternative datasets, such as Screen2Words, did not perform well due to the absence of training data containing relevant design defects.

(iii) Design Relevance: For design relevance, we assessed each model’s ability to retrieve the correct UI based on textual descriptions, using mean reciprocal rank (MRR) as the evaluation metric. UIClip, pretrained on JitterWeb with the default CLIP objective, achieved the highest MRR scores (0.3851 on BetterApp and 0.4085 on JitterWeb). This result indicates superior performance in retrieving relevant examples based on the design quality described in the captions. Interestingly, UIClip models trained with the pairwise contrastive objective, while excelling in design comparison tasks, performed worse in relevance retrieval. This result suggests that task-specific training objectives play a crucial role in determining model success in different areas of design assessment. Despite this, UIClip's overall performance demonstrates the utility of using tailored datasets like JitterWeb and BetterApp for improving design understanding and relevance across multiple applications.

shelf_position Example Use Case 1: Improving UI Code Generation

A short slide animation showing the overivew of WebUI.

We developed a web application that allows users to generate rendered UI screenshots from a natural language description of a UI design. Users can input their descriptions, which are then formulated into prompts and fed into external large language models (LLMs) like OpenAI GPT-3.5 or Mixtral. These LLMs generate corresponding web code (HTML/CSS), which is rendered as UI screenshots. The key step in this process is the use of UIClip to rank these rendered screenshots based on their design quality. In our interface, users receive multiple generated UI outputs—typically, we sample n = 5 different outputs—and each is rendered into a screenshot by programmatically controlling a browser. If any external resources (e.g., images) are required, placeholders are inserted to complete the rendering process. These screenshots are then fed into UIClip, which scores them against the user's input description. The outputs are ranked based on UIClip’s scores, and the results are displayed to the user in descending order of design quality. This example demonstrates how UIClip can enhance the output of generative models by ranking generated designs, similar to best-of-n sampling techniques. While this method is simple and does not require access to the model's internal weights, it can be computationally intensive during inference, as it requires multiple candidate solutions to be generated and evaluated. For more advanced integration, UIClip could be used as a filtering mechanism during the training of generative models or as a reward model in reinforcement learning fine-tuning approaches. These additional uses could reduce computational costs and improve the quality of the final outputs, but we leave such optimizations for future work.

hdr_strong Example Use Case 2: UI Design Tips

A short slide animation showing the overivew of WebUI.

We developed a tool that enables users to upload screenshots of UI designs to generate helpful design tips, utilizing our model's design suggestion generation capabilities. Based on the user input (e.g., the screenshot of an app UI), the system generates actionable design tips to improve the overall UI quality. For instance, our system may suggest changes such as improving the readability of text or adjusting color choices to enhance contrast. While this tool provides useful feedback, it currently offers general suggestions about the entire UI, without pinpointing the exact areas that triggered specific recommendations. This limitation arises from our current approach, which pairs a complete screenshot with a set of design suggestions, without attaching spatial information to particular UI elements. Future improvements could address this by associating design tips with specific regions of the UI, for example by sliding a smaller window across the screenshot and linking the generated suggestions to particular areas. Additionally, collecting data that includes both design feedback and the location of design flaws would enable the model to provide more targeted suggestions. These enhancements would make the system even more precise and useful for users, and we leave these improvements to future work.

landscape Example Use Case 3: UI Example Retrieval

A short slide animation showing the overivew of WebUI.

We built a web application that contains a search box where the user enters their query. Figure above shows the examples of the screens retrieved for a set of queries indexed by UIClip and the vanilla CLIP model. Our tool uses a similar procedure to our UI relevance evaluation, where model-computed embeddings are used to retrieve and sort screenshots based on the user’s query. UIClip's score can take into account both the relevance and quality of retrieved examples, and we incorporate a negative prompt that biases the query vector away from simple or ambiguous designs

quick_reference_allReference

@inproceedings{wu2024uiclip,
        title={UIClip: A Data-driven Model for Assessing User Interface Design},
        author={Wu, Jason and Peng, Yi-Hao and Li, Amanda Xin Yue and Swearngin, Amanda and Bigham, Jeffrey and Nichols, Jeffrey},
        booktitle={Proceedings of the ACM Symposium on User Interface Software and Technology (UIST)},
        year={2024}
}

partner_exchange Acknowledgements

This work was funded in part by an NSF Graduate Research Fellowship.

This webpage template was inspired and modified from the BlobGAN project.