UICoder: Finetuning LLMs to Generate UI Code with Automated Feedback

TL;DR

UICoder investigates finetuning LLMs to generate UI code using automated feedback from compilers and multimodal models.
The workflow starts with an existing LLM, which self-generates a large synthetic dataset; automated tools filter, score, and deduplicate to a refined high-quality dataset.
The original LLM is then finetuned on this refined dataset to produce improved models.
The approach was applied to several open-source LLMs and compared against baseline models using automated metrics and human preferences.
Evaluations show the finetuned models outperform all other downloadable baselines and approach the performance of larger proprietary models.

Context and background

Programmers repeatedly interact with machine learning tutorials and code-generation tools in computational notebooks, yet LLMs still struggle to consistently produce UI code that compiles and yields visually relevant designs. Existing approaches to improve generation rely on expensive human feedback or distilling proprietary models. To address this, the work explores automated feedback (compilers and multi-modal models) to guide LLMs toward high-quality UI code. The method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset using an original model, applying automated tools to aggressively filter, score, and de-duplicate the data into a refined higher quality dataset. The original LLM is improved by finetuning on this refined dataset. We applied our approach to several open-source LLMs and compared the resulting performance to baseline models with both automated metrics and human preferences. Our evaluation shows the resulting models outperform all other downloadable baselines and approach the performance of larger proprietary models. The work has been discussed in the context of contemporary research on LLM evaluation via pairwise preferences over model responses, a data signal used to guide feedback for model improvement. This research is positioned within a broader research program that includes contributions to Speech and Natural Language Processing and Human-Computer Interaction, and it was accepted at the IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC) 2024.

What’s new

The method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset using an original model, applying automated tools to aggressively filter, score, and de-duplicate the data into a refined higher quality dataset.
The original LLM is improved by finetuning on this refined dataset.
This approach was applied to several open-source LLMs and evaluated against baseline models using both automated metrics and human preferences.
The evaluations showed that the resulting models outperform all other downloadable baselines and approach the performance of larger proprietary models.

Why it matters (impact for developers/enterprises)

For developers and organizations building UI-intensive applications, UICoder represents a pathway to more reliable UI code generation without relying on costly human feedback. By leveraging automated feedback signals from compilers and multimodal systems, teams may achieve higher quality UI code that compiles and aligns more closely with visual designs, potentially reducing development time and iteration cycles. The approach also demonstrates how open-source LLMs can be improved through self-generated data and automated quality controls, making state-of-the-art UI code generation more accessible to a broader base of developers and teams.

Technical details or Implementation

Starting point: an existing LLM serves as the baseline model.
Synthetic data generation: the baseline model self-generates a large synthetic dataset reflecting UI code tasks.
Automated filtering, scoring, and de-duplication: the generated data is aggressively filtered, scored, and deduplicated using automated tools (including compilers and multimodal models) to form a refined, higher quality dataset.
Fine-tuning: the original LLM is fine-tuned on the refined dataset to produce improved models.
Evaluation: the improved models are evaluated against baseline models using automated metrics and human preferences to assess UI code quality, compilation success, and visual relevance.
Scope: the approach was applied to several open-source LLMs, illustrating that automated feedback-driven finetuning can yield meaningful gains across different architectures.

Key takeaways

Automated feedback can guide LLMs toward higher-quality UI code without heavy reliance on human annotations.
A self-generated synthetic data pipeline, combined with automated filtering and deduplication, can produce richer training material for finetuning.
Finetuning the original LLM on refined data can yield models that outperform downloadable baselines and approach large proprietary models.
The methodology integrates compilers and multimodal evaluation tools as feedback signals to improve code quality and visual fidelity.
The work contributes to the broader understanding of how automatic signals and self-generated data can accelerate progress in UI code generation.

FAQ

How is automated feedback used in UICoder?

Automated feedback comes from compilers and multimodal models that filter, score, and deduplicate a self-generated synthetic dataset used to fine-tune the base LLM.
Which models were evaluated?

The approach was applied to several open-source LLMs and compared against baseline models using automated metrics and human preferences.
What were the key outcomes?

The finetuned models outperformed all other downloadable baselines and approached the performance of larger proprietary models.
Where can I read more about UICoder?

The detailed work is published by Apple’s machine learning research program at https://machinelearning.apple.com/research/uicoder.