Motiff aims to become a leading design tool in the AI era by focusing on two main areas: first, using AI to create innovative features that assist designers and their teams; second, ensuring that the AI technologies behind these features are robust enough to make the product truly effective.
The large language models have been rapidly evolving, demonstrating enhanced learning capabilities and greater generalization potential. These advancements offer fresh perspectives on AI applications, and the Motiff team is actively exploring these opportunities.
Reflecting on the past year, we have two key insights on the impact of large language models in AI product development:
In practical applications, we've tried using general large models to tackle product challenges, but they often fall short in specialized UI fields. Unlike other companies, Motiff focuses on enhancing product capabilities, which has led us to develop specialized large models. Mature technology in fields like healthcare and law inspires our confidence in creating a UI-specific model.
Through these efforts, we've confirmed the need for self-developed models. For instance, the Motiff team validated AI-generated design drafts with just 200 Shots using a domain model. MLLM by Motiff reduces costs and enhances innovation efficiency in UI design.
We're excited to share our innovative progress in this area.
Multimodal large models like LLaVA and GPT-4V/4o have advanced rapidly, integrating diverse data types (text, images, videos) for improved understanding and accuracy. This progress results from academic and industry collaboration. Initially focused on enhancing input-output modes, multimodal technologies are now expanding into fields like Microsoft's LLaVA-Med.
Despite the growth and availability of open-source models like LLaVA, applying multimodal models to specialized fields remains challenging, with many areas still unexplored. This context sets the stage for Motiff's self-developed UI multimodal large models (MLLM by Motiff).
The chart below illustrates the rapid development of multimodal large language models from 2022 to 2024.
Training general multimodal models involves three stages:
MLLM by Motiff aims to innovate in UI design by leveraging these advancements.
Training a multimodal large model (MLLM) specifically for the UI domain from scratch involves challenges such as limited domain-specific data and high costs. Instead of starting anew, we can adapt existing multimodal models to fit UI design needs.
Here's how we can refine and optimize:
Our experience shows that focusing on the later training stages yields better domain-specific performance. Thus, our strategy prioritizes optimizing these stages. Currently, our efforts are on the latter two stages, but exploring unimodal domain adaptation's impact on multimodal models remains an interesting future research avenue for Motiff.
Motiff's MLLM uses a classic expert model integration approach by linking a pre-trained Visual Encoder with a Large Language Model (LLM) via a Connector. As illustrated, images pass through the Vision Encoder and the Visual-Language Connector, converting them into visual tokens that the LLM can process. These visual tokens, combined with text tokens, enable the LLM to generate a comprehensive text response, enhancing UI design interactions.
High-quality UI domain data, especially for mobile platforms, is scarce. To address this, we used methods like manual annotation, pseudo-labeling, and domain knowledge distillation to gather high-quality UI data, categorized as follows:
Type 1: UI Screenshot Captions
This common multimodal training data is used during alignment and instruction fine-tuning.
Unlike natural scene images, UI screenshots contain more details and require more reasoning.
Through a series of Prompt Engineering, we have generated descriptions similar to the following. These descriptions introduce each UI screenshot module-by-module from top to bottom, including layout styles, component names, key UI elements, and module functionalities. Finally, an comprehensive evaluation of the overall page design is provided.
Type 2: UI Screenshot Structured Captions
Influenced by Meta researchers '2023 paper "LIMA: Less Is More for Alignment"', we have moved away from the "more is better" approach.
Instead, when training the MLLM by Motiff, we have incorporated a batch of high-quality, knowledge-intensive UI data. This data allows for precise localization and comprehensive understanding of each element on the UI interface, enabling the Motiff multimodal large language model to:
Type 3: UI Instruction Tuning Data
We leveraged successful experiences from general domains to collect and construct a rich set of UI interface-related instruction-following data. This mainly covers functions such as UI interface descriptions, question and answering based on UI interfaces, pixel-level interface element localization, fine-grained interface element descriptions, and UI interface interaction guides.
In the data generation task, we introduced several expert models, such as component recognition expert model, icon recognition expert model, OCR recognition expert model, etc. By combining the data generated by these expert models with our structured descriptions and feeding them into the private LLM, we are able to generate high-quality training data that is closer to real user scenarios.
Existing large language model solutions for the UI domain, such as Apple's Ferret UI and Google's ScreenAI, typically rely on classifiers to generate static icon descriptions. In contrast, our approach merges the results from the icon recognition expert model with detailed structured text fed into private LLM. This integration allows the same icon to be described with varying meanings depending on the context, thereby enhancing the accuracy and contextual relevance of the descriptions.
As illustrated in the image below, the upper section showcases data generated by ScreenAI, while the three sections below display data generated by our method respectively.
In addition to the three types of UI-related data mentioned above, we found that tasks such as chart Q&A, document Q&A, and OCR can also improve the understanding of UI interfaces.
Finally, to maintain general capabilities, we also included general domain data such as natural scene image descriptions, natural scene image question and answering, and generic text-based instructions.
In summary, we have collected tens of millions of multimodal training data samples, which include common app screenshots and a large number of web screenshots available in the market, infusing the MLLM by Motiff with extensive UI expertise.
MLLM by Motiff has specifically focused on the unique requirements of UI interface scenes in its foundational choices. Unlike general natural scenes, UI interfaces contain a large number of fine-grained elements. Therefore, we have employed a visual encoder that supports high-resolution inputs.
This high-resolution processing capability enables the visual encoder to capture more details, significantly enhancing the model's ability to perceive the complex details of UI interfaces, thereby reducing the risk of blurring and misclassification caused by low-resolution images.
Through this series of optimizations, the model's accuracy and detail-handling capability when processing UI interfaces have been significantly improved.
As previously mentioned, our domain migration training is currently applied in two stages:
Stage 1: Alignment Training — Introducing UI domain knowledge during the alignment training of visual models and large language models.
In this stage, we introduced two types of UI-related data. The first type is UI interfaces and their natural language descriptions, and the second type is UI interfaces and their structured descriptions. The former is similar to the description of natural scene images, while the latter is unique to UI interfaces. For training stability, we only trained the connector at this stage and froze the visual models and large language models.
Stage 2: Domain-Specific Instruction Fine-Tuning — Introduce UI Domain Knowledge through End-to-End Training of MLLM
In this stage, we trained all task data, including textual modal data in the general domain, multimodal data in the general domain and multimodal data in the UI domain. The goal was to enhance domain knowledge while maintaining general abilities.
We conducted a comprehensive evaluation of the MLLM by Motiff, comparing it with state-of-the-art (SOTA) models for interface-related tasks. The evaluation covered five common UI interface scenarios:
ScreenQA [2] is a benchmark dataset for screen understanding proposed by Google DeepMind in 2022. This dataset aims to evaluate the model's understanding capabilities through question-answer pairs based on screenshots. The evaluation section includes approximately 8,419 manually annotated Q&A pairs, covering 3,489 screenshots from the Rico dataset.
As one of the most representative datasets available, ScreenQA not only provides rich visual information but also involves various elements and interaction methods within user interfaces.
Therefore, evaluating on the ScreenQA dataset effectively tests the model's overall ability to understand and answer interface-related questions.
Screen2Words [3] is a joint abstraction task specifically designed for mobile UI interfaces. It is proposed by scholars from the University of Toronto and Google Research. The main purpose of the task is to evaluate the model's ability to understand and describe important content and abstract functions in the screen.
The dataset contains screenshots from various application scenarios, along with corresponding textual descriptions. These descriptions include both explicit content of the interface (such as text and images) and abstract functions of the interface (such as the purpose of buttons and the main theme of the page).
By evaluating the Screen2Words dataset, we can gain deeper insights into the model's performance in generating natural language descriptions and inferring the functions of the interface.
The RefExp [4] task evaluates the model's ability to precisely locate interface components. This task requires the model to accurately find the referenced component on the screen based on a given referring expression.
The evaluation dataset provides screenshots of mobile UI interfaces along with corresponding natural language descriptions that point to a specific interface element (such as a button, icon, input box, etc.).
The model needs to recognize and locate these elements within the screen image, which not only tests the model's understanding of natural language but also examines its capability in pixel-level visual parsing and precise localization.
The RefExp task has practical applications in voice control systems, such as smart assistants that can locate specific buttons or options on the screen based on the user's verbal instructions.
The Widget Captioning [5] task aims to evaluate the model's ability to generate natural language descriptions, specifically for various components within an interface. This task requires the model to produce brief and accurate descriptions of different UI components (such as buttons, icons, etc.).
The dataset includes common interface components from various applications along with their corresponding descriptive text. These descriptions need to precisely cover the visual characteristics of the components as well as reflect their functions and purposes.
This task helps test the model's ability to understand and generate semantically appropriate natural language descriptions, which is particularly valuable for practical applications in screen readers and assistive technologies.
The Mobile App Tasks with Iterative Feedback (MoTIF) [7] dataset is specifically designed to evaluate the model's ability to execute natural language instructions within mobile applications.
This task not only involves understanding natural language instructions but also requires the model to perform corresponding actions on the screen, such as clicking, typing, swiping, etc. These actions lead to changes in the interface state, thereby assessing the model's capability in dynamic interaction and feedback handling.
After providing a detailed introduction to each evaluation dataset, we will now showcase the evaluation results of the MLLM by Motiff on each dataset.
From the results, it is evident that in these five UI-related metrics, the general large language model (GPT-4) is noticeably weaker than the domain-specific models (Ferret-UI, ScreenAI, and MLLM by Motiff).
Additionally, the MLLM by Motiff significantly outperforms Apple's Ferret-UI model on these metrics, with its overall capabilities approaching those of Google's ScreenAI model, even surpassing ScreenAI in certain aspects.
Evaluation Results:
Overall, MLLM by Motiff outperforms Apple's Ferret-UI and closely matches Google's ScreenAI, even surpassing it in some aspects.
The Motiff team is committed to leading in AI-related products, with the UI multimodal large language model being a pivotal step toward this goal.
The MLLM by Motiff allows us to quickly implement AI capabilities, integrate them into products, and gather user feedback. This feedback loop enhances the development of smarter, more efficient UI design tools in the AI era.
Human creativity arises from cognition and understanding. In the AI era, user interface creation will begin with large language models that fully comprehend these interfaces.
Looking forward, the Motiff team aims to utilize this model to make AI design tools more intelligent and efficient, enabling "unbounded creativity for designers".
[1] Yin S, Fu C, Zhao S, et al. A survey on multimodal large language models[J]. arXiv preprint arXiv:2306.13549, 2023.
[2] Hsiao Y C, Zubach F, Wang M. Screenqa: Large-scale question-answer pairs over mobile app screenshots[J]. arXiv preprint arXiv:2209.08199, 2022.
[3] Wang B, Li G, Zhou X, et al. Screen2words: Automatic mobile UI summarization with multimodal learning[C]//The 34th Annual ACM Symposium on User Interface Software and Technology. 2021: 498-510.
[4] Bai C, Zang X, Xu Y, et al. Uibert: Learning generic multimodal representations for ui understanding[J]. arXiv preprint arXiv:2107.13731, 2021.
[5] Li Y, Li G, He L, et al. Widget captioning: Generating natural language description for mobile user interface elements[J]. arXiv preprint arXiv:2010.04295, 2020.
[6] Burns A, Arsan D, Agrawal S, et al. A dataset for interactive vision-language navigation with unknown command feasibility[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 312-328.
[7] Burns A, Arsan D, Agrawal S, et al. Mobile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments[J]. arXiv preprint arXiv:2104.08560, 2021.