# lightweight_neural_app_control__9dd5721e.pdf

Published as a conference paper at ICLR 2025

LIGHTWEIGHT NEURAL APP CONTROL

Filippos Christianos1*, Georgios Papoudakis1*, Thomas Coste1*, Jianye Hao1,2, Jun Wang3, Kun Shao1

1Huawei Noah s Ark Lab, 2Tianjin University, 3AI Centre, University College London filippos.christianos@huawei.com, georgios.papoudakis1@huawei.com thomas.coste@huawei.com, haojianye@huawei.com, jun.wang@cs.ucl.ac.uk, shaokun2@huawei.com

This paper introduces a novel mobile phone control architecture, Lightweight Multi-modal App Control (Li MAC), for efficient interactions and control across various Android apps. Li MAC takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones, we introduce a small Action Transformer (Ac T) integrated with a fine-tuned vision-language model (VLM) for real-time decision-making and task execution. We evaluate Li MAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o. More specifically, Li MAC increases the overall action accuracy by up to 19% compared to fine-tuned VLMs, and up to 42% compared to prompt-engineering baselines.

1 INTRODUCTION

Smartphone application agents, commonly known as app agents, are expanding the potential applications of artificial intelligence to smartphones and other mobile devices. Such agents could allow users to accomplish a range of tasks, from scheduling appointments and sending messages to purchasing items and booking flights, with minimal effort. Fundamentally, app agents observe user instructions and progressively interact with the smartphone s user interface by clicking, scrolling, inputting text, etc. to accomplish the task. However, due to the limited computational resources of smartphones, these agents must be optimised for efficiency, employing lightweight models with minimal memory usage and fast processing speeds.

Recent advancements have leveraged foundation models to develop app agents that understand natural language instructions and execute complex user commands within the smartphone s interface (e.g., Rawles et al., 2024; Bai et al., 2024; Wang et al., 2024b;a). While foundation models enable sophisticated capabilities, relying on them for every action introduces significant drawbacks. Their substantial size and computational complexity make them resource-intensive and impractical for constant use on mobile devices. Alternatively, querying server-hosted foundation models, such as GPT-4o or Gemini, for each task can be prohibitively expensive due to the operational costs of running large models, making this approach impractical for everyday applications. For example, a state-of-the-art GPT-4o-based app agent (e.g., Rawles et al., 2024) may require one to two minutes to run and cost approximately $1.00 per task on average, based on tasks from the evaluated datasets.

To address these limitations, we propose a gated architecture that combines a lightweight transformer network with a small fine-tuned VLM. The task description and the smartphone state are first processed by a compact model ( 500 million parameters) which effectively handles most actions. For actions that require natural language understanding, such as composing a text message or querying a search engine, a VLM is invoked to generate the necessary text. This hybrid approach reduces

* First authors with equal contribution. Corresponding author.

Published as a conference paper at ICLR 2025

computational demands and improves responsiveness, resulting in significantly faster execution times 30 times faster, down to 3 seconds per task on average and improved accuracy.

In the proposed architecture (Lightweight Multi-modal App Control, or Li MAC), the initial processing stage is managed by an Action Transformer (Ac T), and is primarily responsible for determining the type of action required to fulfil a user s command. Ac T first predicts the action type, such as clicking, inputting text, or scrolling, based on the current state of the smartphone s interface and the task description. For most action types, such as clicks and scrolls, Ac T autonomously executes the task. For predicting the targets of the click action, we employ a contrastive objective between the outputs of Ac T and the embeddings of each user interface (UI) element. The specific approaches for predicting action types and handling click actions are detailed in Sections 3.3 and 3.5, respectively.

However, when the action type predicted by Ac T is input-text or open-app, which necessitate a deeper prior knowledge and understanding of natural language nuances, Li MAC passes the selected action type and user s goal to a fine-tuned VLM to generate the appropriate textual content. This division of labour allows Ac T to handle straightforward interactions while leveraging the VLM s advanced capabilities for more complex text generation tasks, ensuring that the system remains both resource-efficient and capable of sophisticated responses. The process of integrating and fine-tuning the VLM in the app agent domain is detailed in Section 3.4.

In summary, the primary contributions of this work are as follows:

We propose Li MAC, an architecture for app agents that balances efficiency and natural language understanding by combining a lightweight transformer with a fine-tuned VLM.

We also introduce Ac T, a submodule of Li MAC, which is designed to efficiently predict action types and UI element interactions, featuring a novel contrastive objective for click prediction.

We fine-tune and evaluate two open-source vision-language models (VLMs) specifically for handling text-based actions. Our fine-tuned VLMs achieve performance comparable to or exceeding GPT-4o methods while only having 2B parameters or less.

We present experimental results demonstrating that Li MAC improves both task execution time and accuracy up to 30 times faster and 40% higher accuracy compared to GPT-4o-based and fine-tuned VLM app agents.

2 TECHNICAL PRELIMINARIES

2.1 PROBLEM FORMULATION

We model phone interaction as a sequential decision-making process. Each task consists of a given goal g that should be completed during an episode. At each timestep t of the episode, the phone s internal state is denoted by st, while ot represents an observation of this state, including screen captures and UI element trees. The set of visible UI elements on the screen at timestep t is defined as It, with ot,i representing the i-th UI element at timestep t where i It. Each UI element i is represented by three different components: the image that corresponds to the UI element that we denote as oimg t,i , the text that corresponds to the UI element otxt t,i, and the related attributes of the UI element, such as whether it is clickable or not, that we denote as oattr t,i . Therefore, the representation of each UI element can be written as:

ot,i = [oimg t,i , otxt t,i, oattr t,i ]. (1)

The agent interacts with the phone through actions, denoted as at at timestep t. Each action is characterised by two components: its type atype t Atype (e.g., click, scroll-down, input-text) and its specifications aspec t Aspec. The specifications vary based on the action type: for clicks, aspec t might represent the targeted UI element; for typing actions, it would contain the text to be input. Thus, an action can be represented as the tuple at = (atype t , aspec t ). This formulation allows for a flexible representation of diverse actions while maintaining a consistent structure.

In this work, the main goal is to learn a model that will maximise action prediction accuracy, which corresponds to correctly predicting both the action type as well as the action specifications. To achieve this, we train Ac T, which predicts atype t . If the predicted action type is click, Ac T also predicts

Published as a conference paper at ICLR 2025

the aspec t in the form of UI element targets. We focus on click targets because they are among the most difficult and common actions to predict, and Ac T s architecture easily accommodates predicting them with a contrastive learning approach (see Section 3.5). For actions that require natural language specifications (e.g., input-text), we use a VLM fine-tuned on the same dataset.

2.2 SEQUENCE MODELLING WITH TRANSFORMERS

Transformers (Vaswani et al., 2017) have demonstrated exceptional effectiveness in modelling and generating sequential data across a wide range of domains. They excel in various sequence modelling tasks, including those related to language, video processing, and decision-making (Chen et al., 2021). Regardless of the specific application, transformers begin by converting the input into a sequence of vectors. For text, this involves tokenising the input, with each token represented by an embedding vector. In the case of images, the input is typically divided into patches, where each patch is similarly represented by a vector, analogous to the tokenisation process in text. These embeddings, which map tokens or patches to vectors, can either be learned during the model s training or sourced from pre-trained models (e.g., Devlin et al., 2018). The embeddings are fed through several multihead self-attention layers, which are designed to capture dependencies and contextual relationships between different embeddings in the input sequence. These self-attention mechanisms allow the model to focus on relevant parts of the sequence when processing each embedding, enabling it to handle long-range dependencies more effectively. After passing through multiple layers, each consisting of self-attention and feed-forward components, the final activations from the transformer s last hidden layer are passed through a linear (fully connected) layer. This layer is typically tasked with mapping the learned representations to the output space, whether for classification, prediction, or another specific task. The entire model is trained end-to-end, with backpropagation adjusting both the self-attention layers and the final linear layer to optimise performance on the desired task.

3 THE LIGHTWEIGHT MULTI-MODAL APP CONTROL FRAMEWORK

Our methodology processes the user s goal g and the phone s state at time t, utilising Ac T, to determine the action type atype t . If the predicted action type is either input-text or open-app, then g, ot, and atype t are passed to a fine-tuned VLM, which is responsible for determining the specific action aspec t . For actions involving clicks, Ac T handles the prediction directly but employs a different training objective that contrasts UI element embeddings to determine the most likely interaction target. Accordingly, this section is divided into three parts: predicting the action type, predicting specific actions for text input and app launching, and predicting clicks using our novel approach for interaction with UI elements. The full architecture of Li MAC is presented below.

We refer to our method as lightweight because it uses fewer parameters on average during inference than baselines and, as we will show in Section 4.3, has faster inference speeds. The Ac T module only has 520M parameters and the additional VLM component is called for less than 15% of actions in our datasets. Li MAC also selects actions more efficiently than a single VLM, as Ac T does not require auto-regressive generation. While our approach has a higher memory footprint than solely using VLMs, due to loading both the Ac T module and a VLM, its low parameter count remains within the capacity of modern devices (Li et al., 2024b; Laskaridis et al., 2024).

3.1 MODEL INPUTS

Ac T, the model responsible for predicting the action type (and later the click target, as seen in Section 3.5), is built on top of a typical transformer architecture. However, unlike standard transformers, where tokens represent words or characters, our tokens are pretrained embeddings that are mapped to the hidden dimension of the transformer. These tokens represent three key components: the user s goal g, the UI elements on the phone s screen ot,i, and the possible actions. By using these pretrained embeddings as input, we allow the model to effectively capture the relationships between the user s intent, the current state of the interface, and the set of available actions. We encode each key component (UI elements, actions, and goal) into embeddings that can be processed by the transformer. Below, we describe the encoding process for each type of input.

Goal: We encode the user s textual goal g using a sentence encoder, resulting in the embedding eg = ftxt(g). This embedding captures the user s intent and serves as the first token to the transformer.

Published as a conference paper at ICLR 2025

Figure 1: Illustration of Ac T. A separate encoding of each UI element into a vector et,i by using pretrained embedding models. The embeddings are then fed into the sequence of a transformer xt along with the previous timesteps in that episode. The prediction of the transformer is decoded to produce the next action which consists of atype t and aspec t .

UI Elements: The observed representation of each UI element ot,i at time t is transformed into a vector eui t,i through several embedding functions. First, the text component is encoded using a sentence encoder (e.g., BERT) etxt t,i = ftxt(otxt t,i), and the image is encoded using a fine-tuned CLIP visual encoder (Radford et al., 2021) eimg t,i = fimg(oimg t,i ). Additionally, any other attributes (e.g., clickable, editable, nested) are encoded into eattr t,i = fattr(oattr t,i ). The final embedding for each UI element is the concatenation of these vectors, eui t,i = [eattr t,i ; etxt t,i; eimg t,i ]. We fine-tune CLIP using the standard contrastive learning objective (Radford et al., 2021) using the screenshot of the observations and the related UI trees to allow adapting to app control datasets. We also add a positional encoding pi Rd to represent the order or nesting of UI elements: eui t,i = eui t,i +pi. This process is illustrated in Figure 1. To adapt the visual encoder fimg to our task, we fine-tune it using our dataset by minimising the Info NCE loss (Oord et al., 2018), aligning image and text representations of UI elements. Similar methods of representing each UI element as an embedding for the transformer have been suggested by Li et al. (2020); Rawles et al. (2023), with the key distinction that our approach additionally fine-tunes the vision encoder to better adapt it for app control tasks.

Actions: Each action is represented using two embeddings: the action type embedding which is mapped to its corresponding learnable embedding etype, and, for actions requiring a specification (e.g., the target of a click action), the specification embedding espec. Depending on the action type, the action specification embedding is computed differently (e.g., sentence embedding for the textual action, learnable embeddings mapped to the UI element s id for click targets, or a special token for empty specifications). Each action contributes two tokens to the transformer s input sequence, clearly separating action types from their parameters.

Positional Embeddings: To represent temporal information, we also add a learnable positional encoding pt for all the embeddings in a timestep.

3.2 CONSTRUCTING THE INPUT SEQUENCE

After generating the goal, UI elements, and action embeddings, we organise them into a sequence representing the entire episode. Each episode in the dataset is encoded as a sequence of embeddings x, which is fed into the transformer. The sequence starts with the goal embedding eg, followed by the UI element embeddings eui 0,i at timestep 0. Once all UI elements are encoded, a special end marker eend is added. The action type etype 0 and specification espec 0 embeddings for timestep 0 are then appended. This process repeats for each subsequent timestep: encoding UI elements, appending eend, and adding the action embeddings. For an episode with H timesteps, the final sequence is:

x = eg; eui 0,0; . . . ; eui 0,n; eend; etype 0 ; espec 0 ; . . . ; eui H 1,0; . . . ; eui H 1,n; eend; etype H 1; espec H 1

Published as a conference paper at ICLR 2025

Figure 2: The architecture of Li MAC. The history of observations-actions {ot, at 1, ot 1..} and goal g are processed to vector x and passed to Ac T. The image observation oimg t with the bounding boxes and the goal g are passed as inputs to the VLM. The VLM is only called if an action that requires text completion is selected, based on the action type output of Ac T. The action is finally selected based on the protocol described in Sections 3.3 to 3.5.

During training, the full sequence is fed into the transformer. For inference at timestep t, the sequence up to the t-th observation is processed, with the hidden state ht (up to eend) used to predict the action.

3.3 ACTION TYPE PREDICTION

In our pipeline, the prediction of the next action begins with determining the action type. Predicting the action type atype t can be framed as a classification problem, where we identify a combined eleven distinct action types (see Appendix A), of which a subset are seen in individual datasets used in this work. These action types represent various possible interactions, such as click, open-app, scroll-down, input-text, or other essential commands. We implement the action type prediction with a specialised head. The action type head, denoted as ftype, transforms the final hidden state ht of the transformer (after the eend token) into a probability distribution over the possible action types, p(atype t |ht) = ftype(ht). The learning objective for this task is to minimise the cross-entropy loss between the predicted and actual action types. Given a dataset D, the cross-entropy loss for action type prediction is defined as:

Ltype = Eatype,x D log(p(atype|h)) (2)

Here, h represents the transformer s output corresponding to the final hidden state before action prediction, averaged over all observations in the dataset. This loss function ensures that the model is trained to correctly classify the action type based on the sequence of embeddings from previous steps.

3.4 LEVERAGING FINE-TUNED VLMS FOR TEXT GENERATION IN ACTION EXECUTION

As described in the previous section, our agent first predicts the action type. Among the eleven action types, two specifically require textual specifications: i) the input-text action, where the specification is the text to be entered into a text box, and ii) the open-app action, where the specification is the name of the application to be opened. For these actions, we rely on fine-tuning a VLM using an app control dataset. The dataset provides action data in a dictionary-like format, such as: {"action-type":"open-app","app-name":"Chrome"}, with one key corresponding to the action type and another to the action specification. The VLM is trained to generate the correct sequence of tokens that corresponds to the successful completion of each action, optimising for the likelihood of generating the proper tokens based on the observation at each timestep.

During inference, after predicting the action type, Ac T guides the VLM to start its response with this action type. For instance, if Ac T predicts input-text as the action type, the VLM is forced to

Published as a conference paper at ICLR 2025

begin its response with the token pattern: {"action-type":"input-text","text":. The model then completes the specification, producing aspec t , the textual content needed for the action. The full action selection pipeline is presented in Figure 2.

3.5 EFFICIENT CLICK TARGETING USING CONTRASTIVE OBJECTIVES WITH ACT

Having covered how action specifications are generated for textual actions, we now turn to the case of click actions, where the specification is the UI element to interact with. To predict the correct UI element for a click action, we employ a contrastive learning approach that operates over the entire episode, using cosine similarity and a learnable temperature parameter. Since the number of UI elements varies across timesteps and episodes, a contrastive method is better suited than classification, which can suffer from class imbalance and limitations when handling more UI elements in test episodes than seen during training. Let htype t be the transformer s last hidden state up to embedding etype t , and ftarget be an affine transformation that projects the hidden states to an embedding space. Simultaneously, the hidden states of the transformer corresponding to the UI element embeddings, denoted as hui, are also projected into the same embedding space:

qtype = ftarget(htype t ) and pui = ftarget(hui) (3)

Assuming the embedding space lies in Rd, the query embedding qtype t has dimensions 1 D, while the matrix pui, representing all UI elements, has dimensions K D, where K is the total number of UI elements in the episode. The goal is to train the model such that qtype t aligns closely with the correct UI element s embedding at timestep t, using cosine similarity as the alignment measure. To achieve this, we adopt contrastive training techniques with the Info NCE loss (Oord et al., 2018). We first compute the similarity matrix between the query embedding qtype t and all UI element embeddings, scaling the similarity by a learnable parameter τ (e.g., Radford et al., 2021). The scaled cosine similarity matrix is defined as:

q p r τ (4)

where p r is the L2 norm of each row of p. For simplicity, we drop the superscripts in this equation. The Info NCE loss for UI element selection across the episode is computed as:

log exp(S+) PK i=1 exp(Si)

Here, S+ is the scaled similarity between the transformer s output and the correct UI element for the click action, and Si represents the similarity between the output and all other UI elements. During inference, for each action requiring a target element, the UI element with the highest similarity is selected. This contrastive approach enables Ac T to effectively learn which UI elements to interact with during a click action by treating all other UI elements in the episode as negative examples. The use of cosine similarity focuses on the directional alignment of the embeddings, while the learnable temperature τ adjusts the sharpness of the similarity distribution during training, allowing for more flexible and precise UI element selection.

4 EXPERIMENTS

4.1 EXPERIMENTAL SETUP

Datasets: Our experiments focus on two open-source mobile phone control datasets, Android Control (Li et al., 2024a) and Android-in-the-Wild (Ait W) (Rawles et al., 2023). Both contain extensive human demonstrations of mobile phone navigation across a wide variety of tasks. In Android Control, every episode is defined by a specific goal, accompanied by a sequence of observations and actions. Each observation includes a screenshot from the phone and its corresponding UI tree. Conversely, observations in Ait W lack the UI tree. As a result, it is necessary to extract the UI tree using an OCR system that identifies all the UI elements and provides a brief description of each. More details on the goal format, observation space, and action space for each dataset can be found in Appendix A.

GPT-4o Baselines: We compare our approach against four prompt-engineering baselines that use GPT-4 to generate actions in the evaluation dataset. First, we evaluate two baselines proposed by

Published as a conference paper at ICLR 2025

Table 1: Comparison of models in terms of average inference time and overall accuracy on the Ait W and Android Control datasets. The table presents the size of each model, the average inference time (in seconds, lower is better), and the overall accuracy (higher is better) for both datasets.

Model Size Avg Inf. Time (s) Overall

Ait W And Ctrl

See Actchoice unk 9.81 37.7 29.9 See Actann unk 9.76 42.5 35.5 T3A unk 4.87 26.9 53.1 M3A unk 10.64 35.6 57.5

Florence2 820M 0.50 70.8 57.0 Li MAC with Florence2 (ours) +520M 0.34 72.2 63.1

Qwen2-VL 2B 3.03 51.0 52.2 Li MAC with Qwen2-VL (ours) +520M 0.63 70.9 62.5

Rawles et al. (2024): the text-based T3A and the multi-modal M3A. In T3A, the observation is represented as a list of UI elements, while M3A includes screenshots of the observation. Additionally, we evaluate two variants of the See Act agent (Zheng et al., 2024), adapted for mobile app control tasks (Rawles et al., 2024). Specifically, we assess two See Act variants: See Actchoice and See Actann, which use the UI tree text and screenshots of the observations, respectively, to determine the correct action. More details about the prompt engineering baselines are presented in Appendix B.

Vision Language Models (VLMs): We fine-tune two VLMs for our experiments. The first, Florence2 (Xiao et al., 2024), is an 820M-parameter VLM that takes as input an annotated screenshot with numbered bounding boxes, along with the task goal in natural language. Florence2 is trained to maximise the log-likelihood of the correct action tokens from the dataset. Similarly, we fine-tune Qwen2-VL (Bai et al., 2023), a 2B-parameter VLM, using Lo RA adapters (Hu et al., 2021). Qwen2VL follows the same pipeline as Florence2, taking the annotated screenshot and goal as inputs, with supervision provided by the correct action. In most of our experiments, these fine-tuned VLMs are tested in conjunction with Ac T (forming Li MAC).

4.2 EVALUATION PIPELINE

We evaluate on the test set of two datasets, using the same process for all models, with only the observation format and model calling differing. For each timestep, we call the model with the relevant observation format to generate an action. VLMs are trained to return actions in a specific format, while pre-trained models use a detailed prompt with the observation, as in Rawles et al. (2024). One can calculate strict accuracy by directly comparing returned actions to the ground truth. However, in this work we relax this metric for a more practical assessment, where a UI element is deemed correct if its bounding box is within the target element, as described by Li et al. (2024a). For input-text actions, correctness is determined by a Jaccard index score of at least 0.5, reflecting the functional equivalence of similar inputs in search bars. We report the relaxed accuracy metrics in Tables 1 and 2.

In Section 4.5 we also evaluate models action-type and click-target accuracy. Action-type accuracy reflects how well the model predicts the correct type for an action, regardless of the specifications such as the text content or target element. Click-target accuracy measures how accurately the model predicts the correct target for click actions when the action type is known. Computing the clicktarget accuracy requires rerunning a full evaluation over the dataset, where the output of the model is constrained to predict the click action and specify the target element. It should be noted that higher overall accuracy can still be achieved with lower action-type and/or click-target accuracy. This is because click-target accuracy is calculated separately, and because not all action types are equally advantageous for overall accuracy. Indeed, as defined in Section 2.1, an action is represented as at = (atype t , aspec t ), where both atype t and aspec t must be predicted correctly for a successful timestep. Actions which always have a null aspec t , like wait, are easier to predict correctly than those which have a complicated aspec t that may be incorrectly predicted, like input-text.

Published as a conference paper at ICLR 2025

Table 2: Performance comparison of various model configurations using different combinations of modules across the Ait W and Android Control datasets. Using Li MAC but integrating Ac T with baseline methods improves accuracy and reduces inference time (and cost). Not all pairings are shown here for conciseness, the full list can be found in Table 6.

Framework Modules Used Avg Inf. Time (s) Overall

Type Click Text Ait W And Ctrl

T3A only T3A T3A T3A 4.87 26.9 53.1 Li MAC (ours) Ac T T3A T3A 4.03 42.7 65.4 Li MAC (ours) Ac T Ac T T3A 1.04 69.8 63.2

M3A only M3A M3A M3A 10.64 35.6 57.5 Li MAC (ours) Ac T M3A M3A 8.40 52.6 66.8 Li MAC (ours) Ac T Ac T M3A 1.87 70.0 62.5

Florence only Florence2 Florence2 Florence2 0.50 70.8 57.0 Li MAC (ours) Ac T Florence2 Florence2 0.72 71.6 61.1 Li MAC (ours) Ac T Ac T Florence2 0.34 72.2 63.1

Qwen only Qwen2-VL Qwen2-VL Qwen2-VL 3.03 51.0 52.2 Li MAC (ours) Ac T Qwen2-VL Qwen2-VL 2.64 55.7 59.1 Li MAC (ours) Ac T Ac T Qwen2-VL 0.63 70.9 62.5

Li MAC (ours) Ac T M3A T3A 7.57 52.4 67.4

4.3 MEASURING END-TO-END ACCURACY

In this section, we present the total action accuracy of our method, as well as the baselines. Table 1 present the accuracy for action prediction in Android Control and Ait W, respectively. In both Ait W and Android Control, we observe that Li MAC consistently outperforms Florence2, Qwen2-VL, and GPT-4o-based baselines with respect to the action prediction accuracy, demonstrating superior generalisation to the held-out test set. The overall improvement of Li MAC in the accuracy compared to Android Control can be attributed to the closer alignment between the training and test sets, as the test set includes the same set of instructions but applied to mobile devices with varying characteristics, such as screen size and Android version. Additionally, we observe a significant performance drop in text-based baselines like T3A and image-text-based models like M3A and See Act. The absence of original UI trees in the Ait W dataset can explain this decline. Since UI trees must be extracted from images using an annotation tool, inaccuracies are often introduced, which diminishes the performance of models that rely on text-based output conditioning. This underscores a key advantage of Li MAC, which remains robust even when UI trees are imprecise or completely missing (as seen in Table 4), with minimal impact on overall performance.

4.4 COMBINING DIFFERENT MODULES

Li MAC is a modular architecture that enables the integration of different modules for tasks such as predicting the action type, identifying the target element in click actions, and generating text for open-app and input-text. In this architecture, we primarily use Ac T to predict both the action type and the target element for click actions. However, alternative modules can be employed for these predictions as well. In Table 2, we present combinations of different models, excluding See Act due to its low overall accuracy, and compare their performance across two datasets.

In the Android Control dataset, we observe that using M3A for predicting the target elements in click actions improves performance over using Ac T alone. This demonstrates that GPT-4o is highly effective at identifying the correct target element when the prompt specifies that the action is click. This of courses comes at the cost of calling GPT-4o, which significantly increases the inference time. The highest overall accuracy is achieved when Li MAC is used to predict the action type, M3A is applied for target element prediction, and T3A is used for text generation. In the Ait W dataset,

Published as a conference paper at ICLR 2025

Table 3: Action-type, click-target, and text accuracies across module combinations on the Ait W and Android Control datasets. Li MAC achieves the best action-type accuracy in both datasets and the best click-target accuracy in Ait W, while our fine-tuned Florence2 excels at text prediction.

Framework Modules Used Action Type Click Target Text

Type Click Text Ait W And Ctrl Ait W And Ctrl Ait W And Ctrl

See Act only See Actchoice See Actchoice See Actchoice 67.1 66.8 36.9 48.5 69.4 67.1 See Act only See Actann See Actann See Actann 68.2 66.8 44.7 55.7 66.0 61.8 T3A only T3A T3A T3A 56.2 67.7 33.5 71.1 66.5 78.4 M3A only M3A M3A M3A 63.8 69.8 48.3 77.1 67.3 74.3

Qwen only Qwen2-VL Qwen2-VL Qwen2-VL 81.7 70.7 53.2 55.2 70.5 75.7 Li MAC (ours) Ac T Qwen2-VL Qwen2-VL 86.9 82.3 53.2 55.2 70.5 75.7 Li MAC (ours) Ac T Ac T Qwen2-VL 86.9 82.3 77.4 65.4 70.5 75.7

Florence only Florence2 Florence2 Florence2 86.4 79.6 76.2 62.0 84.2 77.5 Li MAC (ours) Ac T Florence2 Florence2 86.9 82.3 76.2 62.0 84.2 77.5 Li MAC (ours) Ac T Ac T Florence2 86.9 82.3 77.4 65.4 84.2 77.5

Table 4: Evaluation of three ablated versions of Li MAC using different types of input, on Android Control. For actions that require text completion, we use the fine-tuned Florence2.

Size Action Type Click Target Overall

Li MAC 520M 82.3 65.4 63.1 Li MAC (no CLIP FT) 520M 81.9 62.3 60.0 Li MAC (no img) 433M 82.4 54.9 56.0 Li MAC (no txt) 410M 83.2 65.7 63.0

Li MAC combined with Florence for text generation yields the highest accuracy. This outcome is expected, as both M3A and T3A show significantly lower accuracy in this dataset (see Table 1).

4.5 ABLATION STUDIES

Table 3 presents the action-type, click-target, and text accuracies for various module combinations across the two datasets. The results show that Li MAC, particularly the Ac T, achieves the best performance in action-type prediction. In the Android Control dataset, M3A and T3A perform well in click-target and text prediction but struggle with action-type accuracy, and they underperform in the automatically annotated Ait W dataset. Overall, Ac T within Li MAC excels at click-target predictions while being significantly smaller. Finally, our Florence fine-tune excels at text prediction, significantly outperforming GPT-4o baselines in Ait W and remaining competitive in Android Control.

Lastly, we present three ablation studies to further explore Ac T design choices. A core feature of Ac T is its ability to process each UI element as a distinct embedding within the transformer, created by concatenating the image, text, and attribute embeddings of the corresponding UI element. To assess the impact of the image and text modalities, as well as the CLIP fine-tuning on Li MAC s performance, we compare it to three ablated versions: one that excludes the image component, another that omits the UI text in the embedding process, and one that uses the original CLIP for encoding the image embeddings instead of the fine-tuned version. The evaluation metrics for these comparisons in the Android Control dataset and using Florence2 for text completion are shown in Table 4. The results demonstrate that removing image embeddings significantly reduces accuracy across all metrics, highlighting the crucial role of visual information in Ac T. In contrast, omitting the text embeddings has only a slight effect on performance, suggesting that Ac T can function effectively using only screenshots of observations without accessing the UI tree. Additionally, we observe that fine-tuning CLIP (see Section 3.1) is an important factor in improving the overall accuracy of Li MAC.

These findings underscore the importance of visual features and the benefits of fine-tuning pre-trained models like CLIP in our context. The minimal impact of removing text embeddings indicates that Li MAC is robust even when textual information is limited or unavailable, which is advantageous in

Published as a conference paper at ICLR 2025

scenarios where UI trees are inaccessible or incomplete. Future work could explore integrating other modalities or further optimising the embedding process to enhance performance.

5 RELATED WORK ON APP CONTROL

Though graphical user interface (GUI) control mainly started with web-based datasets and foundation model agents (Shi et al., 2017; Liu et al., 2018; Yao et al., 2022a; Deng et al., 2023; Furuta et al., 2023; Gur et al., 2023; Zheng et al., 2024), there has recently been a significant focus on mobile phone control. This can be seen both by the rapid development of Android navigation datasets, environments, and benchmarks (Rawles et al., 2023; 2024; Li et al., 2024a; Chen et al., 2024), and of mobile control agents (Yang et al., 2023; Wang et al., 2024b;a; Wen et al., 2023; Hong et al., 2024; Rawles et al., 2024; Li et al., 2024a; Bai et al., 2024; Wang et al., 2024c). Though many agents are published with their own specific evaluation data, popular datasets such as Android-in-the-Wild (Rawles et al., 2023) or Android Control (Li et al., 2024a) are often used as benchmarks. Agents developed for this task can be divided into two clear input types: text-based, using UI accessibility tree or XML information to describe the screen, or image-based. Image-based agents require vision models, which are capable of directly processing image inputs, and are usually backed by VLMs. On the other hand, text-based agents are backed by classical LLMs. Image-based agents also often take a combination of text and image as input to the model. Many mobile control agents propose intricate prompting methods backed by off-the-shelf, often proprietary, LLMs such as GPT-4 (Rawles et al., 2024; Yang et al., 2023; Wang et al., 2024b;a; Wen et al., 2023; Zheng et al., 2024). Although this requires little to no training, it can be both slow and expensive. Moreover, these models cannot be further tailored and trained for specific tasks. As such, another approach is to build mobile control agents around fine-tuned of foundation models on Android control datasets such as Ait W or Android Control. Firstly, both Ait W and Android Control present results for a fine-tuned LLM on their dataset, alongside the dataset itself. For example, Li et al. (2024a) train various Pa LM 2 (Anil et al., 2023) models on their dataset. However, these models are proprietary and supposedly quite large, with the base Pa LM 2 model reported to have over 300B parameters. Cog Agent (Hong et al., 2024) also performs fine-tuning on an 18B-large VLM. Bai et al. (2024) propose a different approach, called Digi RL, using RL to train their 1.3B VLM. This achieves strong performance but has limitations such as gathering cost and simulation difficulty, leading to the model only being adept on a small subset of Ait W.

6 CONCLUSION

In summary, we propose Li MAC, a lightweight framework designed to address app control tasks. Li MAC extracts UI elements from each phone screenshot and encodes them using specialised vision and text modules. These UI element encodings are then passed as embeddings to Ac T, which predicts the type and specifications of the next action. Ac T focuses on two key aspects of actions: the action type and the target element when the predicted action is click. For actions requiring text generation, Li MAC uses a fine-tuned VLM to ensure successful completion. We compare Li MAC against six baselines supported by state-of-the-art foundation models and evaluate them on two open-source datasets. Our results show that Li MAC can outperform the baselines while requiring significantly fewer computational time for both training and inference. This demonstrates that Li MAC is capable of handling task completion on devices with limited computational capabilities.

One of the main limitations of the proposed method is the limited training data. Li MAC is trained on just 13K and 18K episodes for Android Control and Ait W, respectively. The absence of any pretraining further hinders the model s ability to improve performance on more complex tasks. In the future, we aim to enhance the model s performance by incorporating online learning techniques, such as reinforcement learning. After the initial training stage presented in this work, Li MAC could interact with an Android emulator to generate additional data. By using a suitable reward function, or even leveraging GPT-4 to evaluate the generated trajectories and assign rewards (Bai et al., 2024), we could fine-tune Li MAC to improve the completion rate of tasks. An important focus for future work will be to develop error handling and recovery mechanisms to enable high success rates and robustness in online interactions. Another area of future research could address the safety of such models when handling sensitive data, such as credit card information and personal identifiers. It is essential to design foundation models with robust security protocols to protect against data breaches, especially when interacting with mobile phones containing sensitive information.

Published as a conference paper at ICLR 2025

ACKNOWLEDGEMENTS

This work was supported by the National Natural Science Foundation of China (Grant Nos. 62422605, 92370132).

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. ar Xiv preprint ar Xiv:2305.10403, 2023.

Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. ar Xiv preprint ar Xiv:2406.11896, 2024.

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. ar Xiv preprint ar Xiv:2308.12966, 2023.

Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Kaiwen Zhou, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, and Kun Shao. Spa-bench: A comprehensive benchmark for smartphone agent evaluation. 2024.

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 2021.

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2023.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixiang Shane Gu, and Izzeddin Gur. Multimodal web navigation with instruction-finetuned foundation models. ar Xiv preprint ar Xiv:2305.11854, 2023.

Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis. ar Xiv preprint ar Xiv:2307.12856, 2023.

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14281 14290, 2024.

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. ar Xiv preprint ar Xiv:2106.09685, 2021.

Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, and Hamed Haddadi. Melting point: Mobile evaluation of language transformers. 2024.

Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on computer control agents. ar Xiv preprint ar Xiv:2406.03679, 2024a.

Xiang Li, Zhenyan Lu, Dongqi Cai, Xiao Ma, and Mengwei Xu. Large language models on mobile devices: Measurements, analysis, and insights. In Proceedings of the Workshop on Edge and Mobile Foundation Models, pp. 1 6, 2024b.

Published as a conference paper at ICLR 2025

Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. Mapping natural language instructions to mobile ui action sequences. In 58th Annual Meeting of the Association for Computational Linguistics, 2020.

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations (ICLR), 2018.

Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam. ar Xiv preprint ar Xiv:1711.05101, 5, 2017.

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, 2021.

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Androidinthewild: A large-scale dataset for android device control. Advances in Neural Information Processing Systems, 36, 2023.

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. ar Xiv preprint ar Xiv:2405.14573, 2024.

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pp. 3135 3144. PMLR, 2017.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 2014.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 2017.

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. ar Xiv preprint ar Xiv:2406.01014, 2024a.

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. ar Xiv preprint ar Xiv:2401.16158, 2024b.

Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye Hao, Jun Wang, and Kun Shao. Distrl: An asynchronous distributed reinforcement learning framework for on-device control agents. 2024c.

Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. Autodroid: Llm-powered task automation in android. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking, pp. 543 557, 2023.

Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4818 4829, 2024.

Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. ar Xiv preprint ar Xiv:2312.13771, 2023.

Published as a conference paper at ICLR 2025

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744 20757, 2022a.

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. ar Xiv preprint ar Xiv:2210.03629, 2022b.

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded. ar Xiv preprint ar Xiv:2401.01614, 2024.

Published as a conference paper at ICLR 2025

A DATASET FORMAT

We use Android Control and Ait W dataset. While we use the full Android Control dataset, for Ait W we only select a few episodes for each unique instruction, due to the sheer size of the dataset and the repetitive nature of its instructions. The dataset is divided into five categories of tasks, of which we use the Google Apps , Install , and Web Shopping splits, since the other two contain single-step and Q&A tasks. We process the episodic data present in these datasets into refined, step-wise, datapoints which can be used for training and evaluation. Each datapoint is composed of the high-level goal for the task, an observation of the current screen, and the correct action. Details are given below.

Goal: The goal is always a raw text string describing the high-level instruction for the episode to which the datapoint belongs.

Observation: The exact form of the observation depends on the type of model it is used for. Textbased approaches such as T3A need a textual observation. For Android Control, we use the provided accessibility UI trees, which we further process into a list of UI elements containing information such as the element type, description, and attributes (clickable, editable, selected, etc...). Like in Li et al. (2024a), we filter these to retain only important elements, namely those that contain text or have critical attributes. For Ait W, OCR representations of text and icons are given in the dataset, but no comprehensive UI trees are provided. Therefore, to obtain the final representation, each element must be identified and converted using a similar procedure to that for Android Control. Vision models such as Qwen2-VL and Florence2 will expect an image-based observation. This observation will consist of the current phone screenshot along with an overlay of the UI element bounding boxes and their index. Finally, some models, such as M3A and ours, use a mixture of observations, both text and image-based. In particular, our model expects a text-based list of UI elements similar to the one described above, as well as a list of cropped images. The list of cropped images corresponds to each of the UI elements in the text-based observation and is used by our model as described in Section 3.1.

Action: Action grounding is a crucial part of mobile phone control, so as in previous works (Zheng et al., 2024; Yang et al., 2023; Wang et al., 2024b;a) and both the datasets we use, we define a fixed action space, seen in Table 5. Of this action space, open-app, wait, and long-press do not feature in Ait W, while navigate-home does not feature in Android Control. Information for most actions is sourced directly from the datasets, with only the action name at times varying. The only exceptions to this are the click and long-press actions, which require a target element, rather than x-y coordinates. For these, we select the best matching candidate from the observation list of UI elements. The action takes a specific JSON format we expect the models to match to facilitate parsing, which is simply a dictionary with the action type and an action specification (see Table 5). An example would be: {"action-type":"open-app","app-name":"Chrome"}.

Table 5: Agent action space, along with the relevant datasets.

Action type Action specification Ait W And Ctrl

open-app <app-name> click <target-element> long-press <target-element> input-text <text> scroll-{up/down/left/right} - navigate-home - navigate-back - wait -

B PROMPT ENGINEERING BASELINES

We evaluate four prompt engineering methods leveraging GPT-4o to generate actions. First, we assess two baselines proposed by Rawles et al. (2024): a text-based method (T3A) and a multimodal approach (M3A). In both methods, GPT-4o generates a summary of the previous timestep by reflecting on prior actions, the current observation, and previous observations and actions. GPT-4o

Published as a conference paper at ICLR 2025

then generates a proposed action in a Re Act-like (Yao et al., 2022b) fashion using a detailed prompt that includes task guidelines, action space descriptions, previously generated summaries, and the current observation. In T3A, the observation is represented as a list of UI elements, while in M3A, it also includes two screenshots: one of the original image and another with UI element bounding boxes.

The final two prompt-engineering baselines are See Actchoice and See Actann (Zheng et al., 2024). In both methods, GPT-4o is prompted with the current task and a screenshot from the observation to generate a high-level description of the proposed action. This proposal is then passed to GPT-4o for determining the final action, including both the action type and its specifications in the appropriate format. In See Actchoice, a multiple-choice list of textual UI element choices is appended to the prompt to allow GPT-4o to predict the action specifications, such as the target element in click actions. In See Actann, the observation s screenshot is annotated with bounding boxes and labels for each UI element. We base our implementation off the See Act agent in Rawles et al. (2024), which is adapted to app control tasks.

C IMPLEMENTATION DETAILS

Ac T is a compact transformer based on GPT-2 architecture. The transformer consists of 24 layers and 16 heads per layer. The hidden dimension of the transformer is 1024. We apply a dropout rate of 0.3 (Srivastava et al., 2014) during training across all layers. The Adam W optimiser (Loshchilov et al., 2017) is used in all experiments, with a learning rate of 3 10 4 specifically for Ac T. The functions ftype and ftarget are implemented as two-layer fully connected networks, each with a hidden size of 4096 and a dropout rate of 0.3. We use a batch size of 1 with gradient accumulation being set to 32.

We fine-tune Florence2 for 10 epochs, starting with an initial learning rate of 10 6, which is gradually reduced to zero during training. The batch size is set to 2, with gradient accumulation configured to 8. For Qwen2-VL, we employ Lo RA with a dimensionality of 64, beginning with an initial learning rate of 10 4, also gradually decreasing to zero throughout training. The batch size for Qwen2-VL is 1, with gradient accumulation similarly set to 8. We fine-tuned Qwen2-VL for 3 epochs.

D ADDITIONAL STUDIES

This section presents additional evaluation results for Li MAC.

D.1 EXTENDED SUCCESS RATE TABLE

In Table 6, we provide a full set of evaluation metrics for the baseline models, as well as for various combinations of Li MAC with other methods. These combinations are used to predict the target element in click actions or generate text for specific actions, such as open-app and input-text. In all the experiments involving Li MAC, Ac T is employed to predict the action type, while different combinations of methods are used to predict the action specifications, such as the target element or text generation. This approach allows us to isolate the impact of each combination on performance while maintaining a consistent action type prediction. This table extends the results already presented in Tables 1 to 3 providing a more in-depth understanding of the performance across a range of metrics. This additional breakdown offers a clearer understanding of how Li MAC performs when integrated with other methods, offering insights into the strengths and potential trade-offs of each combination in different scenarios.

D.2 CONFUSION MATRIX

Figure 3 shows the confusion matrix for action type prediction using Li MAC on the Android Control dataset. The results indicate that actions like open-app and input-text are generally easier to predict compared to other actions. One of the most frequently mispredicted actions is wait, which is unsurprising given that it can be challenging, even for humans, to determine when this action is required. Additionally, actions such as long-press and swipe in any direction are often misclassified, likely due to their relatively low occurrence in the training dataset compared to other actions.

Published as a conference paper at ICLR 2025

Table 6: Comprehensive table of accuracy results for different modules. All rows which have Ac T for the action type module fall under our Li MAC framework.

Modules Used Action Type Click Target Text Total

Type Click Text Ai TW And Ctr Ai TW And Ctr Ai TW And Ctr Ai TW And Ctr

Ac T Ac T Florence2 86.9 82.3 77.4 65.4 84.2 77.5 72.2 63.1 Ac T Florence2 Florence2 86.9 82.3 76.2 62.0 84.2 77.5 71.6 61.1 Ac T Ac T Qwen2-VL 86.9 82.3 77.4 65.4 70.5 75.7 70.9 62.5 Ac T Qwen2-VL Qwen2-VL 86.9 82.3 53.2 55.2 70.5 75.7 55.7 59.1 Ac T Ac T T3A 85.3 81.7 77.6 65.4 66.5 78.4 69.8 63.2 Ac T T3A T3A 85.3 81.7 33.5 71.1 66.5 78.4 42.7 65.4 Ac T M3A T3A 85.3 81.7 48.3 77.1 66.5 78.4 52.4 67.4 Ac T Ac T M3A 85.3 81.7 77.6 65.4 67.3 74.3 70.0 62.5 Ac T T3A M3A 85.3 81.7 33.5 71.1 67.3 74.3 43.0 64.7 Ac T M3A M3A 85.3 81.7 48.3 77.1 67.3 74.3 52.6 66.8 Ac T Ac T See Actchoice 85.3 81.7 77.6 65.4 69.4 67.1 70.5 62.0 Ac T See Actchoice See Actchoice 85.3 81.7 36.9 48.5 69.4 67.1 45.7 53.7 Ac T Ac T See Actann 85.3 81.7 77.6 65.4 66.0 61.8 70.0 61.1 Ac T See Actann See Actann 85.3 81.7 44.7 55.7 66.0 61.8 49.2 61.6 Florence2 Florence2 Florence2 86.4 79.6 76.2 62.0 84.2 77.5 70.8 57.0 Qwen2-VL Qwen2-VL Qwen2-VL 81.7 70.7 53.2 55.2 70.5 75.7 51.0 52.2 T3A T3A T3A 56.2 67.7 33.5 71.1 66.5 78.4 26.9 53.1 T3A M3A T3A 56.2 67.7 48.3 77.1 66.5 78.4 30.9 55.2 M3A T3A T3A 63.8 69.8 33.5 71.1 66.5 78.4 27.0 53.5 M3A M3A T3A 63.8 69.8 48.3 77.1 66.5 78.4 35.8 57.7 See Actchoice See Actchoice See Actchoice 67.1 66.8 36.9 48.5 69.4 67.1 29.5 38.9 See Actann See Actann See Actann 68.2 66.8 44.7 55.7 66.0 61.8 34.3 45.7

Scroll Down

Swipe Right

Predicted Action Type

Scroll Down

Swipe Right

Actual Action Type

160 75 1 2 7 0 16 0 1 0

81 1970 1 20 41 33 115 8 3 7

0 3 5 0 1 0 0 0 0 0

1 14 0 226 0 7 2 0 0 0

5 14 1 0 265 0 0 0 0 0

3 32 0 1 0 73 4 2 0 1

12 75 0 2 0 4 307 4 0 5

1 21 0 1 0 1 6 35 0 0

1 5 0 0 0 0 5 0 4 0

1 8 0 2 0 0 1 1 0 5

Confusion Matrix of Action Type Predictions Li MAC

Figure 3: Confusion matrix for action type selection for Li MAC in Android Control.

D.3 FAILURE ANALYSIS

We also examine the failure patterns of Li MAC, using Florence2 as the VLM, across the two datasets studied. Figure 4 displays the frequency of these failures, categorised by the type of failure in predicting either the action type or the action specifications. Specifically, within the action specifications, failures occur in two areas: incorrect prediction of the click target and inaccurate generation of input

Published as a conference paper at ICLR 2025

text by the VLM. In both datasets, the most common type of failure is misclassification of the action type, closely followed by failures in predicting the click target. These findings underscore the key challenges that research on app control should address.

Action Type Click Target Input Text Type of Failure

Frequency of Failure Types

Ait W Android Control

Figure 4: Relative frequency of different types of action prediction errors in the two datasets

D.4 UI ELEMENTS SCALABILITY

In this section, we assess how the number of UI elements in an observation impacts the success and failure rates of action prediction. Figure 5 displays the number of successful and unsuccessful action predictions made by Li MAC, categorised by the number of UI elements. The results are grouped into bins of ten on the x-axis. The number of UI elements extends up to 150 in Ait W and up to 290 in Android Control. However, for clarity, only bins containing more than five samples are included in the figures. Overall, the data suggests that the rate of failed action predictions increases slightly as the number of UI elements grows. This trend is expected since accurately predicting the target of click actions becomes more challenging with more UI elements present.

Number of UI Elements

Number of Predictions

Ait W Predictions

Correct Incorrect

Number of UI Elements

Number of Predictions

Android Control Predictions

Correct Incorrect

Figure 5: Number of successful and failed prediction of actions with respect to the number of UI elements in the observation, for the two datasets.

Published as a conference paper at ICLR 2025

E CASE STUDIES

Some sample episodes from Android Control, including agent predictions, are provided in Figures 6 and 7. These are provided for illustration purposes, as well as to further explain relaxed accuracies and an example failure. Figure 6 presents both an instance of a relaxed target element in the third timestep and a failed input-text action in the final timestep. Figure 7 shows a relaxed input-text action in the fourth timestep and an otherwise successful episode. Further details are provided in the figure captions.

Figure 6: Relaxed target element in yellow (timestep 3) and failed action in red (final timestep). The target element of the click in timestep 3 is considered correct under our relaxed accuracy because its bounding box is almost identical to the correct element, and clicking either would have the same effect (opening the text bar). In the final timestep, the agent inputs text Detroit rather than Las Vegas , a clear confusion between the origin and destination of the trip stated in the goal, leading to an incorrect prediction.

Figure 7: Relaxed input-text in yellow (timestep 4) and overall successful episode. Timestep 4 is considered correct under our relaxed input-text textual component because it is simply the singular form of the correct text, leading to a Jaccard index greater than 0.5 and presumably the same search results. The episode terminates successfully, with all timesteps being considered correct under our evaluation metrics.