Vocaela-2-500M-1024R2: A Tiny Mighty GUI Agent Model, Optimized for Efficiency

TL;DR: A compact 500M-parameter Vision-Language Model (VLM) designed for low-level GUI agents. Given a screenshot and a concise instruction (e.g., “click the submit button”), it outputs structured JSON actions with precise pixel coordinates. Despite its small size, it performs well on grounding and low-level GUI control, rivaling much larger models while running efficiently on laptops and even mobile devices.

Compared to its predecessor Vocaela-500M, this version primarily optimizes for efficiency, reducing CPU inference latency (via LlamaCpp) by 50%-70%. In addition, instruction-following robustness is improved.

Model description

A growing number of models can now operate computer and mobile GUIs on behalf of users. However, most are massive and impractical for everyday devices like laptops or phones. While many GUI agent models chase higher autonomy, the former Vocaela-500M model and this update explore a different path: a smaller, efficient model focused on precise low-level control.

Given a screenshot and an explicit instruction such as “click the submit button,” it produces structured JSON actions with pixel coordinates. By narrowing the scope, we maximize efficiency, achieving smooth performance on laptops and even mobile devices.

Despite its compact 500M parameters, it performs surprisingly well on grounding and GUI control tasks. This represents a step toward scaling GUI agent models downward toward lightweight, practical deployment.

Type: Vision-Language Model (VLM) for Computer GUI Agents
Size: 500M parameters
Input: Screenshot + natural language instruction (specific GUI action)
Output: Structured JSON describing GUI action(s), including pixel coordinates
Recommended image resolution: longer edge < 2048
Fine-tuned from: Text tower of HuggingFaceTB/SmolVLM2-500M-Video-Instruct (abbreviated as SmolVLM2-500M below) + google/siglip-base-patch16-256 (abbreviated as siglip-256 below)
Quantized model: vocaela/Vocaela-2-500M-1024R2-GGUF
License: CC BY-NC-SA 4.0
Developed by: Vocaela AI

Key changes compared to the predecessor Vocaela-500M

The main changes are on the vision tower to optimize efficiency. Vocaela-500M was directly fine-tuned from SmolVLM2-500M without any changes on architecture. This version introduces three changes to the vision encoder to optimize efficiency:

Replaces the original vision encoder Siglip-512 with Siglip-256 and hence the size of each tile (sub-image) going to encoder is reduced from 512 to 256, processed more efficiently.
Reduces the input image longer side from 2048 to 1024 to reduce overall computing on vision side. That is the meaning of suffix 1024 in the model name. It implies vision quality drop on fine-grained details.
Reduces the pixel shuffle factor from 4 to 2 to reduce image token compression. That is the meaning of suffix R2 in the model name. It increases number of visual tokens per same image size, trying to save back the loss caused by reducing image size.

With these carefully calibrated configurations, compared to Vocaela-500M, the average number of tiles (sub-images) keeps the same while each tile processed more efficiently. The number of visual tokens passed to the language model remains unchanged, as the effects of reducing tile size (512 → 256) and reducing the pixel shuffle factor (4 → 2) largely cancel each other out. All together, the overall e2e latency reduced to ~30%-50% of the original, measured by clock time running via LlamaCpp on CPU.

Beyond efficiency improvements, instruction-following robustness is enhanced in this version. Additionally, a minor update to the action space is introduced to better support mobile swipe actions.

Action space

The following table lists the default action schema used during training. Users may extend or redefine it via system prompts. Compared to the predecessor Vocaela-500M, the only change is splitting mobile swipe to two actions: general_swipe and element_swipe.

	Action	Parameters	Parameters' Values	Example	Meaning
Common	type	text	string, the text to type in	{"action": "type", "text": "example"}	Typing in specified text
	click	coordinate	[x,y], scaled [0, 1), position to click on	{"action": "click", "coordinate": [0.1,0.5]}	Click using mouse or tap using finger at specified position
Desktop Only	mouse_move	coordinate	[x,y], scaled [0, 1), position to move to	{"action": "mouse_move", "coordinate": [0.1,0.5]}	Move mouse to specified position
	drag	coordinate, coordinate2	[x,y], scaled [0, 1), start (`coordinate`) and end (`coordinate2`) position to drag	{"action": "drag", "coordinate": [0.1,0.5], "coordinate2": [0.2,0.6]}	Drag mouse (click left button and hold) from specified start position to end position
	right_click	coordinate	[x,y], scaled [0, 1), position to click on	{"action": "right_click", "coordinate": [0.1,0.5]}	Click right mouse button at specified position
	middle_click	coordinate	[x,y], scaled [0, 1), position to click on	{"action": "middle_click", "coordinate": [0.1,0.5]}	Click middle mouse button at specified position
	double_click	coordinate	[x,y], scaled [0, 1), position to click on	{"action": "double_click", "coordinate": [0.1,0.5]}	Double click left mouse button at specified position
	scroll	scroll_direction	enum: {'up', 'down'}	{"action": "scroll", "scroll_direction": "up"}	Scroll mouse wheel with specified direction
	press_key	key, presses	`key`: string, single key pressed; `presses`, integer, number of times to press	{"action": "press_key", "key": 'enter'}	Press a single key
	hotkey	hotkeys	list of string, combination of keys to press, e.g, ['ctrl', 'c']	{"action": "hotkey", "hotkeys": ["ctrl", "c"]}	Press hotkey combinations e.g., Ctrl+C
Mobile Only	long_press	coordinate, time	`coordinate`: [x,y], scaled [0, 1), position to press on; `time`: seconds to hold	{"action": "long_press", "coordinate": [0.1,0.5], "time": 5}	Press at specified position and hold for specified time (s)
	general_swipe	swipe_direction, swipe_from	`swipe_direction`: direction swipe towards, enum {'up', 'down', 'left', 'right'}, `swipe_from`: general area to swipe from, enum {'top', 'bottom', 'left', 'right', 'center', 'top_left', 'top_right', 'bottom_left', 'bottom_right'}	{"action": "general_swipe", "swipe_direction": 'up'}	Swipe from specified start area towards specified direction
	element_swipe	swipe_direction, coordinate	`swipe_direction`: direction swipe towards, enum {'up', 'down', 'left', 'right'}, `coordinate`: [x,y], scaled [0, 1), precise position to swipe from	{"action": "element_swipe", "swipe_direction": "up", "coordinate": [0.1,0.5]}	Swipe from an accurate location towards specified direction
	system_button	button	string, system button to press, enum: {'back', 'home', 'menu', 'enter'}	{"action": "system_button", "button": "home"}	Press a specified system button
	open	text	string, name of app to open	{"action": "open", "text": "Google Chrome"}	Open a specified app

See below Section System messages for example of how to instruct the action space.

How to use

The model is used in the same way as SmolVLM2-500M and Vocaela-500M. The example below shows how to load the model and processor, construct multimodal messages, and perform inference. For system messages, please refer to Section System messages. For a completed running example, please refer to the simple play demo vocaela-500m-demo. The repo also includes examples of running via LlamaCpp.

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_path = "vocaela/Vocaela-2-500M-1024R2"
processor = AutoProcessor.from_pretrained(model_path)
torch_dtype = torch.float16 # using torch.bfloat16 if your device supports it
device = 'cuda' # using 'cpu' if want to inference on cpu
_attn_implementation = 'sdpa' # using "flash_attention_2" if it available in your env
model = AutoModelForImageTextToText.from_pretrained(model_path, torch_dtype=torch_dtype, _attn_implementation=_attn_implementation).to(device)

# Ensure the 'content' field uses a list format for every message, even for single items; otherwise, apply_chat_template's result will be wrong without raising any exception.
messages = [
    {
      "role": "system",
      "content": [
        { "type": "text", "text": "<SYSTEM_MESSAGE>"}, # please reference section [System messages](#system-messages) for choices of using message for computer use or mobile use.
      ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "<image full path>"},
            {"type": "text", "text": "Click the ..."},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch_dtype)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])

Evaluation: CPU Latency

We evaluate latency of inference via LlamaCpp on CPUs, measured as end-to-end wall-clock time per request. The same quantized Q8_0 GGUF is used for all models. Evaluation set is constructed by 100 random samples from the above quality eval datasets.

Three system configurations are used:

Dell 7400 laptop with Intel Core i7 8665U 4-Core (released in 2019), Windows.
Workstation with AMD Ryzen 7 7700X 8-Core (released in 2022), Ubuntu.
MacBook Air laptop with M1 8-Core (released in 2020), macOS

For fairness of comparison, LlamaServer is configured to -t 4 to limit parallelization to 4. For MacBook Air M1, as it has on-chip GPUs, we report both numbers with GPUs on and off.

Qwen3-VL-2B-Instruct (GGUF-Q8_0) is evaluated as reference. This 2B model is often considered to be especially suitable for edge devices. It turns out far from practical to be used as local GUI agent model on everyday consumer devices. The compact Vocaela-500M is much faster, although absolute latency is still challenging for realistic usage. Vocaela-2-500M-1024R2 further cuts the latency by 50%-70%.

System Config	Model	Avg Latency Per Request (s), Lower Better
Intel Core i7 8665U - Windows	Qwen3-VL-2B-Instruct (GGUF-Q8_0)	804.2
-	Vocaela-500M (GGUF-Q8_0)	81
-	Vocaela-2-500M-1024R2 (GGUF-Q8_0)	31
AMD Ryzen 7 7700X - Ubuntu	Qwen3-VL-2B-Instruct (GGUF-Q8_0)	118.4
-	Vocaela-500M (GGUF-Q8_0)	19.2
-	Vocaela-2-500M-1024R2 (GGUF-Q8_0)	6.4
MacBook M1 - macOS	Qwen3-VL-2B-Instruct (GGUF-Q8_0)	15.6(w/GPU), 34.0(w/o GPU)
-	Vocaela-500M (GGUF-Q8_0)	3.4(w/GPU), 5.8(w/o GPU)
-	Vocaela-2-500M-1024R2 (GGUF-Q8_0)	1.67(w/ GPU), 3.7(w/o GPU)

Evaluation: Quality

Quality evaluation protocals are the same with those used for Vocaela-500M. For simplicity, we only compare results with Vocaela-500M. For numbers of other related models, please refer to the previous model card.

We evaluated the model on two levels of tasks:

Grounding: the model is asked to directly output screen coordinate (x,y) of a concerned GUI element. In the past, related works show a trend of improving this low-level capability by scaling up model size. However, Vocaela-500M and Vocaela-2-500M-1024R2 models impress us by how a tiny model can still perform remarkably on grounding.
Low-level GUI agent task: the model is asked to execute low-level GUI instructions such as "click the submit button", "type 'diet food' in the search box", "scroll up the page", "open chrome", etc. Although the task is not like those popular highly autonomous agentic ones, we hope it is still a self-contained "agent" model instead of only a "grounding" model.

Grounding evaluation

Screenspot-V2

On ScreenspotV2, there is a noticeable performance drop. We attribute this primarily to two vision-related changes:

Image size reduced from longer-edge=2048 to 1024, which makes resolving fine-grained details more difficult.
Vision encoder changed to siglip-256, which is shown weaker than siglip-512 on base model evaluations (refer to siglip paper)

Model	Mobile-Text	Mobile-Icon	Desktop-Text	Desktop-Icon	Web-Text	Web-Icon	Overall
Vocaela-500M	95.9	73.9	95.4	75.7	91.0	75.4	85.8
Vocaela-2-500M-1024R2	92.1	71.1	90.7	71.4	85.9	75.9	82.4

Showdown

On Showdown, the new model is almost on par with the predecessor.

Model	Acc
Vocaela-500M	52.1
Vocaela-2-500M-1024R2	51.9

Low-level agent evaluation

Following convention of related works, we report the three metrics:

Type: Accuracy of predict the action type, e.g., 'click', 'type' etc.
Grounding: The accuracy of the coordinate of some actions requiring output screen coordinate, such as 'click'.
SR: The step success rate. For all metrics, higher is better.

On all these datasets, Vocaela-2-500M-1024R2 is on par or slightly better than the former version.

AndroidControl-Low

Model	Type	Grounding	SR
Vocaela-500M	83.98	81.52	69.68
Vocaela-2-500M-1024R2	86.39	81.41	72.74

GUI-Act-Web

Model	Type	Grounding	SR
Vocaela-500M	90.28	79.71	80.43
Vocaela-2-500M-1024R2	92.41	83.47	81.21

OmniAct-Web

Model	Type	Grounding	SR
Vocaela-500M	88.16	72.42	67.13
Vocaela-2-500M-1024R2	91.73	71.91	67.41

OmniAct-Desktop

Model	Type	Grounding	SR
Vocaela-500M	89.23	83.05	79.12
Vocaela-2-500M-1024R2	92.59	84.07	80.98

Training strategy

Starting from SmolVLM2-500M, we replace the vision encoder with google/siglip-base-patch16-256 and correspondingly adjust the connector layer's dimensions. The pixel shuffle factor is reduced to be 2 and the longest edge size is reduced to be 1024. After these architectural modification, the model is trained in four stages:

Connector warm up: Both the text tower and the vision encoder are frozen, leaving only the connector layer trainable. This stage uses approximately 256K samples of general vision-language instruction-tuning data.
SFT Stage 1: Approximately 9M examples consisting of a mixture of general VLM instruction data, GUI grounding data, GUI instruction data, and GUI navigation data. Compared to Vocaela-500M, additional 2M general VLM instruction samples are added in this stage.
SFT Stage 2: Approximately 1.2M examples selectively sampled from Stage 1, with rebalanced distributions over GUI action types, grounding, and navigation data.
RFT: GRPO-based reinforcement fine-tuning using approximately 40K selected examples from SFT Stage 2, with adjusted distributions over action types, grounding, and navigation data.

Limitations

Not suitable for high-resolution images. The model is especially designed for efficiency. Although there are no hard constraints, it is recommended that image longest edge < 2048.
Not suitable for high-level agentic tasks. The model is designed for executing low-level GUI instructions. It lacks reasoning capability.
Loss of general-purpose capabilities.
No video input support.

System messages

Below system messages are used in training and hence recommended to use for inference.

System message for computer use

Vocaela_Computer_Use_System_Message = """You are an assistant trained to navigate the computer screen. 
Given a task instruction, a screen observation, and an action history sequence, 
output the next actions and wait for the next observation. 

## Allowed ACTION_TYPEs and parameters:
1. `PRESS_KEY`: Press one specified key. Two parameters: `key`, string, the single key to press; `presses`, integer, the number of times to press the key (default is 1).
2. `TYPE`: Type a string into an element. Parameter: `text`, string, the text to type.
3. `MOUSE_MOVE`: Move the mouse cursor to a specified position. Parameter: `coordinate`, formatted as [x,y], the position to move the cursor to.
4. `CLICK`: Click left mouse button once on an element. Parameter: `coordinate`, formatted as [x,y], the position to click on.
5. `DRAG`: Drag the cursor with the left mouse button pressed, start and end positions are specified. Two parameters: `coordinate`, formatted as [x,y], the start position to drag from; `coordinate2`, formatted as [x2,y2], the end position to drag to.
6. `RIGHT_CLICK`: Click right mouse button once on an element. Parameter: `coordinate`, formatted as [x,y], the position to right click on.
7. `MIDDLE_CLICK`: Click middle mouse button once on an element. Parameter: `coordinate`, formatted as [x,y], the position to middle click on.
8. `DOUBLE_CLICK`: Click left mouse button twice on an element. Parameter: `coordinate`, formatted as [x,y], the position to double click on.
9. `SCROLL`: Scroll the screen (via mouse wheel). Parameter: `scroll_direction`, the direction (`up`/`down`/`left`/`right`) to scroll.
10. `HOTKEY`: Press a combination of keys simultaneously. Parameter: `hotkeys`, list of strings, the keys to press together.
11. `ANSWER`: Answer a specific question. Required parameter: `text`, string, the answer text.

* NOTE *: The `coordinate` and `coordinate2` parameters (formatted as [x,y]) are the relative coordinates on the screenshot scaled to range of 0-1, [0,0] is the top-left corner and [1,1] is the bottom-right corner.

## Format your response as
<Action>the next actions</Action>

`The next actions` can be one or multiple actions. Format `the next actions` as a JSON array of objects as below, each object is an action:
[{"action": "<ACTION_TYPE>", "key": "<key>", "presses": <presses>, "hotkeys": ["<hotkeys>"], "text": "<text>", "coordinate": [x,y], "coordinate2": [x2,y2], "scroll_direction": "<scroll_direction>"}]

If a parameter is not applicable, don't include it in the JSON object.
"""

System message for mobile phone use

Compared to the former Vocaela-500M, this version splits swipe to general_swipe and element_swipe.

Vocaela_Mobile_Use_System_Message = """You are an assistant trained to navigate the mobile phone. 
Given a task instruction, a screen observation, and an action history sequence, 
output the next actions and wait for the next observation. 

## Allowed ACTION_TYPEs and parameters:
1. `PRESS_KEY`: Press one specific key. Supports adb's `keyevent` such as 'volume_up', 'volume_down', 'power', 'camera', 'clear', etc. Parameter: `key`, string, the single key to press
2. `CLICK`: Click/tap on the screen. Parameter: `coordinate`, formatted as [x,y], the position to click on.
3. `LONG_PRESS`: Long press on the screen. Two parameters: `coordinate`, formatted as [x,y], the position to long press on; `time`, duration in seconds to long press.
4. `GENERAL_SWIPE`: General swipe on a screen area. Two parameters: `swipe_from`, the start area to swipe from, only allowed value in {'top', 'bottom', 'left', 'right', 'center', `top_left`, `top_right`, `bottom_left`, `bottom_right`}; `swipe_direction`, the direction (`up`/`down`/`left`/`right`) to swipe towards.
5. `ELEMENT_SWIPE`: Accurate swipe on a specified UI element. Two parameters: `coordinate`, formatted as [x,y], the precise start position to swipe from; `swipe_direction`, the direction (`up`/`down`/`left`/`right`) to swipe towards.
6. `TYPE`: Type a string into an element. Parameter: `text`, string, the text to type.
7. `SYSTEM_BUTTON`: Press a system button. Parameter: `button`, the system button to press, allowed button values: 'Back', 'Home', 'Menu', 'Enter'.
8. `OPEN`: Open an app. Parameter: `text`, string, the app name to open.
9. `ANSWER`: Answer a specific question. Parameter: `text`, string, the answer text.


* NOTE *: `coordinate` parameter (formatted as [x,y]) is relative coordinate on the screenshot scaled to range of 0-1, [0,0] is the top-left corner and [1,1] is the bottom-right corner.

## Format your response as
<Action>the next actions</Action>

`The next actions` can be one or multiple actions. Format `the next actions` as a JSON array of objects as below, each object is an action:
[{"action": "<ACTION_TYPE>", "key": "<key>", "text": "<text>", "coordinate": [x,y], "swipe_from": "<swipe_from>", "swipe_direction": "<swipe_direction>", "time": <time>, "button": "<button>"}]

If a parameter is not applicable, don't include it in the JSON object.
"""

Special tokens & chat template

The base model SmolVLM2-500M does not provide special token to identify user or assistant role. For convenience of accurately masking user-turn messages in SFT, two existing special tokens are used to mark the beginning and end of an assistant message, <|reserved_special_token_50|> for the beginning and <|reserved_special_token_51|> for the end. Consequently, if looking into the chat_template.jinja file of the model folder, you will find the chat template added the prefix token <|reserved_special_token_50|> for inference:

<|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>
{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:<|reserved_special_token_50|>' }}{% endif %}

For a normal generation, if you configured tokenizer decoding not skipping special tokens, a completed sequence ends with two successive special tokens <|reserved_special_token_51|><end_of_utterance>, where <end_of_utterance> is the default end token of the base model and <|reserved_special_token_51|> is introduced in by our training process.

License

This model is made available under the CC BY-NC-SA 4.0 license. To comply with the license, you may use, modify, and share the dataset or derivative works for non-commercial purposes only. Any derivative works must be shared under the same license.

We adopt the CC BY-NC-SA 4.0 license because portions of the training data are released under the same license.

Please see the full license here.

Acknowledgements

Thanks to Microsoft Azure startup credit offer for funding the computing.
Thanks to related projects Jedi, TongUI, UGround, Aguvis, OS-ATLAS, GTA-1, OpenCUA, GUI-R1, FineVision, LLaVA-NeXT, etc. We leveraged datasets, code, and insights shared out from them.