UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents

Jiwen Zhang^1,2* , Yaqi Yu^2* , Minghui Liao² , Wentao Li², Jihao Wu² , Zhongyu Wei¹.

¹Fudan University ²Huawei Inc.

This work presents UI-Hawk, a visual GUI agent specially designed to processing screen streams encountered during GUI navigation. UI-Hawk incorporates a history-aware visual encoder and an efficient resampler to handle the screen sequences. To acquire a better understanding of screen streams, we define four fundamental tasks—UI grounding, UI referring, screen question answering, and screen summarization. We develop an automated data curation method to generate the corresponding training data for UI-Hawk. Along with the efforts above, we have also created a benchmark FunUI to quantitatively evaluate the fundamental screen understanding ability of MLLMs. Extensive experiments on FunUI and GUI navigation benchmarks consistently validate that screen stream understanding is not only beneficial but also essential for GUI navigation.

📣 Update

[2025-10-03] FunUI benchmark is open sourced at huggingface!
[2025-08-20] Our paper is accpeted as a long paper in the main conference of EMNLP 2025!
[2024-10-10] Our project page UI-Hawk is now visible!
[2024-08-30] We have our paper online, you can access the preprint or directly get the PDF at here !

Model Architecture

UI-Hawk is a MLLM-based GUI agent equipped with screen stream understanding capabilities. It is built upon TextHawk. To harness the screen sequences, UI-Hawk incorporates a history-aware visual encoder, which explicitly models the temporal dependencies of images via scalable position embeddings. We utilize an efficient resampler with a compression ratio of 16, enabling UI-Hawk to handle multiple steps of history screens. This meticulous architecture design empowers UI-Hawk to effectively perceive the fine-grained details involved in the entire navigation process.

FunUI Benchmark

We introduce FunUI, a bilingual evaluation benchmark to evaluate the fundamental screen understanding capabilities of MLLMs. Concretely, FunUI enjoys the following three characteristics:

Bilingual: FunUI comprises of 2150 Chinese screens and 9347 English screens from Android devices, annotated with 14k and 18k samples, respectively.
Comprehensive: FunUI includes different evaluation dimensions of screen understanding, including UI grounding and UI referring tasks to access the regional location and identification abilities of models, together with screen question answering and screen summarization tasks that require more integrated analysis of screen contents.
Diverse: FunUI covers various types of question answering pairs, including grounding and referring questions about 120+ icons and widgets, and complex questions with related to elements relations, attributions, arithmetics and so on.

Experiment Results

UI-Hawk is a bilingual model with advanced screen understanding capabilities and achieves new SOTA on episodic GUI navigation.

Model	Tool	Information	Shopping	Media	Social	Multi-Apps	Overall	ClickAcc
GPT-4V	10.6	9.8	11.2	7.6	5.0	11.2	9.2	3.4
CogAgent	12.9	10.0	14.2	10.5	9.0	8.4	10.3	7.5
SeeClick	6.8	6.4	5.8	7.2	8.1	5.5	6.5	6.5
OdysseyAgent	81.5	63.6	62.2	72.5	72.5	68.8	70.8	43.8
UI-Hawk	88.2	70.9	66.8	82.4	81.4	80.1	79.4	76.3

Table 2: Sequential navigation performance on GUI-Odyssey+ dataset.

Citation

If you find our work helpful, please consider citing our paper.

@article{202408.2137,
    title = {UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents},
    author = {Jiwen Zhang and Yaqi Yu and Minghui Liao and Wentao Li and Jihao Wu and Zhongyu Wei},
    doi = {10.20944/preprints202408.2137.v1},
    url = {https://doi.org/10.20944/preprints202408.2137.v1},
    year = 2024,
    month = {August},
    publisher = {Preprints},
    journal = {Preprints}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents

📣 Update

Model Architecture

FunUI Benchmark

Experiment Results

Citation

About

Uh oh!

Releases

Packages

License

IMNearth/UIHawk

Folders and files

Latest commit

History

Repository files navigation

UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents

📣 Update

Model Architecture

FunUI Benchmark

Experiment Results

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages