【GUI-Agent】阿里通义MAI-UI 代码阅读（2）--- 实现-北京尧图网络科技有限公司

MAI-UI 的两类核心Agent如下本篇会介绍这两类AgentAgent文件任务输出协议MAIGroundingAgentsrc/mai_grounding_agent.pyUI 元素定位单步grounding_think./grounding_think{coordinate:[x,y]}坐标基于 SCALE_FACT0R999 归一化MAIUINavigationAgentsrc/mai_navigation_agent.py多步移动端GUI导航支持ask_user与mcp_call.tool_call{json}/tool_call多轮带历史截图0x01 工程实现特色MAI-UI 工程实现的三个特色如下。1.1 特色1特色1三套系统提示词对应三种Agent形态grounding / 纯导航 / ask_user MCP 增强导航src/prompt.py同时维护MAI_MOBILE_SYS_PROMPT_GROUNDING 一单步元素定位MAI_MOBILE_SYS_PROMPT 一标准多步导航MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP 一在导航动作集里叠加两个特殊工具ask_userquestion模型主动反问用户、把任务“打回去mcp_calltoolargs调外部MCP工具如高德导航补全设备端做不到的能力意义这是Agent-User Interaction MCP Augmentation 范式在代码层面的真实落点一不是新API、就是同一个模型在不同system prompt下解锁不同动作集。新增交互类工具的正确姿势就是改prompt.pyparse_tagged_text的 schema而不是另起一个Agent类。1.2 特色2特色2是归一化坐标空间SCALE_FACTOR 999 XML标签输出协议而非function-calling。src/mai_grounding_agent.py 与 src/mai_naivigation_agent.py 都硬编码为 SCALE_FACTOR999模型永远输出[0999]区间整数由客户端按当前截图WH反归一化。输出不是OpenAI function-calling而是裸文本里的 XML 标签Grounding:grounding_think.../grounding_think{coordinate:[x,y]}Navigation:.tool_call{json}/tool_call兼容 thinking 模型的解析器parse_grounding_response、parse_tagged_text错误统一抛 ValueError。意义跨分辨率泛化同一个模型同一个权重无缝服务任意手机分辨率不需要在 prompt里写屏幕尺寸协议无关于推理后端一VLLM0.11.0、HFtransformers 本地推理、DashScope都能用因为只解析纯文本不依赖任何后端的tool-call结构代价解析鲁棒性必须由客户端自己保证所以两个parser都做了容错显式异常1.3 特色 3特色 3无状态服务端客户端自管TrajMemory每步把历史截图重塞回 messagesBaseAgent 持有 traj_memoryTrajMemory每个 TrajStep 同时存 screenshot: Image 和 screenshot_bytesbytes渲染vs序列化双用MAIUINaivigationAgent._build_messages() 按 runtime_conf[history_n] 把最近 N 步的“截图模型回复“重组成多轮user/assistant对话再发给vLLM一一一vLLM 端零会话状态。save_traj()/load_traj()走bytes可被序列化/回放/做评测离线分析。stept的请求体每步独立、无状态如下意义可回放、可评测、可断点续跑—save_traj出dict、load_traj直接灌回离线replay不需要真机/模拟器横向扩展友好一VLLM可以集群水平扩因为没有会话粘性这正契合 scaling parallel environments up to 512的训l练形态在推理侧的对应做法代价每步N张图都要重传带宽与 prefill 成本随 history_n线性增长调小 history_n是常见的省 token 技巧。1.4 小结MAI-UI的工程独到之处不是模型本身而是这套客户端契约分辨率无关的999坐标空间 XML标签协议与后端解耦无状态多轮重放与历史长度解耦三档 prompt解锁的grounding/导航/ask_userMCP三种形态一一一后续任何二次开发都沿着这四条线走而不是去改模型契约。0x02 提示词2.1 提示词代码以下是提示词代码。MAI_MOBILE_SYS_PROMPTMAI_MOBILE_SYS_PROMPT You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. ## Output Format For each function call, return the thinking process in thinking /thinking tags, and a json object with function name and arguments within tool_call/tool_call XML tags: thinking ... /thinking tool_call {name: mobile_use, arguments: args-json-object} /tool_call ## Action Space {action: click, coordinate: [x, y]} {action: long_press, coordinate: [x, y]} {action: type, text: } {action: swipe, direction: up or down or left or right, coordinate: [x, y]} # coordinate is optional. Use the coordinate if you want to swipe a specific UI element. {action: open, text: app_name} {action: drag, start_coordinate: [x1, y1], end_coordinate: [x2, y2]} {action: system_button, button: button_name} # Options: back, home, menu, enter {action: wait} {action: terminate, status: success or fail} {action: answer, text: xxx} # Use escape characters \\, \\, and \\n in text part to ensure we can parse the text in normal python string format. ## Note - Write a small plan and finally summarize your next action (with its target element) in one sentence in thinking/thinking part. - Available Apps: [Camera,Chrome,Clock,Contacts,Dialer,Files,Settings,Markor,Tasks,Simple Draw Pro,Simple Gallery Pro,Simple SMS Messenger,Audio Recorder,Pro Expense,Broccoli APP,OSMand,VLC,Joplin,Retro Music,OpenTracks,Simple Calendar Pro]. You should use the open action to open the app as possible as you can, because it is the fast way to open the app. - You must follow the Action Space strictly, and return the correct json object within thinking /thinking and tool_call/tool_call XML tags. .strip()MAI_MOBILE_SYS_PROMPT_NO_THINKINGMAI_MOBILE_SYS_PROMPT_NO_THINKING You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. ## Output Format For each function call, return a json object with function name and arguments within tool_call/tool_call XML tags: tool_call {name: mobile_use, arguments: args-json-object} /tool_call ## Action Space {action: click, coordinate: [x, y]} {action: long_press, coordinate: [x, y]} {action: type, text: } {action: swipe, direction: up or down or left or right, coordinate: [x, y]} # coordinate is optional. Use the coordinate if you want to swipe a specific UI element. {action: open, text: app_name} {action: drag, start_coordinate: [x1, y1], end_coordinate: [x2, y2]} {action: system_button, button: button_name} # Options: back, home, menu, enter {action: wait} {action: terminate, status: success or fail} {action: answer, text: xxx} # Use escape characters \\, \\, and \\n in text part to ensure we can parse the text in normal python string format. ## Note - Available Apps: [Camera,Chrome,Clock,Contacts,Dialer,Files,Settings,Markor,Tasks,Simple Draw Pro,Simple Gallery Pro,Simple SMS Messenger,Audio Recorder,Pro Expense,Broccoli APP,OSMand,VLC,Joplin,Retro Music,OpenTracks,Simple Calendar Pro]. You should use the open action to open the app as possible as you can, because it is the fast way to open the app. - You must follow the Action Space strictly, and return the correct json object within thinking /thinking and tool_call/tool_call XML tags. .strip()MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP# Placeholder prompts for future features MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP Template( You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. ## Output Format For each function call, return the thinking process in thinking /thinking tags, and a json object with function name and arguments within tool_call/tool_call XML tags: thinking ... /thinking tool_call {name: mobile_use, arguments: args-json-object} /tool_call ## Action Space {action: click, coordinate: [x, y]} {action: long_press, coordinate: [x, y]} {action: type, text: } {action: swipe, direction: up or down or left or right, coordinate: [x, y]} # coordinate is optional. Use the coordinate if you want to swipe a specific UI element. {action: open, text: app_name} {action: drag, start_coordinate: [x1, y1], end_coordinate: [x2, y2]} {action: system_button, button: button_name} # Options: back, home, menu, enter {action: wait} {action: terminate, status: success or fail} {action: answer, text: xxx} # Use escape characters \\, \\, and \\n in text part to ensure we can parse the text in normal python string format. {action: ask_user, text: xxx} # you can ask user for more information to complete the task. {action: double_click, coordinate: [x, y]} {% if tools -%} ## MCP Tools You are also provided with MCP tools, you can use them to complete the task. {{ tools }} If you want to use MCP tools, you must output as the following format: thinking ... /thinking tool_call {name: function-name, arguments: args-json-object} /tool_call {% endif -%} ## Note - Available Apps: [Contacts, Settings, Clock, Maps, Chrome, Calendar, files, Gallery, Taodian, Mattermost, Mastodon, Mail, SMS, Camera]. - Write a small plan and finally summarize your next action (with its target element) in one sentence in thinking/thinking part. .strip() )MAI_MOBILE_SYS_PROMPT_GROUNDINGMAI_MOBILE_SYS_PROMPT_GROUNDING You are a GUI grounding agent. ## Task Given a screenshot and the users grounding instruction. Your task is to accurately locate a UI element based on the users instructions. First, you should carefully examine the screenshot and analyze the users instructions, translate the users instruction into a effective reasoning process, and then provide the final coordinate. ## Output Format Return a json object with a reasoning process in grounding_think/grounding_think tags, a [x,y] format coordinate within answer/answer XML tags: grounding_think.../grounding_think answer {coordinate: [x,y]} /answer .strip()2.2 移动系统提示词差异一览只有MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP模板支持 MCP 工具集成且通过 Jinja2 条件语法实现动态插入其余提示词版本均不包含 MCP 功能。提示词 ID核心用途思考标签操作空间特殊功能MAI_MOBILE_SYS_PROMPT标准 GUI 代理必须点击/长按/输入/滑动等全功能无MAI_MOBILE_SYS_PROMPT_NO_THINKING快速响应无思考标签同上省略思考直接返回 JSONMAI_MOBILE_SYS_PROMPT_ASK_USER_MCP模板化用户询问可选同上ask_user、double_click、Jinja2 模板、MCP 工具集成MAI_MOBILE_SYS_PROMPT_GROUNDING纯定位专用仅元素识别输出 [x,y] 坐标无操作命令2.3 工具集成差异MCP 功能只在MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP模板层集成其余版本需外部桥接。集成位置仅MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP内置 MCP 工具调用入口通过 Jinja2 模板动态注入。其余版本无 MCP 工具入口需外部调用。提示词层差异标准版无 MCP 占位符纯 JSON 输出。MCP 版模板内预留{{mcp_tools}}变量运行时注入具体工具描述。运行时差异标准版LLM 输出传统动作 JSON由外部框架手动转发至 MCP。MCP 版渲染后提示词包含完整 MCP 工具 JSONLLM 可直接调用。条件性集成仅 MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP使用 Jinja2 模板语法{%if tools -%}...{%endif -%}实现动态集成独立## MCP Tools区域存放 MCP 工具描述通过{{tools}}变量动态插入可用工具信息输出格式与标准移动操作不同内直接嵌入 MCP 函数调用0x03 输出3.1 输出格式区别非 MCP 版本MAI_MOBILE_SYS_PROMPT统一格式所有操作通过mobile_use函数调用固定结构GUI 操作封装在arguments字段示例thinking.../thinking tool_call {name:mobile_use,arguments:args-json-object} /tool_callMCP 版本MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP双重格式支持标准 GUI 操作和 MCP 工具调用工具特定格式MCP 工具调用使用实际函数名作为name示例thinking.../thinking tool_call {name:function-name,arguments:args-json-object} /tool_call下面代码把LLM的输出转换为结构化输出def parse_action_to_structure_output(text: str) - Dict[str, Any]: Parse model output text into structured action format. Args: text: Raw model output containing thinking and tool_call tags. Returns: Dictionary with keys: - thinking: The models reasoning process - action_json: Parsed action with normalized coordinates Note: Coordinates are normalized to [0, 1] range by dividing by SCALE_FACTOR. text text.strip() results parse_tagged_text(text) thinking results[thinking] tool_call results[tool_call] action tool_call[arguments] # Normalize coordinates from SCALE_FACTOR range to [0, 1] if coordinate in action: coordinates action[coordinate] if len(coordinates) 2: point_x, point_y coordinates elif len(coordinates) 4: x1, y1, x2, y2 coordinates

【GUI-Agent】阿里通义MAI-UI 代码阅读（2）--- 实现

相关新闻

AI那么牛，为啥做不出来新产品颠覆抖音、淘宝

正版黄金投资软件怎么下载（具体步骤大全）

一文看懂6大AI Agent架构！零基础也能轻松读懂

最新新闻

C语言10-位运算与内存管理

收藏！500+AI Agent实战项目，小白也能快速上手大模型应用

兰亭妙微原创作品 血型分析系统UI设计

About-SwiftUI：学 SwiftUI，这一个仓库够了

Voohu：共模电感在EMI滤波器中的阻抗匹配与宽频带选型策略

【AlphaFold3】本地超算部署AlphaFold3 可接蛋白-蛋白预测 蛋白-小分子预测 突变分析 保守性分析 有序性分析 互作用力分析

日新闻

如何在PC上免费畅玩Nintendo Switch游戏：Ryujinx模拟器终极指南

【Netty源码解读和权威指南】第54篇：Netty在Elasticsearch中的应用——分布式搜索引擎的网络通信

Qwen2.5-Turbo百万上下文实战指南：百炼平台长文本处理全解析

周新闻

深入解析P89LPC932A1 CCU模块：输入捕获与PWM实战指南

进化博弈论解析AI代理欺骗行为与风险管控

SCF5250 FlashMedia接口与DMA控制器配置实战：实现嵌入式存储高效数据传输

月新闻

兰亭妙微原创作品血型分析系统UI设计

【AlphaFold3】本地超算部署AlphaFold3 可接蛋白-蛋白预测蛋白-小分子预测突变分析保守性分析有序性分析互作用力分析