【GUI-Agent】阿里通义MAI-UI 代码阅读(2)--- 实现 MAI-UI 的两类核心Agent如下本篇会介绍这两类AgentAgent文件任务输出协议MAIGroundingAgentsrc/mai_grounding_agent.pyUI 元素定位单步grounding_think./grounding_think{coordinate:[x,y]}坐标基于 SCALE_FACT0R999 归一化MAIUINavigationAgentsrc/mai_navigation_agent.py多步移动端GUI导航支持ask_user与mcp_call.tool_call{json}/tool_call多轮带历史截图0x01 工程实现特色MAI-UI 工程实现的三个特色如下。1.1 特色1特色1三套系统提示词对应三种Agent形态grounding / 纯导航 / ask_user MCP 增强导航src/prompt.py同时维护MAI_MOBILE_SYS_PROMPT_GROUNDING 一 单步元素定位MAI_MOBILE_SYS_PROMPT 一 标准多步导航MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP 一 在导航动作集里叠加两个特殊工具ask_userquestion模型主动反问用户、把任务“打回去mcp_calltoolargs调外部MCP工具如高德导航补全设备端做不到的能力意义这是Agent-User Interaction MCP Augmentation 范式在代码层面的真实落点一不是新API、就是同一个模型在不同system prompt下解锁不同动作集。新增交互类工具的正确姿势就是改prompt.pyparse_tagged_text的 schema而不是另起一个Agent类。1.2 特色2特色2是归一化坐标空间SCALE_FACTOR 999 XML标签输出协议而非function-calling。src/mai_grounding_agent.py 与 src/mai_naivigation_agent.py 都硬编码为 SCALE_FACTOR999模型永远输出[0999]区间整数由客户端按当前截图WH反归一化。输出不是OpenAI function-calling而是裸文本里的 XML 标签Grounding:grounding_think.../grounding_think{coordinate:[x,y]}Navigation:.tool_call{json}/tool_call兼容 thinking 模型的解析器parse_grounding_response、parse_tagged_text错误统一抛 ValueError。意义跨分辨率泛化同一个模型同一个权重无缝服务任意手机分辨率不需要在 prompt里写屏幕尺寸协议无关于推理后端一VLLM0.11.0、HFtransformers 本地推理、DashScope都能用因为只解析纯文本不依赖任何后端的tool-call结构代价解析鲁棒性必须由客户端自己保证所以两个parser都做了容错显式异常1.3 特色 3特色 3无状态服务端 客户端自管TrajMemory每步把历史截图重塞回 messagesBaseAgent 持有 traj_memoryTrajMemory每个 TrajStep 同时存 screenshot: Image 和 screenshot_bytesbytes渲染vs序列化双用MAIUINaivigationAgent._build_messages() 按 runtime_conf[history_n] 把最近 N 步的“截图模型回复“重组成多轮user/assistant对话再发给vLLM一一一vLLM 端零会话状态。save_traj()/load_traj()走bytes可被序列化/回放/做评测离线分析。stept的请求体每步独立、无状态如下意义可回放、可评测、可断点续跑—save_traj出dict、load_traj直接灌回离线replay不需要真机/模拟器横向扩展友好一VLLM可以集群水平扩因为没有会话粘性这正契合 scaling parallel environments up to 512的训l练形态在推理侧的对应做法代价每步N张图都要重传带宽与 prefill 成本随 history_n线性增长调小 history_n是常见的省 token 技巧。1.4 小结MAI-UI的工程独到之处不是模型本身而是这套客户端契约分辨率无关的999坐标空间 XML标签协议与后端解耦 无状态多轮重放与历史长度解耦 三档 prompt解锁的grounding/导航/ask_userMCP三种形态一一一后续任何二次开发都沿着这四条线走而不是去改模型契约。0x02 提示词2.1 提示词代码以下是提示词代码。MAI_MOBILE_SYS_PROMPTMAI_MOBILE_SYS_PROMPT You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. ## Output Format For each function call, return the thinking process in thinking /thinking tags, and a json object with function name and arguments within tool_call/tool_call XML tags: thinking ... /thinking tool_call {name: mobile_use, arguments: args-json-object} /tool_call ## Action Space {action: click, coordinate: [x, y]} {action: long_press, coordinate: [x, y]} {action: type, text: } {action: swipe, direction: up or down or left or right, coordinate: [x, y]} # coordinate is optional. Use the coordinate if you want to swipe a specific UI element. {action: open, text: app_name} {action: drag, start_coordinate: [x1, y1], end_coordinate: [x2, y2]} {action: system_button, button: button_name} # Options: back, home, menu, enter {action: wait} {action: terminate, status: success or fail} {action: answer, text: xxx} # Use escape characters \\, \\, and \\n in text part to ensure we can parse the text in normal python string format. ## Note - Write a small plan and finally summarize your next action (with its target element) in one sentence in thinking/thinking part. - Available Apps: [Camera,Chrome,Clock,Contacts,Dialer,Files,Settings,Markor,Tasks,Simple Draw Pro,Simple Gallery Pro,Simple SMS Messenger,Audio Recorder,Pro Expense,Broccoli APP,OSMand,VLC,Joplin,Retro Music,OpenTracks,Simple Calendar Pro]. You should use the open action to open the app as possible as you can, because it is the fast way to open the app. - You must follow the Action Space strictly, and return the correct json object within thinking /thinking and tool_call/tool_call XML tags. .strip()MAI_MOBILE_SYS_PROMPT_NO_THINKINGMAI_MOBILE_SYS_PROMPT_NO_THINKING You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. ## Output Format For each function call, return a json object with function name and arguments within tool_call/tool_call XML tags: tool_call {name: mobile_use, arguments: args-json-object} /tool_call ## Action Space {action: click, coordinate: [x, y]} {action: long_press, coordinate: [x, y]} {action: type, text: } {action: swipe, direction: up or down or left or right, coordinate: [x, y]} # coordinate is optional. Use the coordinate if you want to swipe a specific UI element. {action: open, text: app_name} {action: drag, start_coordinate: [x1, y1], end_coordinate: [x2, y2]} {action: system_button, button: button_name} # Options: back, home, menu, enter {action: wait} {action: terminate, status: success or fail} {action: answer, text: xxx} # Use escape characters \\, \\, and \\n in text part to ensure we can parse the text in normal python string format. ## Note - Available Apps: [Camera,Chrome,Clock,Contacts,Dialer,Files,Settings,Markor,Tasks,Simple Draw Pro,Simple Gallery Pro,Simple SMS Messenger,Audio Recorder,Pro Expense,Broccoli APP,OSMand,VLC,Joplin,Retro Music,OpenTracks,Simple Calendar Pro]. You should use the open action to open the app as possible as you can, because it is the fast way to open the app. - You must follow the Action Space strictly, and return the correct json object within thinking /thinking and tool_call/tool_call XML tags. .strip()MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP# Placeholder prompts for future features MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP Template( You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. ## Output Format For each function call, return the thinking process in thinking /thinking tags, and a json object with function name and arguments within tool_call/tool_call XML tags: thinking ... /thinking tool_call {name: mobile_use, arguments: args-json-object} /tool_call ## Action Space {action: click, coordinate: [x, y]} {action: long_press, coordinate: [x, y]} {action: type, text: } {action: swipe, direction: up or down or left or right, coordinate: [x, y]} # coordinate is optional. Use the coordinate if you want to swipe a specific UI element. {action: open, text: app_name} {action: drag, start_coordinate: [x1, y1], end_coordinate: [x2, y2]} {action: system_button, button: button_name} # Options: back, home, menu, enter {action: wait} {action: terminate, status: success or fail} {action: answer, text: xxx} # Use escape characters \\, \\, and \\n in text part to ensure we can parse the text in normal python string format. {action: ask_user, text: xxx} # you can ask user for more information to complete the task. {action: double_click, coordinate: [x, y]} {% if tools -%} ## MCP Tools You are also provided with MCP tools, you can use them to complete the task. {{ tools }} If you want to use MCP tools, you must output as the following format: thinking ... /thinking tool_call {name: function-name, arguments: args-json-object} /tool_call {% endif -%} ## Note - Available Apps: [Contacts, Settings, Clock, Maps, Chrome, Calendar, files, Gallery, Taodian, Mattermost, Mastodon, Mail, SMS, Camera]. - Write a small plan and finally summarize your next action (with its target element) in one sentence in thinking/thinking part. .strip() )MAI_MOBILE_SYS_PROMPT_GROUNDINGMAI_MOBILE_SYS_PROMPT_GROUNDING You are a GUI grounding agent. ## Task Given a screenshot and the users grounding instruction. Your task is to accurately locate a UI element based on the users instructions. First, you should carefully examine the screenshot and analyze the users instructions, translate the users instruction into a effective reasoning process, and then provide the final coordinate. ## Output Format Return a json object with a reasoning process in grounding_think/grounding_think tags, a [x,y] format coordinate within answer/answer XML tags: grounding_think.../grounding_think answer {coordinate: [x,y]} /answer .strip()2.2 移动系统提示词差异一览只有MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP模板支持 MCP 工具集成且通过 Jinja2 条件语法实现动态插入其余提示词版本均不包含 MCP 功能。提示词 ID核心用途思考标签操作空间特殊功能MAI_MOBILE_SYS_PROMPT标准 GUI 代理 必须点击/长按/输入/滑动等全功能无MAI_MOBILE_SYS_PROMPT_NO_THINKING快速响应无思考标签同上省略思考直接返回 JSONMAI_MOBILE_SYS_PROMPT_ASK_USER_MCP模板化用户询问可选同上ask_user、double_click、Jinja2 模板、MCP 工具集成MAI_MOBILE_SYS_PROMPT_GROUNDING纯定位专用仅元素识别输出 [x,y] 坐标无操作命令2.3 工具集成差异MCP 功能只在MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP模板层集成其余版本需外部桥接。集成位置仅MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP内置 MCP 工具调用入口通过 Jinja2 模板动态注入。其余版本无 MCP 工具入口需外部调用。提示词层差异标准版无 MCP 占位符纯 JSON 输出。MCP 版模板内预留{{mcp_tools}}变量运行时注入具体工具描述。运行时差异标准版LLM 输出传统动作 JSON由外部框架手动转发至 MCP。MCP 版渲染后提示词包含完整 MCP 工具 JSONLLM 可直接调用。条件性集成仅 MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP使用 Jinja2 模板语法{%if tools -%}...{%endif -%}实现动态集成独立## MCP Tools区域存放 MCP 工具描述通过{{tools}}变量动态插入可用工具信息输出格式与标准移动操作不同 内直接嵌入 MCP 函数调用0x03 输出3.1 输出格式区别非 MCP 版本MAI_MOBILE_SYS_PROMPT统一格式所有操作通过mobile_use函数调用固定结构GUI 操作封装在arguments字段示例thinking.../thinking tool_call {name:mobile_use,arguments:args-json-object} /tool_callMCP 版本MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP双重格式支持标准 GUI 操作和 MCP 工具调用工具特定格式MCP 工具调用使用实际函数名作为name示例thinking.../thinking tool_call {name:function-name,arguments:args-json-object} /tool_call下面代码把LLM的输出转换为结构化输出def parse_action_to_structure_output(text: str) - Dict[str, Any]: Parse model output text into structured action format. Args: text: Raw model output containing thinking and tool_call tags. Returns: Dictionary with keys: - thinking: The models reasoning process - action_json: Parsed action with normalized coordinates Note: Coordinates are normalized to [0, 1] range by dividing by SCALE_FACTOR. text text.strip() results parse_tagged_text(text) thinking results[thinking] tool_call results[tool_call] action tool_call[arguments] # Normalize coordinates from SCALE_FACTOR range to [0, 1] if coordinate in action: coordinates action[coordinate] if len(coordinates) 2: point_x, point_y coordinates elif len(coordinates) 4: x1, y1, x2, y2 coordinates