Abstract
Embodied intelligence is a strategic technology in the current wave of technological change and industrial transformation and is a focal area of international competition. Mobile manipulators, with strong mobility, planning, and execution capabilities, are a preferred hardware platform for embodied intelligence. As a cross-domain, multi-scene, multi-function autonomous embodied platform, embodied-intelligence mobile manipulator systems are expected to drive the next-generation development of information technology and artificial intelligence.
Introduction
The journal China Engineering Science published a 2024 article by Professor Zhang Tao's team at Tsinghua University titled "Research on the Development of Mobile Manipulator Systems Based on Embodied Intelligence." The article summarizes the status of development, analyzes challenges, and proposes common key technologies and recommendations to support the development of mobile manipulators under the embodied-intelligence trend in China.
Artificial intelligence (AI) is a strategic technology driving a new round of scientific and industrial transformation. With rapid breakthroughs in general AI techniques, mobile manipulators that combine solid technical foundations and multi-scene applicability have emerged as an effective platform for embodied intelligence, prompting renewed global research and development efforts.
Embodied-intelligence mobile manipulator systems aim to build robots with autonomous environmental perception, deep cognitive understanding, fluent human-machine interaction, reliable decision making, and natural motion and manipulation planning. By providing a cross-domain, multi-scene, multi-function autonomous embodied platform, they can upgrade traditional mobile manipulators and guide future industry development. With brain-like architectures capable of perception, understanding, and decision making, these robots can autonomously interpret and execute high-level human commands, moving toward general intelligence.
Compared with traditional mobile robots, embodied-intelligence mobile manipulators can perform complex tasks that usually require human intelligence. As the underlying technologies mature, they have the potential to bring transformative societal impacts. Application domains include civil sectors such as services, dining, healthcare, smart home, and unmanned delivery; industrial sectors such as smart factories and intelligent manufacturing; and military applications such as dismounted operations. Most current R&D remains at the laboratory prototype stage. Systems designed for specific scenarios have made progress, but overall technology is not yet mature for broad industrialization or commercialization. Academic research focuses on environment perception, motion control, path planning, and coordination between chassis and arm, while embodied-intelligence research and mobile-manipulator engineering are evolving in parallel. This article clarifies development needs, summarizes the current status, analyzes challenges, and outlines key technologies and recommendations for future research.
Current Status of Mobile Manipulator Systems Based on Embodied Intelligence
Mobile manipulators are robots that combine mobility and manipulation, typically composed of a mobile base, a manipulator arm, and an end effector, and may evolve toward humanoid forms. Their morphology and mobile manipulation capabilities make them the form closest to human structure and an ideal hardware platform for embodied intelligence. Mobile manipulator technology has a long history and a relatively mature technical system. As an important realization path for general AI, embodied intelligence has seen recent breakthroughs, expanding the application prospects of mobile manipulators.
Mobile manipulator technologies
Mobile manipulators must perform perception, navigation, and control in unknown environments. Core technologies include perception, navigation, control, and dexterous manipulation. Advances enable robots to use multimodal data more accurately and robustly for environment perception, motion control, and path planning. With the development of deep learning and reinforcement learning, control techniques and multimodal perception for robots will continue to advance, improving perception, planning, and control capabilities.
Current developments in perception, navigation, and control are as follows. 1) Perception: mobile manipulators use sensors such as cameras, LiDAR, ultrasound, infrared, IMUs, and encoders to determine position, pose, and motion state. Multi-sensor fusion improves accuracy and robustness and supports high-precision, real-time environment perception. The perception layer also handles mapping. Localization and mapping (SLAM) solutions typically use either LiDAR-based point-cloud mapping, which yields high-precision maps and good robustness under poor lighting but lacks rich visual detail, or vision-based methods, which provide richer environmental features but are more sensitive to lighting and algorithm complexity. 2) Navigation: given a goal and a perception map, algorithms generate discrete waypoints or continuous desired trajectories. Navigation includes global path planning and local obstacle avoidance. Tasks include building a world model, computing collision-free trajectories from start to goal, and executing motion while avoiding obstacles. 3) Control: beyond perception and navigation, mobile manipulators need precise control to reach planned waypoints. Control methods include kinematic and dynamic modeling and controllers based on global linearization, approximate linearization, and Lyapunov theory. Classic control strategies include computed torque control, robust control, sliding mode control, adaptive control, neural-network control, fuzzy control, active disturbance rejection control, and compliance control.
In dexterous manipulation, traditional end effectors such as simple grippers lack agility. Mobile manipulators are evolving toward more dexterous and general end effectors such as hand-like structures and compliant capture mechanisms to improve generality in grasping, tool use, and handling deformable objects, enabling tasks like assembly, welding, transport in industrial settings and kitchen tasks or home cleaning in domestic settings. Computer vision enhances interaction between manipulators and the environment, improving visual tracking, object recognition, mobile grasping, and human-robot interaction. Manipulator control and dexterous techniques enable high-precision, high-performance task execution. Robots often rely on image data complemented by tactile sensors to enhance manipulation and interaction. Machine learning algorithms process image pixels from cameras into object categories, pose, velocity, and human expressions or gestures, supporting intelligent applications.
Improved perception, mobility, grasping, and dexterity expand application prospects. In the civil sector, mobile manipulators appear in healthcare logistics, intelligent factories, transportation and logistics, laboratory assistants, home services, dining, and hospitality, supporting smart-city efforts. In healthcare, they can deliver medicine and tools, assist patient care, and support diagnostics. In manufacturing, they enable functions beyond fixed arms, increasing autonomy. Laboratory assistants can support chemical experiments and data-driven lab workflows. In home services, tasks include delivery, door operation, and waste removal. In dining and hospitality, they assist with food delivery, dish collection, and item transport. In military domains, they aid reconnaissance, obstacle crossing, ammunition transport, and logistics support. These application demands provide drivers for industrialization.
Embodied-intelligence mobile manipulators
The concept of embodied intelligence, first proposed in 1950, refers to agents that interact with the environment and possess autonomous planning, decision making, action, and execution capabilities, whether physical robots or simulated agents. Embodied intelligence combines perception, cognition, reasoning, and action with a body capable of executing tasks. Conceptually, an embodied-intelligence system can be thought of as having components analogous to "cortex" for cognition and reasoning, "cerebellum" for flexible and coordinated control and dexterous skills, and "brainstem" for energy allocation, basic sensing, and signal processing. Implementation relies on computer vision, multimodal perception fusion, natural language processing, causal inference, and navigation and planning technologies. Unlike offline intelligence, embodied intelligence requires an integrated "brain" capable of perception, understanding, and decision making, and a robot "body" capable of stable, safe, and natural movement. Robots must also be able to learn online and update both "brain" and "body" during execution of high-level human commands and environment interaction.
Large language models (LLMs) such as ChatGPT renewed interest in embodied intelligence by demonstrating significant improvements in personalized responses, translation, language and image understanding after training on massive datasets. Integrated large-scale models can serve as a cognitive core, enabling mobile manipulators to reason and understand. Several commercial-scale multimodal and reasoning-capable models are available internationally, including GPT-4, DALL·E 3, Gemini, WizardMath, PaLM-E, and others. These models demonstrate capabilities in commonsense reasoning, code completion, knowledge transfer, language understanding, and image and point-cloud reasoning. Multimodal models can associate text with images and point clouds, enabling pixel-level segmentation and zero-shot classification on point clouds. Some models can generate step-by-step instructions from high-level human commands by combining current sensory inputs with knowledge to decompose tasks, for example producing subtasks for tidying a room based on an image.
The major advancement of embodied-intelligence mobile manipulators is the inclusion of a reasoning "brain" that enables world understanding and intelligent decision making. Current embodied-intelligence development focuses on the "brain" layer. With the rapid evolution of multimodal large models, mobile manipulators will gain stronger reasoning, perception, cognition, and decision-making abilities, supporting more general autonomous intelligence through self-learning, self-adaptation, and self-optimization using multimodal data to select optimal actions and solutions.
Physical robot systems and platforms with perception, navigation, and manipulation are emerging. Examples include end-to-end operation models such as GR-1, Stanford's Mobile ALOHA, DeepMind's RT-2 and AutoRT, industrial mobile collaborative robots from manufacturers like KUKA, and integrated systems from several companies providing autonomous navigation, path planning, and interface integration. Some research platforms and commercial systems demonstrate autonomous charging, localization, intelligent path planning, and collaborative capabilities. Systems like Spot from Boston Dynamics provide agile mobility and extensibility for manipulation and sensing. Overall, embodied-intelligence mobile manipulators combine sensing and motion, enabling proactive perception and flexible execution. They can understand human language, perceive and interpret environments, decompose tasks, recognize objects while moving, interact with physical environments, and complete assigned tasks. Development trends include diversified forms, comprehensive functions, general-purpose task handling, autonomous behavior, and more natural human interaction. While platform industrialization is accelerating, understanding of external environments and high-level human commands still needs improvement; human-issued concrete instructions remain common. Deeper integration of embodied intelligence and mobile-manipulator systems will continue to drive the industry.
Key Technologies for Embodied-Intelligence Mobile Manipulators
Key technologies include multimodal perception, world cognition and understanding, intelligent autonomous decision making, and joint motion and manipulation planning. These areas aim to advance overall system capability.
Multimodal perception
Multimodal perception enables higher autonomy, efficiency, and generality by providing rich, stable, and accurate environmental data. Indoor motion-perception data are often multi-source, heterogeneous, and dynamic, and robots must handle lighting changes, partial observations, occlusion recovery, and inference. Robots can use multi-view images, LiDAR, and other modalities to achieve local stereo reconstruction. For outdoor or high-noise environments, fusing image, LiDAR, thermal imaging, and GNSS improves robustness. Object detection and segmentation create spatial mappings between perceived and actual environments, supporting one-stop multimodal fusion and enabling immediate perception and virtual reconstruction of local spaces and objects to feed the cognition and planning system.
World cognition and understanding
Compared with traditional mobile manipulators, embodied-intelligence systems must autonomously perceive, reason about, and plan tasks. Two main approaches are used: 1) deep-learning-based large models trained on environment perception data to form empirical cognitive representations of the perceived world; and 2) physics-based modeling and simulation of object behavior, deformation, and tool use to construct common physical models of the world. Systems combine perception and understanding to analyze high-level human commands, decompose tasks, and form upper-level planners that can independently interpret human instructions using natural perception data.
Intelligent autonomous decision making
Robust, safe, and optimal decision systems are essential for stable interaction with environments and humans. Robots can generate decisions via autonomous generation, human-robot interaction, or inter-robot collaboration. Decision systems must map and align local environment perception with object understanding, align human and robot values during interaction, and convert human commands into executable robot instructions. After receiving a high-level human command, systems fuse perception data and object understanding to map local spaces and associate object types, uses, and manipulation methods. The decision module then generates step-by-step instruction sets, decomposes tasks, and aligns decisions with human values through online interaction and learning. Finally, it translates these high-level steps into executable robot instructions, supporting social navigation, object-centric navigation, human-robot and multi-robot collaboration.
Two prevalent decision-making approaches are: 1) using LLMs as the core to parse human commands and propose plans via encoding/decoding analysis, and 2) analyzing human behavior and physical-world models to infer optimal plans from current perceptions. Both have advantages and drawbacks; recent rapid advances in LLMs have increased attention on the first approach.
Joint motion and manipulation planning
Although base navigation and manipulator planning have matured independently, coordinated motion and manipulation planning for embodied-intelligence systems remain incomplete. Embodied robots now perform autonomous perception, decision making, multi-robot cooperation, and human-robot interaction, which impose higher requirements on the coordination between navigation motion and manipulation. Single-purpose navigation or manipulation controllers cannot meet demands for dexterity, efficiency, coherence, stability, and safety. Joint planning must coordinate base and arm motion, support multi-robot and human-robot collaboration, and operate under constrained spatial conditions. Typical tasks include navigating complex indoor/outdoor environments, grasping, searching, transporting, interactive manipulation, local mapping and localization, multi-joint path planning, cooperative interactive manipulation, and tool understanding and use. In shared social spaces, robots must infer human future motion in real time and plan safe, efficient, socially compliant paths to be accepted as collaborators, emphasizing human-aware social navigation. Human-robot interaction and integration technologies will become research priorities.
Challenges
As demands for robot autonomy and intelligence increase, limitations of offline intelligence become more apparent. Despite progress in multimodal perception, AI, human-robot interaction, natural language processing, and motion planning, key embodied-intelligence technologies require further advancement. Main challenges are described below.
Perception
1) Insufficient autonomous perception. Embodied-intelligence robots must not only perform local sensing at a given position but also autonomously determine required perception levels based on high-level human commands and plan motions to enrich their understanding. This raises higher expectations for autonomous perception. 2) Weak interactive perception. Robots must use high-level human commands to guide multi-granularity exploration of local environments and enrich perception data. Current autonomous perception under constrained conditions remains immature, challenging accuracy, immediacy, and effectiveness of human-robot interactive perception. 3) Slow multimodal fusion and local environment reconstruction. Robots require rapid multimodal fusion and 3D mapping to generate complete local maps; current methods do not guarantee real-time responses and autonomous replanning efficiency.
Cognition and understanding
1) Incomplete understanding of object morphology, function, use, and interaction. Robots must infer object affordances and usage from perception data. LLMs provide general understanding but still differ from human experience, making it difficult to establish object relationships and fuse information based on object state, impacting feasibility of decision making and execution. 2) Insufficient understanding of high-level human commands. For complex tasks like room tidying, warehouse transport, or rescue, current LLM-based systems struggle to fuse perception data and produce reasonable decompositions, and lack sufficient interactive learning and correction mechanisms.
Decision making
1) Weak autonomous intelligent decision capability. Robots need to fuse perception and world understanding, generate reasonable and feasible plans aligned with human values, and map human plans to executable robot instruction sequences. This requires human-like value standards, aligned understanding of physical and conceptual objects, and decomposition capabilities. Current methods often rely on LLMs with added priors or posterior correction. LLMs can produce hallucinations or illogical plans that are hard to execute. Fine-grained image and point-cloud segmentation and feature extraction methods can be computationally expensive and slow (e.g., SAM), limiting real-time navigation and decision making. Recent models like LLaMA and LLaVA have improved via pretraining and fine-tuning, but often produce overly generic instructions that are not directly executable.
Joint motion and manipulation planning
Combining navigation and manipulator planning suffers from inflexibility, instability, and suboptimal spatial paths. 1) In complex indoor/outdoor scenes with strong spatial and temporal constraints, existing algorithms can produce discontinuous, unnatural, unstable, or unsafe motions. 2) In dynamic environments, local-perception-based navigation and manipulation planning can be slow and inefficient. Multi-agent dynamics introduce further challenges. 3) Interaction with other dynamic agents or humans in public spaces raises stringent safety requirements. Human common-sense understanding of behavior is difficult to replicate; current robots rely on preset interaction models and limited human behavior sets. Evaluating effectiveness of robot behavior in human-robot systems is difficult due to system complexity and the criticality of human safety. Therefore, developing safer and more effective human-robot interaction evaluation platforms is essential.
Generic simulation platforms
The field lacks universal, reliable, multi-interface, multi-scenario simulation platforms, slowing R&D and causing deployment issues in compatibility, stability, accuracy, and generalization. Fast-paced algorithm and hardware updates and high system development costs require robust simulation environments for testing and validation. Currently, academia and industry lack a comprehensive robotics simulation platform with general interfaces, realistic physics engines, and multi-scenario coverage, which impedes development and is a key bottleneck for embodied-intelligence mobile manipulators.
Recommendations
1. Support sustained development and industry ecosystem formation
China should aim to build end-to-end research and deployment chains covering high-quality data, key frontier technologies, testing platforms, development ecosystems, and technology transfer. Collaboration between high-level R&D institutions, technology firms, and interdisciplinary organizations can concentrate strategic resources, form effective cooperation mechanisms, and create complementary "industry-academia-research-application" models. Guided by national strategic needs and policy, prioritize research into foundational, critical, and deployment technologies to build cross-domain industrial advantages for smart societies and smart cities.
2. Emphasize original breakthroughs in key technical areas
Prioritize breakthroughs in common key technologies such as environment perception, cognition and understanding, intelligent decision making, and joint motion and manipulation planning. Industry authorities should consider issuing research programs for common embodied-intelligence technologies, strengthen attention to autonomy, generality, dexterity, and safety in mobile manipulator systems, and accelerate the integration of AI capabilities into mobile manipulators by building complete hardware-software interaction platforms and advancing systems with autonomous perception, decision making, natural interaction, and safe execution.
3. Strengthen education and talent cultivation in intelligent science and robotics
Promote university-industry collaboration for key technical research and engineering integration, pool innovation resources, and deepen cooperation between research institutes and upstream/downstream enterprises. Build an innovation and industrial chain for embodied-intelligence mobile manipulators. Enhance academic discipline layouts, develop intelligent science and technology degree programs, establish specialized robotics programs, and expand graduate student recruitment in intelligent robotics. Improve curricula with interdisciplinary integration across mathematics, physics, biology, and computing, and increase practical experimentation and deployment in teaching to boost capability in deploying AI on physical mobile manipulators.
4. Encourage construction of multi-scenario validation platforms
Collaborate with healthcare, education, fire and rescue, transportation, home service, and industrial stakeholders to provide application scenarios and demonstration validation platforms for embodied-intelligence mobile manipulators. Build general-purpose test bases that cover multiple scenarios to supply research data and deployment experience. Leverage China's trade and cooperation initiatives such as the Belt and Road to promote related technologies and products internationally and expand global influence.
5. Coordinate harmonious development with human society
Given autonomous, intelligent, and interactive characteristics, development of embodied-intelligence mobile manipulators requires ethical and legal safeguards to ensure safe R&D. Ethically, decision-making processes should be transparent, with clear reasoning paths to allow human intervention and reduce risks. Legally, consider potential social impacts: clarify legal responsibility for autonomous decisions and motion planning, address workforce displacement issues through legal measures, and document system modules to identify responsible parties for incident attribution and legal enforcement.