Voice-Guided Companion Robot for Indoor Elderly Care

The growing phenomenon of “empty nest” families presents a significant societal challenge, where elderly individuals often face prolonged isolation and a lack of daily companionship. This sustained loneliness can have profound negative impacts on both mental and physical well-being. As robotics technology advances, the concept of a companion robot has emerged as a promising tool to provide interaction and mitigate the sense of solitude for the elderly. However, many existing solutions suffer from cumbersome operation or limited applicability, often being repurposed smart speakers with minimal mobility or requiring complex remote control. To create a truly supportive and intuitive companion robot, we have developed a system centered around an indoor voice positioning and navigation core. This design allows the robot to respond naturally to vocal summons, navigate autonomously through a home environment, and position itself near the user, thereby lowering the interaction barrier and enhancing the robot’s utility as a practical companion robot for elderly care.

The operational paradigm is straightforward: when an elderly user calls out, the robot acquires both the voice command and the spatial location of the sound source. It then plans and executes a path to navigate to the user’s side. This process integrates several key technologies: voice command recognition, sound source localization, simultaneous localization and mapping (SLAM), and autonomous navigation with obstacle avoidance. The system is built upon the Robot Operating System (ROS) framework, which facilitates modular software development and robust communication between functional nodes. The hardware platform is a custom-designed mobile robot based on a Raspberry Pi 4B serving as the main computational core.

System Overview and Hardware Architecture

The companion robot is designed with a modular, layered hardware architecture to simplify debugging and future functional expansion. The Raspberry Pi 4B acts as the central controller, coordinating all subsystems. The primary modules include the Motion Control Module, Video Communication Module, Voice Control Module, and the Power System. Inter-module communication is primarily established via serial ports (UART/USB) from the main controller.

The robot’s mobility is provided by a Mecanum wheel holonomic drive chassis, enabling omnidirectional movement crucial for maneuvering in tight, cluttered home spaces. A 2D LiDAR sensor is mounted for environmental scanning and mapping. For voice interaction, a commercially available six-microphone circular array is employed, providing high-quality audio pickup with beamforming, noise suppression, and de-reverberation capabilities. A speaker provides audio feedback. The physical structure was modeled in SolidWorks and fabricated to house all components securely. A summary of the core hardware components is presented in Table 1.

Module	Core Component	Primary Function
Main Controller	Raspberry Pi 4B	Central computation, ROS master, node coordination.
Motion Control	Mecanum Chassis, Motor Drivers, LiDAR	Omnidirectional movement, environment perception, SLAM.
Voice Control	6-Mic Array, Codec	Sound source localization, voice command acquisition, audio playback.
Video Communication	Camera Module	Enables video calls between the elderly and family members.
Power System	Lithium Battery Pack, Voltage Regulators	Provides stable power to all electronic subsystems.

Kinematic Model of the Mecanum Drive

Accurate kinematic modeling is essential for precise navigation control. The Mecanum wheel chassis allows for three degrees of freedom on a plane: translation along the robot’s X-axis (forward/backward), translation along its Y-axis (left/right strafing), and rotation about its Z-axis (yaw). The robot’s geometric center is defined as the control point. Let $V_a$, $V_b$, $V_c$, $V_d$ represent the linear velocities of the front-left, front-right, rear-left, and rear-right wheels, respectively. The robot’s body-frame velocities are $V_x$ (surge), $V_y$ (sway), and $\omega$ (yaw rate). The parameters $a = W/2$ and $b = L/2$ denote half the track width and half the wheelbase, respectively.

The contribution of each motion component to the individual wheel speeds is derived from geometric principles. For pure translation along X+:
$$V_a = V_b = V_c = V_d = +V_x$$
For pure translation along Y+ (strafing left, from the robot’s perspective):
$$V_a = V_c = +V_y, \quad V_b = V_d = -V_y$$
For pure clockwise rotation ($\omega$):
$$V_a = V_b = -\omega(a+b), \quad V_c = V_d = +\omega(a+b)$$
Superimposing these three independent motions yields the inverse kinematic model that computes required wheel speeds from desired robot velocities:
$$
\begin{aligned}
V_a &= V_x + V_y – \omega(a+b) \\
V_b &= V_x – V_y – \omega(a+b) \\
V_c &= V_x + V_y + \omega(a+b) \\
V_d &= V_x – V_y + \omega(a+b)
\end{aligned}
$$
This model is fundamental for the low-level motor controller, allowing the companion robot to execute any commanded velocity vector $(V_x, V_y, \omega)$ accurately.

Software Architecture and Voice Information Acquisition

The software stack is organized within the ROS framework. Key functionalities are packaged into discrete nodes that communicate via topics and services. The core navigation stack includes nodes for LiDAR SLAM (e.g., Cartographer), global path planning (A*), local path planning and obstacle avoidance (DWA), and motor control. A custom voice control package (`voice_control`) manages the interaction with the microphone array SDK.

This package handles audio stream processing, sends data to the speech recognition engine, and publishes the recognized text command as a ROS message. Crucially, it also interfaces with the microphone array’s directional functions. Upon detecting a sound source, the array’s Direction of Arrival (DOA) estimation is used to set the “primary microphone” direction, enhancing the signal-to-noise ratio for the voice activity coming from that bearing and improving recognition accuracy for the navigation command.

ROS Package/Node	Function
`voice_control`	Acquires audio, performs ASR, publishes commands & DOA data.
`slam_node` (Cartographer)	Processes LiDAR/odometry data to build and maintain a 2D occupancy grid map.
`move_base` (Navigation Stack)	Integrates global planner (A*), local planner (DWA), and costmaps for navigation.
`motor_driver`	Subscribes to cmd_vel, implements the inverse kinematics model, controls motors via PID.
`main_coordinator`	A finite-state machine node that orchestrates the sequence from voice command to navigation completion.

Sound Source Localization Algorithm

The ability for the companion robot to determine the user’s location originates from sound source localization using the microphone array. The core principle is Time Difference of Arrival (TDOA). When a sound event occurs, the acoustic wavefront reaches each microphone at slightly different times due to their spatial separation. By estimating these time delays, the direction and potentially the distance to the source can be calculated.

Consider a simplified 2D model with three microphones $M_1$, $M_2$, $M_3$ arranged linearly or in a triangle, with the source at distance $r_1$ from $M_1$. Let $c$ be the speed of sound. The time delays between $M_1$ and the other microphones are:
$$\tau_{12} = \frac{r_2 – r_1}{c}, \quad \tau_{13} = \frac{r_3 – r_1}{c}$$
From geometry, the distances $r_2$ and $r_3$ can be related to $r_1$, the microphone spacing $d$, and the angle of arrival $\theta_1$ relative to the array baseline. For a linear array with $M_1$, $M_2$ spaced by $d$ and $M_1$, $M_3$ spaced by $2d$:
$$
\begin{aligned}
r_2^2 &= r_1^2 + d^2 + 2 r_1 d \cos\theta_1 \\
r_3^2 &= r_1^2 + (2d)^2 + 4 r_1 d \cos\theta_1
\end{aligned}
$$
Given the estimated $\tau_{12}$ and $\tau_{13}$, one can solve for $r_1$ and $\cos\theta_1$, thereby estimating both range and bearing. In practice, with a circular array, more sophisticated generalized cross-correlation (GCC) methods are used for robust TDOA estimation in noisy, reverberant environments. The final output is an angle (azimuth) which, when combined with the robot’s map and its own known position, can be used to set a goal coordinate for navigation. The system’s localization performance was evaluated via simulation, with key parameters and results shown in Table 2.

Parameter	Symbol/Expression	Value
Path Loss Exponent	$n$	2.4
Signal Frequency	$f_c$	2000 Hz
Sampling Period	$T$	1 ms
Number of Microphone Pairs	–	3
Simulation Duration	$T_{sim}$	1000 s
Average Positioning Error (in 10m x 10m area)	–	< 8%

Indoor Positioning and Navigation Stack

Once a voice command like “robot, come here” is recognized and a rough direction is established, the companion robot must navigate safely to the user’s vicinity. This requires a precise map of the environment and real-time localization within it.

Mapping and Localization: We employ the Cartographer SLAM algorithm for building a consistent 2D occupancy grid map. Cartographer uses a graph-based optimization approach. The frontend creates local submaps by matching incoming LiDAR scans with a probabilistic grid. The backend performs loop closure detection; when the robot revisits a known area, constraints are added to a pose graph, which is then optimized to minimize global error, effectively reducing cumulative drift. The process can be modeled as finding the maximum likelihood set of poses $\Xi = \{\xi_i\}$ and the map $m$ given sensor observations $z_{1:t}$ and odometry data $u_{1:t}$:
$$\Xi^*, m^* = \arg\max_{\Xi, m} P(\Xi, m | z_{1:t}, u_{1:t})$$
This yields a highly accurate map (typically with ~5 cm resolution) essential for reliable navigation of the companion robot.

Path Planning: Navigation is split into global and local planning. The global planner uses the A* search algorithm on the occupancy grid to find the shortest feasible path from the robot’s current location to the user-defined goal. A* evaluates nodes using a cost function:
$$f(n) = g(n) + h(n)$$
where $g(n)$ is the exact cost from the start node to node $n$, and $h(n)$ is a heuristic (often Euclidean distance) estimating the cost from $n$ to the goal. The algorithm efficiently expands the most promising nodes to find an optimal path.

However, the global path is static. The Dynamic Window Approach (DWA) local planner handles dynamic obstacle avoidance. It samples feasible velocity commands $(v, \omega)$ within the robot’s dynamic constraints (acceleration limits, current speed). For each candidate $(v, \omega)$, it simulates the robot’s trajectory a short time into the future. Each trajectory is scored based on its alignment with the global path, distance to obstacles, and forward progress. The highest-scoring velocity command is selected and sent to the motor controller. The robot’s omnidirectional capability significantly enhances DWA’s effectiveness, as it can strafe away from obstacles without reorienting first. The trajectory simulation for a sampled velocity $(v_x, v_y)$ over a time step $\Delta t$, given current heading $\theta_t$, projects displacement in the world frame as:
$$
\begin{aligned}
\Delta x &= v_x \Delta t \cos(\theta_t + \pi/2) = -v_x \Delta t \sin(\theta_t) \\
\Delta y &= v_y \Delta t \sin(\theta_t + \pi/2) = v_y \Delta t \cos(\theta_t)
\end{aligned}
$$
This iterative simulation allows the companion robot to anticipate and avoid collisions proactively.

System Integration and Experimental Validation

The complete system was integrated and tested in a real home-like environment to validate the functionality and reliability of the voice-guided companion robot. The test scenario involved placing the robot in a cluttered room and issuing a vocal summons from a specific location, with obstacles deliberately placed in the potential path.

Upon receiving the voice command “come here,” the robot’s `voice_control` node published the command text and the estimated sound source direction. The `main_coordinator` node translated this directional data into a goal coordinate on the existing map. The `move_base` navigation stack then engaged: the global planner (A*) computed an initial path, and the local planner (DWA) executed the movement while dynamically avoiding the placed obstacles. Real-time visualization on RViz confirmed successful path planning and obstacle circumvention. The robot consistently navigated to a point near the user, demonstrating the practical viability of the system. The integration of voice positioning with robust SLAM-based navigation proved effective, making the companion robot highly responsive and user-friendly.

In conclusion, this work presents the design and implementation of an elderly companion robot with an intuitive voice positioning and navigation system. By addressing the critical issue of complex operation, the robot allows users to interact with it naturally through speech, summoning it to their location effortlessly. The modular design based on Raspberry Pi and ROS, combined with precise Mecanum wheel kinematics, robust sound source localization, and advanced Cartographer-A*-DWA navigation, results in a functional and adaptable platform. This companion robot not only fulfills a core need for accessible companionship but also provides a foundation for integrating additional assistive features such as medication reminders, fall detection, or enhanced social interaction capabilities, further solidifying its role as a comprehensive tool for supporting independent living for the elderly.