FPGA-Based Integrated Image Processing System for Humanoid Robots

In recent years, the commercialization of humanoid robots has accelerated, with market scale continuously expanding and application scenarios extending from specialized tasks such as firefighting and high-risk inspections to manufacturing collaboration and home services. The diversification of scenarios has increased environmental complexity for humanoid robots. For instance, frequent lighting changes in homes and mechanical occlusions or oil contamination in industrial settings impose two core requirements on the “front-end image processing” of visual perception: real-time performance and processing accuracy in complex environments. For example, in industrial collaboration, humanoid robots must identify obstacles within an extremely short time to avoid collisions; high latency can drastically increase risk. Similarly, under sudden lighting changes or partial target occlusion, insufficient image processing accuracy can reduce subsequent recognition accuracy, affecting navigation and interaction safety. From an international perspective, applications in robotics, smart manufacturing, and edge intelligence are driving “edge-side visual front-ends” toward low-latency, configurable, and evolvable directions, particularly the image signal processing (ISP) and preprocessing pipeline between camera sensors and backend recognition, which have become key factors influencing system reliability and interaction safety.

Current mainstream image processing solutions struggle to meet both requirements simultaneously. ARM-based solutions, while mature in software ecosystems and capable of rapid multi-algorithm integration, are limited by the von Neumann architecture. Data frequently interacts with memory, leading to surging power consumption under high loads, and serial operations result in high image processing latency, making it difficult to adapt to dynamic scenarios. Meanwhile, in international engineering practices, implementations for camera front-end preprocessing emphasize predictable end-to-end latency and customizable data pathways to reduce link overhead between sensor input and visual algorithms. FPGA-based solutions leverage hardware-level parallel computing advantages to effectively reduce latency, but existing FPGA solutions often focus on single functions, lacking “multi-functional integrated design” for complex scenarios of humanoid robots, and cannot directly support the full-process needs of visual front-ends.

Addressing these pain points, this study designs an integrated image processing system based on Robei EDA, with FPGA as the hardware core, integrating functions such as brightness adjustment, semantic extraction, and gesture recognition preprocessing. It solves issues like poor real-time performance and insufficient adaptability in traditional solutions, providing more efficient support for the visual front-end of humanoid robots.

The overall architecture of the integrated image processing system is illustrated in the figure below, consisting of the host computer, FPGA hardware layer, and USB-UART communication link. This architecture aims to address the two core pain points of “poor real-time performance” and “single functionality”: through hardware-software co-design, it leverages FPGA’s parallel computing advantages to reduce processing latency and enhances adaptability to complex scenarios through multi-module integration.

The host computer is deployed on the control terminal of the humanoid robot (or an external computer), building a Web front-end control interface and back-end service based on the Python Flask framework. The front-end includes real-time display areas, parameter configuration areas, and functional mode areas, supporting flexible parameter adjustments for scenarios such as home lighting changes and industrial occlusions. The back-end integrates parameter calculation and image post-processing extension functions, compensating for the limitation of traditional FPGA solutions that only cover single functions. The FPGA hardware layer is deployed in the robot’s visual processing unit, with Xilinx Artix-7 as the core. Using Robei EDA for top-level design, it integrates UART reception, FIFO buffering, serial-to-parallel conversion, image enhancement, and UART transmission modules, cascaded according to data flow. Hardware-level parallel processing replaces ARM-based serial operations, reducing latency caused by frequent memory interactions. The host computer and FPGA are connected via a USB-UART chip linking USB ports and FPGA’s RX/TX pins, forming a hardware-software communication link.

During system operation, the host computer front-end receives user instructions (processing mode, parameter levels, etc.), and the back-end generates function enable and level parameters accordingly, sending them to the FPGA via UART protocol. The FPGA parses control parameters through the UART reception module and loads them into the image enhancement module to determine processing strategies. Simultaneously, the host computer sends raw image data in batches. The FPGA writes it into the FIFO buffer via the UART reception module to alleviate rate differences between serial reception and parallel processing, avoiding overflow or waiting. The data is then reconstructed into parallel format through serial-to-parallel conversion and input into the image enhancement module for processing such as brightness adjustment and skin tone ROI extraction. The processed results are converted back to serial data via parallel-to-serial conversion and returned to the host computer by the UART transmission module.

After receiving the returned data, the host computer decides whether to enable back-end recognition functions based on instructions, performing contour detection and convex hull defect analysis on skin tone ROI regions to achieve 0–5 digital gesture recognition, thus providing a non-contact interaction interface for the humanoid robot. Ultimately, the system implements functions such as parameter visualization configuration, real-time image processing, result display, and gesture recognition extension, meeting real-time requirements while considering accuracy needs in multi-scenarios, supporting visual preprocessing for humanoid robots in dynamic environments.

The host computer is the interactive control and data post-processing core of the integrated image processing system. Its design core goal is to assist in solving the pain point of “single functionality,” by building a convenient operation interface and reliable data transmission link, complementing traditional solutions in data interaction and scenario flexibility adaptation. It provides precise control instructions and stable raw image data for the FPGA hardware layer, ensuring efficient initiation and progression of hardware processing. This system builds a modular architecture based on Python, designing around three core functions: parameter configuration, data interaction, and result visualization. Each functional module maintains independent operation flexibility while forming a complete interaction and data chain through collaborative cooperation, demonstrating good practicality.

The parameter configuration module is the core for the host computer to control FPGA image processing strategies, supporting automatic and manual mode switching, allowing flexible adjustment of processing parameters based on the humanoid robot’s environment to ensure image enhancement effects adapt to scenario needs. The function mapping table for control parameters, including function enable parameters (cfg_mode) and parameter level parameters (cfg_param), is summarized below:

Bit Position	Function Enable Parameter	Parameter Level Parameter
7	Traffic Light Recognition Control	Brightness Level
6	–	–
5	Skin Tone ROI Enable	Contrast Level
4	Green Screen Matting Enable	–
3	Color Temperature Mode Control	Color Temperature Intensity Level
2	–	ROI Sensitivity
1	Contrast Enable	–
0	Brightness Enable	–

In automatic mode, the module dynamically generates control parameters through image quality analysis logic: calling image grayscale mean calculation functions to statistical raw image grayscale distribution, combining contrast standard deviation and skin tone pixel proportion, generating relevant parameters according to preset rules, enabling related functions, so that adjusted parameters are used for subsequent image recognition and processing. In manual mode, the module provides a visual parameter adjustment interface: users directly configure brightness levels, contrast levels, skin tone sensitivity, and enable functions such as brightness adjustment, green screen matting, and gesture recognition via sliders and switches on the front-end interface. The back-end assembles control parameters according to FPGA hardware bit allocation rules after receiving instructions, ensuring parameter format fully aligns with FPGA control logic to avoid parsing deviations.

The image transmission and data communication module is the data bridge between the host computer and FPGA hardware layer, implementing full-process data communication including raw image decomposition, image data and control parameter transmission, and processed image reception based on serial communication protocol. The module completes communication parameter configuration through serial port initialization functions, ensuring strict matching with FPGA’s UART transceiver module parameters. When transmitting raw images, it decomposes BMP format images into minimum transmission units of 2-byte pixel index + 3-byte RGB data, sending in batches of 1024 pixels per batch to avoid serial port buffer overflow due to excessive single data volume. After each batch transmission, the module verifies the reception acknowledgment signal returned by the FPGA. If the number of received bytes is not equal to the number of sent bytes, a retransmission mechanism is triggered to ensure no data loss. Simultaneously, the module monitors data transmission progress in real-time, pushing current processed pixels/total pixels and transmission rate to the front-end via Socket IO. When the FPGA returns processed image data, the module first verifies the data format, then parses it into processable pixel arrays, providing reliable data input for subsequent image post-processing.

The result visualization and interaction module is the operation and feedback window for users on the host computer, building an intuitive interaction interface based on Web front-end framework, achieving functional goals of monitorable processing processes, viewable processing results, and traceable historical data, reducing user operation thresholds and adapting to humanoid robot market demands. The interface core is divided into three functional areas: the real-time display area uses dual-window comparison design, synchronously displaying original and processed images, supporting zoom for detail viewing; the progress feedback area updates data transmission progress, current processing functions, and system status in real-time, popping up prompt information in abnormal situations; the historical record area automatically stores image processing logs from the past 24 hours, including processing time, used parameters, and image quality comparison data, facilitating user review of processing effects in different scenarios. Additionally, the module supports real-time preview of parameter configuration, where the front-end immediately displays specific formats corresponding to parameters after user adjustments, ensuring users understand current control logic and avoid misoperations.

The FPGA preprocessing IP adopts a modular hardware architecture, with the core goal of solving the pain point of “poor real-time performance,” while breaking through the limitation of “single functionality” through multi-module integration, focusing on real-time preprocessing needs of visual data for humanoid robots. Each functional module connects in a pipeline manner, ensuring processing efficiency from the hardware architecture level. The architecture uses UART reception module and FIFO buffer module as data entry points, implementing serial data interaction with the host computer. The FIFO, with hardware design of 8-bit data width and 1024 depth, specifically addresses the rate difference between UART byte-level serial data streams and subsequent pixel-level parallel processing, ensuring continuous input and clearing rate obstacles for subsequent parallel processing, directly supporting real-time requirements.

After stable output from the FIFO buffer, data sequentially enters image format conversion, basic image enhancement, semantic region extraction, and invalid region removal modules. The integration of these modules breaks through the limitation of existing FPGA solutions focusing only on single functions, covering multi-scenario needs of humanoid robots. Simultaneously, seamless data transfer between modules through standard hardware interfaces further strengthens real-time advantages. Ultimately, output preprocessed image data with complete retention of target regions and lightweight invalid regions reduces computational overhead for subsequent recognition functions, forming a closed loop of “real-time processing, efficient output.”

Data reception and buffering are the foundational data entry points of the FPGA preprocessing IP, composed of UART reception module and FIFO buffer module in cascade, with the core function of providing stable raw data for subsequent serial-to-parallel conversion and image preprocessing. The UART reception module is designed based on a 50MHz system clock and 921600bps baud rate, responsible for receiving serial data transmitted by the host computer via USB-UART: parsing serial frames through UART state machine, automatically separating the first 2-byte control parameters and subsequent image data, ensuring no deviation in parameter and data parsing. The FIFO buffer module is a synchronous FIFO with 8-bit data width and 1024 depth, mainly solving the rate difference between UART-received byte-level serial data streams and subsequent pixel-level 5-byte parallel data streams for serial-to-parallel conversion: temporarily storing image data output by the UART reception module, stably outputting according to subsequent module processing needs, avoiding data overflow or loss, ensuring data chain continuity.

Image format conversion converts RGB color space to YCbCr, with the core of separating image brightness information Y component and color information Cb, Cr components, where Y represents pixel brightness, and Cb and Cr represent offsets of blue and red relative to grayscale, respectively. This separation allows the system to independently process brightness and color, simplifying computational flow and improving processing efficiency. This module uses ITU-R BT.601 standard for conversion, suitable for standard-definition image scenarios, with color space mapping more aligned with image characteristics collected by humanoid robot vision. The original formulas are as follows:

$$Y = 0.299R + 0.587G + 0.114B$$

$$Cb = 128 – 0.1687R – 0.3313G + 0.5B$$

$$Cr = 128 + 0.5R – 0.4187G – 0.0813B$$

Taking Y component as an example for FPGA hardware implementation deduction: to adapt to FPGA’s integer operation characteristics, a floating-point to fixed-point strategy is adopted, i.e., multiplying all coefficients in the formula by 256 times and rounding, accelerating calculation through hardware multiplication and shift operations. The coefficients for Y component, 0.299, 0.587, 0.114, after expansion become 77, 150, 29, respectively, so the fixed-point formula is:

$$Y_1 = 77R + 150G + 29B$$

The calculation result needs to be right-shifted by 8 bits to restore accuracy, i.e.:

$$Y_2 = (77R + 150G + 29B) \gg 8$$

Calculation for Cb and Cr components uses similar derivation methods. Hardware implementation adopts a three-stage pipeline structure: the first stage uses 3 multipliers to parallelly compute products of RGB channels and coefficients; the second stage accumulates results and adds offsets; the third stage performs right-shift operation to output final values. Simultaneously, pixel index synchronization is ensured through index delay registers, aligning converted data with index timing. Through these operations, floating-point operations are avoided, and efficiency is improved through pipelining.

The basic image enhancement function is implemented based on component separation characteristics of YCbCr color space, performing quantitative adjustment on three types of image quality parameters: brightness, contrast, and color temperature. Brightness and contrast adjustments only act on Y component to avoid affecting color performance; color temperature adjustment achieves tone shift by directionally modifying Cb, Cr components. All three support dynamic switching via function enable and parameter levels configured by the host computer, adapting to different lighting scenarios of humanoid robots, such as brightness compensation in low-light environments and contrast enhancement in industrial scenarios.

Brightness adjustment core principle is by changing the absolute value of Y component to control overall image darkness, avoiding color distortion caused by brightness-color coupling in RGB space. Adjustment logic: according to the level set by control parameters, apply preset offset amplitude to Y component:

$$Y_{adj} = Y_{in} + \Delta Y$$

where $Y_{in}$ is the original Y component, $\Delta Y$ is the offset corresponding to the level. Simultaneously, to avoid data exceeding the 8-bit valid range (0-255) after offset, overflow truncation logic is added: if the calculation result is below 0, set to 0; if above 255, set to 255, ensuring validity of brightness information after adjustment.

Contrast adjustment is by gain coefficient amplifying or reducing brightness differences of Y component, enhancing image details. Different from “overall offset” of brightness adjustment, contrast adjustment focuses on “differences between pixels,” but avoids excessive darkness in dark parts or overexposure in bright parts caused by single gain. Adjustment logic adopts “segmented gain” strategy: set brightness baseline value, divide Y component into dark and bright parts; apply corresponding gain coefficients (low contrast < 1, high contrast > 1) to the two segments according to levels; dark parts directly calculated with gain; bright parts first calculate deviation from baseline, apply gain only to deviation, then add baseline, avoiding bright part overflow.

Color temperature adjustment core principle is using color difference characteristics of Cb, Cr components to directionally adjust tone. Cb component reflects deviation of blue from grayscale, Cr component reflects deviation of red from grayscale; separately adjusting them achieves precise control of warm and cool tones without affecting brightness stability of Y component. Adjustment logic: according to color temperature level, select adjustment object: in cool color mode, only apply positive offset to Cb component; in warm color mode, only apply positive offset to Cr component; in disabled mode, keep Cb and Cr components unchanged. Offset amplitude setting follows the principle of “visual naturalness,” ensuring perceptible tone changes without exceeding human eye acceptance range for natural color temperature, avoiding color distortion from excessive offset.

Semantic region extraction is the pre-core link for target recognition in robot vision, relying on characteristics of brightness and color difference separation in YCbCr color space, performing precise screening for key targets of concern for humanoid robots, while identifying and marking invalid backgrounds. Through logic of “feature threshold judgment + pixel index marking,” preprocessing can distinguish different semantic regions such as effective targets and invalid backgrounds, providing basis for subsequent invalid region removal.

Green screen matting uses unique color difference features of green screen in YCbCr space, i.e., Cb, Cr values of green pixels distribute in a fixed range. By calculating color distance between pixels and “standard green,” green screen background and foreground targets can be accurately distinguished. By setting Cb, Cr baseline values for standard green, calculate differences between input pixel Cb, Cr and baseline values, then compute color distance through sum of squares operation to avoid negative difference effects. According to color distance, divide into “core green screen area,” “feathering transition area,” and “non-green screen area.” Core green screen area marked as background; non-green screen area determined as foreground, no processing; feathering transition area achieves smooth transition between foreground and background through dynamic transparency calculation, avoiding hard edges in matting.

Skin tone ROI extraction is based on characteristics of skin tone clustering stably in YCbCr space, i.e., regardless of race or slight lighting changes, skin tone Cb, Cr values always concentrate in specific intervals, and are less affected by Y. Based on clustering intervals as base thresholds, combine skin tone sensitivity levels from control signals to dynamically adjust threshold ranges. Low sensitivity levels correspond to narrow intervals, high sensitivity levels correspond to wide intervals. By comparing input pixel Cb, Cr values with dynamic thresholds, pixels meeting conditions are marked as “skin tone ROI,” other pixels retain original index. It should be noted that skin tone ROI extraction based on fixed threshold intervals and manual empirical rules may still have false detections and missed detections in complex scenarios (strong light/shadow, white balance drift, background with similar skin tone interference, etc.). Subsequently, machine learning can be introduced to achieve threshold self-adaptation: for example, using unsupervised clustering/probability models to online model skin tone chroma distribution, dynamically updating Cb, Cr threshold boundaries, improving cross-scenario robustness and generalization ability.

Traffic light ROI extraction targets color features of red, yellow, green traffic lights, using differentiated distribution of three colors in YCbCr space, achieving precise recognition through multi-dimensional threshold judgment. Set exclusive YCbCr baseline threshold intervals for red, yellow, green colors, adjust interval ranges according to traffic light recognition levels from control end: standard levels correspond to narrow intervals, relaxed levels correspond to wide intervals. Input pixels must simultaneously meet Y, Cb, Cr threshold conditions for a certain color to be determined as “traffic light ROI” and marked; if not meeting any color condition, determined as non-traffic light region.

Invalid region removal is an extensional post-processing link of semantic region extraction, targeting non-target regions marked after semantic extraction, such as green screen background, meaningless scene elements. Through combination strategies of compression and blurring, while retaining integrity of target regions, reduce information redundancy and subsequent computational overhead of invalid regions. This link directly connects with semantic region extraction flow, based on semantic index, ultimately outputting image data with “target regions precisely retained, invalid regions lightweight,” providing efficient input for subsequent visual tasks.

Invalid region blurring adopts series strategy of downsampling compression, Gaussian blur erasing details, upsampling restoring size, achieving information lightweighting of invalid regions through multi-stage processing. Perform Lanczos resampling on invalid regions to reduce resolution, algorithm core based on sinc interpolation function, formula as follows:

$$L(x) = \frac{\sin(\pi x) \sin(\pi x / a)}{\pi^2 x^2}$$

where window parameter a usually takes 3, balancing anti-aliasing effect and computational overhead. This algorithm through “band-limited interpolation” of original pixels, when compressing invalid region resolution to 50% of original size, avoids aliasing artifacts from high-frequency information, achieving preliminary lightweighting of spatial information in invalid regions. Apply Gaussian blur filter to downsampled invalid regions, core using weight kernel conforming to two-dimensional normal distribution:

$$G(x,y) = \frac{1}{2\pi\sigma^2} e^{-\frac{x^2+y^2}{2\sigma^2}}$$

where σ is blur radius, typical value 3, positively correlated with edge blur degree. Convolve Gaussian convolution kernel with pixel matrix, since Gaussian kernel weight exponentially decays with increasing Euclidean distance from pixel to kernel center, after blurring, edge contrast, fine textures, and other details of invalid regions are further weakened, only retaining large-scale color and brightness distribution trends. On FPGA, implementing 3×3 convolution first requires obtaining three rows of parallel image pixel data, which can be achieved using two-layer FIFO structure as shown in the block diagram. For blurred invalid regions, restore to original size through bilinear interpolation. It uses grayscale values of 4 nearest neighbor pixels around target pixel, linearly weighted according to rule “inversely proportional to distance from target pixel,” formula simplified as:

$$P(x,y) = \sum_{i=0}^{1} \sum_{j=0}^{1} w_{i,j} \cdot p_{i,j}$$

where $w_{i,j}$ are weighting coefficients, $p_{i,j}$ are neighborhood pixel values. In this process, fine-grained details lost in downsampling and blurring cannot be restored, ultimately making invalid regions in “low-detail, smooth transition” lightweight state while consistent in size with original image, both reducing information redundancy and naturally connecting with retained target regions. Based on invalid region mask, reassemble pixels of completely retained target regions and pixels of invalid regions processed by compression and blurring. For pixels marked as “target region” by mask, directly retain original values; for pixels marked as “invalid region” by mask, replace with values after compression and blurring processing. To quantify invalid region lightweighting effect, assume original image resolution is $W \cdot H$, invalid region background before reassembly downsampled by ratio r, its effective sampling points reduced from $W \cdot H$ to $(rW) \cdot (rH) = r^2 WH$. Therefore, equivalent sampling point compression ratio for invalid region background is:

$$CR_{bg} = \frac{WH}{r^2 WH} = \frac{1}{r^2}$$

When setting r as 0.5, $CR_{bg} = 4$, i.e., background sampling points reduced by approximately 75%, equivalent background information about one-fourth of original. Reassembled image both ensures integrity of target region information and achieves dual reduction of data volume and computational complexity through lightweighting invalid regions.

Gesture recognition, as the core extension function of system post-processing IP, aims based on skin tone ROI data output by FPGA, through Python image feature analysis and pattern recognition on host computer side, achieving precise determination of 0-5 finger digital gestures, further enriching system interaction capabilities, providing non-contact interaction interface for humanoid robots. This sub-module design balances recognition accuracy and real-time performance, relying on skin tone data preprocessed by FPGA hardware layer to reduce redundant computation of full-image traversal, narrowing effective computation range to skin tone regions, significantly improving computational efficiency. Simultaneously, through algorithm optimization adapting to complex scenarios such as lighting changes and hand posture offsets, adopting four-stage pipeline flow of ROI cropping, contour detection, convex hull and defect analysis, gesture determination, each link specifically optimized for hand features and scene interference, detailed implementation as follows:

Skin tone ROI cropping and region optimization: FPGA output skin tone ROI is discrete pixel coordinates, need first compute hand minimum bounding box and expand to retain edge features: traverse ROI coordinates extract horizontal/vertical coordinate extremes, form initial bounding box; considering completeness of skin tone marking, expand bounding box by 30 pixels in all directions, expanded region as effective computation range for gesture recognition. This step can shrink computation range from full image to hand region, reducing subsequent algorithm computation, while avoiding finger misjudgment caused by edge cropping.

Hand contour detection and screening: Contour is core geometric feature of hand posture, need extract and screen hand main contour from optimized skin tone mask: call cv2.findContours extract all contours, both retaining contour hierarchy relationships and reducing redundant point storage; screen by area to exclude noise—compute all contour areas, select largest contour as hand main body, set minimum area threshold to eliminate small noise contours. If no contour exceeding threshold detected, determine no hand detected, push prompt to front-end and terminate flow, avoiding invalid computation.

Convex hull construction and convex hull defect analysis: Convex hull is minimum convex polygon fitting hand contour, its difference from hand contour directly reflects finger count, core link for gesture determination. Call cv2.convexHull function generate convex hull of hand main contour, function underlying implemented based on Graham scan method, through locating y-coordinate minimum reference point, sorting contour points polar angle with reference point as pole, stack screening eliminating concave points, forming minimum convex polygon four steps, obtain convex hull structure wrapping hand contour. After generating convex hull, compute area ratio of convex hull to hand contour to assist gesture determination; then call cv2.convexityDefects function extract convex hull defects. Need screen valid defects through dual conditions: defect angle ≤90° (see formula below), exclude contour edge fluctuations; defect depth d>20, filter noise interference. Screened valid defects need marked in ROI image for front-end visualization verification and debugging.

$$\theta = \arccos\left(\frac{b^2 + c^2 – a^2}{2bc}\right) \times \frac{180}{\pi}$$

where a, b, c are side lengths of triangle formed by start point, end point, deepest point.

Gesture determination rules: Based on number of valid convex hull defects and contour-convex hull area ratio, establish 0-5 finger gesture determination rules (mapping relationship as table below). When determining, first count number of valid defects, then combine area ratio to distinguish easily confused gestures; if number of valid defects n>4, determine as unknown gesture, push prompt to front-end and record image and parameters, providing samples for algorithm optimization.

Number of Valid Defects	Contour-Convex Hull Area Ratio	Determined Gesture	Determination Explanation
0	> 0.9	Fist	No obvious concavity
0	≤ 0.9	1 Finger	Concavity not obvious
1	–	2 Fingers	1 valid concavity
2	–	3 Fingers	2 valid concavities
3	–	4 Fingers	3 valid concavities
4	–	5 Fingers	4 valid concavities
> 4	–	Unknown Gesture	Too many defects

After determination completed, use cv2.putText annotate result on ROI image, return marked full image to front-end display, simultaneously output result for humanoid robot interaction instruction generation.

The system relies on Robei EDA tool for Verilog code writing, achieving IP design and application functions, with Xilinx Artix-7 chip as core, using 50MHz active crystal oscillator, host computer selects PC with Intel i7-14650HX CPU, 16GB memory, based on Python 3.13 ecosystem achieving back-end recognition and interactive control. System operation interface in Web form, with parameter configuration, process monitoring, result viewing as core logic, using modular layout, functions focused and operation intuitive, specific description detailed in host computer system and functional design chapter.

After using downloader to download executable file generated by comprehensive compilation to board main control chip, connect host computer and board via USB data cable. Run Python and start front-end webpage, then can regulate and run various functions of system, demonstration as follows: Basic image enhancement function can quantitatively adjust three types of image quality parameters: brightness, contrast, color temperature, adapting to different lighting scenarios of humanoid robots, such as brightness compensation in low-light environments, contrast enhancement in industrial scenarios. Gesture recognition, as core extension function of system post-processing IP, aims through image feature analysis and pattern recognition on host computer side, achieving precise determination of 0-5 finger digital gestures, providing non-contact interaction interface for humanoid robots.

To quantify processing performance of core modules, use comparative testing method: build two control systems, one internal directly connected no-processing system, only achieving data transmission, no image enhancement, semantic extraction, etc. modules; second is complete system. Perform 200 repeated tests on both systems, record total time consumption for single-frame image transmission and processing, compute average running speed. Through speed difference of both, obtain actual processing speed of core functional modules, test results stable at 70MB/s, i.e., about 560Mbps, meeting design expectations for real-time requirements. During testing, each time send 1,470,000 bytes, respectively compare software algorithms based on OpenCV and FPGA hardware link of this paper, part of test data as table below:

Operator Name	OpenCV (MB/s)	This Paper (MB/s)
RGB to YCbCr	200.02	402.34
Brightness/Contrast Adjustment	1915.11	2507.64
Color Temperature Adjustment	1093.67	2342.13
Semantic Region Extraction	24.09	98.27
Invalid Region Removal	70.25	369.02
Overall	13.97	70.07

Compared to OpenCV implementation method, this paper’s FPGA preprocessing system has higher processing speed for same functions, average growth multiple 5.01 times. Among them, semantic region extraction and invalid region removal functions show most prominent speed increase, simple operations like brightness, contrast adjustment relatively less growth. In 2018, Liu Xiang et al. conducted image convolution experiments based on ARM Cortex-A9 processor in literature. On 667 MHz Cortex-A9 platform, using OpenCV library for Sobel edge detection on 640×480 resolution grayscale image, single-frame execution time about 91.5 ms, corresponding frame rate about 10.9 fps; performing 3×3 morphological closing operation at same resolution, single-frame execution time about 27.5 ms, frame rate about 36.3 fps. These results can represent typical processing speeds of two-dimensional convolution and morphological operators implemented based on OpenCV software on general ARM platforms, basically consistent with above conclusions. It can be seen that at similar resolutions, even when serially completing more complex multi-stage image preprocessing within single chip, this paper’s FPGA solution can still maintain lower frame latency and higher overall throughput, reflecting performance advantages of dedicated parallel hardware structures in such convolution and morphological operators.

To evaluate back-end gesture recognition algorithm, this paper uses Python to build test system test, using offline image set for evaluation (total 200 images, covering digital gestures 1–5). Run MediaPipe and this paper’s back-end OpenCV gesture recognition algorithm on same batch images separately, record classification results and single-frame processing time, and statistics category accuracy, overall accuracy, and average response time.

Test Item	MediaPipe	This Paper
Gesture 1 Accuracy	100%	100%
Gesture 2 Accuracy	99.5%	92.3%
Gesture 3 Accuracy	71.3%	99.2%
Gesture 4 Accuracy	100%	81.4%
Gesture 5 Accuracy	100%	100%
Overall Accuracy	94.5%	91.8%
Response Delay	35.1 ms/img	13.8 ms/img

Table results show both methods achieve 100% recognition rate on gestures 1, 5; overall accuracy 91.8% and 94.5% respectively. In average response time, this paper’s method 13.8 ms/image, significantly lower than MediaPipe’s 35.1 ms/image, speed increased 2.54 times, possessing better real-time performance. At category level, some gestures have differentiated advantages, indicating different feature expressions have different sensitivities to posture changes and occlusions, subsequently can further optimize discrimination strategies for easily confused categories.

In terms of hardware resources, this paper’s complete preprocessing system synthesized and deployed on Xilinx Artix-7 FPGA (XC7A50T). Its logic resource occupancy about 13,961 LUTs, accounting for 42.8% of total resources; flip-flop occupancy about 15,220 FFs, accounting for 23.3% of total resources. While meeting real-time processing requirements, chip still retains relatively sufficient logic resource redundancy, facilitating subsequent function expansion and algorithm upgrade. Current system, while achieving expected functions and reaching basic performance indicators, still has optimization space: processing speed slightly decreases in high-resolution (e.g., 4K) scenarios, parallel computing efficiency of core modules has potential for improvement; additionally, semantic region extraction accuracy fluctuates under extreme lighting conditions, need further optimize algorithm robustness to adapt to more complex practical application scenarios.

In conclusion, the image preprocessing and post-processing system based on FPGA and PC collaborative architecture designed in this paper integrates four core modules: RGB-YCbCr color space conversion, quantitative adjustment of brightness, contrast, color temperature, semantic region extraction, and invalid region removal, effectively solving issues such as insufficient real-time performance of traditional image preprocessing systems and inability to meet accuracy requirements in multi-scenarios. Experimental results show that this system ensures low-latency processing of core modules through FPGA hardware parallel computing, PC-side collaboration achieves lightweight post-processing, actual processing speed of core functional modules stable at 70MB/s (560Mbps), meeting real-time requirements for 1080P resolution images at 60fps, achieving good balance between processing efficiency, recognition accuracy, and hardware resource consumption. This system can provide high-quality, lightweight image input for subsequent visual tasks, having broad application prospects in fields such as intelligent transportation, human-computer interaction, and public safety, particularly enhancing the visual capabilities of humanoid robots in dynamic environments.