====== Object Detection, Sensor Fusion, Mapping, and Positioning ====== {{:en:iot-open:czapka_b.png?50| Bachelors (1st level) classification icon }} ===== Object Detection ===== **Object detection** is the fundamental perception function that allows an autonomous vehicle to identify and localize relevant entities in its surroundings. It converts raw sensor inputs into structured semantic and geometric information, forming the basis for higher-level tasks such as tracking, prediction, and planning. By maintaining awareness of all objects within its operational environment, the vehicle can make safe and contextually appropriate decisions. Detected objects may include: * Dynamic agents such as other vehicles, cyclists, and pedestrians. * Static or quasi-static structures such as curbs, barriers, and traffic signs. * Unexpected obstacles such as debris, animals, or construction equipment. Each detection typically includes a semantic label, a spatial bounding box (2D or 3D), a confidence score, and sometimes velocity or orientation information. Accurate detection underpins all subsequent stages of autonomous behavior; any missed or false detection may lead to unsafe or inefficient decisions downstream. Object detection relies on a combination of complementary sensors, each contributing distinct types of information and requiring specialized algorithms. ===== Camera-Based Detection ===== Cameras provide dense visual data with rich color and texture, essential for semantic understanding. Typical camera-based detection methods include: * **Feature-based algorithms** such as ''Histogram of Oriented Gradients (HOG)'', ''Scale-Invariant Feature Transform (SIFT)'', and ''Speeded-Up Robust Features (SURF)'', used in early lane and pedestrian detection systems. * **Stereo vision** for depth estimation and obstacle localization by triangulating disparity between image pairs. * **Optical flow analysis** for detecting motion and estimating object trajectories in video sequences. * **Classical machine learning approaches** like ''Support Vector Machines (SVM)'' or ''AdaBoost'' combined with handcrafted features for real-time pedestrian detection. * **Modern convolutional and transformer-based networks** (covered in Section 5.2), which directly infer 2D and 3D bounding boxes from images. Cameras are indispensable for interpreting traffic lights, signs, lane markings, and human gestures, but their performance can degrade under low illumination, glare, or adverse weather conditions. ===== LiDAR-Based Detection ===== LiDAR (Light Detection and Ranging) measures distances by timing laser pulse returns, producing dense 3D point clouds. LiDAR-based object detection methods focus on geometric reasoning: * **Point clustering** approaches such as ''Euclidean Cluster Extraction'' and ''Region Growing'' group nearby points into potential objects. * **Model-based fitting**, including ''RANSAC'' for detecting planes, poles, or cylindrical objects. * **Voxelization methods**, which discretize 3D space into volumetric grids for feature extraction. * **Projection methods**, where point clouds are projected into 2D range or intensity images for segmentation. * **Shape fitting and contour tracking**, which match clusters to known templates such as vehicle rectangles or pedestrian ellipsoids. LiDAR’s precise geometry enables accurate distance and shape estimation, but sparse returns or partial occlusions can challenge classification performance. ===== Radar-Based Detection ===== Radar (Radio Detection and Ranging) provides long-range distance and velocity information using radio waves. Its unique Doppler measurements are invaluable for tracking motion, even in fog, dust, or darkness. Typical radar-based detection techniques include: * **Constant False Alarm Rate (CFAR)** processing, which statistically distinguishes true reflections from noise in range–Doppler maps. * **Range–Doppler clustering** to group radar detections corresponding to individual objects. * **Kalman and particle filtering** for multi-target tracking using velocity estimates. * **Micro-Doppler analysis**, which identifies signatures of specific motion types such as walking pedestrians or spinning wheels. * **Radar–vision association**, where radar detections are correlated with camera-based bounding boxes for class refinement. Radar systems are especially important for early hazard detection and collision avoidance, as they function effectively through adverse weather and poor visibility. ===== Ultrasonic and Sonar-Based Detection ===== Ultrasonic and sonar sensors detect objects through acoustic wave reflections and are particularly useful in environments where optical or electromagnetic sensing is limited. They are integral not only to ground vehicles for close-range detection but also to **surface** and **underwater autonomous vehicles** for navigation, obstacle avoidance, and terrain mapping. For **ground vehicles**, ultrasonic sensors operate at short ranges (typically below 5 meters) and are used for parking assistance, blind-spot detection, and proximity monitoring. Common methods include: * **Time-of-Flight (ToF)** ranging, where the distance to an obstacle is computed from the delay between transmitted and received sound waves. * **Amplitude and phase analysis**, used to infer surface hardness or material properties. * **Triangulation and beamforming**, which estimate object direction and shape using multiple transducers. * **Echo profiling**, classifying obstacles based on their acoustic return signatures. For **surface and underwater autonomous vehicles**, sonar systems extend these principles over much longer ranges and through acoustically dense media. Typical sonar-based detection methods include: * **Multibeam echo sounding**, producing 3D point clouds of the seafloor or underwater structures. * **Side-scan sonar**, generating high-resolution acoustic images for obstacle and object identification along a vehicle’s trajectory. * **Forward-looking sonar (FLS)**, providing real-time obstacle avoidance for surface or submersible vehicles. * **Synthetic Aperture Sonar (SAS)**, achieving enhanced spatial resolution by coherently combining multiple sonar pings as the vehicle moves. * **Acoustic Doppler Velocity Log (ADVL)** integration, allowing both obstacle detection and precise motion estimation. These acoustic systems are essential in domains where electromagnetic sensing (e.g., camera, LiDAR, radar) is unreliable — such as murky water, turbid environments, or beneath the ocean surface. Although sonar has lower spatial resolution than optical systems and is affected by multipath and scattering effects, it offers unmatched robustness in low-visibility conditions. As with other sensors, regular calibration, signal filtering, and environmental adaptation are necessary to maintain detection accuracy across varying salinity, temperature, and depth profiles. Object detection outputs can be represented in different coordinate systems and abstraction levels: * **2D detection** localizes objects in image coordinates, suitable for semantic reasoning. * **3D detection** estimates object dimensions, position, and orientation in the world frame, enabling motion planning and distance-based safety checks. * **Bird’s-Eye-View (BEV) detection** projects all sensor data onto a unified ground-plane map, simplifying spatial reasoning across multiple sensors. Multi-modal sensor fusion in Bird’s Eye View (BEV) representation, which projects multi-sensor data into a unified ground-plane coordinate system, has become the leading approach for 3D object detection. Hybrid systems combine these paradigms—for example, camera-based semantic labeling enhanced with LiDAR-derived 3D geometry—to achieve both contextual awareness and metric accuracy. ==== Detection Pipeline and Data Flow ==== A standard object detection pipeline in an autonomous vehicle proceeds through the following stages: - **Data acquisition and preprocessing** — raw sensor data are collected, filtered, timestamped, and synchronized. - **Feature extraction and representation** — relevant geometric or visual cues are computed from each modality. - **Object hypothesis generation** — candidate detections are proposed based on motion, clustering, or shape priors. - **Classification and refinement** — hypotheses are validated, labeled, and refined based on fused sensory evidence. - **Post-processing and temporal association** — duplicate detections are merged, and tracking ensures temporal consistency. The pipeline operates continuously in real time (typically 10–30 Hz) with deterministic latency to meet safety and control requirements. ===== Sensor Fusion ===== No single sensor technology can capture all aspects of a complex driving scene under all circumstances, diverse weather, lighting, and traffic conditions. Therefore, data from multiple sensors is fused (combined) to obtain a more complete, accurate, and reliable understanding of the environment than any single sensor could provide alone. Each sensor modality has distinct advantages and weaknesses: * **Cameras** provide high-resolution color and texture information, essential for recognizing traffic lights, signs, and object appearance, but are sensitive to lighting and weather. * **LiDAR** delivers precise 3D geometry and range data, allowing accurate distance estimation and shape reconstruction, yet is affected by rain, fog, and reflective surfaces. * **Radar** measures object velocity and distance robustly, even in poor visibility, but has coarse angular resolution and may struggle with small or static objects. * **GNSS** provides global position but suffers from signal blockage and reflection in e.g., urban canyons, tunnels, and under tree canopies. * **IMU** provides motion estimation but is prone to drift and accumulated error. By fusing these complementary data sources, the perception system can achieve redundancy, increased accuracy, and fault tolerance — key factors for functional safety (ISO 26262). Sensor fusion can be focused on **complementarity** – different sensors contribute unique, non-overlapping information and **redundancy** – overlapping sensors confirm each other’s measurements, improving reliability. As multiple sensor modalities are used, both goals can be achieved. Accurate fusion depends critically on spatial and temporal alignment among sensors. * **Extrinsic calibration** determines the rigid-body transformations between sensors (translation and rotation). It is typically estimated through target-based calibration (e.g., checkerboard or reflective spheres) or self-calibration using environmental features. * **Intrinsic calibration** corrects sensor-specific distortions, such as lens aberration or LiDAR beam misalignment. * **Temporal synchronization** ensures that all sensor measurements correspond to the same physical moment, using hardware triggers, shared clocks, or interpolation. Calibration errors lead to spatial inconsistencies that can degrade detection accuracy or cause false positives. Therefore, calibration is treated as part of the functional safety chain and is regularly verified in maintenance and validation routines. Fusion can occur at different stages in the perception pipeline, commonly divided into three levels: * **Data-level** fusion combines raw signals from sensors before any interpretation, providing the richest input but also the heaviest computational load. * **Feature-level** fusion merges processed outputs such as detected edges, motion vectors, or depth maps, balancing detail with efficiency. * **Decision-level** fusion integrates conclusions drawn independently by different sensors, producing a final decision that benefits from multiple perspectives. The mathematical basis of sensor fusion lies in probabilistic state estimation and Bayesian inference. Typical formulations represent the system state as a probability distribution updated by sensor measurements. Common techniques include: * **Kalman Filter (KF)** and its nonlinear extensions, the **Extended Kalman Filter (EKF)** and **Unscented Kalman Filter (UKF)**, which maintain a Gaussian estimate of state uncertainty and iteratively update it as new sensor data arrive. * **Particle Filter (PF)**, which uses a set of weighted samples to approximate arbitrary non-Gaussian distributions. * **Bayesian Networks** and **Factor Graphs**, which represent dependencies between sensors and system variables as nodes and edges, enabling large-scale optimization. * **Deep Learning–based Fusion**, where neural networks implicitly learn statistical relationships between sensor modalities through backpropagation rather than explicit probabilistic modeling. ==== Learning-Based Fusion Approaches ==== Deep learning has significantly advanced sensor fusion. Neural architectures learn optimal fusion weights and correlations automatically, often outperforming hand-designed algorithms. For example: * **BEVFusion** fuses LiDAR and camera features into a top-down BEV representation for 3D detection. * **TransFusion** uses transformer-based attention to align modalities dynamically. * **DeepFusion** and **PointPainting** project LiDAR points into the image plane, enriching them with semantic color features. End-to-end fusion networks can jointly optimize detection, segmentation, and motion estimation tasks, enhancing both accuracy and robustness. However, deep fusion models require large multimodal datasets for training and careful validation to ensure generalization and interpretability. ===== Mapping ===== The existence of a certain form of world model is a fundamental prerequisite for autonomous navigation during the execution of various tasks, since their definitions are often directly or indirectly related to the spatial configuration of the environment and the specification of partial goal positions. The world model thus represents a reference frame relative to which goals are defined. When discussing an environmental model, it is also necessary to consider its suitable representation in relation to the specific problem being solved. The problem of representing and constructing an environmental model from imprecise sensory data must be viewed from several perspectives: - **Representation.** The perceived environment must be stored in a suitable data structure corresponding to the complexity of the environment (e.g., empty halls, furnished rooms). When working with raw, unprocessed sensory data, a compact data representation is required. - **Uncertainty in data.** For reliable use of the constructed model, it is desirable to maintain a description of uncertainty in the data, which arises from the processing of noisy sensory information. Methods for data fusion can be advantageously used here to reduce overall uncertainty. Data can be merged from different sources or from different time intervals during task execution. - **Computational and data efficiency.** The data structures used must be efficiently updatable, typically in real time, even in relatively large environments and at various levels of resolution. It is almost always necessary to find an appropriate compromise between method efficiency (model quality) and computational and data demands. - **Adaptability.** The representation of the environmental model should be as application-independent as possible, which is, however, a very demanding task. In real situations, the environmental representation is often tailored directly to the task for which the model is used, so that the existence of the map enables or significantly increases the efficiency of solving the selected problem. The ways of representing environment maps can be divided according to their level of abstraction into the following types: - **Sensor-based ** maps work directly with raw sensory data or only with their relatively simple processing. Raw sensor data are used at the level of reactive feedback in subsystems for collision avoidance and resolution, where fast responses are required (David, 1996); thus, they are mainly applied at the motion control level. A typical example of a sensor-based map is the **occupancy grid** -- a two-dimensional array of cells, where each cell represents a square area of the real world and determines the probability that an obstacle exists in that region. - **Geometric ** maps describe objects in the environment using geometric primitives in a Cartesian coordinate system. Geometric primitives include segments (lines), polygons, arc sections, splines, or other curves. Additional structures such as visibility graphs or rectangular decompositions are often built on top of this description to support robot path planning. Geometric maps can also be used for robot localization or exported into CAD tools, where further corrections create a complete CAD model of the environment for non-robotic applications. - **Topological ** maps achieve a high level of abstraction. They do not store information about the absolute coordinates of objects but rather record the **topological relationships** between objects in the environment. This representation focuses on connectivity and adjacency between places rather than precise geometric positions. - **Symbolic.** maps contain information that the robot typically cannot directly obtain through its sensors — usually abstract data that allow a certain degree of **natural-language communication** with the robot. These maps are commonly implemented in a **declarative language** such as *Prolog* or *LISP*. The environmental model can be created in many ways, from manual data collection to fully automatic mapping procedures. However, the difference in time efficiency is significant. During the construction of an environmental model — mapping — several basic tasks must be solved. One of them is the choice of a suitable movement trajectory that ensures the vehicle's sensor system acquires a sufficient amount of data from the entire mapped space. This issue is addressed by planning algorithms for exploration and space coverage. The choice of a suitable trajectory can also be replaced by remote teleoperation, where the operator decides which parts of the mapped area the robot should visit. Another necessary condition for consistent mapping is the ability to localize. By its nature, the mapping and global localization problem can be described as a “chicken and egg” problem. The intertwined nature of localization and mapping arises from the fact that inaccuracies in the vehicle's motion affect errors and the accuracy of sensory measurements used as input to mapping algorithms. As soon as the robot moves, the estimate of its current position becomes affected by noise caused by motion inaccuracy. The perceived absolute position of objects inserted into the map is then influenced both by measurement noise and by the error in the estimated current position of the robot. The mapping error caused by inaccurate localization can further reduce the accuracy of localization methods that depend on the map. In this way, both tasks influence each other. In general, the error of the estimated robot trajectory is strongly correlated with errors in the constructed map — that is, an accurate map cannot be built without sufficiently precise, reliable, and robust localization, and vice versa. ===== Positioning ===== An essential component of spatial orientation and environmental understanding for an autonomous vehicle is **localization** within that environment. **Localization** is defined as the process of estimating the vehicle’s current coordinates (position) based on information from available sensors. Suitable sensors for localization are those that allow obtaining either information about the vehicle's relative position with respect to the environment or about its own motion within the environment. To obtain information about ego motion, **inertial** and **odometric sensors** are used. These provide an approximate initial estimate of the position. Inertial sensors infer position from acceleration and rotation, while odometric sensors measure directly the length of the trajectory traversed by the vehicle. However, both measurement principles suffer from significant measurement errors. As the primary sensors for accurate position determination, **laser range finders (scanners)** are commonly used. These measure the direct distance to obstacles relative to which the robot orients itself. In recent years, **camera-based systems** have also begun to emerge as viable alternatives or complements to laser range sensors. ==== Reference Frames for Localization ==== The vehicle's coordinates in the working environment can be referenced either to an existing **environment model (map)** or to a chosen **reference frame**. The reference is often, for example, the vehicle's initial position. If the vehicle's initial position is known, or if only **relative coordinates** with respect to the starting point are important, we speak of the **position tracking problem** (also called **continuous localization**). These localization tasks correct odometric errors accumulated during movement, thus refining the vehicle's current position estimate. In the opposite case — when the initial position is unknown and we need to determine the vehicle's **absolute coordinates** within the world model — we speak of the **global localization problem**. One of the fundamental differences between **global localization** and **position tracking** lies in the requirement for a **predefined world model**. While global localization requires such a model, position tracking does not necessarily depend on it — it can, for example, utilize a model being gradually constructed during movement. ==== The “Kidnapped Robot Problem” ==== The most general form of the localization task is the so-called **“kidnapped robot problem”**, which generalizes both position tracking and global localization. In this task, the vehicle must not only continuously track its position but also detect situations in which it has been **suddenly displaced** to an unknown location — without the event being directly observable by its sensors. The method must therefore be capable of detecting, during continuous localization, that the vehicle is located in a completely different position than expected, and at that moment, switch to a **different localization strategy**. This new strategy typically employs **global localization algorithms**, which, after some time, re-establish the vehicle's true position in the environment. Once the correct position has been determined, localization can continue using a less computationally demanding **position-tracking strategy**.