====== Simulation & Formal Methods ======
{{:en:iot-open:czapka_m.png?50| Masters (2nd level) classification icon }}

<todo @momala></todo>

====== Why Simulation Needs Formalism ======

Simulation is indispensable in autonomous-vehicle validation because it lets us probe safety-critical behavior without exposing the public to risk, but simulation alone is only as persuasive as its predictive value. A simulator that cannot anticipate how the real system behaves—because of poor modeling, missing variability, or unmeasured assumptions—does not provide credible evidence for a safety case. This is why we pair simulation with formal methods: a discipline for specifying scenarios and safety properties with mathematical precision, generating test cases systematically, and measuring how closely simulated outcomes match track or road trials. In our program, the digital twin of the vehicle and its operating environment acts as the concrete “world model,” while formal specifications direct the exploration of that world to the places where safety margins are most likely to fail.

Treating the digital twin as a live feedback loop is central to maintaining predictive value over time. The twin ingests logs and environmental data from the physical shuttle, updates maps and vehicle parameters, and feeds those data back into the simulator so that new tests reflect actual wear, calibration drift, and environmental change. This continuous synchronization turns simulation into an ongoing assurance activity rather than a one-off milestone.

Building such twins is non-trivial. Our workflow constructs environment twins from aerial photogrammetry with RTK-supported georeferencing, then processes point clouds into assets capable of driving a modern simulator. The resulting model can be used across many AVs and studies, amortizing the cost of data collection and asset creation while preserving the fidelity needed for planning, perception, and control validation.

====== From Scenarios to Properties: Making Requirements Executable ======

Formal methods begin by making requirements executable. We express test intent as a distribution over concrete scenes using the SCENIC language, which provides geometric and probabilistic constructs to describe traffic, occlusions, placements, and behaviors. A SCENIC program defines a scenario whose parameters are sampled to generate test cases; each case yields a simulation trace against which temporal properties—our safety requirements—are monitored. This tight loop, implemented with the VERIFAI toolkit, supports falsification (actively searching for violations), guided sampling, and clustering of outcomes for test selection.

In practice, the pipeline unfolds as follows. We first assemble the photorealistic simulated world and dynamics models from HD maps and 3D meshes. We then formalize scenarios in SCENIC and define safety properties as monitorable metrics—often using robust semantics of Metric Temporal Logic (MTL), which provide not just a pass/fail verdict but a quantitative margin to violation. VERIFAI searches the parameter space, records safe and error tables, and quantifies “how strongly” a property held or failed; these scores guide which cases deserve promotion to track tests. This process transforms vague test ideas (“test passing pedestrians”) into a concrete population of parameterized scenes with measurable, comparable outcomes.

Our project also leverages scenario distribution over maps: using OpenDRIVE networks of the TalTech campus, SCENIC instantiates the same behavioral narrative—say, overtaking a slow or stopped vehicle—at diverse locations, ensuring that lane geometry, curbside clutter, and occlusions vary meaningfully while the safety property remains constant. The result is a family of tests that stress the same planning and perception obligations under different geometric and environmental embeddings.

====== Selection, Execution, and Measuring the Sim-to-Real Gap ======

A formal pipeline is only convincing if simulated insights transfer to the track. After falsification, we select representative safe/unsafe cases through visualization or clustering of the safe/error tables and implement them on a closed course with controllable agents. Notably, the same SCENIC parameters (starting pose, start time, velocities) drive hardware actors on the track as drove agents in simulation, subject to physical limitations of the test equipment. This parity enables apples-to-apples comparisons between simulated and real traces.

We then quantify the sim-to-real gap using time-series metrics such as dynamic time warping and the Skorokhod distance to compare trajectories, first-detection times, and minimum-distance profiles. In published results, trajectories for the same test were qualitatively similar but showed measurable differences in separation minima and timing; moreover, even identical simulations can diverge when the autonomy stack is non-deterministic, a reality that the methodology surfaces rather than hides. Understanding this variance is a virtue: tests with lower variance are more reproducible on track, while highly variable tests reveal sensitivity in planning, perception, or prediction that merits redesign or tighter ODD limits.

This formal sim-to-track pipeline does more than label outcomes; it helps diagnose causes. By replaying logged runs through the autonomy stack’s visualization tools, we can attribute unsafe behavior to perception misses, unstable planning decisions, or mispredictions, and then target those subsystems in subsequent formal campaigns. In one case set, the dominant failure mode was oscillatory planning around a pedestrian, discovered and characterized through this exact loop of scenario specification, falsification, track execution, and trace analysis.

====== Multi-Fidelity Workflows and Continuous Assurance ======

Exhaustive testing is infeasible, so we combine multiple fidelity levels to balance breadth with realism. Low-fidelity (LF) platforms sweep large scenario grids quickly to map where safety margins begin to tighten; high-fidelity (HF) platforms (e.g., LGSVL/Unity integrated with Autoware) replay the most informative LF cases with photorealistic sensors and closed-loop control. Logging is harmonized so that KPIs and traces are comparable across levels, and optimization or tuning derived from LF sweeps is verified under HF realism before any track time is spent. In extensive experiments, thousands of LF runs revealed broad patterns, but only HF replays uncovered subtle interactions that flipped outcomes—evidence that fidelity matters exactly where the safety case will later be challenged.

This workflow sits within a DOE-driven V&V suite that treats the digital twin and scenario engine as programmable assets. Scenario definitions, vehicle models, and evaluation logic are versioned; control-loop delays, TTC profiles, and collision metrics are computed consistently per run; and the same infrastructure can be extended downward into hardware-in-the-loop experiments of low-level control paths to test actuator-path integrity under identical scene conditions. In our project platform, the simulator co-runs with Autoware, accepts parameterized scenarios through a public interface, and emits validation reports that roll up from frame-level signal checks to mission-level success, closing the traceability chain from formal property to system outcome.

Just as important as capability is honesty about limits. Our own survey and case study argue for explicit attention to abstraction choices, modeling assumptions, and convergence questions for AI-based components. The literature and our results stress that simulation’s value depends on calibrated models, careful measurement of non-determinism, and disciplined mapping to the real world; formal methods help precisely because they make these assumptions visible, testable, and comparable over time. The digital-twin perspective then turns those measurements into an engine for continuous improvement, updating the twin as the physical system and environment evolve.