[rahulrazdan][✓ rahulrazdan, 2025-06-16]
The fundamental characteristics of DBE systems are problematic in safety critical systems. However, the IT sector has been a key megatrend which has transformed the world over the last 50 years. In the process, it has developed large ecosystems around semiconductors, operating systems, communications, and application software. At this point, using these ecosystems is critical to nearly every product’s success, so mixed-domain safety critical products are now a reality. Mixed Domain structures can be classified in three broad paradigms each of which have very different V&V requirements: Mechanical Replacement (Big PBE, small DBE), Electronic Adjacent (separate PBE and DBE), autonomy (Big DBE, small PBE). Drive-by-Wire functionality is an example of the mechanical replacement paradigm where the implementation of the original mechanical functionality is done by electronic components (HW/SW). In their initial configurations, these mixed electronic/mechanical systems were physically separated as independent subsystems. In this configuration, the V&V process looked very similar to the traditional mechanical verification process. Regulations were updated to include the idea of electronics failure with standards such as SOTIF (See Table 1). TABLE I DIFFERENCES BETWEEN SOTIF AND ISO 26262 Aspect ISO 26262 SOTIF Focus System faults and malfunctions Hazards due to functional insufficiencies Applicability All safety-critical systems Primarily ADAS and autonomous systems Hazard Source Hardware and software failure Limitations in functionality, unknown scenarios Methods Fault avoidance and control Scenario-based testing
The paradigm of separate physical subsystems has the advantage of V&V simplification and safety, but the large disadvantage of component skew and material cost. Thus, a large trend has been to build underlying computational fabrics with networking and virtually separate functionality. From a V&V perspective, this means that the virtual backbone which maintains this separation (ex: RTOS) must be verified to a very high standard. Infotainment systems are an example of Electronics Adjacent integration. Generally, there is an independent IT infrastructure working with the safety critical infrastructure, and from a V&V perspective, they can be validated separately. However, the presence of infotainment systems enables very powerful communication technologies (5G, Bluetooth, etc.) where the cyber-physical system can be impacted by external third parties. From a safety perspective, the simplest method for maintaining safety would be to physically separate these systems. However, this is not typically done because a connection is required to provide “over-the-air” updates to the device. Thus, the V&V capability must again verify the virtual safeguards against malicious intent are robust. Finally, the last level of integration is in the context of autonomy. In autonomy, the DBE processes of sensing, perception, location services, path planning envelope the traditional mechanical PBE functionality. As Figure 5 shows, the Execution paradigm consists of four layers of functionality. The inner core, layer 4, is of course the world of physics which has all the nice PBE properties. Layer 3 consists of the traditional actuation and edge sensing functionality which maintains nice PBE properties. As we go to layer 2, there is a combination of software and AI which operate in the DBE-AI world. Finally, the outer design for the experiment V&V layer has the unique challenge of testing a system with fundamentally PBE properties but doing so through a layer dominated by DBE-AI functions.
Fig. 5. Conceptual Layers in Cyber-Physical Systems
V. AUTONOMY V&V CURRENT APPROACHES For safety-critical systems, the evolution of V&V has been closely linked to regulatory standards frameworks such as ISO 26262. Key elements of this framework include: 1) System Design Process: A structured development assurance approach for complex systems, incorporating safety certification within the integrated development process. 2) Formalization: The formal definition of system operating conditions, functionalities, expected behaviors, risks, and hazards that must be mitigated. 3) Lifecycle Management: The management of components, systems, and development processes throughout their lifecycle. The primary objective was to meticulously and formally define the system design, anticipate expected behaviors and potential issues, and comprehend the impact over the product's lifespan. With the advent of conventional software paradigms, safety-critical V&V adapted by preserving the original system design approach while integrating software as system components. These software components maintained the same overall structure of fault analysis, lifecycle management, and hazard analysis within system design. However, certain aspects required extension. For instance, in the airborne domain, standard DO-178C, which addresses “Software Considerations in Airborne Systems and Equipment Certification,” updated the concept of hazard from physical failure mechanisms to functional defects, acknowledging that software does not degrade due to physical processes. Also revised were lifecycle management concepts, reflecting traditional software development practices. Design Assurance Levels (DALs) were incorporated, allowing the integration of software components into system design, functional allocation, performance specification, and the V&V process, akin to SOTIF in the automotive industry. TABLE II CONTRAST OF CONVENTIONAL AND MACHINE LEARNING ALGORITHMS
Conventional Algorithm ML Algorithms Comment Logical Theory No Theory In conventional algorithms, one needs a theory of operation to implement the solution. ML algorithms can often “work” without a clear understanding of exactly why they work. Analyzable Not Analyzable Conventional algorithms are encoded in a way one can see and analyze the software code. Most validation and verification methodologies rely on this ability to find errors. ML algorithms offer no such ability, and this leaves a large gap in validation. Causal Correlation Conventional algorithms have built in causality and ML algorithms discover correlations. The difference is important if one wants to reason at a higher level. Deterministic Non-Deterministic Conventional algorithms are deterministic in nature, and ML algorithms are fundamentally probabilistic in nature. Known Computational Complexity Unknown Computational Complexity Given the analyzable nature of conventional algorithms, one can build a model for computational complexity. That is, how long will it take the algorithm to work. For ML techniques, no generic method exists to evaluate computational complexity.
Moving beyond software, AI has built a “learning” paradigm. In this paradigm, there is a period of training where the AI machine “learns” from data to build its own rules, and in this case, learning is defined on top of traditional optimization algorithms which try to minimize some notion of error. This effectively is data driven software development. However, as Table 2 above shows, there are profound differences between AI software and conventional software. These differences have generated three “elephants in the room” issues: AI component validation, AI Specification, and Intelligent Scaling.
A. AI COMPONENT VALIDATION
Both the automotive and airborne spaces have reacted to AI by viewing it as “specialized Software” in standards such as ISO 8800 [14] and [13]. This approach has the great utility of leveraging all the past work in generic mechanically safety and past work in software validation. However, now, one must manage the issue of how to handle the fact that we have a data generated “code” vs conventional programming code. In the world of V&V, this difference is manifested in three significant aspects: coverage analysis, code reviews, and version control. TABLE III V&V Technique Software AI/ML Coverage Analysis: Code Structure provides basis of coverage No structure Code Reviews: Crowd source expert knowledge No Code to Review Version Control Careful construction/release Very Difficult with data
These differences generate an enormous issue for intelligent test generation and any argument for completeness. This is an area of active research, and two threads have emerged: 1) Training Set Validation: Since the final referenced component is very hard to analyze, one approach is to examine the training set and the ODD to find interesting tests which may expose the cracks between them [16]. 2) Robustness to Noise: Either through simulation or using formal methods [17], the approach is to assert various higher-level properties and use these to test the component. An example in object recognition might be to assert the property that an object should be recognized independent of orientation. Overall, developing robust methods for AI component validation is quite an active and unsolved research topic for “fixed” function AI components. That is, AI components where the function is changing with active version control. Of course, many AI applications prefer a model where the AI component is constantly morphing. Validating the morphing situation is a topic of future research.
B. AI SPECIFICATION
For well-defined systems with an availability of system level abstractions, AI/ML components significantly increase the difficulty of intelligent test generation. With a golden spec, one can follow a structured process to make significant progress in validation and even gate the AI results with conventional safeguards. Unfortunately, one of the most compelling uses of AI is to employ it in situations where the specification of the system is not well defined or not viable using conventional programming. In these Specification Less /ML (SLML) situations, not only is building interesting tests difficult, but evaluating the correctness of the results creates further difficulty. Further, most of the major systems (perception, location services, path planning, etc.) in autonomous vehicles fall into this category of system function and AI usage. To date, there have been two approaches to attack the lack of specification problem: Anti-Spec and AI-Driver. 1) Anti-Spec In these situations, the only approach left is to specify correctness through an anti-spec. The simplest anti-spec is to avoid accidents. Based on some initial work by Intel, there is a standard, IEEE 2846, “Assumptions for Models in Safety-Related Automated Vehicle Behavior” [18] which establishes a framework for defining a minimum set of assumptions regarding the reasonably foreseeable behaviors of other road users. For each scenario, it specifies assumptions about the kinematic properties of other road users, including their speed, acceleration, and possible maneuvers. Challenges include an argument for completeness, a specification for the machinery for checking against the standard, and the connection to a liability governance framework. 2) AI-Driver While IEEE 2846 comes from a bottom-up technology perspective, Koopman/Widen [19] have proposed the concept of defining an AI driver which must replicate all the competencies of a human driver in a complex, real-world environment. Key points of Koopman’s AI driver concept include:
a) Full Driving Capability: The AI driver must handle the entire driving task, including perception (sensing the environment), decision-making (planning and responding to scenarios), and control (executing physical movements like steering and braking). It must also account for nuances like social driving norms and unexpected events. b) Safety Assurance: Koopman stresses that AVs need rigorous safety standards, similar to those in industries like aviation. This includes identifying potential failures, managing risks, and ensuring safe operation even in the face of unforeseen events. c) Human Equivalence: The AI driver must meet or exceed the performance of a competent, human driver. This involves adhering to traffic laws, responding to edge cases (rare or unusual driving scenarios), and maintaining situational awareness at all times. d) Ethical and Legal Responsibility: An AI driver must operate within ethical and legal frameworks, including handling situations that involve moral decisions or liability concerns. e) Testing and Validation: Koopman emphasizes the importance of robust testing, simulation, and on-road trials to validate AI driver systems. This includes covering edge cases, long-tail risks, and ensuring that systems generalize across diverse driving conditions. Overall, it is a very ambitious endeavor and there are significant challenges to building this specification of a reasonable driver. First, the idea of a “reasonable” driver is not even well encoded on the human side. Rather, this definition of “reasonableness” is built over a long history of legal distillation, and of course, the human standard is built on the understanding of humans by other humans. Second, the complexity of such a standard would be very high and it is not clear if it is doable. Finally, it may take quite a while of legal distillation to reach some level of closure on a human like an “AI-Driver.” Currently, the state-of-art for specification is relatively poor for both ADAS and AV. ADAS systems, which are widely proliferated, have massive divergences in behavior and completeness. When a customer buys ADAS, it is not entirely clear what they are getting. Tests by industry groups such as AAA, consumer reports, and IIHS have shown the significant shortcomings of existing solutions [20]. In 2024, IIHS introduced a ratings program to evaluate the safeguards of partial driving automation systems. Out of 14 systems tested, only one received an acceptable rating, highlighting the need for improved measures to prevent misuse and ensure driver engagement [21]. Today, there is only one non process oriented regulation in the marketplace, and this is the NHTSA regulations around AEB [22].
C. INTELLIGENT TEST GENERATION
Recognizing the importance of intelligent scenarios for testing, three major styles of intelligent test generation are currently active: physical testing, real-world seeding, and virtual testing. 1) Physical Testing Typically, physical scaling is the most expensive method to verify functionality. However, Tesla has built a flow where their existing fleet is a large distributed testbed. Using this fleet, Tesla's approach to autonomous driving uses a sophisticated data pipeline and deep learning system designed to process vast amounts of sensor data efficiently [23]. In this flow, the scenario under construction is the one driven by the driver, and the criterion for correctness is the driver's corrective action. Behind the scenes, the MaVV flow can be managed by large databases and supercomputers (DoJo) [24]. By employing this methodology, Tesla knows that its scenarios are always valid. However, there are challenges with this approach. First, the real world moves very slowly in terms of new unique situations. Second, by definition the scenarios seen are very much tied to the market presence of Tesla, so not predictive of new situations. Finally, the process of capturing data, discerning an error, and building corrective action is non-trivial. At the extreme, this process is akin to taking crash logs from broken computers, diagnosing them, and building the fixes. 2) Real-World Seeding Another line of test generation is to use physical situations as a seed for further virtual testing. Pegasus, the seminal project initiated in Germany, took such an approach. The project emphasized a scenario-based testing methodology which used observed data from real-world conditions as a base [25]. Another similar effort comes from Warwick University with a focus on test environments, safety analysis, scenario-based testing, and safe AI. One of the contributions from Warwick is Safety Pool Scenario Database [26]. Databases and seeding methods, especially of interesting situations, offer some value, but of course, their completeness is not clear. Further, databases of tests are very susceptible to be over optimized by AI algorithms. 3) Virtual Testing Another important contribution was ASAM OpenSCENARIO 2.0 [27] which is a domain-specific language designed to enhance the development, testing, and validation of Advanced Driver-Assistance Systems (ADAS) and Automated Driving Systems (ADS). A high-level language allows for a symbolic higher level description of the scenario with an ability to grow in complexity by rules of composition. Underneath the symbolic apparatus are pseudo-random test generation which can scale the scenario generation process. The randomness also offers a chance to expose “unknown-unknown” errors. Beyond component validation, there have been proposed solutions specifically for autonomous systems such as UL 4600, “Standard for Safety for the Evaluation of Autonomous Products.” [28] Similar to ISO 26262/SOTIF, UL 4600 has a focus on safety risks across the full lifecycle of the product and introduces a structured “safety case” approach. The crux of this methodology is to document and justify how autonomous systems meet safety goals. It also emphasizes the importance of identifying and validating against a wide range of real-world scenarios, including edge cases and rare events. There is also a focus on including human-machine interactions. UL 4600 is a good step forward, but at the end, it is a process standard, and does not offer any advice on how to exactly solve the “elephants” in the room for AI validation. Overall, nearly all the standards and current regulations are process centric. They focus on the product developer making an argument and either through self-certification or explicit regulator getting approval. This methodology has the Achilles heel that the product owner does not have a method to get past the critical issues, nor does the regulator have a way to access completeness. All of these techniques have moved the state-of-art forward, but there remains a very fundamental issue. For both physical and virtual execution, how does one sufficient scale to reasonably explore the ODD. Further, when performing virtual execution, what level of abstraction is appropriate? Is it better to have abstract models or highly detailed physics-based models? Typically, the answer is dependent on the nature of the verification. If so, how do these abstraction levels connect to each other? A key missing piece is an ability to split the problem into manageable pieces and then recompose the result. This capability has not been developed for cyber-physical systems but has been developed for semiconductor designs.