Differences

This shows you the differences between two versions of the page.

Link to this comparison view

--- en:multiasm:paarm:chapter_5_15 [2025/12/04 22:32] – [Speculative instruction execution] eriks.klavins
+++ en:multiasm:paarm:chapter_5_15 [2025/12/04 23:12] (current) – [Power saving] eriks.klavins
@@ Line 29: / Line 29: @@
 ===== Barriers(instruction synchronization / data memory / data synchronization / one way BARRIER) =====
+Many processors today can execute instructions out of the programmer-defined order. This is done to improve performance, as instructions can be fetched, decoded, and executed in a single cycle (instruction stage). Meanwhile, memory access can also be delayed or reordered to maximise throughput on data buses. Mostly, this is invisible to the programmers.
+The barrier instructions enforce instruction order between operations. This does not matter whether the processor has a single core or multiple cores; these barrier instructions ensure that the data is stored before the next operation with them, that the previous instruction result is stored before the next instruction is executed, and that the second core (if available) accesses the newest data. ARM has implemented special barrier instructions to do that: ''<fc #800000>ISB</fc>'', ''<fc #800000>DMB</fc>'', and ''<fc #800000>DSB</fc>''. The instruction synchronisation barrier (''<fc #800000>ISB</fc>'') ensures that subsequent instructions are fetched and executed after the previous instruction's results are stored (the previous instruction's operations have finished). A data memory barrier (''<fc #800000>DMB</fc>'') ensures that the order of data read or written to memory is fixed. And the Data synchronisation barrier (''<fc #800000>DSB</fc>'') ensures that both the previous instruction and the data access are complete and that the order of data access is fixed.
+Since the instructions are prefetched and decoded ahead of time, those earlier fetched and executed instructions might not yet reflect the newest state. The ''<fc #800000>ISB</fc>'' instruction – it forces the processor to stop fetching the next instruction before the previous instruction operations are finished. This type of instruction is required to ensure proper changes to a processor's control registers, memory access permissions, exception levels, and other settings. These instructions are not necessary to ensure the equation is executed correctly, unless the equation is complex and requires storing temporary variables in memory.\\
+''<fc #800000>MRS    </fc> <fc #008000>X0</fc>, SCTLR_EL1        <fc #6495ed>@ Read system control register</fc>''\\
+''<fc #800000>ORR    </fc> <fc #008000>X0</fc>, <fc #008000>X0</fc>, <fc #ffa500>#</fc>(<fc #ffa500>1</fc> << <fc #ffa500>0</fc>)    <fc #6495ed>@ Set bit 0 (enable MMU)</fc>''\\
+''<fc #800000>MSR    </fc> SCTLR_EL1, <fc #008000>X0       </fc> <fc #6495ed>@ Write back</fc>''\\
+''<fc #800000>ISB                         </fc> <fc #6495ed>@ Ensure new control state takes effect</fc>''
+The code example ensures that the control register settings are stored before the following instructions are fetched. The setting enables the Memory Management Unit and updates new address translation rules. The following instructions, if prefetched without a barrier, could operate under old address translation rules and result in a fault.
+Sometimes the instructions may sequentially require access to memory, and situations where previous instructions store data in memory and the next instruction is intended to use the most recently stored data. Of course, this example is to explain the data synchronisation barrier. Such an example is unlikely to occur in real life. Let's imagine a situation where the instruction computes a result and must store it in memory. The following instruction must use the result stored in memory by the previous instruction.
+The first problem may occur when the processor uses special buffers to maximise data throughput to memory. This buffer collects multiple data chunks and, in a single cycle, writes them to memory. The buffer is not visible to the following instructions so that they may use the old data rather than the newest. The data synchronisation barrier instruction will solve this problem and ensure that the data are stored in memory rather than left hanging in the buffer.\\
+**Core 0:** \\
+''<fc #800000>STR    </fc> <fc #008000>W1</fc>, [<fc #008000>X0</fc>]      <fc #6495ed> @ Store the data</fc>''\\
+''<fc #800000>DMB    </fc> <fc #9400d3>ISH           </fc> <fc #6495ed>@ Make sure that the data is visible</fc>''\\
+''<fc #800000>STR    </fc> <fc #008000>W2</fc>, [<fc #008000>X3</fc>]       <fc #6495ed>@ store the function status flag</fc>''\\
+**Core 1:**\\
+''<fc #800000>LDR    </fc> <fc #008000>W2</fc>, [<fc #008000>X3</fc>]       <fc #6495ed>@ Check the flags</fc>''\\
+''<fc #800000>DMB</fc>     <fc #9400d3>ISH           </fc> <fc #6495ed>@ Ensure data is seen after the flag</fc>''\\
+''<fc #800000>LDR    </fc> <fc #008000>W1</fc>, [<fc #008000>X0</fc>]       <fc #6495ed>@ Read the data (safely)</fc>''\\
+In the example, the data memory barrier will ensure that the data is stored before the function status flag. The DMB and DSB barriers are used to ensure that data are available in the cache memory. And the purpose of the shared domain is to specify the scope of cache consistency for all inner processor units that can access memory. These instructions and parameters are primarily used for cache maintenance and memory barrier operations. The argument that determines which memory accesses are ordered by the memory barrier and the Shareability domain over which the instruction must operate. This scope effectively defines which Observers the ordering imposed by the barriers extends to.
+In this example, the barrier takes the parameter; the domain “ISH” (inner sharable) is used with barriers meant for inner sharable cores. That means the processors (in this area) can access these shared caches, but the hardware units in other areas of the system (such as DMA devices, GPUs, etc.) cannot. The “''<fc #800000>DMB</fc> <fc #9400d3>ISH</fc>''” forces memory operations to be completed across both cores: Core0 and Core1. The ‘''<fc #9400d3>NSH</fc>''’ domain (non-sharable) is used only for the local core on which the code is being executed. Other cores cannot access that cache area and are also not affected by the specified barrier. The ‘''<fc #9400d3>OSH</fc>''’ domain (outer sharable) is used for hardware units with memory access capabilities (like GPU). All those hardware units will be affected by this barrier domain. And the ‘''<fc #9400d3>SY</fc>''’ domain forces the barrier to be observed by the whole system: all cores, all peripherals – everything on the chip will be affected by barrier instruction.\\
+These domain have their own options. These options are available in the table below.
+<table tab_label>
+<caption>Domain options</caption>
+^ Option ^ Order access (before-after) ^ Shareability domain ^
+| OSH    | Any-Any | Outer shareable |
+| OSHLD    | Load-Load, Load-Store | Outer shareable |
+| OSHST    | Store-Store | Outer shareable |
+| NSH | Any-Any | Non-shareable |
+| NSHLD | Load-Load, Load-Store | Non-shareable |
+| NSHST | Store-Store | Non-shareable |
+| ISH | Any-Any | Inner shareable |
+| ISHLD | Load-Load, Load-Store | Inner shareable |
+| ISHST | Store-Store | Inner shareable |
+| SY | Any-Any | Full system |
+| ST | Store-Store | Full system |
+| LD | Load-Load, Load-Store | Full system |
+</table>
+Order access specifies which classes of accesses the barrier operates on. A "load-load" and "load-store" barrier requires all loads to be completed before the barrier occurs, but it does not require that the data be stored. All other loads and stores must wait for the barrier to be completed. The store-store barrier domain affects only data storing in the cache, while loads can still be issued in any order around the barrier. Meanwhile, "any-any" barrier domain requires both loading and storing the data to be completed before the barrier. The next ongoing load or store instructions must wait for the barrier to complete.
+All these barriers are practical with high-level programming languages, where unsafe optimisation and specific memory ordering may occur. In most scenarios, there is no need to pay special attention to memory barriers in single-processor systems. Although the CPU supports out-of-order and predictive execution, in general, it ensures that the final execution result meets the programmer's requirements. In multi-core concurrent programming, programmers need to consider whether to use memory barrier instructions. The following are scenarios where programmers should consider using memory barrier instructions.
+  * Share data between multiple CPU cores. Under the weak consistency memory model, a CPU's disordered memory access order may cause contention for access.
+  * Perform operations related to peripherals, such as DMA operations. The process of starting DMA operations is usually as follows: first, write data to the DMA buffer; second, set the DMA-related registers to start DMA. If there is no memory barrier instruction in the middle, the second-step operations may be executed before the first step, causing DMA to transmit the wrong data.
+  * Modifying the memory management strategy, such as context switching, requesting page faults, and modifying page tables.
+In short, the purpose of using memory barrier instructions is to ensure the CPU executes the program's logic rather than having its execution order disrupted by out-of-order and speculative execution.
 ===== Conditional instructions =====
+Meanwhile, speculative instruction execution consumes power. If the speculation gave the correct result, the power wasn’t wasted; otherwise, it was. Power consumption must be taken into account when designing the program code. Not only is power consumption essential, but so is data safety. The Cortex-A76 on the Raspberry Pi 5 has a very advanced branch predictor, but mispredictions still cause wasted instructions, pipeline flushes, and energy spikes. It’s better to avoid unpredictable branches inside loops. Unfortunately, that is mostly impossible.
+The use of the conditional select (''<fc #800000>CSEL</fc>'') instruction may help with small, unpredictable branches.\\
+''<fc #800000>CMP    </fc> <fc #008000>X0</fc>, <fc #008000>X1</fc>''\\
+''<fc #800000>CSEL   </fc> <fc #008000>X2</fc>, <fc #008000>X3</fc>, <fc #008000>X4</fc>, <fc #9400d3>LT  </fc> <fc #6495ed>@ more predictable and avoids branch flush</fc>''
+This conditional instruction writes the value of the first source register to the destination register if the condition is TRUE. If the condition is FALSE, it writes the value of the second source register to the destination register. So, if the ''<fc #008000>X0</fc>'' value is less than the value in the ''<fc #008000>X1</fc>'' register, the ''<fc #800000>CSEL</fc>'' instruction will write the ''<fc #008000>X3</fc>'' register value into the ''<fc #008000>X2</fc>'' register; otherwise, the ''<fc #008000>X4</fc>'' register value will be written into the ''<fc #008000>X2</fc>'' register.
+Other conditional instructions can be used similarly:
+{{:en:multiasm:paarm:conditionainst.jpg|}}
+These conditional instructions are helpful in branchless conditional checks. Taking into account that these instructions can also be executed speculatively, this execution won't waste power compared to branching if branch prediction fails.
 ===== Power saving =====
+Some special instructions are meant to put the processor into sleep modes and wait for an event to occur. The processor can be woken up by an interrupt or by an event. In these modes, the code may be explicitly created to initialise interrupts and events, and to handle them. After that, the processor may be put into sleep mode and remain asleep unless an event or interrupt occurs. The following code example can be used only in bare-metal mode – without an OS.
+<codeblock code_label>
+<caption>IDLE loop</caption>
+<code>
+.global idle_loop
+idle_loop:
+:  WFI             @ Wait For Interrupt, core goes to low-power
+    B   1b          @ After the interrupt, go back and sleep again
+</code>
+</codeblock>
+<note>Note that interrupt handling and initialisation must also be implemented in the code; otherwise, the CPU may encounter an error that may force a reboot. </note>
+The example only waits for interrupts to occur. To wait for events and interrupts, the ''<fc #800000>WFI</fc>'' instruction must be replaced with the ''<fc #800000>WFE</fc>'' instruction. Another CPU core may execute an ''<fc #800000>SEV</fc>'' instruction that signals an event to all cores.
+On a Raspberry Pi 5 running Linux, it is not observable whether the CPU enters these modes, because the OS generates many events between CPU cores and also handles many interrupts from communication interfaces and other Raspberry Pi components.
+Another way to save more energy while running the OS on the Raspberry Pi is to reduce the CPU clock frequency. There is a scheme called dynamic voltage and frequency scaling (DVFS), the same technique used in laptops, that reduces power consumption and thereby increases battery life. On the internet, there is a paper named “Cooling a Raspberry Pi Device ”. The paper includes one chapter explaining how to reduce the CPU clock frequency. The Linux OS exposes CPU frequency scaling through sysfs, e.g.:
+  * ”/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor”
+  * “/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq”
+It is possible to use syscalls in assembler to open and write specific values into them.
+<codeblock code_label>
+<caption>Power saving</caption>
+<code>
+.global _start
+.section .text
+_start:
+    @ openat(AT_FDCWD, path, O_WRONLY, 0)
+    mov     x0, #-100               @ AT_FDCWD
+    ldr     x1, =gov_path           @ const char *pathname
+    mov     x2, #1                  @ O_WRONLY
+    mov     x3, #0                  @ mode (unused)
+    mov     x8, #56                 @ sys_openat
+    svc     #0
+    mov     x19, x0                 @ save fd
+    @ write(fd, "powersave\n", 10)
+    mov     x0, x19
+    ldr     x1, =gov_value
+    mov     x2, #10                 @ length of "powersave\n"
+    mov     x8, #64                 @ sys_write
+    svc     #0
+    @ close(fd)
+    mov     x0, x19
+    mov     x8, #57                 @ sys_close
+    svc     #0
+    @ exit(0)
+    mov     x0, #0
+    mov     x8, #93                 @ sys_exit
+    svc     #0
+.section .rodata
+gov_path:
+    .asciz "/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor"
+gov_value:
+    .asciz "powersave\n"
+</code>
+</codeblock>
+Similar things can be done with CPU frequencies, or even by turning off a separate core. This is just one example template that can be used to put the processor into a specific power mode. By changing the stored path in //gov_path// variable and //gov_value// value. The main idea is to use the OS's system call functions. The OS will do the rest

en/multiasm/paarm/chapter_5_15.1764887565.txt.gz · Last modified: 2025/12/04 22:32 by eriks.klavins