Advanced Assembly Programming

ARM processors are more capable than just toggling a GPIO pin or sending data through communication interfaces. Data processing requires more computational power than can be imagined, and it is not simply about ADD or SUB instructions. Performing multiple operations on multiple data sets requires knowledge of hardware possibilities and experience in such programming. The following small examples may repeat across different tasks: setting up counters or pointers to variables, browsing memory with post-indexed loads and stores, performing vector operations, returning the result of a function, and so on. There exist some basic rules: arguments to the function are passed through registers X0..X7, and the return value is in register X0. If used, the stack must be 16-byte aligned each time, and LDP/STP instructions are the best choice for working with the stack when entering or exiting a function. An essential part of the code is tail handling: when the algorithm is designed to operate in parallel with 4 bytes and the user provides, for example, 15 bytes; in parallel, the operation can be executed on 4 bytes, parallel operations on the remaining 3 bytes cannot be performed, which may lead to an error.

First, let's take a look at the complete example code. The following example will demonstrate a function which adds two float arrays: result[i]=a[i] b[i], where a and b are passed as arguments, as well as the size of the array.

Listing 1: Multiply multiple elements in parallel

@ X0=result, X1=a, X2=b, X3=n (the number of floats)
.global vadd_f32
vadd_f32:
    @ Convert element count to bytes that will be processed
    @ 16 floats per loop → 64 bytes per array per loop
    @ Build loop trip count = n / 16
    MOV     X9, #16
    UDIV    X8, X3, X9            @ X8 = blocks of 16
    CBZ     X8, .tail

.loop:
    @ Load 16 floats from a and b (4x vectors)
    LD1     {V0.4S, V1.4S, V2.4S, V3.4S}, [X1], #64   @ a
    LD1     {V4.4S, V5.4S, V6.4S, V7.4S}, [X2], #64   @ b
    @ Add vectors
    FADD    V0.4S, V0.4S, V4.4S
    FADD    V1.4S, V1.4S, V5.4S
    FADD    V2.4S, V2.4S, V6.4S
    FADD    V3.4S, V3.4S, V7.4S
    ST1     {V0.4S, V1.4S, V2.4S, V3.4S}, [X0], #64     @ Store the result
    SUBS    X8, X8, #1
    B.NE    .loop

.tail:
    @ Handle remaining n % 16 scalars
    @ Compute r = n % 16
    MSUB    X9, X9, X8, X3         @ X9 = n - 16*blocks
    CBZ     X9, .done

.tail_loop:
    LDR     S0, [X1], #4
    LDR     S1, [X2], #4
    FADD    S0, S0, S1
    STR     S0, [X0], #4
    SUBS    X9, X9, #1
    B.NE    .tail_loop

.done:
    RET

This example contains new instructions and a bit different type of labels. Note that the dot before the label makes it act like a local label. Like “.loop” label can be used in many places in the code. The compiler translates this label to something unique, like “loop23,” if it's the 23rd (maybe) loop the compiler has found in the code. To be sure, the compiler code must be investigated. There is another way to branch to a local label: a numeric label like “1:” which was used in earlier examples. In the code, it is possible to identify a jump back to B.EQ 1b to the label “1:” or a jump forward to B.EQ 1f to the next local label “1:”. Such branches must be very short; it's best to use them in tiny loops. The instruction LD1 {V0.4S, V1.4S, V2.4S, V3.4S}, [X1], #64 loads 64 bytes in total (16 float variables for this example). After instruction executions, the register X1 gets increased by 64. The LD1 instruction loads the element structure from the memory. This is one of the advanced SIMD (NEON) instructions.

The curly brackets {V0.4S, V1.4S, V2.4S, V3.4S} indicate which NEON registers (V0..V3) are used to be loaded with four single-precision floating-point numbers (4 bytes – 32 bits). These registers operate like general-purpose registers; writing one byte to one of those registers will zero out the rest of the most significant bits. To save all the data in these registers, a different access method must be used: treat them as vectors using the Vx register. This means it is treated as though it contains multiple independent values rather than a single value. Operations with vector registers are a bit complex; to make them easier to understand, let's divide those instructions into several groups. One group is meant to load and store vector register values, and another performs arithmetic operations on vector registers.

NEON (SIMD – Single Instruction Multiple Data)

NEON is a powerful extension to the ARM processors. It is capable of performing a single instruction with multiple variables or data. Like in the previous example, numerous floating-point variables were loaded into a single vector register ”V0”. One NEON register can hold up to 16 bytes (128 bits). These floating-point values are loaded from the memory pointed to by any of the general-purpose registers, like the [X1] register, after which the [X1] register may be incremented by 64 or any other value.

Float variable F0 is the one pointed to by the [X1] register. Four vectors were loaded with one single “LD1” instruction.

Table 1: Variable mapping

F0	F1	F2	F3	`V0.4 = {F0, F1, F2, F3}`
F4	F5	F6	F7	`V1.4 = {F4, F5, F6, F7}`
F8	F9	F10	F11	`V2.4 = {F8, F9, F10, F11}`
F12	F13	F14	F15	`V3.4 = {F12, F13, F14, F15}`

In AArch64, NEON and floating-point operations share the same set of 32 registers: V0..V31, where each is 128 bits wide. Note that in instruction LD1 {v0.4s, v1.4s, v2.4s, v3.4s} the postfix “s” identifies the type of variable. These are named as elements in the NEON architecture documentation.

en/multiasm/paarm/chapter_5_10.1764863917.txt.gz · Last modified: 2025/12/04 15:58 by eriks.klavins