Advanced Assembly Programming

ARM processors are more capable than just toggling a GPIO pin or sending data through communication interfaces. Data processing requires more computational power than can be imagined, and it is not simply about ADD or SUB instructions. Performing multiple operations on multiple data sets requires knowledge of hardware possibilities and experience in such programming. The following small examples may repeat across different tasks: setting up counters or pointers to variables, browsing memory with post-indexed loads and stores, performing vector operations, returning the result of a function, and so on. There exist some basic rules: arguments to the function are passed through registers X0..X7, and the return value is in register X0. If used, the stack must be 16-byte aligned each time, and LDP/STP instructions are the best choice for working with the stack when entering or exiting a function. An essential part of the code is tail handling: when the algorithm is designed to operate in parallel with 4 bytes and the user provides, for example, 15 bytes; in parallel, the operation can be executed on 4 bytes, parallel operations on the remaining 3 bytes cannot be performed, which may lead to an error.

First, let's take a look at the complete example code. The following example will demonstrate a function which adds two float arrays: result[i]=a[i] b[i], where a and b are passed as arguments, as well as the size of the array.

Listing 1: Multiply multiple elements in parallel

@ X0=result, X1=a, X2=b, X3=n (the number of floats)
.global vadd_f32
vadd_f32:
    @ Convert element count to bytes that will be processed
    @ 16 floats per loop → 64 bytes per array per loop
    @ Build loop trip count = n / 16
    MOV     X9, #16
    UDIV    X8, X3, X9            @ X8 = blocks of 16
    CBZ     X8, .tail

.loop:
    @ Load 16 floats from a and b (4x vectors)
    LD1     {V0.4S, V1.4S, V2.4S, V3.4S}, [X1], #64   @ a
    LD1     {V4.4S, V5.4S, V6.4S, V7.4S}, [X2], #64   @ b
    @ Add vectors
    FADD    V0.4S, V0.4S, V4.4S
    FADD    V1.4S, V1.4S, V5.4S
    FADD    V2.4S, V2.4S, V6.4S
    FADD    V3.4S, V3.4S, V7.4S
    ST1     {V0.4S, V1.4S, V2.4S, V3.4S}, [X0], #64     @ Store the result
    SUBS    X8, X8, #1
    B.NE    .loop

.tail:
    @ Handle remaining n % 16 scalars
    @ Compute r = n % 16
    MSUB    X9, X9, X8, X3         @ X9 = n - 16*blocks
    CBZ     X9, .done

.tail_loop:
    LDR     S0, [X1], #4
    LDR     S1, [X2], #4
    FADD    S0, S0, S1
    STR     S0, [X0], #4
    SUBS    X9, X9, #1
    B.NE    .tail_loop

.done:
    RET

This example contains new instructions and a bit different type of labels. Note that the dot before the label makes it act like a local label. Like “.loop” label can be used in many places in the code. The compiler translates this label to something unique, like “loop23,” if it's the 23rd (maybe) loop the compiler has found in the code. To be sure, the compiler code must be investigated. There is another way to branch to a local label: a numeric label like “1:” which was used in earlier examples. In the code, it is possible to identify a jump back to B.EQ 1b to the label “1:” or a jump forward to B.EQ 1f to the next local label “1:”. Such branches must be very short; it's best to use them in tiny loops. The instruction LD1 {V0.4S, V1.4S, V2.4S, V3.4S}, [X1], #64 loads 64 bytes in total (16 float variables for this example). After instruction executions, the register X1 gets increased by 64. The LD1 instruction loads the element structure from the memory. This is one of the advanced SIMD (NEON) instructions.

The curly brackets {V0.4S, V1.4S, V2.4S, V3.4S} indicate which NEON registers (V0..V3) are used to be loaded with four single-precision floating-point numbers (4 bytes – 32 bits). These registers operate like general-purpose registers; writing one byte to one of those registers will zero out the rest of the most significant bits. To save all the data in these registers, a different access method must be used: treat them as vectors using the Vx register. This means it is treated as though it contains multiple independent values rather than a single value. Operations with vector registers are a bit complex; to make them easier to understand, let's divide those instructions into several groups. One group is meant to load and store vector register values, and another performs arithmetic operations on vector registers.

NEON (SIMD – Single Instruction Multiple Data)

NEON is a powerful extension to the ARM processors. It is capable of performing a single instruction with multiple variables or data. Like in the previous example, numerous floating-point variables were loaded into a single vector register ”V0”. One NEON register can hold up to 16 bytes (128 bits). These floating-point values are loaded from the memory pointed to by any of the general-purpose registers, like the [X1] register, after which the [X1] register may be incremented by 64 or any other value.

Float variable F0 is the one pointed to by the [X1] register. Four vectors were loaded with one single “LD1” instruction.

Table 1: Variable mapping

F0	F1	F2	F3	`V0.4 = {F0, F1, F2, F3}`
F4	F5	F6	F7	`V1.4 = {F4, F5, F6, F7}`
F8	F9	F10	F11	`V2.4 = {F8, F9, F10, F11}`
F12	F13	F14	F15	`V3.4 = {F12, F13, F14, F15}`

In AArch64, NEON and floating-point operations share the same set of 32 registers: V0..V31, where each is 128 bits wide. Note that in instruction LD1 {V0.4S, V1.4S, V2.4S, V3.4S}, [X1], #64 the postfix “S” identifies the type of variable. These are named as elements in the NEON architecture documentation.

V0.2D can hold two elements: two double-precision floating-point variables or two 64-bit integers, where the letter d indicates doubleword
V3.4S can hold four elements: four single-precision floating-point variables or four 32-bit integers, where the letter s indicates word
V16.8H can hold eight elements: eight 16-bit integers, where the letter h indicates a halfword
V31.16B can hold 16 elements: sixteen 8-bit integers, where the letter b indicates byte

The NEON allows the CPU to perform operations on multiple data items with a single instruction. Standard scalar instructions can handle only one scalar value in one register, but NEON registers are specially designed so that they can hold multiple values. And this is how Single Instruction, Multiple Data (SIMD) operations work: one instruction operating on several elements in parallel. This technique is widely used for image, signal, and multimedia processing, as well as for accelerating math-heavy tasks like machine learning, encryption or other Artificial Intelligence AI tasks. The processor used in the Raspberry Pi 5 supports the NEON and floating-point (FP) instruction sets.

This is important to specify the register and how to interpret it when working with NEON instructions. Elements in a vector are ordered from the least significant bit. That is, element 0 uses the least significant bits of the register.
FADD V0.4S, V1.4S, V2.4S @ Add 4 single-precision floats from v1 and v2, store in v0.
ADD V0.8H, V1.8H, V2.8H @ Add 8 halfwords (16-bit integers) elementwise.
MUL V0.16B, V1.16B, V2.16B @ Multiply 16-byte elementwise.

Elementwise operations are performed by lane, where each element occupies its own lane. Like ADD V0.8H, V1.8H, V2.8H. The instruction specifies that there will be eight distinct 16-bit variables. The instruction is designed to operate on the integers, and the result is computed from eight 16-bit integers. In the picture below, it is visible which element from vector V1 is added to vector V2 and how the result is obtained in vector V0. in the image is ilustrated instruction ADD V0.8H, V1.8H, V2.8H operation.

In the image, it is also visible how the elements are added together in separate lanes (eight lanes). The lanes used for operations depend on the vector register elements used in the operation. Another example with only four lanes involved would be from this instruction: ADD V0.4S, V1.4S, V2.4S

The NEON can also perform operations by narrowing the destination register.
ADD V0.4H, V1.4S, V2.4S

And another example with widening the destination register:
ADD V0.4S, V1.4H, V2.4H

The instruction specifies that there will be eight distinct 16-bit variables. The ADD instruction is designed to operate on integers, and the result is computed from eight 16-bit integers, four 32-bit integers and so on. SIMD instructions also allow taking only one element from a vector and using it as a scalar value. As a result, the vector, for example, is multiplied by a scalar value
FMLA V0.4S, V1.4S, V2.4S[1] @ v0 += v1 * (scalar from v2 element nr1)

The data preparation for NEON is also an essential part. Note that in the previous code example, the values into the vector registers were loaded by use of the LD1 instruction. NEON has special LDn and STn instructions, where n is 1 to 4 (i.e., LD1, ST4, ST2, LD3). Each of the load (or store) variants performs differently and must be chosen accordingly to the data. Furthermore, the LOAD instructions will be explained, as the STORE instruction works similarly but stores data from a register into memory.

The LD1 instruction:

Loads a single-element structure to a vector register
Loads multiple single-element structures into a vector register
With postfix ‘R’, like LD1R, loads a single-element structure into all lanes of the vector register

To load a single point value into the register, the register must be pointed to the exact element number where the value will be loaded, like in this example: LD1 {V0.S[3], [X1]. The V0 register here is divided into four lanes, each capable of holding either a single-precision floating-point number or a 32-bit integer. After curly brackets ‘[3]’ point out the exact lane (element), in which the data must be loaded, in this case, the element number 3 is chosen.

The instruction LD1R {V0.S, [X1] loads one 32-bit value (integer or single-precision floating-point) in each of four vector lanes. All vector elements will contain the same value.

The instruction LD1 {V0.4S, V1.4S} [X1] loads eight 32-bit values into the lanes of first V0 and then on V1. All the data are loaded sequentially in the form x0, x1, x2, x3 and so on. Note that x here is meant for any chosen data type and that the data by itself are in sequential order. Similar operations are performed with LD2, LD3 and LD4 instructions; the data loaded into the vector register are interleaved. The LD2 instruction loads the structure of two elements into two vector registers. There must always be two vector registers identified. The structure in the memory may be in such order: x0, y0, x1, y1, x2, y2, … and using the LD2 instruction, the x part of the structure is loaded into the first pointed register and the y part into the second vector register. This instruction is used, for example, to process audio data, where x may be the left audio channel and y the right:
LD2 {V0.4S, V1.4S}, [X1]

The LD3 instruction takes memory data as a three-element structure, such as an image's RGB data. For this instruction, three vector registers must be identified. Assuming the data in memory: r0, g0, b0, r1, g1, b1, r2, g2, b2,…, the r (red) channel will be loaded in the first identified register, the ‘g’ in the second and ‘b’ in the third, the last identified register.
LD3 {V0.4S, V1.4S, V2.4S}, [X1]

The LD4 instruction is also used to load interleaved data into four vector registers.
LD4 {V0.4S, V1.4S, V2.4S, V3.4S}, [X1]

The LD4R instruction will take four values from memory and fill the first vector with the first value, the second with the second, and so on. Such instructions can be used to perform arithmetical operations on the same value with multiple different scalar values. Similar operations are performed with LD3R and LD2R instructions. Note that the LDn, the ‘n’ identifies the number of vector registers loaded with the structure elements. Remember that the structure elements are the same as the number of vector registers used by an instruction.

en/multiasm/paarm/chapter_5_10.1764885247.txt.gz · Last modified: 2025/12/04 21:54 by eriks.klavins