This is an old revision of the document!
ARM processors are more capable than just toggling a GPIO pin or sending data through communication interfaces. Data processing requires more computational power than can be imagined, and it is not simply about ADD or SUB instructions. Performing multiple operations on multiple data sets requires knowledge of hardware possibilities and experience in such programming. The following small examples may repeat across different tasks: setting up counters or pointers to variables, browsing memory with post-indexed loads and stores, performing vector operations, returning the result of a function, and so on. There exist some basic rules: arguments to the function are passed through registers X0..X7, and the return value is in register X0. If used, the stack must be 16-byte aligned each time, and LDP/'STP instructions are the best choice for working with the stack when entering or exiting a function. An essential part of the code is tail handling: when the algorithm is designed to operate in parallel with 4 bytes and the user provides, for example, 15 bytes; in parallel, the operation can be executed on 4 bytes, parallel operations on the remaining 3 bytes cannot be performed, which may lead to an error.
First, let's take a look at the complete example code. The following example will demonstrate a function which adds two float arrays: result[i]=a[i] b [i], where a and b are passed as arguments, as well as the size of the array.
<codeblock code_label>
<caption>Multiply multiple elements in parallel</caption>
<code>
@ X0=result, X1=a, X2=b, X3=n (the number of floats)
.global vadd_f32
vadd_f32:
@ Convert element count to bytes that will be processed
@ 16 floats per loop → 64 bytes per array per loop
@ Build loop trip count = n / 16
MOV X9, #16
UDIV X8, X3, X9 @ X8 = blocks of 16
CBZ X8, .tail
.loop:
@ Load 16 floats from a and b (4x vectors)
LD1 {V0.4S, V1.4S, V2.4S, V3.4S}, [X1], #64 @ a
LD1 {V4.4S, V5.4S, V6.4S, V7.4S}, [X2], #64 @ b
@ Add vectors
FADD V0.4S, V0.4S, V4.4S
FADD V1.4S, V1.4S, V5.4S
FADD V2.4S, V2.4S, V6.4S
FADD V3.4S, V3.4S, V7.4S
ST1 {V0.4S, V1.4S, V2.4S, V3.4S}, [X0], #64 @ Store the result
SUBS X8, X8, #1
B.NE .loop
.tail:
@ Handle remaining n % 16 scalars
@ Compute r = n % 16
MSUB X9, X9, X8, X3 @ X9 = n - 16*blocks
CBZ X9, .done
.tail_loop:
LDR S0, [X1], #4
LDR S1, [X2], #4
FADD S0, S0, S1
STR S0, [X0], #4
SUBS X9, X9, #1
B.NE .tail_loop
.done:
RET
</code>
</codeblock>
This example contains new instructions and a bit different type of labels. Note that the dot before the label makes it act like a local label. Like “.loop” label can be used in many places in the code. The compiler translates this label to something unique, like “loop23,” if it's the 23rd (maybe) loop the compiler has found in the code. To be sure, the compiler code must be investigated. There is another way to branch to a local label: a numeric label like “1:” which was used in earlier examples. In the code, it is possible to identify a jump back to B.EQ 1b to the label “1:” or a jump forward to B.EQ 1f to the next local label “1:”. Such branches must be very short; it's best to use them in tiny loops.
the instruction LD1 {V0.4S, V1.4S, V2.4S, V3.4S}, [X1], #64''