| Both sides previous revisionPrevious revisionNext revision | Previous revision |
| en:multiasm:paarm:chapter_5_10 [2025/12/04 15:38] – eriks.klavins | en:multiasm:paarm:chapter_5_10 [2025/12/04 21:55] (current) – [NEON (SIMD – Single Instruction Multiple Data)] eriks.klavins |
|---|
| ====== Advanced Assembly Programming ====== | ====== Advanced Assembly Programming ====== |
| |
| ARM processors are more capable than just toggling a GPIO pin or sending data through communication interfaces. Data processing requires more computational power than can be imagined, and it is not simply about ''<fc #800000>ADD</fc>'' or ''<fc #800000>SUB</fc>'' instructions. Performing multiple operations on multiple data sets requires knowledge of hardware possibilities and experience in such programming. The following small examples may repeat across different tasks: setting up counters or pointers to variables, browsing memory with post-indexed loads and stores, performing vector operations, returning the result of a function, and so on. There exist some basic rules: arguments to the function are passed through registers ''<fc #008000>X0</fc>''..''<fc #008000>X7</fc>'', and the return value is in register ''<fc #008000>X0</fc>''. If used, the stack must be 16-byte aligned each time, and ''<fc #800000>LDP</fc>''/'<fc #800000>STP</fc>'' instructions are the best choice for working with the stack when entering or exiting a function. An essential part of the code is tail handling: when the algorithm is designed to operate in parallel with 4 bytes and the user provides, for example, 15 bytes; in parallel, the operation can be executed on 4 bytes, parallel operations on the remaining 3 bytes cannot be performed, which may lead to an error. | ARM processors are more capable than just toggling a GPIO pin or sending data through communication interfaces. Data processing requires more computational power than can be imagined, and it is not simply about ''<fc #800000>ADD</fc>'' or ''<fc #800000>SUB</fc>'' instructions. Performing multiple operations on multiple data sets requires knowledge of hardware possibilities and experience in such programming. The following small examples may repeat across different tasks: setting up counters or pointers to variables, browsing memory with post-indexed loads and stores, performing vector operations, returning the result of a function, and so on. There exist some basic rules: arguments to the function are passed through registers ''<fc #008000>X0</fc>''..''<fc #008000>X7</fc>'', and the return value is in register ''<fc #008000>X0</fc>''. If used, the stack must be 16-byte aligned each time, and ''<fc #800000>LDP</fc>''/''<fc #800000>STP</fc>'' instructions are the best choice for working with the stack when entering or exiting a function. An essential part of the code is tail handling: when the algorithm is designed to operate in parallel with 4 bytes and the user provides, for example, 15 bytes; in parallel, the operation can be executed on 4 bytes, parallel operations on the remaining 3 bytes cannot be performed, which may lead to an error. |
| |
| First, let's take a look at the complete example code. The following example will demonstrate a function which adds two float arrays: ''result[i]=a[i] b [i]'', where a and b are passed as arguments, as well as the size of the array. | First, let's take a look at the complete example code. The following example will demonstrate a function which adds two float arrays: ''result[i]=a[i] b[i]'', where a and b are passed as arguments, as well as the size of the array. |
| <codeblock code_label> | <codeblock code_label> |
| <caption>Multiply multiple elements in parallel</caption> | <caption>Multiply multiple elements in parallel</caption> |
| |
| This example contains new instructions and a bit different type of labels. Note that the dot before the label makes it act like a local label. Like “.loop” label can be used in many places in the code. The compiler translates this label to something unique, like “loop23,” if it's the 23rd (maybe) loop the compiler has found in the code. To be sure, the compiler code must be investigated. There is another way to branch to a local label: a numeric label like “1:” which was used in earlier examples. In the code, it is possible to identify a jump back to B.EQ 1b to the label “1:” or a jump forward to B.EQ 1f to the next local label “1:”. Such branches must be very short; it's best to use them in tiny loops. | This example contains new instructions and a bit different type of labels. Note that the dot before the label makes it act like a local label. Like “.loop” label can be used in many places in the code. The compiler translates this label to something unique, like “loop23,” if it's the 23rd (maybe) loop the compiler has found in the code. To be sure, the compiler code must be investigated. There is another way to branch to a local label: a numeric label like “1:” which was used in earlier examples. In the code, it is possible to identify a jump back to B.EQ 1b to the label “1:” or a jump forward to B.EQ 1f to the next local label “1:”. Such branches must be very short; it's best to use them in tiny loops. |
| the instruction ''LD1 {V0.4S, V1.4S, V2.4S, <fc #008000>V3</fc>.4S}, [<fc #008000>X1</fc>], <fc #ffa500>#64</fc>'' | The instruction ''<fc #800000>LD1 </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>, <fc #008000>V3</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>], <fc #ffa500>#64</fc>'' loads 64 bytes in total (16 float variables for this example). After instruction executions, the register ''<fc #ffa500>X1</fc>'' gets increased by 64. The ''<fc #800000>LD1</fc>'' instruction loads the element structure from the memory. This is one of the advanced SIMD (NEON) instructions. |
| | |
| | The curly brackets ''{<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>, <fc #008000>V3</fc>.<fc #808000>4S</fc>}'' indicate which NEON registers (''<fc #008000>V0</fc>''..''<fc #008000>V3</fc>'') are used to be loaded with four single-precision floating-point numbers (4 bytes – 32 bits). These registers operate like general-purpose registers; writing one byte to one of those registers will zero out the rest of the most significant bits. To save all the data in these registers, a different access method must be used: treat them as vectors using the Vx register. This means it is treated as though it contains multiple independent values rather than a single value. Operations with vector registers are a bit complex; to make them easier to understand, let's divide those instructions into several groups. One group is meant to load and store vector register values, and another performs arithmetic operations on vector registers. |
| | |
| | |
| | ===== NEON (SIMD – Single Instruction Multiple Data) ===== |
| | |
| | NEON is a powerful extension to the ARM processors. It is capable of performing a single instruction with multiple variables or data. Like in the previous example, numerous floating-point variables were loaded into a single vector register ”''<fc #008000>V0</fc>''”. One NEON register can hold up to 16 bytes (128 bits). These floating-point values are loaded from the memory pointed to by any of the general-purpose registers, like the ''[<fc #008000>X1</fc>]'' register, after which the ''[<fc #008000>X1</fc>]'' register may be incremented by 64 or any other value. |
| | |
| | Float variable F0 is the one pointed to by the ''[<fc #008000>X1</fc>]'' register. Four vectors were loaded with one single “''<fc #800000>LD1</fc>''” instruction. |
| | <table tab_label> |
| | <caption>Variable mapping</caption> |
| | | F0 | F1 | F2 | F3 | ''V0.4 = {F0, F1, F2, F3}'' | |
| | | F4 | F5 | F6 | F7 | ''V1.4 = {F4, F5, F6, F7}'' | |
| | | F8 | F9 | F10 | F11 | ''V2.4 = {F8, F9, F10, F11}'' | |
| | | F12 | F13 | F14 | F15 | ''V3.4 = {F12, F13, F14, F15}'' | |
| | </table> |
| | |
| | In AArch64, NEON and floating-point operations share the same set of 32 registers: ''<fc #008000>V0</fc>''..''<fc #008000>V31</fc>'', where each is 128 bits wide. Note that in instruction ''<fc #800000>LD1 </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>, <fc #008000>V3</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>], <fc #ffa500>#64</fc>'' the postfix “<fc #808000>S</fc>” identifies the type of variable. These are named as elements in the NEON architecture documentation. |
| | * <fc #008000>V0</fc>.<fc #808000>2D</fc> can hold two elements: two double-precision floating-point variables or two 64-bit integers, where the letter d indicates doubleword |
| | * <fc #008000>V3</fc>.<fc #808000>4S</fc> can hold four elements: four single-precision floating-point variables or four 32-bit integers, where the letter s indicates word |
| | * <fc #008000>V16</fc>.<fc #808000>8H</fc> can hold eight elements: eight 16-bit integers, where the letter h indicates a halfword |
| | * <fc #008000>V31</fc>.<fc #808000>16B</fc> can hold 16 elements: sixteen 8-bit integers, where the letter b indicates byte |
| | |
| | {{:en:multiasm:paarm:2025-12-04_17_56_45-.png|}} |
| | |
| | The NEON allows the CPU to perform operations on multiple data items with a single instruction. Standard scalar instructions can handle only one scalar value in one register, but NEON registers are specially designed so that they can hold multiple values. And this is how Single Instruction, Multiple Data (SIMD) operations work: one instruction operating on several elements in parallel. This technique is widely used for image, signal, and multimedia processing, as well as for accelerating math-heavy tasks like machine learning, encryption or other Artificial Intelligence AI tasks. The processor used in the Raspberry Pi 5 supports the NEON and floating-point (FP) instruction sets. |
| | |
| | This is important to specify the register and how to interpret it when working with NEON instructions. Elements in a vector are ordered from the least significant bit. That is, element 0 uses the least significant bits of the register.\\ |
| | ''<fc #800000>FADD</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc> <fc #6495ed>@ Add 4 single-precision floats from v1 and v2, store in v0.</fc>''\\ |
| | ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>8H</fc>, <fc #008000>V1</fc>.<fc #808000>8H</fc>, <fc #008000>V2</fc>.<fc #808000>8H</fc> <fc #6495ed>@ Add 8 halfwords (16-bit integers) elementwise.</fc>''\\ |
| | ''<fc #800000>MUL</fc> <fc #008000>V0</fc>.<fc #808000>16B</fc>, <fc #008000>V1</fc>.<fc #808000>16B</fc>, <fc #008000>V2</fc>.<fc #808000>16B</fc> <fc #6495ed>@ Multiply 16-byte elementwise.</fc>'' |
| | |
| | Elementwise operations are performed by lane, where each element occupies its own lane. Like ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>8H</fc>, <fc #008000>V1</fc>.<fc #808000>8H</fc>, <fc #008000>V2</fc>.<fc #808000>8H</fc>''. The instruction specifies that there will be eight distinct 16-bit variables. The instruction is designed to operate on the integers, and the result is computed from eight 16-bit integers. In the picture below, it is visible which element from vector ''<fc #008000>V1</fc>'' is added to vector ''<fc #008000>V2</fc>'' and how the result is obtained in vector ''<fc #008000>V0</fc>''. |
| | in the image is ilustrated instruction ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>8H</fc>, <fc #008000>V1</fc>.<fc #808000>8H</fc>, <fc #008000>V2</fc>.<fc #808000>8H</fc>'' operation. |
| | |
| | {{:en:multiasm:paarm:2025-12-04_18_45_46-pictures_for_the_book.pptx_-_powerpoint.jpg|}} |
| | |
| | In the image, it is also visible how the elements are added together in separate lanes (eight lanes). The lanes used for operations depend on the vector register elements used in the operation. Another example with only four lanes involved would be from this instruction: ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>'' |
| | |
| | {{:en:multiasm:paarm:neon_simd2.jpg|}} |
| | |
| | The NEON can also perform operations by narrowing the destination register.\\ |
| | ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>4H</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>'' |
| | |
| | {{:en:multiasm:paarm:neon_simd_stoh.jpg|}} |
| | |
| | And another example with widening the destination register:\\ |
| | ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4H</fc>, <fc #008000>V2</fc>.<fc #808000>4H</fc>'' |
| | {{:en:multiasm:paarm:neon_simd_htos.jpg|}} |
| | |
| | The instruction specifies that there will be eight distinct 16-bit variables. The ''<fc #800000>ADD</fc>'' instruction is designed to operate on integers, and the result is computed from eight 16-bit integers, four 32-bit integers and so on. SIMD instructions also allow taking only one element from a vector and using it as a scalar value. As a result, the vector, for example, is multiplied by a scalar value\\ |
| | ''<fc #800000>FMLA</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>[<fc #ffa500>1</fc>] <fc #6495ed>@ v0 += v1 * (scalar from v2 element nr1)</fc>'' |
| | |
| | {{:en:multiasm:paarm:neon_fmla.jpg|}} |
| | |
| | |
| | The data preparation for NEON is also an essential part. Note that in the previous code example, the values into the vector registers were loaded by use of the LD1 instruction. NEON has special LDn and STn instructions, where n is 1 to 4 (i.e., LD1, ST4, ST2, LD3). Each of the load (or store) variants performs differently and must be chosen accordingly to the data. Furthermore, the LOAD instructions will be explained, as the STORE instruction works similarly but stores data from a register into memory. |
| | |
| | The LD1 instruction: |
| | * Loads a single-element structure to a vector register |
| | * Loads multiple single-element structures into a vector register |
| | * With postfix ‘R’, like LD1R, loads a single-element structure into all lanes of the vector register |
| | |
| | To load a single point value into the register, the register must be pointed to the exact element number where the value will be loaded, like in this example: ''<fc #800000>LD1</fc> {<fc #008000>V0</fc>.<fc #808000>S</fc>[<fc #ffa500>3</fc>], [<fc #008000>X1</fc>]''. The ''<fc #008000>V0</fc>'' register here is divided into four lanes, each capable of holding either a single-precision floating-point number or a 32-bit integer. After curly brackets ‘''[<fc #ffa500>3</fc>]''’ point out the exact lane (element), in which the data must be loaded, in this case, the element number 3 is chosen. |
| | |
| | The instruction ''<fc #800000>LD1R</fc> {<fc #008000>V0</fc>.<fc #808000>S</fc>, [<fc #008000>X1</fc>]'' loads one 32-bit value (integer or single-precision floating-point) in each of four vector lanes. All vector elements will contain the same value. |
| | |
| | The instruction ''<fc #800000>LD1</fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>} [<fc #008000>X1</fc>]'' loads eight 32-bit values into the lanes of first ''<fc #008000>V0</fc>'' and then on ''<fc #008000>V1</fc>''. All the data are loaded sequentially in the form x0, x1, x2, x3 and so on. Note that x here is meant for any chosen data type and that the data by itself are in sequential order. Similar operations are performed with ''<fc #800000>LD2</fc>'', ''<fc #800000>LD3</fc>'' and ''<fc #800000>LD4</fc>'' instructions; the data loaded into the vector register are interleaved. The ''<fc #800000>LD2</fc>'' instruction loads the structure of two elements into two vector registers. There must always be two vector registers identified. The structure in the memory may be in such order: x0, y0, x1, y1, x2, y2, … and using the LD2 instruction, the x part of the structure is loaded into the first pointed register and the y part into the second vector register. This instruction is used, for example, to process audio data, where x may be the left audio channel and y the right:\\ |
| | ''<fc #800000>LD2 </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]'' |
| | |
| | {{:en:multiasm:paarm:neon_ld2.jpg|}} |
| | |
| | The ''<fc #800000>LD3</fc>'' instruction takes memory data as a three-element structure, such as an image's RGB data. For this instruction, three vector registers must be identified. Assuming the data in memory: r0, g0, b0, r1, g1, b1, r2, g2, b2,…, the r (red) channel will be loaded in the first identified register, the ‘g’ in the second and ‘b’ in the third, the last identified register.\\ |
| | ''<fc #800000>LD3 </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]'' |
| | |
| | {{:en:multiasm:paarm:neon_ld3.jpg|}} |
| | |
| | |
| | The ''<fc #800000>LD4</fc>'' instruction is also used to load interleaved data into four vector registers.\\ |
| | ''<fc #800000>LD4 </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>, <fc #008000>V3</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]'' |
| | |
| | {{:en:multiasm:paarm:neon_ld4.jpg|}} |
| | |
| | The ''<fc #800000>LD4R</fc>'' instruction will take four values from memory and fill the first vector with the first value, the second with the second, and so on. Such instructions can be used to perform arithmetical operations on the same value with multiple different scalar values. Similar operations are performed with ''<fc #800000>LD3R</fc>'' and ''<fc #800000>LD2R</fc>'' instructions. Note that the ''<fc #800000>//LDn//</fc>'', the ‘n’ identifies the number of vector registers loaded with the structure elements. Remember that the structure elements are the same as the number of vector registers used by an instruction. |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| |
| |