Differences

This shows you the differences between two versions of the page.

Link to this comparison view

--- en:multiasm:paarm:chapter_5_10 [2025/12/04 16:01] – [NEON (SIMD – Single Instruction Multiple Data)] eriks.klavins
+++ en:multiasm:paarm:chapter_5_10 [2025/12/04 21:55] (current) – [NEON (SIMD – Single Instruction Multiple Data)] eriks.klavins
@@ Line 69: / Line 69: @@
 In AArch64, NEON and floating-point operations share the same set of 32 registers: ''<fc #008000>V0</fc>''..''<fc #008000>V31</fc>'', where each is 128 bits wide. Note that in instruction ''<fc #800000>LD1    </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>, <fc #008000>V3</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>], <fc #ffa500>#64</fc>'' the postfix “<fc #808000>S</fc>” identifies the type of variable. These are named as elements in the NEON architecture documentation.
-  * <fc #008000>V0</fc>.<fc #808000>2D</fc> can hold two elements: two double-precision floating-point variables or two 64-bit integers, where d indicates doubleword
+  * <fc #008000>V0</fc>.<fc #808000>2D</fc> can hold two elements: two double-precision floating-point variables or two 64-bit integers, where the letter d indicates doubleword
-  * <fc #008000>V3</fc>.<fc #808000>4S</fc> can hold four elements: four single-precision floating-point variables or four 32-bit integers, where s indicates word
+  * <fc #008000>V3</fc>.<fc #808000>4S</fc> can hold four elements: four single-precision floating-point variables or four 32-bit integers, where the letter s indicates word
-  * <fc #008000>V16</fc>.<fc #808000>8H</fc> can hold eight elements: eight 16-bit integers, where h indicates a halfword
+  * <fc #008000>V16</fc>.<fc #808000>8H</fc> can hold eight elements: eight 16-bit integers, where the letter h indicates a halfword
-  * <fc #008000>V31</fc>.<fc #808000>16B</fc> can hold 16 elements: sixteen 8-bit integers, where b indicates byte
+  * <fc #008000>V31</fc>.<fc #808000>16B</fc> can hold 16 elements: sixteen 8-bit integers, where the letter b indicates byte
+{{:en:multiasm:paarm:2025-12-04_17_56_45-.png|}}
+The NEON allows the CPU to perform operations on multiple data items with a single instruction. Standard scalar instructions can handle only one scalar value in one register, but NEON registers are specially designed so that they can hold multiple values. And this is how Single Instruction, Multiple Data (SIMD) operations work: one instruction operating on several elements in parallel. This technique is widely used for image, signal, and multimedia processing, as well as for accelerating math-heavy tasks like machine learning, encryption or other Artificial Intelligence AI tasks. The processor used in the Raspberry Pi 5 supports the NEON and floating-point (FP) instruction sets.
+This is important to specify the register and how to interpret it when working with NEON instructions. Elements in a vector are ordered from the least significant bit. That is, element 0 uses the least significant bits of the register.\\
+''<fc #800000>FADD</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc> <fc #6495ed>@ Add 4 single-precision floats from v1 and v2, store in v0.</fc>''\\
+''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>8H</fc>, <fc #008000>V1</fc>.<fc #808000>8H</fc>, <fc #008000>V2</fc>.<fc #808000>8H</fc> <fc #6495ed>@ Add 8 halfwords (16-bit integers) elementwise.</fc>''\\
+''<fc #800000>MUL</fc> <fc #008000>V0</fc>.<fc #808000>16B</fc>, <fc #008000>V1</fc>.<fc #808000>16B</fc>, <fc #008000>V2</fc>.<fc #808000>16B</fc> <fc #6495ed>@ Multiply 16-byte elementwise.</fc>''
+Elementwise operations are performed by lane, where each element occupies its own lane. Like ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>8H</fc>, <fc #008000>V1</fc>.<fc #808000>8H</fc>, <fc #008000>V2</fc>.<fc #808000>8H</fc>''. The instruction specifies that there will be eight distinct 16-bit variables. The instruction is designed to operate on the integers, and the result is computed from eight 16-bit integers. In the picture below, it is visible which element from vector ''<fc #008000>V1</fc>'' is added to vector ''<fc #008000>V2</fc>'' and how the result is obtained in vector ''<fc #008000>V0</fc>''.
+in the image is ilustrated instruction ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>8H</fc>, <fc #008000>V1</fc>.<fc #808000>8H</fc>, <fc #008000>V2</fc>.<fc #808000>8H</fc>'' operation.
+{{:en:multiasm:paarm:2025-12-04_18_45_46-pictures_for_the_book.pptx_-_powerpoint.jpg|}}
+In the image, it is also visible how the elements are added together in separate lanes (eight lanes). The lanes used for operations depend on the vector register elements used in the operation. Another example with only four lanes involved would be from this instruction: ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>''
+{{:en:multiasm:paarm:neon_simd2.jpg|}}
+The NEON can also perform operations by narrowing the destination register.\\
+''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>4H</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>''
+{{:en:multiasm:paarm:neon_simd_stoh.jpg|}}
+And another example with widening the destination register:\\
+''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4H</fc>, <fc #008000>V2</fc>.<fc #808000>4H</fc>''
+{{:en:multiasm:paarm:neon_simd_htos.jpg|}}
+The instruction specifies that there will be eight distinct 16-bit variables. The ''<fc #800000>ADD</fc>'' instruction is designed to operate on integers, and the result is computed from eight 16-bit integers,  four 32-bit integers and so on. SIMD instructions also allow taking only one element from a vector and using it as a scalar value. As a result, the vector, for example, is multiplied by a scalar value\\
+''<fc #800000>FMLA</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>[<fc #ffa500>1</fc>] <fc #6495ed>@ v0 += v1 * (scalar from v2 element nr1)</fc>''
+{{:en:multiasm:paarm:neon_fmla.jpg|}}
+The data preparation for NEON is also an essential part. Note that in the previous code example, the values into the vector registers were loaded by use of the LD1 instruction. NEON has special LDn and STn instructions, where n is 1 to 4 (i.e., LD1, ST4, ST2, LD3). Each of the load (or store) variants performs differently and must be chosen accordingly to the data. Furthermore, the LOAD instructions will be explained, as the STORE instruction works similarly but stores data from a register into memory.
+The LD1 instruction:
+  * Loads a single-element structure to a vector register
+  * Loads multiple single-element structures into a vector register
+  * With postfix ‘R’, like LD1R, loads a single-element structure into all lanes of the vector register
+To load a single point value into the register, the register must be pointed to the exact element number where the value will be loaded, like in this example: ''<fc #800000>LD1</fc> {<fc #008000>V0</fc>.<fc #808000>S</fc>[<fc #ffa500>3</fc>], [<fc #008000>X1</fc>]''. The ''<fc #008000>V0</fc>'' register here is divided into four lanes, each capable of holding either a single-precision floating-point number or a 32-bit integer. After curly brackets ‘''[<fc #ffa500>3</fc>]''’ point out the exact lane (element), in which the data must be loaded, in this case, the element number 3 is chosen.
+The instruction ''<fc #800000>LD1R</fc> {<fc #008000>V0</fc>.<fc #808000>S</fc>, [<fc #008000>X1</fc>]'' loads one 32-bit value (integer or single-precision floating-point) in each of four vector lanes. All vector elements will contain the same value.
+The instruction ''<fc #800000>LD1</fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>} [<fc #008000>X1</fc>]'' loads eight 32-bit values into the lanes of first ''<fc #008000>V0</fc>'' and then on ''<fc #008000>V1</fc>''. All the data are loaded sequentially in the form x0, x1, x2, x3 and so on. Note that x here is meant for any chosen data type and that the data by itself are in sequential order. Similar operations are performed with ''<fc #800000>LD2</fc>'', ''<fc #800000>LD3</fc>'' and ''<fc #800000>LD4</fc>'' instructions; the data loaded into the vector register are interleaved. The ''<fc #800000>LD2</fc>'' instruction loads the structure of two elements into two vector registers. There must always be two vector registers identified. The structure in the memory may be in such order: x0, y0, x1, y1, x2, y2, … and using the LD2 instruction, the x part of the structure is loaded into the first pointed register and the y part into the second vector register. This instruction is used, for example, to process audio data, where x may be the left audio channel and y the right:\\
+''<fc #800000>LD2    </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]''
+{{:en:multiasm:paarm:neon_ld2.jpg|}}
+The ''<fc #800000>LD3</fc>'' instruction takes memory data as a three-element structure, such as an image's RGB data. For this instruction, three vector registers must be identified. Assuming the data in memory: r0, g0, b0, r1, g1, b1, r2, g2, b2,…, the r (red) channel will be loaded in the first identified register, the ‘g’ in the second and ‘b’ in the third, the last identified register.\\
+''<fc #800000>LD3    </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]''
+{{:en:multiasm:paarm:neon_ld3.jpg|}}
+The ''<fc #800000>LD4</fc>'' instruction is also used to load interleaved data into four vector registers.\\
+''<fc #800000>LD4    </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>, <fc #008000>V3</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]''
+{{:en:multiasm:paarm:neon_ld4.jpg|}}
+The ''<fc #800000>LD4R</fc>'' instruction will take four values from memory and fill the first vector with the first value, the second with the second, and so on. Such instructions can be used to perform arithmetical operations on the same value with multiple different scalar values. Similar operations are performed with ''<fc #800000>LD3R</fc>'' and ''<fc #800000>LD2R</fc>'' instructions. Note that the ''<fc #800000>//LDn//</fc>'', the ‘n’ identifies the number of vector registers loaded with the structure elements. Remember that the structure elements are the same as the number of vector registers used by an instruction.

en/multiasm/paarm/chapter_5_10.1764864072.txt.gz · Last modified: 2025/12/04 16:01 by eriks.klavins