Vectorisation

With ARMv8.2, the scalable vector extension is introduced, and later, with ARMv9.2, the scalable matrix extension is introduced. They share a common technique and idea; the difference is that a vector operates on one-dimensional data, whereas a matrix operates on two-dimensional data.

Unlike NEON, where each vector register is 128 bits wide and the size is fixed, SVE is scalable. This means that its vectors can have different lengths. It mainly depends on the hardware and can be 128 to 2048 bits wide, divided into multiples of 128 bits. The main advantage over NEON is that the programmer no longer needs to consider vector length, allowing the same program code to be used on other SVE machines with different register sizes. This makes SVE forward scalable.

SVE adds its own independent registers alongside NEON: Z0..Z31, P0..P15 and FFR registers. The Zn registers hold the data, and the register Pn acts as a predicate, enabling or disabling specific lanes during operation. Predicate registers are bitmasks, with each bit corresponding to one lane of the vector. Each Pn register bit controls one lane: 0-disabled, 1-enabled lane during the operation. The number of bits in the Pn register depends on the implemented SVE vector register length and the element size used in the operation. This mechanism replaces the “loop remainder” handling that was required in NEON. SVE programming is similar to the NEON programming, but the operation requires an additional predicate register, like in the example:
ADD Z0.S, P0/M, Z1.S, Z2.S

The predicate register P0 is used as a lane-mark register: bit 0 controls lane 0, bit 1 controls lane 1, and so on. This defines which lane is active for an operation across all Zn registers. Register P0 postfix “/M” means that the active lanes must merge with the destination register, where active lanes execute the operation and inactive lanes remain unchanged.

The Scalable Matrix Extension (SME) is built on the Scalable Vector Extension (SVE) by adding architectural support for matrix-oriented operations. This enables large, dense numerical computations, such as those in machine learning (ML), signal processing, graphics, and scientific workloads. SVE handles vectorised elements, and SME introduces mechanisms for efficiently handling 2D data matrices. Matrices can now be multiplied, accumulated, and transformed at scale by hardware. SME is also scalable, like SVE.

Remember that the SME is introduced with ARMv9, and the Raspberry Pi 5 uses a Broadcom BCM2712, an ARM Cortex-A76, which is a 64-bit ARMv8.2-A core. Even the Raspberry Pi 5 processor offers optional v8.3/v8.4 extensions for performance and security, but unfortunately, it does not include the Scalable Vector Extension (SVE). Besides that, it's good to know the new architectural concepts and possibilities. Taking SVE and SME as examples, the next ARM may introduce a scalable 3D matrix extension that can handle 3D matrices, and in the far future (maybe not so far…), n-dimensional matrix operations may be performed by hardware in some clock cycles.

en/multiasm/paarm/chapter_5_11.txt · Last modified: 2025/12/04 22:02 by eriks.klavins