=== Authors ===

MultiASM Consortium Partners proudly present the Assembler Programming Book, in its first edition.
**ITT Group**
  * Raivo Sell, Ph. D., ING-PAED IGIP

**Riga Technical University**  
  * Agris Nikitenko, Ph. D., Eng.
  * Karlis Berkolds, M. sc., Eng.
  * Eriks Klavins, M. sc., Eng.

**Silesian University of Technology**
  * Piotr Czekalski, Ph. D., Eng.
  * Krzysztof Tokarz, Ph. D., Eng.
  * Godlove Suila Kuaban, Ph. D., Eng.

**Western Norway University of Applied Sciences**
  * <todo @mfojcik>Add data</todo>
=== Preface ===
This book is intended to provide comprehensive information about low-level programming in the assembler language for a wide range of computer systems, including: 
  * embedded systems - constrained devices, 
  * mobile and mobile-like devices - based on ARM architecture,
  * PCs and servers - based on Intel and AMD x86 and in particular 64-bit (x86_64) architecture.
The book also contains an introductory chapter on general computer architectures and their components.

While low-level assembler programming was once considered outdated, it is now back, and this is because of the need for efficient and compact code to save energy and resources, thus promoting a green approach to programming. It is also essential for cutting-edge algorithms that require high performance while using constrained resources. Assembler is a base for all other programming languages and hardware platforms, and so is needed when developing new hardware, compilers and tools. It is crucial, in particular, in the context of the announced switch from chip manufacturing in Asia towards EU-based technologies.

We (the Authors) assume that persons willing to study this content possess some general knowledge about IT technologies, e.g., understand what an embedded system is and the essential components of computer systems, such as RAM, CPU, DMA, I/O, and interrupts and knows the general concept of software development with programming languages, C/C++ best (but not obligatory).


<pagebreak>
{{ :en:multiasm:curriculum:logo_400.png?200 | Multiasm Project Logo}}
=== Project Information ===
This content was implemented under the following project:
  * Cooperation Partnerships in higher education, 2023, MultiASM: A novel approach for energy-efficient, high performance and compact programming for next-generation EU software engineers: 2023-1-PL01-KA220-HED-000152401.

**Consortium Partners**\\
  *     Silesian University of Technology, Gliwice, Poland (Coordinator),
  *     Riga Technical University, Riga, Latvia,
  *     Western Norway University, Forde, Norway,
  *     ITT Group, Tallinn, Estonia.
{{ :en:multiasm:curriculum:pasek_z_logami_1000px.jpg?400 | Consortium Partner's Logos}}

**Erasmus+ Disclaimer**\\
This project has been co-funded by the European Union.\\ 
Views and opinions expressed are, however, those of the author or authors only and do not necessarily reflect those of the European Union or the Foundation for the Development of the Education System. Neither the European Union nor the entity providing the grant can be held responsible for them.

**Copyright Notice**\\
This content was created by the MultiASM Consortium 2023–2026.\\
The content is copyrighted and distributed under CC BY-NC [[https://en.wikipedia.org/wiki/Creative_Commons_license|Creative Commons Licence]] and is free for non-commercial use. 
<figure label>
{{ en:iot-open:ccbync.png?100 |CC BY-NC}}
</figure>
In case of commercial use, please get in touch with MultiASM Consortium representative.
====== Introduction ======
The old, somehow abandoned Assembler programming comes to light again: "machine code to machine code and assembler to the assembler." In the context of rebasing electronics, particularly developing new CPUs in Europe, engineers and software developers need to familiarise themselves with this niche, still omnipresent technology: every code you write, compile, and execute goes directly or indirectly to the machine code.\\
Besides developing new products and the related need for compilers and tools, the assembler language is essential to generating compact, rapid implementations of algorithms: it gives software developers a powerful tool of absolute control over the hardware, mainly the CPU. To put it a bit tongue-in-cheek, you never know what the future holds. If the machines take over, we should at least speak their language.

Assembler programming applies to the selected and specific groups of tasks and algorithms.\\
Using pure assembler to implement, e.g., a user interface, is possible but does not make sense.\\
Nowadays, assembler programming is commonly integrated with high-level languages, and it is a part of the application's code responsible for rapid and efficient data processing without higher-level language overheads. This applies even to applications that do not run directly on the hardware but rather use virtual environments and frameworks (either interpreted or hybrid) such as Java, .NET languages, and Python.\\

It is a rule of thumb that the simpler and more constrained the device is, the closer the developer is to the hardware. An excellent example of this rule is development for an ESP32 chip: it has 2 cores that can easily handle Python apps, but when it comes to its energy-saving modes when only ultra-low power coprocessor is running, the only programming language available is assembler, it is compact enough to run in very constrained environments, using microampers of current and fitting dozen of bytes.

This book is divided into four main chapters:
  * The first chapter introduces students to computer platforms and their architectures. Assembler programming is very close to the hardware, and understanding those concepts is necessary to write code successfully and effectively.
  * The second chapter discusses programming in assemblers for constrained devices that usually do not have operating systems but are bare-metal programmed. As it is impossible to review all assemblers and platforms, the one discussed in detail in this chapter is the most popular: AVR (ATMEL) microcontrollers (e.g. Arduino Uno development board).
  * The third chapter discusses assembler programming for the future of most devices: ARM architecture. This architecture is perhaps the most popular worldwide as it applies to mobile phones, edge and fog-class devices, and recently to the growing number of implementations in notebooks, desktops, and even workstations (e.g., Apple's M series). Practical examples use Raspberry Pi version 5.
  * The last, fourth chapter describes in-depth assembler programming for PCs in its 64-bit version for Intel and AMD CPUs, including their CISC capabilities, such as SIMD instruction sets and their applications. Practical examples present (among others) the merging of the assembler code with .NET frontends under Windows.

<WRAP excludefrompdf>
The following chapters present the contents of the coursebook:
  * [[en:multiasm:cs]]
  * [[en:multiasm:paiot]]
  * [[en:multiasm:paarm]]
  * [[en:multiasm:papc]]
</WRAP>

====== Computer Architectures ======
Assembler programming is a very low-level technique for writing the software. It uses the hardware directly relying on the architecture of the processor and computer with the possibility of having influence on every single element and every single bit in hundreds of registers which control the behaviour of the computer and all its units. That gives the programming engineer much more control over the hardware but also requires a high level of carefulness while writing programs. Modification of a single bit can completely change the way of a computer’s behaviour.
This is why we begin our book on the assembler with a chapter about the architecture of computers. It is essential to understand how the computer is designed, how it operates, executes the programs, and performs calculations. It is worth mentioning that this is important not only for programming in assembler. The knowledge of computer and processor architecture is useful for everyone who writes the software in any language. Having this knowledge programming engineers can write more efficient applications in terms of memory use, time of execution and energy consumption.

You can notice that in this book we often use the term processor for elements sometimes called in some other way. Historically the processor or microprocessor is the element that represents the central processing unit (CPU) only and requires also memory and peripherals to form the fully functional computer. If the integrated circuit contains all mentioned elements it is called a one-chip computer or microcontroller. A more advanced microcontroller for use in embedded systems is called an embedded processor. An embedded processor which contains other modules, for example, a wireless networking radio module is called a System on Chip (SoC). From the perspective of an assembler programmer differentiating between these elements is not so important, and in current literature, all such elements are often called the processor. That's why we decided to use the common name "processor" although other names can also appear in the text.

===== Overall View on Computer Architecture: Processor, Memory, IO, Buses =====
The computers we use every day are all designed around the same general idea of cooperation among three base elements: processor, memory and peripheral devices. Their names represent their functions in the system: memory stores data and program code, the processor manipulates data by executing programs, and peripherals maintain contact with the user, the environment and other systems. To exchange information, these elements communicate through interconnections named buses. The generic block schematic diagram of the exemplary computer is shown in Fig. {{ref>compblock}}.

<figure compblock>
{{ :en:multiasm:cs:block_schematic_computer.png?600 |Block schematic of computer}}
<caption>Block schematic of computer</caption>
</figure>

==== Processor ====
It is often called “the brain” of the computer. Although it doesn’t think, the processor is the element which controls all other units of the computer. The processor is the device that manages everything in the machine. Every hardware part of the computer is controlled more or less by the main processor. Even if the device has its own processor - for example, a keyboard - it works under the control of the main one. The processor handles events. We can say that synchronous events are those that the processor handles periodically. The processor can’t stop. Of course, when it has the power, even when you don’t see anything special happening on the screen. In PC computer in an operating system without a graphical user interface, for example, plain Linux, or a command box in Windows, if you see only „C:\>”, the processor works. In this situation, it executes the main loop of the system. In such a loop, it’s waiting for the asynchronous events. Such an asynchronous event occurs when the user pushes a key or moves the mouse, when the sound card stops playing a sound, or when the hard disk ends up transmitting the data. For all of those actions, the processor handles executing the programs - or, if you prefer, procedures.

A processor is characterised by its main parameters, including its operating frequency and class.

Frequency is crucial information which tells the user how many operations can be executed in a time unit. To have the detailed information of a real number of instructions per second, it must be combined with the average number of clock pulses required to execute the instruction. Older processors required a few or even a dozen clock pulses per instruction. Modern machines, thanks to parallel execution, can achieve impressive results of a few instructions per single cycle.

The processor's class indicates the number of bits in the data word. It tells us what the size of the arguments the processor can calculate with a single arithmetic, logic or other operation. The still-popular 8-bit machines have a data length of 8 bits, while the most sophisticated can use 32 or 64-bit arguments. The processor's class determines the size of its internal data registers, the size of instructions' arguments and the number of data bus lines.
<note info> Some modern 64-bit processors have additional registers much longer than the ones in the main set. For example, in the x64 architecture, there are ZMM registers that are 512 bits in length. </note>

==== Memory ====

Memory is the element of the computer that stores data and programs. It is visible to the processor as a sequence of data words, where every word has its own address. Addressing allows the processor to access simple and complex variables and to read the instructions for execution. Although intuitively, the size of a memory word should correspond to the class of the processor, it is not always true. For example, in PCs, independently of the processor class, the memory is always organised as a sequence of bytes and the size of it is provided in megabytes or gigabytes. 
<note info> The byte is historically assumed as 8 bits of information and used as the base unit to express the size of data in the world of computers. </note>
The size of the memory installed on the computer does not have to correspond to the size of the address space, the maximal size of the memory which is addressable by the processor. In modern machines, it would be impossible or hardly achievable. For example, for x64 architecture, the theoretical address space is 2^64 (16 exabytes). Even the address space currently supported by processors' hardware is as large as 2^48, which equals 256 terabytes. On the opposite side, in constrained devices, the size of the physical memory can be bigger than supported by the processor. To enable access to a bigger memory than the addressing space of the processor or to support flexible placement of programs in a big address space, the paging mechanism is used. It is a hardware support unit for mapping the address used by the processor into the physical memory installed in the computer. 
<note info> To learn more about paging please refer to https://connormcgarr.github.io/paging/ </note>

==== Peripherals ====
Called also input-output (I/O) devices. There is a variety of units belonging to this group. It includes timers, communication ports, general-purpose inputs and outputs, displays, network controllers, video and audio modules, data storage controllers and many others. The main function of peripheral modules is to exchange information between the computer and the user, collect information from the environment, send and receive data to and from elements of the computer not connected directly to the processor, and exchange data with other computers and systems. Some of them are used to connect the processor to the functional unit of the same computer. For example, the hard disk drive is not connected directly to the processor. It uses the specialised controller, which is the peripheral module operating as an interface between the central elements of the computer and the hard drive. Another example can be the display. It uses a graphics card, which plays the role of a peripheral interface between the computer and the monitor.

==== Buses ====
The processor, memory and peripherals exchange information using interconnections called buses. Although you can find in the literature and on the internet a variety of bus types and their names, at the very lowest level, there are three buses connecting the processor, memory, and peripherals. 

**Address bus** delivers the address generated by the processor to memory or peripherals. This address specifies the single memory cell or peripheral register that the processor wants to access. The address bus is used not only to address the data which the processor wants to transmit to or from memory or a peripheral. Instructions are also stored in memory, so the address bus also selects the instruction that the processor fetches and later executes. The address bus is one-directional. The address is generated by the processor and delivered to other units. 
<note info> If there is a DMA controller in the computer, in some circumstances, it can also generate an address instead of the processor. Refer to the chapter with the DMA description. </note>
The number of lines in the address bus is fixed for the processor and determines the size of the addressing space the processor can access. For example, if the address bus of some processor has 16 lines, it can generate up to 16^2 = 65536 different addresses.

**Data bus** is used to exchange data between the processor and the memory or peripherals. The processor can read the data from memory or peripherals, or write the data to these units, previously sending their address with the address bus. As data can be read or written, the data bus is bi-directional.
<note info> In the systems with a DMA controller, the data bus is utilised to exchange data between memory and peripherals directly. Refer to the chapter with the DMA description. </note>
The number of bits of the data bus usually corresponds to the class of the processor. It means that an 8-bit class processor has 8 lines of the data bus.

**Control bus** is formed by lines mainly used for synchronisation between the elements of the computer. In the minimal implementation, it includes the read and write lines. Read line (#RD) is the information to other elements that the processor wants to read the data from the unit. In such a situation, the element, e.g. memory, puts the data from the addressed cell on the data bus. Active write signal (#WR) informs the element that the data which is present on the data bus should be stored at the specified address.
The control bus can also include other signals specific to the system, e.g. interrupt signals, DMA control lines, clock pulses, signals distinguishing the memory and peripheral access, signals activating chosen modules and others.

===== Von Neumann vs Harvard Architectures, Mixed Architectures =====

The classical architecture of computers uses a single address and a single data bus to connect the processor, memory and peripherals. This architecture is called von Neumann or Princeton, and we showed it in Fig. {{ref>vonneumann}}. Additionally, in this architecture, the memory contains the code of programs and the data they use. This suffers the drawback of the impossibility of accessing the instruction of the program and the data to be processed at the same time, called the von Neumann bottleneck. The answer to this issue is the Harvard architecture, where program memory is separated from the data memory, and they are connected to the processor with two pairs of address and data buses. Of course, the processor must support such a type of architecture. The Harvard architecture we show in Fig. {{ref>harvard}}.


<figure vonneumann>
{{ :en:multiasm:cs:block_schematic_vonneumann.png?600 |Block schematic of von Neumann architecture computer}}
<caption>Block schematic of von Neumann architecture computer</caption>
</figure>

<figure harvard>
{{ :en:multiasm:cs:block_schematic_harvard.png?600 |Block schematic of Harvard architecture computer}}
<caption>Block schematic of Harvard architecture computer</caption>
</figure>

Harvard architecture is very often used in one-chip computers. It does not suffer from the von Neumann bottleneck and additionally makes it possible to implement different lengths of the data word and instruction word. For example, the AVR 8-bit class family of microcontrollers has 16 bits of the program word. PIC microcontrollers, also 8-bit class, have 13 or 14-bit instruction word length.
In modern microcontrollers, the program is usually stored in internal flash reprogrammable memory, and data in internal static RAM memory. All interconnections, including address and data buses, are implemented internally, making the implementation of the Harvard architecture easier than in a computer based on the microprocessor.
In several mature microcontrollers, the program and data memory are separated but connected to the processor unit with a single set of buses. It is named mixed architecture. This architecture benefits from an enlargement of the size of the possible address space, but still suffers from the von Neumann bottleneck. This approach can be found in the 8051 family of microcontrollers. The schematic diagram of mixed architecture we presented in Fig. {{ref>mixed}}.

<figure mixed>
{{ :en:multiasm:cs:block_schematic_mixed.png?600 |Block schematic of mixed architecture computer}}
<caption>Block schematic of mixed architecture computer</caption>
</figure>

===== CISC, RISC =====

It is not only the whole computer that can have a different architecture. This also touches processors. There are two main internal processor architectures: CISC and RISC. CISC stands for Complex Instruction Set Computer, while RISC stands for Reduced Instruction Set Computer. The naming difference can be a little confusing because it treats the instruction set as complex or reduced. We can find CISC and RISC processors with a similar number of instructions implemented. The difference is rather in the complexity of instructions, not just the instruction set.

Complex instructions mean that a typical single CISC processor instruction performs more operations during its execution than a typical RISC instruction. The CISC processors also implement more sophisticated addressing modes. This means that if we want to implement an algorithm, we can use fewer CISC instructions compared to more RISC instructions. Of course, it comes with the price of a more sophisticated construction of a CISC processor than RISC. The complexity of a circuit influences the average execution time of a single instruction. As a result, the CISC processor does more work during program execution, while in RISC, the compiler (or assembler programmer) does more work during the implementation phase. 

The most visible distinguishing features between RISC and CISC architectures to a programmer lie in the general-purpose registers and instruction operands. Typically, in CISC, the number of registers is smaller than in RISC. Additionally, they are specialised. It means that not all operations can be performed using any register. For example, in some CISC processors, arithmetic calculations can be done only with the use of a special register called the accumulator, while addressing of a table element or using a pointer can be done with the use of a special index (or base) register. In RISC, almost all registers can be used for any purpose, like the mentioned calculations or addressing. In CISC processors, instructions usually have two arguments in the form of:
<code>
operation arg1, arg2    ; Example: arg1 = arg1 + arg2
</code>
In such a situation, //arg1// is one of the sources and also a destination - the place for the result. It destroys the original //arg1// value. In many RISC processors, three-argument instructions are present:
<code>
operation arg1, arg2, arg3 ; Example: arg3 = arg1 + arg2
</code>
In such an approach, two arguments are the source, and the third one is the destination - original arguments are preserved and can be used for further calculations.
The table {{ref>ciscrisc}} summarises the difference between CISC and RISC processors. 

<table ciscrisc>
<caption>CISC and RISC features</caption>
^ Feature                      ^ CISC              ^ RISC               ^
| Instructions                 | Complex           | Simple             |
| Registers                    | Specialised       | Universal          |
| Number of registers          | Smaller           | Larger             |
| Calculations                 | With accumulator  | With any register  |
| Addressing modes             | Complex           | Simple             |
| Non destroying instructions  | No                | Yes                |
| Examples of processors       | 8086, 8051        | AVR, ARM           |
</table>

All these features make the RISC architecture more flexible, allowing the programmer or compiler to create code without unnecessary data transfer. The execution units of RISC processors can be simpler in construction, enabling higher operating frequencies and faster program execution. What is interesting is that the modern versions of popular CISC machines, such as Intel and AMD processors used in personal computers, internally translate CISC instructions into RISC microoperations, which are executed by execution units.
A simple instruction can be converted directly into a single microoperation. Complex instructions are often implemented as sequences of microoperations called microcodes.
=====Components of Processor: Registers, ALU, Bus Control, Instruction Decoder ===== 

From our perspective, the processor is the electronic integrated circuit that controls other elements of the computer. Its main ability is to execute instructions. While we will go into details of the instruction set, you will see that some instructions perform calculations or process data, while others do not. This suggests that the processor comprises two main units. One of them is responsible for instruction execution, while the second performs data processing. The first one is called the control unit or instruction processor. The second one is named the execution unit or data processor. We can see them in Fig {{ref>procblock}}.

<figure procblock>
{{ :en:multiasm:cs:block_schematic_processor.png?600 |Units of the processor}}
<caption>Units of the processor</caption>
</figure>

==== Control unit ====
The function of the control unit, also known as the instruction processor, is to fetch, decode and execute instructions. It also generates signals to the execution unit if the instruction being executed requires so. It is a synchronous and sequential unit. Synchronous means that it changes state synchronously with the clock signal. Sequential means that the next state depends on the states at the inputs and the current internal state. As inputs, we can consider not only physical signals from other units of the computer but also the code of the instruction. To ensure that the computer behaves the same every time it is powered on, the execution unit is set to the known state at the beginning of operation by the RESET signal. 
A typical control unit contains some essential elements:
  * Instruction register (IR).
  * Instruction decoder.
  * Program counter (PC)/instruction pointer (IP).
  * Stack pointer (SP).
  * Bus interface unit.
  * Interrupt controller.
Elements of the control unit are shown in Fig {{ref>controlunit}}.

<figure controlunit>
{{ :en:multiasm:cs:control_unit.png?600 |Elements of the control unit}}
<caption>Elements of the control unit</caption>
</figure>

The control unit executes instructions in a few steps:
  * Generates the address of the instruction.
  * Fetches instruction code from memory.
  * Decodes instructions.
  * Generates signals to the execution unit or executes instructions internally.

In detail, the process looks as follows:
  - The control unit takes the address of the instruction to be executed from a special register known as the Instruction Pointer or Program Counter and sends it to the memory via the address bus. It also generates signals on the control bus to synchronise memory with the processor.
  - Memory takes the code of instruction from the provided address and sends it to the processor using a data bus.
  - The processor stores the instruction code in the instruction register and, based on the bit pattern, interprets what to do next.
  - If the instruction requires the execution unit operation, the control unit generates signals to control it. In cooperation with the execution unit, it can also read data from or write data to the memory.

The control unit works according to the clock signal generator cycles known as main clock cycles. With every clock cycle, some internal operations are performed. One such operation is reading or writing the memory, which sometimes requires more than a single clock cycle. Single memory access is known as a machine cycle. As instruction execution sometimes requires more than one memory access and other actions, the execution of the whole instruction is named an instruction cycle. Summarising, one instruction execution requires one instruction cycle and several machine cycles, each composed of a few main clock cycles. Modern advanced processors are designed in such a way that they are able to execute a single instruction (sometimes even more than one) every single clock cycle. This requires a more complex design of a control unit, many execution units and other advanced techniques, which makes it possible to process more than one instruction at a time.

The control unit also accepts input signals from peripherals, enabling interrupts and direct memory access mechanisms. For proper return from the interrupt subroutine, the control unit uses a special register called the stack pointer. Interrupts and direct memory access mechanisms will be explained in detail in further chapters.
<note>
Although the stack pointer is not implemented in all modern processors, every processor has some mechanism for storing the returning address.
</note>
==== Execution unit ====

An execution unit, also known as the data processor, executes instructions. Typically, it is composed of a few essential elements:
  * Arithmetic logic unit (ALU).
  * Accumulator and set of registers,
  * Flags register,
  * Temporal register.

The arithmetic logic unit (ALU) is the element that performs logical and arithmetical calculations. It uses data coming from registers, the accumulator or from memory. Data coming from memory for arithmetic and logic instructions is stored in the temporal register. The result of calculations is stored back in the accumulator, another register or memory. In some legacy CISC processors, the only possible place for storing the result is the accumulator.
Besides the result, ALU also returns some additional information about the calculations. It modifies the bits in the flag register, which comprises flags that are modified according to the results from arithmetic and logical operations. For example, if the result of the addition operation is too large to be stored in the resulting argument, the carry flag is set to indicate such a situation.

Typically, the flags register includes:
  * Carry flag, set in case a carry or borrow occurs.
  * Sign flag, indicating whether the result is negative.
  * Zero flag, set in case the result is zero.
  * Auxiliary Carry flag used in BCD operations.
  * Parity flag, which indicates whether the result has an even number of ones.

The flags are used as conditions for decision-making instructions (like //if// statements in some high-level languages).
The flags register can also implement some control flags to enable/disable processor functionalities. An example of such a flag can be the Interrupt Enable flag from the 8086 microprocessor.

==== Registers ====

Registers are memory elements which are placed logically and physically very close to the arithmetic logic unit. It makes them the fastest memory in the whole computer. They are sometimes called scratch registers, and the set of registers is called the register file. 

As we mentioned in the chapter about CISC and RISC processors, in CISC processors, registers are specialised, including the accumulator. A typical CISC execution unit is shown in Fig {{ref>CISCexeunit}}.

<figure CISCexeunit>
{{ :en:multiasm:cs:CISC_execution_unit.png?600 |Elements of the CISC execution unit}}
<caption>Elements of the CISC execution unit</caption>
</figure>

A typical RISC execution unit does not have a specialised accumulator register. It implements the set of scratch registers as shown in Fig {{ref>RISCexeunit}}.

<figure RISCexeunit>
{{ :en:multiasm:cs:RISC_execution_unit.png?600 |Elements of the RISC execution unit}}
<caption>Elements of the RISC execution unit</caption>
</figure>

===== Processor Taxonomies, SISD, SIMD, MIMD =====
As we already know, the processor executes instructions which can process the data. We can consider two streams flowing through the processor. A stream of instructions which passes through the control unit, and a stream of data processed by the execution unit. In 1966, Michael Flynn proposed the taxonomies to define different processors' architectures. Flynn classification is based on the number of concurrent instruction (or control) streams and data streams available in the architecture.
Taxonomies as proposed by Flynn are presented in Table{{ref>taxonomies}}

<table taxonomies>
<caption>Processor taxonomies by Flynn</caption>

| Basic taxonomies                |^ Data streams             ||
| :::                             | :::       | **Single**        | **Multiple**  |
^ Instruction streams  | **Single**    | SISD          | SIMD      |
| :::                  | **Multiple**  | MISD          | MIMD      |
</table>

==== SISD ====
Single Instruction Single Data processor is a classical processor with a single control unit and a single execution unit. It can fetch a single instruction in one cycle and perform a single calculation. Mature PC computers based on 8086, 80286 or 80386 processors or some modern small-scale microcontrollers like AVR, are examples of such an architecture.

==== SIMD ====
Single Instruction Multiple Data is an architecture in which one instruction stream can perform calculations on multiple data streams. Good examples of implementation of such architecture are all vector instructions (called also SIMD instructions) like MMX, SSE, AVX, and 3D-Now in x64 Intel and AMD processors. Modern ARM processors also implement SIMD instructions, which perform vectorised operations.

==== MIMD ====
Multiple Instruction Multiple Data is an architecture in which many control units operate on multiple streams of data. These are all architectures with more than one processor (multi-processor architectures). MIMD architectures include multi-core processors like modern x64, ARM and even ESP32 SoCs. This also includes supercomputers and distributed systems, using common shared memory or even a distributed but shared memory.

==== MISD ====
Multiple Instruction Single Data. At first glance, it seems illogical, but these are machines where the certainty of correct calculations is crucial and required for the security of the system operation. Such an approach can be found in applications like space shuttle computers.

==== SIMT ====
Single Instruction Multiple Threads. Originally defined as the subset of SIMD. The difference between SIMD and SIMT is that in pure SIMD, a single instruction operates on all elements of the vector in the same way. In SIMT, selected threads can be activated or deactivated. Instructions and data are processed only on the active threads, while the data remains unchanged on inactive threads. 

We can find SIMT in modern constructions. Nvidia uses this execution model in their G80 architecture 
((https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf)), where multiple independent threads execute concurrently using a single instruction. 
In x64 architecture, the new vector instructions have masked variants, in which an operation can be turned on or off for selected vector elements. This is done by an additional operand called a mask or predicates.
===== Memory, Types and Their Functions===== 

The computer cannot work without memory. The processor fetches instructions from memory, and data is also stored there. In this chapter, we will discuss memory types, technologies and their properties.
The overall view of computer memory can be represented by the memory hierarchy triangle shown in Fig. {{ref>memtriangle}}, where the available size reduces while going up, but the access time decreases significantly. It is visible that the fastest memory in the computer is the set of internal registers, next is the cache memory, operational memory (RAM), disk drive, and the slowest, but the largest is the memory available via the computer network - usually the Internet.

<figure memtriangle>
{{ :en:multiasm:cs:memory_triangle.png?400 |Memory hierarchy triangle}}
<caption>Memory hierarchy triangle</caption>
</figure>

In the table {{ref>memhierarchy}}, you can find the comparison of different memory types with their estimated size and access time.

<table memhierarchy>
<caption>Parameters of memory in the hierarchy</caption>
^ Memory type  ^ Average size   ^ Access time                            ^
| Registers    | few kilobytes  | <1 ns                                  |
| Cache        | few megabytes  | <10 ns                                 |
| RAM          | few gigabytes  | <100 ns                                |
| Disk         | few terabytes  | <100 us                                |
| Network      | unlimited      | a few seconds, but can be unpredictable  |
</table>

==== Address space ====

Before we start talking about details of memory, we will mention the term //address space//. The processor can access instructions of the program and data to process. The question is how many instructions and data elements it can reach? The possible number of elements (usually bytes) represents the address space available for the program and for data. We know from the chapter about von Neumann and Harvard architectures that these spaces can overlap or be separate.

The size of the program address space is determined by the number of bits of the Instruction Pointer register. In many 8-bit microprocessors, the size of IP is 16 bits. It means that the size of the address space, expressed in the number of distinct addresses the processor can reach, is 2^16 = 65536 (64k). It is too small for bigger machines like personal computers, so here the number of bits of IP is 64 (although the actual number of bits used is limited to 48).

The size of the possible data address space is determined by addressing modes and the size of index registers used for indirect addressing. In 8-bit microprocessors, it is sometimes possible to join two 8-bit registers to achieve an address space of the same 64k size as for the program. In personal computers, the program and data address spaces overlap, so the data address space is the same as the program address space.

==== Program memory ====

In this kind of memory, the instructions of the programs are stored. In many computers (like PCs), the same physical memory is used to store the program and the data. This is not the case in most microcontrollers, where the program is stored in non-volatile memory, usually Flash EEPROM. Even if the memory for programs and data is common, the code area is often protected from writing to prevent malware and other viruses from interfering with the original, safe software.

==== Data memory ====

In data memory, all kinds of data can be stored. Here we can find numbers, tables, pictures, audio samples and everything that is processed by the software. Usually, the data memory is implemented as RAM, with the possibility of reading and writing. RAM is a volatile memory, so its content is lost if the computer is powered off.

==== Memory technologies ====
There are different technologies for creating memory elements. The technology determines if memory is volatile or non-volatile, the physical size (area) of a single bit, the power consumption of the memory array, and how fast we can access the data. According to the data retention, we can divide memories into two groups:
  * non-volatile,
  * volatile.
Non-volatile memories are implemented as:
  * ROM
  * PROM
  * EPROM
  * EEPROM
  * Flash EEPROM
  * FRAM
Volatile memories have two main technologies:
  * Static RAM
  * Dynamic RAM

==== Non-volatile memories ====
Non-volatile memories are used to store the data which should be preserved during the power is off. Their main application is to store the code of the program, constant data tables, and parameters of the computer (e.g. network credentials). The hard disk drive (or SSD) is also a non-volatile memory, although not directly accessible by the processor via address and data buses.

==== ROM ====
ROM - Read Only Memory is the kind of memory that is programmed by the chip manufacturer during the production of the chip. This memory has fixed content which can't be changed in any way. It was popular as a program memory in microcontrollers a few years ago because of the low price of a single chip, but due to the need for replacement of the whole chip in case of an error in the software and lack of possibility of a software update, it was replaced with technologies which allow reprogramming the memory content. Currently, ROM is sometimes used to store the serial number of a chip.

==== PROM ====
PROM - Programmable Read Only Memory is the kind of memory that can be programmed by the user but can't be reprogrammed further. It is sometimes called OTP - One Time Programming. Some microcontrollers implement such a memory, but due to the inability of software updates, it is a choice for simple devices with a short product lifetime. OTP technology is used in some protection mechanisms for protecting from unauthorised software change or reconfiguration of the device.

==== EPROM ====
EPROM - Erasable Programmable Read Only Memory is a memory which can be erased and programmed many times. The data can be erased by shining an intense ultraviolet light through a special glass window into the memory chip for a few minutes. After erasing, the chip can be programmed again. This kind of memory was very popular a few years ago, but due to a complicated erase procedure, requiring a UV light source, it was replaced with EEPROM.
==== EEPROM ====
EEPROM - Electrically Erasable Programmable Read Only Memory is a type of non-volatile memory that can be erased and reprogrammed many times with electric signals only. They replaced EPROM memories due to ease of use and the possibility of reprogramming without removing the chip from the device. It is designed as an array of MOSFET transistors with so-called floating gates, which can be charged or discharged and keep the stable state for a long time (at least 10 years). In EEPROM memory, every byte can be individually reprogrammed.

==== Flash EEPROM ====
Flash EEPROM is a type of non-volatile memory that is similar to EEPROM. Whilst in the EEPROM, a single byte can be erased and reprogrammed, in the Flash version, erasing is possible in larger blocks. This reduces the area occupied by the memory, making it possible to create denser memories with larger capacity. Flash memories can be realised as part of a larger chip, making it possible to design microcontrollers with built-in reprogrammable memory for the firmware. This enables firmware updates without the need for any additional tools or removing the processor from the operating device. It also allows remote firmware updates.

==== FRAM ====
FRAM - Ferroelectric Random Access Memory is a type of memory where the information is stored with the effect of a change of polarity of ferroelectric material. The main advantage is that the power efficiency, access time, and density are comparable to DRAM, but without the need for refreshing and with data retention while the power is off. The main drawback is the price, which limits the popularity of FRAM applications. Currently, due to their high reliability, FRAMs are mainly used in specialised applications like data loggers, medical equipment, and automotive "black boxes".

==== Volatile memories ====
Volatile memory is used for temporal data storage while the computer system is running. Their content is lost after the power is off. Their main application is to store the data while the program processes it. Their main applications are operational memory (known as RAM), cache memory, and data buffers.

==== SRAM ====
SRAM - Static Random Access Memory is the memory used as operational memory in smaller computer systems and as cache memory in larger machines. Its main benefit is very high operational speed. The access time can be as short as a few nanoseconds, making it the first choice when speed is required: in cache memory or as the processor's registers. The drawbacks are the area and power consumption. Every bit of static memory is implemented as a flip-flop and composed of six transistors. Due to its construction, the flip-flop always consumes some power. That is the reason why the capacity of SRAM memory is usually limited to a few hundred kilobytes, up to a few megabytes.

==== DRAM ====
DRAM - Dynamic Random Access Memory is the memory used as operational memory in bigger computer systems. Its main benefit is very high density. The information is stored as a charge in a capacitance created under the MOSFET transistor. Because every bit is implemented as a single transistor, it is possible to implement a few gigabytes in a small area. Another benefit is the reduced power required to store the data in comparison to static memory. But there are also drawbacks. DRAM memory is slower than SRAM. The addressing method and the complex internal construction prolong the access time. Additionally, because the capacitors which store the data discharge in a short time, DRAM memory requires refreshing, typically 16 times per second; it is done automatically by the memory chip, but additionally slows the access time. 
===== Peripherals =====

Peripherals or peripheral devices, also known as Input-Output devices, enable the computer to remain in contact with the external environment or expand the computer's functionality. Peripheral devices enhance the computer's capability by making it possible to enter information into a computer for storage or processing and to deliver the processed data to a user, another computer, or a device controlled by the computer.
Internal peripherals are connected directly to the address, data, and control buses of the computer. External peripherals can be connected to the computer via USB or a similar connection.
<note info>
The USB controller is also a peripheral device, so every external peripheral (e.g. mouse) is connected to the processor via an internal peripheral. In this book, we consider internal peripherals directly connected to the address, data, and control buses.
</note>

==== Types of peripherals ====
There is a variety of peripherals which can be connected to the computer. The most commonly used are:
  * parallel input/output ports
  * serial communication ports
  * timers/counters
  * analogue to digital converters
  * digital to analogue converters
  * interrupt controllers
  * DMA controllers
  * displays
  * keyboards
  * sensors
  * actuators

==== Addressing of I/O devices ====
From the assembler programmer's perspective, the peripheral device is represented as a set of registers available in the I/O address space. Registers of peripherals are used to control their behaviour, including mode of operation, parameters, configuration, speed of transmission, etc. Registers are also data exchange points where the processor can store data to be transmitted to the user or external computer, or read the data coming from the user or another system.

The size of the I/O address space is usually smaller than the size of the program or data address space. The method of accessing peripherals depends on the design of the processor. We can find two methods of I/O addressing implementation: separate or memory-mapped I/O.

==== Separate I/O address space ====
A separate I/O address space is accessed independently of the program or data memory. In the processor, it is implemented with the use of separate control bus lines to read or write I/O devices. Separate control lines usually mean that the processor also implements different instructions to access memory and I/O devices. It also means that the chosen peripheral and byte in the memory can have the same address, and only the type of instruction used distinguishes the final destination of the address.
Separate I/O address space is shown schematically in Fig {{ref>separateio}}. Reading the data from the memory is activated with the #MEMRD signal, writing with the #MEMWR signal. If the processor needs to read or write the register of the peripheral device, it uses #IORD or #IOWR lines, respectively.
<note info>
The "#" symbol before a signal name means that the active signal on the line is a LOW state while idle is a HIGH state.
</note>

<figure separateio>
{{ :en:multiasm:cs:separate_io.png?600 |Separate I/O address space}}
<caption>Separate I/O address space</caption>
</figure>

==== Memory-mapped I/O address space ====

In this approach, the processor doesn't implement separate control signals to access the peripherals. It uses a #RD signal to read the data from and a #WR signal to write the data to a common address space. It also doesn't have different instructions to access registers in the I/O address space. Every such register is visible like any other memory location. It means that this is only the responsibility of the software to distinguish the I/O registers and data in the memory.
Memory-mapped I/O address space is shown schematically in Fig {{ref>memorymappedio}}.

<figure memorymappedio>
{{ :en:multiasm:cs:memory_mapped_io.png?600 |Memory-mapped I/O address space}}
<caption>Memory-mapped I/O address space</caption>
</figure>
==== Instruction Execution Process =====
As we already mentioned, instructions are executed by the processor in a few steps. You can find in the literature descriptions that there are three, four, or five stages of instruction execution. Everything depends on the level of detail one considers. The three stages description says that there are fetch, decode and execute steps. The four-stage model says that there are fetch, decode, data read and execute steps. The five stages version adds another final step for writing the result back and sometimes reverses the steps of data read and execution.

It is worth remembering that even a simple fetch step can be divided into a set of smaller actions which must be performed by the processor. The real execution of instructions depends on the processor's architecture, implementation and complexity. Considering the five-stage model we can describe the general model of instruction execution:
  - Fetching the instruction:
    * The processor addresses the instruction by sending the content of the IP register by the address bus.
    * The processor reads the code of the instruction from program memory by data bus.
    * The processor stores the code of the instruction in the instruction register.
    * The processor prepares the IP (possibly increments) to point to the next instruction in a stream.
  - Instruction decoding:
    * A simple instruction can be interpreted directly by instruction decoder logic.
    * More complex instructions are processed in some steps by microcodes.
    * Results are provided to the execution unit.
  - Data reading (if the instruction requires reading the data):
    * The processor calculates the address of the data in the data memory.
    * The processor sends this address by address bus to the memory.
    * The processor reads the data from memory and stores it in the accumulator or general-purpose register.
  - Instruction execution:
    * The execution unit performs actions as encoded in the instruction (simple or with microcode).
    * In modern processors there are more execution units, they can execute different instructions simultaneously.
  - Writing back the result:
    * The processor writes the result of calculations into the register or memory.

==== Instruction encoding ====
From the perspective of the processor, instructions are binary codes, unambiguously determining the activities that the processor is to perform. Instructions can be encoded using a fixed or variable number of bits.

A fixed number of bits makes the construction of the instruction decoder simpler because the choice of some specific behaviour or function of the execution unit is encoded with the bits which are always at the same position in the instruction. On the opposite side, if the designer plans to expand the instruction set with new instructions in the future, there must be some spare bits in the instruction word reserved for future use. It makes the code of the program larger than required. Fixed lengths of instructions are often implemented in RISC machines. For example in ARM architecture instructions have 32 bits. In AVR instructions are encoded using 16 bits.

A variable number of bits makes the instruction decoder more complex. Based on the content of the first part of the instruction (usually byte) it must be able to decide what is the length of the whole instruction. In such an approach, instructions can be as short as one byte, or much longer. An example of a processor with variable instruction length is 8086 and all further processors from the x86 and x64 families. Here the instructions, including all possible constant arguments, can have even 15 bytes.

<note info>
Although in the computer world information is very often encoded in bytes or multiples of bytes, there are processors with instructions encoded in other numbers of bits. Examples include PIC microcontrollers with an instruction length of 13 or 14 bits.</note>

===== Modern Processors: Pipeline, Superscalar, Branch Prediction, Hyperthreading =====

Modern processors have a very complex design and include many units responsible mainly for shortening the execution time of the software. 

==== Pipeline ====
As was described in the previous chapter, executing a single instruction requires many actions which must be performed by the processor. We could see that each step, or even substep, can be performed by a separate logical unit. This feature has been used by designers of modern processors to create a processor in which instructions are executed in a pipeline. A pipeline is a collection of logical units that execute many instructions at the same time - each of them at a different stage of execution. If the instructions arrive in a continuous stream, the pipeline allows the program to execute faster than a processor that does not support the pipeline. Note that the pipeline does not reduce the time of execution of a single instruction, it increases the throughput of the instruction stream.

A simple pipeline is implemented in AVR microcontrollers. It has two stages, which means that while one instruction is executed another one is fetched as shown in Fig {{ref>pipelineavr}}.
<figure pipelineavr>
{{ :en:multiasm:cs:pipeline_AVR.png?600 |Simple 2-stage pipeline in AVR microcontroller}}
<caption>Simple 2-stage pipeline in AVR microcontroller</caption>
</figure>
A mature 8086 processor executed the instruction in four steps. This allowed for the implementation of the 4-stage pipeline as shown in Fig. {{ref>pipeline8086}}
<figure pipeline8086>
{{ :en:multiasm:cs:pipeline_8086.png?600 |2-stage pipeline in 8086 microprocessor}}
<caption>4-stage pipeline in 8086 microprocessor</caption>
</figure>

Modern processors implement longer pipelines. For example, Pentium III used the 10-stage pipeline, Pentium 4 20-stage, and Pentium 4 Prescott even a 31-stage pipeline. Does the longer pipeline mean faster program execution? Everything has benefits and drawbacks. The undoubted benefit of a longer pipeline is more instructions executed at the same time which gives the higher instruction throughput. But the problem appears when branch instructions come. While in the instruction stream a conditional jump appears the processor must choose what way the instruction stream should go. Should the jump be taken or not? The answer usually is based on the result of the preceding instruction and is known when the branch instruction is close to the end of the pipeline. In such a situation in modern processors, the branch prediction unit guesses what to do with the branch. If it misses, the pipeline content is invalidated and the pipeline starts operation from the beginning. This causes stalls in the program execution. If the pipeline is longer - the number of instructions to invalidate is bigger. That's why Intel decided to return to shorter pipelines. In modern microarchitectures, the length of the pipeline varies between 12 and 20.


==== Superscalar ====
The superscalar processor increases the speed of program execution because it can execute more than one instruction during a clock cycle. It is realised by simultaneously dispatching instructions to different execution units on the processor. The superscalar processor doesn't implement two or more independent pipelines, rather decoded instructions are sent for further processing to the chosen execution unit as shown in Fig. {{ref>superscalar}}.

<figure superscalar>
{{ :en:multiasm:cs:superscalar.png?600 |Superscalar architecture of pipelined processor}}
<caption>Superscalar architecture of pipelined processor</caption>
</figure>

In the x86 family first processor with two paths of execution was Pentium with U and V pipelines. Modern x64 processors like i7 implement six execution units. Not all execution units have the same functionality, for example, In the i7 processor, every execution unit has different possibilities, as presented in table {{ref>executionunits}}.

<table executionunits>
<caption>Execution units of i7 processor</caption>
^ Execution unit  ^ Functionality                                                                    ^
| 0               | Integer calculations, Floating point multiplication, SSE multiplication, divide  |
| 1               | Integer calculations, Floating point addition, SSE addition                      |
| 2               | Address generation, load                                                         |
| 3               | Address generation, store                                                        |
| 4               | Data store                                                                       |
| 5               | Integer calculations, Branch, SSE addition                                       |
</table>

<note info>
The real path of instruction processing is much more complex. Additional techniques are implemented to achieve better performance e.g. out-of-order execution, and register renaming. They are performed automatically by the processor and the assembler programmer does not influence their behaviour.
</note>


==== Branch prediction ====
As it was mentioned, the pipeline can suffer invalidation if the conditional branch is not properly predicted. The branch prediction unit is used to guess the outcome of conditional branch instructions. It helps to reduce delays in program execution by predicting the path the program will take. Prediction is based on historical data and program execution patterns.
There are many methods of predicting the branches. In general, the processor implements the buffer with the addresses of the last few branch instructions with a history register for every branch. Based on history, the branch prediction unit can guess if the branch should be taken.

==== Hyperthreading ====
Hyper-Threading Technology is an Intel approach to simultaneous multithreading technology which allows the operating system to execute more than one thread on a single physical core. 
For each physical core, the operating system defines two logical processor cores and shares the load between them when possible. The hyperthreading technology uses a superscalar architecture to increase the number of instructions that operate in parallel in the pipeline on separate data. With Hyper-Threading, one physical core appears to the operating system as two separate processors. The logical processors share the execution resources including the execution engine, caches, and system bus interface. Only the elements that store the architectural state of the processor are duplicated including essential registers for the code execution.


===== Fundamentals of Addressing Modes =====
Addressing Modes is the way in which the argument of an instruction is specified. The addressing mode defines a rule for interpreting the address field of the instruction before the operand is reached. Addressing mode is used in instructions which operate on the data or in instructions which change the program flow.

==== Data addressing ====
Instructions which reach the data have the possibility of specifying the data placement. The data is an argument of the instruction, sometimes called an operand. Operands can be of one of the following: register, immediate, direct memory, and indirect memory.
As in this part of the book the reader doesn't know any assembler instructions we will use the hypothetic instruction //copy// that copies the data from the source operand to the destination operand. The order of the operands will be similar to high-level languages where the left operand is the destination and the right operand is the source. Copying data from //a// to //b// will be done with an instruction as in the following example:
<code>
 copy b, a
</code>

**Register operand** is used where the data which the processor wants to reach is stored or is intended to be stored in the register. If we assume that //a// and //b// are both registers named //R0// and //R1// the instruction for copying data from //R0// to //R1// will look as in the following example and as shown in the Fig.{{ref>addrregister}}.
<code>
 copy R1, R0
</code>

<figure addrregister>
{{ :en:multiasm:cs:addressing_register.png?600 |Illustration of addressing registers}}
<caption>Illustration of addressing registers</caption>
</figure>

**An immediate operand** is a constant or the result of a constant expression. The assembler encodes immediate values into the instruction at assembly time. The operand of this type can be only one in the instruction and is always at the source place in the operands list.
Immediate operands are used to initialise the register or variable, as numbers for comparison. An immediate operand as it's encoded in the instruction, is placed in code memory, not in data memory and can't be modified during software execution. Instruction which initialises register //R1// with the constant (immediate) value of //5// looks like this:
<code>
 copy R1, 5
</code>

<figure addrimmediate>
{{ :en:multiasm:cs:addressing_immediate.png?600 |Illustration of immediate addressing mode}}
<caption>Illustration of immediate addressing mode</caption>
</figure>

**A direct memory operand** specifies the data at a given address. An address can be given in numerical form or as the name of the previously defined variable. It is equivalent to static variable definition in high-level languages. If we assume that the //var// represents the address of the variable the instruction which copies data from the variable to //R1// can look like this:
<code>
 copy R1, var
</code>

<figure addrdirect>
{{ :en:multiasm:cs:addressing_direct.png?600 |Illustration of direct addressing mode}}
<caption>Illustration of direct addressing mode</caption>
</figure>

**Indirect memory operand** is accessed by specifying the name of the register which value represents the address of the memory location to reach.  We can compare the indirect addressing to the pointer in high-level languages where the variable does not store the value but points to the memory location where the value is stored. Indirect addressing can also be used to access elements of the table in a loop, where we use the index value which changes every loop iteration rather than a single address. Different assemblers have different notations of indirect addressing, some use brackets, some square brackets, and others //@// symbol. Even different assembler programs for the same processor can differ. In the following example, we assume the use of square brackets. The instruction which copies the data from the memory location addressed by the content of the //R0// register to //R1// register would look like this:
<code>
 copy R1, [R0]
</code>

<figure addrindirect>
{{ :en:multiasm:cs:addressing_indirect.png?600 |Illustration of indirect addressing mode}}
<caption>Illustration of indirect addressing mode</caption>
</figure>

**Variations of indirect addressing**. The indirect addressing mode can have many variations where the final address doesn't have to be the content of a single register but rather the sum of a constant value with one or more registers. Some variants implement automatic incrementation (similar to the "++" operator) or decrementation ("--") of the index register before or after instruction execution to make processing the tables faster. For example, accessing elements of the table where the base address of the table is named //data_table// and the register //R0// holds the index of the byte which we want to copy from a table to //R1// could look like this:
<code>
 copy R1, table[R0]
</code>

<figure addrindex>
{{ :en:multiasm:cs:addressing_index.png?600 |Illustration of indirect index addressing mode}}
<caption>Illustration of indirect index addressing mode</caption>
</figure>

Addressing mode with pre-decrementation (decrementing before instruction execution) could look like this:
<code>
 copy R1, table[--R0]
</code>

Addressing mode with post-incrementation (incrementing after instruction execution) could look like this:
<code>
 copy R1, table[R0++]
</code>


==== Program control flow destination addressing ====

The operand of jump, branch, or function call instructions addresses the destination of the program flow control. The result of these instructions is the change of the Instruction Pointer content. Jump instructions should be avoided in structural or object-oriented high-level languages, but they are rather common in assembler programming. Our examples will use the hypothetic //jump// instruction with a single operand—the destination address. 

**Direct addressing** of the destination is similar to direct data addressing. It specifies the destination address as the constant value, usually represented by the name. In assembler, we define the names of the addresses in code as //labels//. In the following example, the code will jump to the label named //destin//:
<code>
 jump destin
</code>

<figure jumpdirect>
{{ :en:multiasm:cs:jump_direct.png?600 |Illustration of addressing in direct jump}}
<caption>Illustration of addressing in direct jump</caption>
</figure>

**Indirect addressing** of the destination uses the content of the register as the address where the program will jump. In the following example, the processor will jump to the destination address which is stored in //R0//:
<code>
 jump [R0]
</code>

<figure jumpindirect>
{{ :en:multiasm:cs:jump_indirect.png?600 |Illustration of addressing in indirect jump}}
<caption>Illustration of addressing in indirect jump</caption>
</figure>

==== Absolute and Relative addressing ====

In all previous examples, the addresses were specified as the values which represent the **absolute** memory location. The resulting address (even calculated as the sum of some values) was the memory location counted from the beginning of the memory - address "0". It is presented in Fig{{ref>addrabsolute}}.

<figure addrabsolute>
{{ :en:multiasm:cs:addressing_absolute.png?600 |Illustration of absolute addressing of the variable}}
<caption>Illustration of absolute addressing of the variable</caption>
</figure>

Absolute addressing is simple and doesn't require any additional calculations by the processor. It is often used in embedded systems, where the software is installed and configured by the designer and the location of programs does not change.
Absolute addressing is very hard to use in general-purpose operating systems like Linux or Windows where the user can start a variety of different programs, and their placement in the memory differs every time they're loaded and executed. Much more useful is the **relative addressing** where operands are specified as differences from memory location and some known value which can be easily modified and accessed. Often the operands are provided relative to the Instruction Pointer which allows the program to be loaded at any address in the address space, but the distance between the currently executed instruction and the location of the data it wants to reach is always the same. This is the default addressing mode in the Windows operating system working on x64 machines. It is illustrated in Fig{{ref>addrrelative}}.

<figure addrrelative>
{{ :en:multiasm:cs:addressing_relative.png?600 |Illustration of IP relative addressing of the variable}}
<caption>Illustration of IP relative addressing of the variable</caption>
</figure>

Relative addressing is also implemented in many jump, branch or loop instructions.
===== Fundamentals of Data Encoding, Big Endian, Little Endian =====
The processor can work with different types of data. These include integers of different sizes, floating point numbers, texts, structures and even single bits. All these data are stored in the memory as a single byte or multiple bytes. 

==== Integers ====
Integer data types can be 8, 16, 32 or 64 bits long. If the encoded number is unsigned, it is stored in binary representation, while if the value is signed, the representation is two's complement. A natural binary number starts with aero if it contains all bits equal to zero. While it contains all bits equal to one, the value can be calculated with the expression {{ :en:multiasm:cs:equation_binary.png?200 |}}, where n is the number of bits in a number. 

In two's complement representation, the most significant bit (MSB) represents the sign of the number. Zero means a non-negative number, one represents a negative value. The table {{ref>binarynumbers}} shows the integer data types with their ranges.
<table binarynumbers>
<caption> Integer binary numbers</caption>
^ Number of bits  ^ Minimum value (hexadecimal)  ^ Maximum value (hexadecimal)  ^ Minimum value (decimal)     ^ Maximum value (decimal)     ^
| 8               | 0x00                         | 0xFF                         | 0                           | 255                         |
| 8 signed        | 0x80                         | 0x7F                         | -128                        | 127                         |
| 16              | 0x0000                       | 0xFFFF                       | 0                           | 65 535                      |
| 16 signed       | 0x8000                       | 0x7FFF                       | -32 768                     | 32 767                      |
| 32              | 0x0000 0000                  | 0xFFFF FFFF                  | 0                           | 4 294 967 295               |
| 32 signed       | 0x8000 0000                  | 0x7FFF FFFF                  | -2 147 483 648              | 2 147 483 647               |
| 64              | 0x0000 0000 0000 0000        | 0xFFFF FFFF FFFF FFFF        | 0                           | 18 446 744 073 709 551 615  |
| 64 signed       | 0x8000 0000 0000 0000        | 0x7FFF FFFF FFFF FFFF        | -9 223 372 036 854 775 808  | 9 223 372 036 854 775 807   |
</table>

==== Floating point ====
Integer calculations do not always cover all mathematical requirements of the algorithm. To represent real numbers the floating point encoding is used. A floating point is the representation of the value //A// which is composed of three fields:
  * Sign bit
  * Exponent (E)
  * Mantissa (M)
fulfilling the equation
{{ :en:multiasm:cs:equation_floating.png?200 |}}

There are two main types of real numbers, called floating point values. Single precision is the number which is encoded in 32 bits. Double precision floating point number is encoded with 64 bits. They are presented in Fig{{ref>realtypes}}.

<figure realtypes>
{{ :en:multiasm:cs:floating_numbers.png?600 |Illustration of a single and double precision real numnbers}}
<caption>Illustration of a single and double precision real numnbers</caption>
</figure>

The Table{{ref>realnumbers}} shows a number of bits for exponent and mantissa for single and double precision floating point numbers. It also presents the minimal and maximal values which can be stored using these formats (they are absolute values, and can be positive or negative depending on the sign bit).

<table realnumbers>
<caption> Floating point numbers</caption>
^ Precision        ^ Exponent  ^ Mantissa  ^ The smallest                             ^ The largest                              ^
| Single (32 bit)  | 8 bits    | 23 bits   | {{ :en:multiasm:cs:min_fp_32.png?105 }}  | {{ :en:multiasm:cs:max_fp_32.png?100 }}  |
| Double (64 bit)  | 11 bits   | 52 bits   | {{ :en:multiasm:cs:min_fp_64.png?120 }}  | {{ :en:multiasm:cs:max_fp_64.png?100 }}  |
</table>

The most common representation for real numbers on computers is standardised in the document IEEE Standard 754. There are two modifications implemented which make the calculations easier for computers.
  * The Biased exponent
  * The Normalised Mantissa
Biased exponent means that the bias value is added to the real exponent value. This results with all positive exponents which makes it easier to compare numbers.
The normalised mantissa is adjusted to have only one bit of the value "1" to the left of the decimal. It requires an appropriate exponent adjustment.


==== Texts ====
Texts are represented as a series of characters. In modern operating systems, texts are encoded using two-byte Unicode which is capable of encoding not only 26 basic letters but also language-specific characters of many different languages. In simpler computers like in embedded systems, 8-bit ASCII codes are often used. Every byte of the text representation in the memory contains a single ASCII code of the character. It is quite common in assembler programs to use the zero value (NULL) as the end character of the string, similar to the C/C++ null-terminated string convention.

==== Endianness ====
Data encoded in memory must be compatible with the processor. Memory chips are usually organised as a sequence of bytes, which means that every byte can be individually addressed. For processors of the class higher than 8-bit, there appears the issue of the byte order in bigger data types. There are two possibilities:
  - Little Endian - low-order byte is stored at a lower address in the memory.
  - Big Endian - high-order byte is stored at a lower address in the memory.

These two methods for a 32-bit class processor are shown in Fig{{ref>littlebigendian}}

<figure littlebigendian>
{{ :en:multiasm:cs:little_big_endian.png?600 |Illustration of Little and Big Endian data placement in the memory}}
<caption>Illustration of Little and Big Endian data placement in the memory</caption>
</figure>
===== Interrupt Controller, Interrupts =====

An interrupt is a request to the processor to temporarily suspend the currently executing code in order to handle the event that caused the interrupt. If the request is accepted by the processor, it saves its state and performs a function named an interrupt handler or interrupt service routine (ISR). Interrupts are usually signalled by peripheral devices in a situation while they have some data to process. Often, peripheral devices do not send an interrupt signal directly to the processor, but there is an interrupt controller in the system that collects requests from various peripheral devices. The interrupt controller prioritizes the peripherals to ensure that the more important requests are handled first.

From a hardware perspective, an interrupt can be signalled with the signal state or change.
  * Level triggered - stable low or high level signals the interrupt. While the interrupt handler is finished and the interrupt signal is still active the interrupt is signalled again.
  * Edge triggered - interrupt is signalled only while there is a change on interrupt input. The falling or rising edge of the interrupt signal.

The interrupt signal comes asynchronously which means that it can come during execution of the instruction. Usually, the processor finishes this instruction and then calls the interrupt handler. To be able to handle interrupts the processor must implement the mechanism of storing the address of the next instruction to be executed in the interrupted code. Some implementations use the stack while some use a special register to store the returning address. The latter approach requires software support if interrupts can be nested (if the interrupt can be accepted while already in another ISR).

After finishing the interrupt subroutine processor uses the returning address to return back program control to the interrupted code.

The Fig. {{ref>interrupt}} shows how interrupt works with stack use. The processor executes the program. When an interrupt comes it saves the return address on the stack. Next jumps to the interrupt handler. With //return// instruction processor returns to the program taking the address of an instruction to execute from the stack.

<figure interrupt>
{{ :en:multiasm:cs:interrupt_handling.png?600 |Illustration of interrupt handling with stack use}}
<caption>Illustration of interrupt handling with stack use</caption>
</figure>

==== Recognising interrupt source ====
To properly handle the interrupts the processor must recognise the source of the interrupt. Different code should be executed when the interrupt is signalled by a network controller, different if the source of the interrupt is a timer. The information on the interrupt source is provided to the processor by the interrupt controller or directly by the peripheral.
We can distinguish three main methods of calling proper ISR for incoming interrupts.
  * Fixed – microcontrollers. Every interrupt handler has its own fixed starting address.
  * Vectored – microprocessors. There is one interrupt pin, the peripheral sends the address of the handler through the data bus.
  * Indexed – microprocessors. The peripheral sends an interrupt number. It is used as the index in the table of handlers' addresses.

==== Maskable and non-maskable interrupts ====
Interrupts can be enabled or disabled. Disabling interrupts is often used for time-critical code to ensure the shortest possible execution time. Interrupts which can be disabled are named maskable interrupts. They can be disabled with the corresponding flag in the control register. In microcontrollers, there are separate bits for different interrupts.

If an interrupt can not be disabled is named non-maskable interrupt. Such interrupts are implemented for critical situations:
  * memory failure
  * power down
  * critical hardware errors
In microprocessors, there exists separate non-maskable interrupt input – NMI.


==== Software and internal interrupts ====
In some processors, it is possible to signal the interrupt by executing special instructions. They are named software interrupts and can be used to test interrupt handlers. In some operating systems (DOS, Linux) software interrupts are used to implement the mechanism of calling system functions.

Another group of interrupts signalled by the processor itself are internal interrupts. They aren't signalled with special instruction but rather in some specific situations during normal program execution.
  * Exceptions – generated by abnormal behaviour in some instructions (e.g. dividing by 0).
  * Trap – intentionally initiated by programmer transfer control to handler routine (e.g. for debugging).
  * Fault – generated while detecting some errors in the program (memory protection, invalid code of operation).


===== DMA =====
Direct memory access (DMA) is the mechanism for fast data transfer between peripherals and memory. In some implementations, it is also possible to transfer data between two peripherals or from memory to memory. DMA operates without the activity of the processor, no software is executed during the DMA transfer. It must be supported by a processor and peripheral hardware, and there must be a DMA controller present in the system. The controller plays the main role in transferring the data.

DMA controller is a specialised unit which can control the data transfer process. It implements several channels each containing the address register which is used to address the memory location and counter to specify how many cycles should be performed. Address register and counter must be programmed by the processor, it is usually done in the system startup procedure. The system with an inactive DMA controller is presented in Fig.{{ref>DMAinactive}}.

<figure DMAinactive>
{{ :en:multiasm:cs:DMA_inactive.png?600 |System with inactive DMA controller}}
<caption>System with inactive DMA controller</caption>
</figure>

The process of data transfer is done in some steps, Let us consider the situation when a peripheral has some data to be transferred.
  * peripheral signals the request to transfer data (DREQ).
  * DMA controller forwards the request to the processor (HOLD).
  * The processor accepts the DMA cycle (HLDA) and switches off from the busses.
  * DMA controller generates the address on the address bus and sends the acknowledge signal to the peripheral (DACK).
  * Peripheral sends the data by the data bus.
  * DMA generates a write signal to store data in the memory.
  * DMA controller updates address register and the counter.
  * If the counter reaches zero data transfer stops.

Everything is done without any action of the processor, no program is fetched and executed. Because everything is done by hardware the transfer can be done in one memory access cycle so much faster than by the processor. Data transfer by processor is significantly slower because requires at least four instructions of program execution and two data transfers: one from the peripheral and another to the memory for one cycle. The system with an active DMA controller is presented in Fig.{{ref>DMAactive}}.

<figure DMAactive>
{{ :en:multiasm:cs:DMA_active.png?600 |System performing DMA transfer}}
<caption>System performing DMA transfer</caption>
</figure>

DMA transfer can be done in some modes:
  * Single - one transfer at a time
  * Block (burst) - block of data at a time
  * On-demand - as long as the I/O device accepts transfer
  * Cycle stealing - one cycle DMA, one CPU
  * Transparent - DMA works when the CPU is executing instructions


====== Programming in Assembler for IoT and Embedded Systems ======

Assembler is a low-level language that directly reflects processor instructions. It is used for programming microcontrollers and embedded devices, allowing for maximum control over the hardware of IoT systems. Programming in an assembler requires knowledge of processor architecture, which is more complex than high-level languages but offers greater efficiency. It is often used in critical applications where speed and efficiency matter. It also enables code optimization for energy consumption.

In IoT and embedded systems, an assembler enables efficient use of hardware resources, offering direct access to hardware. It is possible to use assembler code in environments like Arduino or Raspberry Pi. Assembler programs are fast and efficient, crucial in systems with limited resources. Direct control of processor registers and hardware components allows for the creation of advanced applications.

Although many libraries use assembler functions, it is often necessary to write code independently. Programming in assembler requires a deep understanding of processor architecture, which improves system design and optimization.


===== Evolution of the Hardware =====

The development of computer hardware is crucial for technology, particularly in the context of the Internet of Things (IoT) and embedded systems. The beginnings date back to the 1940s when the first electronic computers, such as ENIAC, were created.

In the 1970s, the first processors, such as the Intel 4004, appeared, revolutionizing the computer industry. Processors are advanced computing units in computers, smartphones, and other devices. Their main tasks are performing computational operations, managing memory, and controlling system components. Microprocessors are a subset of processors designed to work in more specialized devices, such as televisions or industrial equipment. They are small, efficient, and energy-saving, making them ideal for Internet of Things (IoT) applications.

Microprocessors became the foundation for the development of microcontrollers, which are the heart of embedded systems. Microcontrollers differ from processors and microprocessors in applications and functionality—small, standalone units with built-in memory, clocks, and interfaces.

Microcontrollers are complex integrated circuits designed to act as the brain of electronic devices. The central processing unit (CPU) plays a key role in their operation, controlling the microcontroller's functions and performing arithmetic and logical operations. Inside the microcontroller are registers, communication interface controllers, timers, and memory, all of which are connected by a common bus. Modern microcontrollers have built-in functions such as analog-to-digital converters (ADC) and communication interfaces (UART, SPI, I2C).

There are several types of microcontrollers, including PIC (Microchip Technology), AVR (Atmel), ESP8266/ESP32 (Espressif), and TI MSP430 (Texas Instruments).


===== Specific Elements of AVR Architecture =====

AVR is an extension of the idea presented in Vegard Wollan and Alf-Egil Bogen's thesis. Together with Gaute Myklebust, they patented the architecture, and in 1996, Atmel Norway was established as the AVR microcontroller design center. In 2016, Microchip acquired Atmel.

AVR architecture is a popular choice for microcontrollers due to its efficiency and versatility. Here are some specific elements that define the AVR architecture:
  * RISC Architecture: AVR microcontrollers are based on the RISC (Reduced Instruction Set Computing) architecture, which allows them to execute instructions in a single clock cycle. This results in high performance and low power consumption. 
  * Harvard Architecture: AVR utilizes the Harvard architecture, which features separate memory spaces for program code and data. This separation allows simultaneous access to both memories, increasing processing speed.
  * On-chip Memory: AVR microcontrollers feature built-in flash memory for storing program code, as well as SRAM and EEPROM for data storage. This integration simplifies design and improves reliability.
  * General-purpose Registers: They contain 32 x 8-bit general-purpose registers, facilitating efficient data manipulation and quick access during operations.
  * Interrupts: AVR microcontrollers support internal and external interrupt sources. 
  * Real Time Counter (RTC): Timer/Counter Type 2 (TC2) general-purpose, dual-channel, 8-bit timer/counter module. This timer/counter allows clocking from an external 32 kHz watch crystal
  * I/O Capabilities: AVR microcontrollers offer various programmable I/O lines, allowing flexible interfacing with external devices and peripherals.
  * Watchdog Timer: A programmable watchdog timer enhances system reliability by resetting the microcontroller in the event of software failure.
  * Clock Oscillator: The built-in RC oscillator provides the necessary clock signals for the microcontroller's operation.

{{:en:multiasm:piot:architecture.gif?400|}}

Other Architectures:
PIC
  * Manufacturer: Microchip Technology
  * Architecture: RISC
  * PIC16: 8-bit microcontrollers, various versions with different peripheral sets, low power consumption
  * PIC18: 8-bit microcontrollers with advanced features such as CAN FD
  * PIC32: 32-bit microcontrollers, high performance, wide range of applications

ESP8266/ESP32
  * Manufacturer: Espressif Systems
  * Architecture: Tensilica Xtensa (extended version based on RISC)
  * ESP8266: Single-core processor, 80 MHz (overclockable to 160 MHz), 32 KB RAM, 80 KB user RAM, Wi-Fi
  * ESP32: Dual-core processor, 240 MHz, 520 KB SRAM, Wi-Fi, Bluetooth, more GPIO

TI MSP430
  * Manufacturer: Texas Instruments
  * Architecture: RISC
  * MSP430: 16-bit microcontrollers, ultra-low power consumption, advanced analog peripherals, fast switching between modes


===== Registers =====

Registers are a key element of AVR microcontrollers. There are various types of registers, including general-purpose, special-purpose, and status registers. General-purpose registers are used to store temporary data. Special registers control microcontroller functions, such as timers or ADC. Status registers store information about the processor state, such as carry or zero flags. Each register has a specific function and is accessible through particular assembler instructions. Registers allow quick access to data and control over the processor.

AVR CPU General Purpose Working Registers (R0-R31)

**R0-R15:** 
Basic general-purpose registers used for various arithmetic and logical operations.

**R16-R31:** 
General-purpose registers that can be used with immediate
  
{{:en:multiasm:piot:avr_general_purpose_register.png?400|}}

**The X-register, Y-register, and Z-register:**
address pointers for indirect addressing
R26-R31: These registers have additional functions as 16-bit address pointers for indirectly addressing the data space. They are defined as X, Y, and Z registers.
 

**Other registers:**

RAMPX, RAMPY, RAMPZ: Registers concatenated with the X-, Y-, and Z-registers, enabling indirect addressing of the entire data space on MCUs with more than 64 KB of data space, and constant data fetch on MCUs with more than 64 KB of program space.

RAMPD: Register concatenated with the Z-register, enabling direct addressing of the whole data space on MCUs with more than 64 KB data space.

EIND: Register concatenated with the Z-register, enabling indirect jump and call to the entire program space on MCUs with more than 64K words (128 KB) of program space.

===== Data Types and Encoding =====

In assembler, various data types, such as bytes, words, and double words, are used. Bytes are the smallest data unit and have 8 bits. Words have 16 bits, and double words have 32 bits. Data types are used to store integers, floating-point numbers, and characters. Data encoding involves representing information in binary form. Data can be encoded in binary, decimal, or hexadecimal format. Character encoding is done using standards such as ASCII. Understanding data types and encoding is crucial for effective programming in assembler. This course will discuss various data types and encoding techniques. Practical examples will demonstrate how to encode and decode data in assembly language.

AVR microcontrollers are classified as 8-bit devices, but the instruction word is 16-bit.

**Data Types in AVR Assembler**
  * Single bits: Can be used for logical operations and control, e.g., setting or clearing bits in registers.
  * Bytes, 8-bit values: The basic data type in AVR, used to store small integers, ASCII characters, etc.
  * Words, 16-bit values: Consist of two bytes, used to store larger integers or memory addresses.
  * Double Words, 32-bit values: Consist of four bytes, used to store even larger integers or addresses.

For additional libraries, such as those used in the C language, 64-bit data can be utilized. These data require extra handling.


**Encoding**
  * Binary Encoding: Data is stored in binary format, the most fundamental form of data representation in microcontrollers.
  * ASCII Encoding: Characters are often stored using ASCII encoding, where a unique 7-bit or 8-bit code represents each character.
  * BCD (Binary-Coded Decimal): Sometimes used for applications requiring precise decimal representation, such as digital clocks or calculators. Each decimal digit is represented by its binary equivalent.
  * Endianness: AVR microcontrollers typically use little-endian format, where the least significant byte is stored at the lowest memory address.
===== Addressing Modes =====

Addressing modes define how the processor accesses data. There are 15 different addressing modes, such as: Direct Addressing, Indirect Addressing, Indirect with Displacement, Immediate Addressing, Register Addressing, Relative Addressing, Indirect I/O Addressing, and Stack Addressing.

Details: 


**1. Direct Single Register Addressing**

{{:en:multiasm:piot:adr01.png?400|}}

The operand is contained in the destination register (Rd).


**2. Direct Register Addressing, Two Registers**

{{:en:multiasm:piot:adr02.png?400|}}

Operands are contained in the source register (Rr) and destination register (Rd). The result is stored in the destination register (Rd).	

 
**3. I/O Direct Addressing**

{{:en:multiasm:piot:adr03.png?400|}} 

Operand address A is contained in the instruction word. Rr/Rd specifies the destination or source register.

 
**4. Data Direct Addressing**

{{:en:multiasm:piot:adr04.png?400|}}  

A 16-bit Data Address is contained in the 16 LSBs of a two-word instruction. Rd/Rr specify the destination or source register. The LDS instruction uses the RAMPD register to access memory above 64 KB.


**5. Data Indirect Addressing**

{{:en:multiasm:piot:adr05.png?400|}}

The operand address is the contents of the X-, Y-, or Z-pointer. Data Indirect Addressing is called Register Indirect Addressing in AVR devices without SRAM.

**6. Data Indirect Addressing with Pre-decrement**

{{:en:multiasm:piot:adr06.png?400|}}  

The X,- Y-, or the Z-pointer is decremented before the operation. The operand address is the decremented contents of the X-, Y-, or the Z-pointer.

**7. Data Indirect Addressing with Post-increment**

{{:en:multiasm:piot:adr07.png?400|}}

The X-, Y-, or the Z-pointer is incremented after the operation. The operand address is the content of the X-, Y-, or the Z-pointer before incrementing.

**8. Data Indirect with Displacement**

{{:en:multiasm:piot:adr08.png?400|}}

The operand address is the result of the q displacement contained in the instruction word added to the Y- or Z-pointer. Rd/Rr specify the destination or source register.


**9. Program Memory Constant Addressing**

{{:en:multiasm:piot:adr09.png?400|}}

Constant byte address is specified by the Z-pointer contents. The 15 MSbs select the word address. For LPM, the LSb selects the low byte if cleared (LSb == 0) or high byte if set (LSb == 1). For SPM, the LSb should be cleared. If ELPM is used, the RAMPZ Register is used to extend the Z-register.

**10. Program Memory Addressing with Post-increment**

Constant byte address is specified by the Z-pointer contents. The 15 MSbs select word address. The LSb selects low byte if cleared (LSb == 0) or high byte if set (LSb == 1). If ELPM Z+ is used, the RAMPZ Register is used to extend the Z-register.

**11. Store Program Memory**

{{:en:multiasm:piot:adr11.png?400|}}

The Z-pointer is incremented by 2 after the operation. Constant byte address is specified by the Z-pointer contents before incrementing. The 15 MSbs select word address and the LSb should be left cleared.

**12. Direct Program Memory Addressing**

{{:en:multiasm:piot:adr12.png?400|}}

Program execution continues at the address immediate in the instruction word.

**13. Indirect Program Memory Addressing**

{{:en:multiasm:piot:adr13.png?400|}}

Program execution continues at the address contained by the Z-register (i.e., the PC is loaded with the contents of the Z-register).

**14. Extended Indirect Program Memory Addressing**

{{:en:multiasm:piot:adr14.png?400|}}

Program execution continues at the address contained by the Z-register and the EIND-register (i.e., the PC is loaded with the contents of the EIND and Z-register).

**15. Relative Program Memory Addressing**

{{:en:multiasm:piot:adr15.png?400|}}

Program execution continues at the address PC + k + 1. The relative address k is from -2048 to 2047.
===== Instruction Encoding =====

Assembler instructions have a specific format that defines the operation and its arguments. The instruction format can vary depending on the processor architecture. Instruction encoding involves representing operations in binary form. Instructions can vary in length, from one to several bytes. Encoding instructions is crucial for the processor's efficient execution of a program.

Instructions encoding for AVR microcontrollers involve converting assembler instructions into the appropriate machine code that the processor can directly execute. Each assembler instruction is represented by a specific set of bits that define the operation and its operands.

**Instruction Format:**
  * AVR instructions are encoded in 16-bit or 32-bit machine words.
  * 16-bit instructions are most commonly used, but some more complex instructions may require 32-bit encoding.

**Bit Allocation:**
  * Opcode: Specifies the type of operation (e.g., addition, subtraction, shift).
  * Registers: Specification of source and destination registers.
  * Constants: Immediate values used in instructions.
  * Addresses: Memory addresses used in load and store instructions.

===== Instruction Set =====

The assembler instruction set includes arithmetic, logical, control, and input/output operations. Arithmetic instructions include addition, subtraction, multiplication, and division. Logical instructions include operations such as AND, OR, XOR, and NOT. Control instructions include jumps, loops, and conditions. Input/output instructions include operations on ports and registers. Understanding the instruction set is crucial for effective programming in assembler. This course will discuss the most commonly used instructions and their application in assembler code. Practical examples will show how to use instructions in AVR programming. The instruction set allows full control over the processor and its functions.


**Basic commands**


^  Data Transfer  ^^
|ldi  |  load immediate|
|mov  |  copy register|
|movw  |  copy register pair|


^  Logical Instructions  ^^
|and  |  logical AND|
|andi  |  logical AND with immediate|
|or  |  logical OR|
|ori  |  logical OR with immediate|
|eor  |  exclusive OR|
|com  |  one's complement|
|neg  |  two's complement|

^  Arithmetic Instructions  ^^
|add  |  add without carry|
|adc  |  add with carry|
|adiw  |  add immediate to word|
|sub  |  subtract without carry|
|subi  |  subtract immediate|
|sbc  |  subtract with carry|
|sbci  |  subtract immediate with carry|
|sbiw  |  subtract immediate from word|
|inc  |  increment|
|dec  |  decrement|
|mul  |  multiply unsigned//(1)//|
|muls  |  multiply signed//(1)//|
|mulsu  |  multiply signed with unsigned//(1)//|
|fmul  |  fractional multiply unsigned//(1)//|
|fmuls  |  fractional multiply signed//(1)//|
|fmulsu  |  fractional multiply signed with unsigned//(1)//|

//(1)	Not all processors support commands//

^  Bit Shifts  ^^ 
|lsl  |  logical shift left|
|lsr  |  logical shift Right|
|rol  |  rotate left through carry|
|ror  |  rotate right through carry|
|asr  |  arithmetic shift right|

^   Bit Manipulation  ^^ 
|sbr  |  set bit(s) in register|
|cbr  |  clear bit(s) in register|
|ser  |  set register|
|clr  |  clear register|
|swap  |  swap nibbles|

===== Best Practices on Structural Programming =====

Modular Programming: Modular programming involves dividing code into smaller, independent modules. Modules can be reused multiple times and are easy to test. Modular programming increases the readability and understandability of the code. Each module should have a clearly defined function and interface. Modules should be independent of each other, which facilitates their modification and development.
===== 3 Levels of Programming: C++, Libraries, Assembler =====

Programming AVR microcontrollers can be divided into three levels: C++, libraries, and assembler. Each of these levels offers distinct benefits and is utilized according to the project's specific requirements.

  * **C++** is a high-level language that allows programmers to write code in a more abstract and understandable way. Using C++ for AVR enables the use of advanced features such as object-oriented programming, inheritance, and polymorphism. This makes the code more modular and easier to maintain. Compilers like AVR-GCC convert C++ code into machine code that can be executed by the microcontroller.
  * **Libraries** are sets of predefined functions and procedures that facilitate programming AVR microcontrollers. An example is the AVR Libc library, which offers functions for I/O handling, memory management, and mathematical operations. Using libraries allows for rapid application development without the need to write code from scratch. Libraries are particularly useful in projects that require frequent use of standard functions.
  * **Assembler** is a low-level language that allows direct programming of the AVR microcontroller. Writing code in assembler gives full control over the hardware and allows for performance optimization. However, programming in assembler requires a deep understanding of the microcontroller's architecture and is more complex than programming in C++. Assembler is often used in critical applications where every clock cycle counts.

The choice of programming level depends on the project's specifics. C++ is ideal for creating complex applications, libraries facilitate rapid prototyping, and assembler provides maximum control and performance. Each of these levels has its place in AVR microcontroller programming.


===== Software Tools for AVR =====

Programming in assembler for AVR microcontrollers requires appropriate tools that facilitate code creation, debugging, and optimization. Here are some popular software tools for AVR:
  * Microchip Studio (formerly Atmel Studio) is an integrated development environment (IDE) for AVR and SAM microcontrollers. It allows writing, compiling, and debugging code in assembler and C/C++. Microchip Studio offers support for a wide range of AVR devices and integration with programming and debugging tools.

  * AVR Toolchain is a set of tools that includes a compiler, assembler, linker, and standard and mathematical libraries. These tools are based on GNU projects and are available for Windows, Linux, and macOS. AVR Toolchain is ideal for developers who prefer working with command-line tools.

  * AVRDUDE (AVR Downloader/Uploader) is a tool for programming AVR microcontrollers. It allows uploading code to the microcontroller using various programmers. AVRDUDE is often used in conjunction with other tools, such as Microchip Studio, to provide a complete programming cycle.

===== Hardware Debugging =====

Hardware debugging of AVR microcontrollers is a crucial element of the programming process, enabling precise testing and diagnosis of code issues. Here are some popular methods and tools for hardware debugging for AVR:

  * debugWIRE is a debugging interface developed by Atmel (now Microchip Technology) for AVR microcontrollers. It enables debugging using a single pin (RESET) and is particularly useful in small microcontrollers that lack many pins. debugWIRE allows setting breakpoints, tracking code execution, and monitoring registers.
  * JTAG (Joint Test Action Group) is a standard debugging interface that enables full debugging of AVR microcontrollers. JTAG offers advanced features, including code execution tracking, setting breakpoints, monitoring registers, and memory. It is a more advanced tool than debugWIRE but requires more pins.
  * PDI (Program and Debug Interface) PDI is a debugging interface used in some AVR microcontrollers, such as AVR XMEGA. PDI enables programming and debugging of the microcontroller using two pins. It is a compromise between the simplicity of debugWIRE and the advanced capabilities of JTAG.

Debugging Tools
  * Microchip Studio: An integrated development environment (IDE) for AVR microcontrollers that supports debugging using debugWIRE, JTAG, and PDI.
  * AVR-GDB: A version of the classic GDB debugger adapted for AVR microcontrollers. It allows debugging code at the assembler and C/C++ level.
  * AVRDUDE: A tool for programming AVR microcontrollers that can be used in conjunction with hardware debuggers.

Simulator
  * Simulators, such as SimulAVR, allow testing and debugging code without the need for physical hardware. Simulators are useful for verifying code functionality and identifying errors at an early stage.

Practical Tips
  * Setting fuses: Before starting debugging, the microcontroller fuses should be properly set to enable the use of debugWIRE or JTAG.
  * Disconnecting external reset sources: In the case of debugWIRE, all external reset sources should be disconnected to avoid interference.

===== Energy Efficiency =====

There are five sleep modes to select from:
  * Idle Mode
  * Power Down
  * Power Save
  * Standby
  * Extended Standby

To enter any of the sleep modes, the Sleep Enable bit in the Sleep Mode Control Register (SMCR.SE) must be written to '1' and a SLEEP instruction must be executed. Sleep Mode Select bits (SMCR.SM[2:0]) select which sleep mode (Idle, Power-down, Power-save, Standby, or Extended Standby) will be activated by the SLEEP instruction. Sleep options shown are from ATmega328PB.

{{:en:multiasm:piot:sleep_modes.png?400|}}

**Idle Mode**

When the SM[2:0] bits are set to '000', the SLEEP instruction puts the MCU into Idle mode, stopping the CPU but allowing peripherals like SPI, USART, Analog Comparator, 2-wire Serial Interface, Timer/Counters, Watchdog, and interrupts to continue operating. This mode halts the CPU and Flash clocks but keeps other clocks running. The MCU can wake up from both external and internal interrupts.

**Power-Down Mode**

When the SM[2:0] bits are set to '010', the SLEEP instruction puts the MCU into Power-Down mode, stopping the external oscillator. Only external interrupts, 2-wire Serial Interface address watch, and Watchdog (if enabled) can wake the MCU. This mode halts all generated clocks, allowing only asynchronous modules to operate.

**Power-Save Mode**

When the SM[2:0] bits are set to '011', the SLEEP instruction puts the MCU into Power-Save mode, similar to Power-Down but with Timer/Counter2 running if enabled. The device can wake up from Timer Overflow or Output Compare events from Timer/Counter2.

**Standby Mode**

When the SM[2:0] bits are set to '110' and an external clock option is selected, the SLEEP instruction puts the MCU into Standby mode, similar to Power-Down but with the oscillator running. The device wakes up in six clock cycles.

**Extended Standby Mode**

When the SM[2:0] bits are set to '111' and an external clock option is selected, the SLEEP instruction puts the MCU into Extended Standby mode, similar to Power-Save but with the oscillator running. The device wakes up in six clock cycles.


AVR® 8-bit microcontrollers include several sleep modes to save power. The AVR device can also lower power consumption by shutting down the clock for select peripherals via a register setting. That register is called the Power Reduction Register (PRR).

{{:en:multiasm:piot:sleep_modes_details.png?400|}}
 
The PRR provides a runtime method to stop the clock to select individual peripherals. The current state of the peripheral is frozen, and the I/O registers cannot be read or written. Resources used by the peripheral when stopping the clock will remain committed. Hence, the peripheral should, in most cases, be disabled before stopping the clock. Waking up a module, which is done by clearing the bit in PRR, puts the module into the same state as before shutdown.
PRR clock shutdown can be used in Idle mode and Active mode to significantly reduce the overall power consumption. In all other sleep modes, the clock is already stopped.

Tips to Minimize Power Consumption
  * Analog to Digital Converter (ADC): Disable the ADC before entering sleep modes to save power.
  * Analog Comparator: Disable the Analog Comparator in Idle and ADC Noise Reduction modes if not used.
  * Brown-Out Detector (BOD): Turn off the BOD if not needed, as it consumes power in all sleep modes.
  * Internal Voltage Reference: Disable it if not needed by the ADC, Analog Comparator, or BOD.
  * Watchdog Timer: Turn off the Watchdog Timer if not needed, as it consumes power in all sleep modes.
  * Port Pins: Configure port pins to minimize power usage, ensuring no pins drive resistive loads.
  * On-chip Debug System: Disable the On-chip debug system if not needed, as it consumes power in sleep modes.

====== Programming in Assembler for Mobiles and ARM ======

Let's take a look at the mobile devices market. Mobile phones, tablets, and other devices are built on ARM processor architecture. For example, take Snapdragon SoC (System-on-Chip) designed by Qualcomm – this chip integrates a CPU based on the ARM architecture. Of course, that chip may have an additional graphics processing unit(GPU) and even a digital signal processor(DSP) for faster signal processing. Similarly, Apple A18 processors are based on ARM architecture. The only difference between all these mobile devices is the ARM version on which the processor is designed.

ARM stands for Advanced RISC Machine.

ARM has multiple processor architecture series: Cortex-M, Cortex-R, Cortex-A, and Cortex-X, as well as other series, not only generalised Central Processing Units - CPU.
Cortex-X series processors are made for performance on smartphones and laptops. It supports up to 14 cores, but that number may change over time. Similarly, the Cortex-A series is designed for devices that execute complex computational tasks. These processors are designed to deliver power-efficient performance, increasing battery life. Like the Cortex-X series, these processors also may have up to 14 cores. The cortex-R series is made for real-time operations, where reaction to events is crucial. These are designed specifically for use in time-sensitive and safety-critical environments. The architecture is very similar to the Cortex-A processors, with some exceptions that will not be discussed here. The Cortex-M series of microcontrollers requires low power consumption and high computational power. Many wearable devices, IoT devices, and embedded devices use ARM processors. 
The ARM processor “world” is huge – only through individual investigation can it be explored and its potential realised. In this book, we will present key ideas for what needs to be investigated in more depth to unlock the exact processor's full potential and features. The primary resource for this book is the ARM official homepage at https://developer.arm.com/documentation, which includes all its resources and documentation.

Mobile devices tend to use Cortex-A series processors. As the instruction set is very similar to that of the Cortex series (of course, with minor differences), we will use a Raspberry Pi 5 as a mobile device to learn its architecture. The A64 instruction set will be used in assembly language. The Raspberry Pi 5 uses BCM2712, an ARM Cortex-A76 processor based on the ARMv8.2 architecture. The processor architecture includes many useful features to speed up applications. Currently, the Raspberry Pi 5 has NEON with an integrated floating-point unit, but only half-precision. The architecture has an optional L3 cache, and the BCM2712 on the Raspberry Pi 5 has 2MB of it. All specifics are available in the BCM2712 datasheet. 
Note that ARM typically uses registers to manipulate data. There are no operations that directly manipulate data in memory. All the data must be loaded into the processor's general-purpose registers. All manipulations are performed directly on those registers containing data, and the data can finally be stored back in memory.

===== Overview of ARM and Mobile Device Architecture =====
Now you may realize that there is more than one assembly language type. The most widely used assembly language types are ARM, MIPS, and x86. New architectures will come, and they will replace the old ones, just because they may have reduced power consumption, or silicon die size. Another push for new CPU architecture is malware that uses the architecture features against users, to steal their data. Such malware, like “Spectre” and “Meltdown”, perform attacks based on vulnerabilities in modern CPU designs. Used vulnerabilities were designed as a feature for the CPU – speculative execution and branch prediction. So – this means that in the future, there will be new architectures or their newer versions and alongside them, new malwares.

===== ARM Assembly Language Specifics =====

Before starting programming, it's good to know how to add comments to our newly created code and the syntax specifics. This mainly depends on the selected compiler. All code examples provided will be written in GNU ARM ASM. { https://sourceware.org/binutils/docs-2.26/as/}
The symbol ‘@’ is used to create a comment in the code till the end of the exact line. Some other assembly languages use ‘;’ to comment, but in ARM assembly language, ‘;’ indicates a new line to separate statements. The suggestion is to avoid the use of this ‘;’ character; there will be no such statements that would be divided into separate code lines. Multiline comments are created in the same way as in C/C++ programming languages by use of ‘/*’ and ‘*/’ character combinations. However, it is better to read the manual and syntax for each software used to create the code and compile it.

Now, let's compare the Cortex-M and Cortex-A series. The ARMv8-A processor series supports three instruction sets: A32, T32, and A64. A32 and T32 are the Arm and Thumb instruction sets, respectively. These instruction sets are used to execute instructions in the AArch32 Execution state. The A64 instruction set is used when executing in the AArch64 Execution state, and there are no 64-bit-wide instructions; it does not refer to instruction size in memory.
ARM Cortex-M processors based on ARMv7 can execute two instruction sets: Thumb and ARM. The Thumb instructions are 16-bit wide and have some restrictions, such as access to registers. Many microcontrollers are currently based on the ARMv7 processor, but new ones based on ARMv8 and ARMv8.1 are already available. The ARMv8-M series processors support only Thumb instructions; there is no ARM state to execute ARM32 instructions, so these processors do not support the A32 instruction set. The feature was removed, maybe because there was no need to use the ARM state – many compilers use the Thumb instruction set. 

Looking at the ARMv8-A processors, the instruction sets make a difference. AArch32 execution state can execute programs designed for ARMv7 processors. This means many A32 or T32 instructions on ARMv8 are identical to the ARMv7 ARM or THUMB instructions, respectively. On ARMv7 processors, the THUMB instruction set has some restrictions, such as reduced access to general-purpose registers, whereas the ARM instruction set provides complete access to all registers. Similar access restrictions can also be found on the ARMv8 processors.
Let's look at the machine code generated by the assembler to determine why the exact instruction set imposes restrictions on the ARMv8 processor. We will take just one instruction, one of the most common, the MOV instruction. Multiple machine code variants for the MOV instruction are generated due to different addressing modes. We will take just one instruction that copies a value from one register to another. All information on the machine code is available on the ARM homepage: for A64, the documentation number is DDI0602.

**A64**

The machine code for the MOV instruction is given in the figure above. The bit values presented are fixed for proper identification of the operation. The ‘sf’ bit identifies data encoding variant 32-bit (sf=0) or 64-bit(sf=1). The ‘sf’ bit does not change the instruction binary code.

{{:en:multiasm:paarm:mov64_32_bitform.jpg|}}

The ‘Rm’ and the ‘Rd’ bit fields identify the exact register number from which the data will be copied to the destination register. Each is 5 bits wide, so all 32 CPU registers can be addressed. The ‘sf’ bit only identifies the number of bits to be copied between registers: 32 or 64 bits of data. The ‘opc’ bitfield identifies the operation variant (addressing mode for this instruction), and the ‘N’ bit, mainly used for bitwise shift operations; for others, like this one, this bit has no meaning. There are instructions where the ‘N’ bit is used to identify some instruction options, but this bit is used with the ‘imm6’ bits together.

**A32**

The figure above shows the same instruction, but the register address is smaller than in the A64 instruction. The bitfields ‘Rd’ and ‘Rm’ are 4-bit wide, so only 16 CPU registers can be addressed using this instruction in A32.

{{:en:multiasm:paarm:mov32_bitform.svg|}}


Other bitfields, like ‘cond’, are also used for conditional instruction execution. The ‘S’ bit identifies whether the status register must be updated. The ’stype’ bitfield is used for the type of shift to be applied to the second source register, and finally, the ‘imm5’ bitfield is used to identify the shift amount 0 to 31.

<table tab_label>
<caption>Shift options</caption>
^ ‘stype’ bit-field value ^ Shift type ^ Meaning ^
|0b00 |LSL |Logical Shift Left |
|0b01 |LSR |Logical Shift Right |
|0b10 |ASR |Arithmetic Shift Right |
|0b11 |ROR |ROtate Right |
</table>

Many instructions include options such as bit shifting. These operations also have specific instructions for binary bit shifting. These shifts affect the operand values. Shifting the register left or right by one bit multiplies or divides the value by 2, respectively.

{{ :en:multiasm:paarm:instroptlsl.svg |}}
{{ :en:multiasm:paarm:instroptlsr.svg |}}
{{ :en:multiasm:paarm:instroptasr_1.svg |}}
{{ :en:multiasm:paarm:instroptasr_2.svg |}}
{{ :en:multiasm:paarm:instroptror.svg |}}

**T32**

The Thumb instructions have multiple machine codes for this one operation. 

{{:en:multiasm:paarm:thumbt1.svg|}}

T1 THUMB instruction D bit and Rd fields together identify the destination register. The source and destination registers can now be addressed with only 4 bits, so only 16 general-purpose registers are accessible. A smaller number of registers can be accessed in the following machine code.

{{:en:multiasm:paarm:thumbt2.svg|}}

The OP bitfield specifies the shift type, and imm5 specifies the amount. The result Rd will be shifted by imm5 bits from the Rm register. Notice that only three bits are used to address the general-purpose registers – only eight registers are accessible.
Finally, the last machine code for this instruction is a sixteen 16-bit-wide instruction, but the Rd and Rm fields are still 4 bits wide. This instruction provides more shift operations than the previous one, but instead of a single imm5 field specifying the shift amount, it is split into two fields: imm3 and imm2. Both parts are combined for the same purpose: to identify the shift amount.

{{:en:multiasm:paarm:thumbt3.svg|}}

Different machine codes for the T32 instructions allow you to choose the most suitable one, but the code must be consistent with a single machine code type. Switching between machine code types in the processor is still possible, but compiling code that uses multiple machine codes will be even more complicated than learning assembler. 

Summarising these instruction sets, the A32 was best suited to ARMv7 processors, with only 16 general-purpose registers. For ARMv8, there are 31 registers to address, forcing ARM to introduce the A64 instruction set, which supports 32 registers. This is why all code examples in the following sections are created using the A64 instruction set.


===== Registers =====
The CPU registers are the closest place to the processor to store the data. However, not all registers are meant to store data. There might be more specialised registers to configure the processor, or additional modules for it. Microcontrollers have integrated special peripherals to perform communication tasks, convert analogue voltage signals into digital values, and vice versa, and many other peripherals. Each of them has its dedicated registers to store configuration and data. These registers are not located near the CPU but rather in memory, primarily in the specified RAM address range. Information about these peripheral module registers can be found in the reference manuals and/or the programmers' manuals. Some microcontrollers store all information in the datasheets. This section will focus on the CPU and the registers closest to it in the ARMv8 architecture.
==== CPU registers==== 
AArch64 provides 31 general-purpose registers (R0..R30) with 64 bits. A 64-bit general-purpose register is named X0 to X30, and a 32-bit register is named W0 to W30, like in the picture above. Both registers are the same physically – the programmer can choose when to use 32-bit or 64-bit registers.
The term ‘Rn’ refers to architectural registers, not the registers to be used in the assembler code. 

{{ :en:multiasm:paarm:registersizes.svg |}}

Note that accessing the W0 or W1 register does not allow access to the remaining 32 most significant bits. Also, when the W register is written in a 32-bit register, the top 32 bits (most significant bits of the 64-bit register) are zeroed. And there are no registers named R0 or R1, so if we need to access the 64-bit register result, we need to address it with X0 or X1 (or other register up to X30), and similarly with 32-bit registers – W0, W1 and so on are used to address general-purpose registers. 
These examples perform single 32-bit arithmetic operations:

{{ :en:multiasm:paarm:arthmetic32bit.svg |}}

These examples perform single 64-bit arithmetic operations:

{{ :en:multiasm:paarm:arithmetic64bit.svg |}}

Special registers like the Stack Pointer (SP), Link Register (LR), and Program Counter (PC) are available. The Link Register (LR) is stored in the X30 register. The Stack Pointer SP and the Program Counter PC registers are no longer available as regular general-purpose registers. It is still possible to use the SP register with a limited set of data-processing instructions via the WSP register name. Unlike ARMv7, the PC register is no longer accessible via data-processing instructions. The PC register can be read by ‘ADR’ instruction, i.e. ‘ADR, X17, . ’ – the dot ‘.’ means “here” and register X17 will be written with the exact PC register value. Some branch instructions and some load/store operations implicitly use the value of the PC register.
Note that the PC and SP are general-purpose registers in the A32 and T32 instruction sets, but this is not the case in the A64 instruction set.
Someone may have noticed that, for A64 instructions, 32 registers in total can be addressed, but only 31 general-purpose registers. Accessing the X31 register returns either the current stack pointer or the zero register, depending on the instruction. Writing to R31 will not have any effect. The register X31 is called the Zero Register. The Zero Register, ZXR, and WZR (64- or 32-bit wide) are always read as zero and ignore writes.


<table tab_label>
^ ^ General-purpose registers ^ Dedicated registers ^ Explanation ^
| Architectural name | R0, R1, R2, R3, .., R29, R30 | SP, ZR | These names are mainly used in documentation about architecture and instruction sets. |
|64-bit | X0, X1, X2, X3, .., X29, X30 | SP, XZR | The ‘x’ stands for an extended word. All 64 bits are used |
|32-bit | W0, W1, W2, W3, .., W29, W30 | WSP, WZR | The ‘w’ stands for word. Only the bottom (least significant) 32 bits are used |
</table>

Example adding two registers together with different register notation.

{{ :en:multiasm:paarm:exampleregnotations.svg |}}


The main difference between 64-bit and 32-bit register operations is that the result is calculated using only the 32 least significant bits of the whole register (the 32 most significant bits are greyed out). Also, the result: writing the result in the register zeroes out the 32 most important bits (the red zeroes).
ARMv8 has an additional 32 register set for floating-point and vector operations, like general-purpose registers. These registers are 128 bits wide and, like general-purpose registers, can be accessed in several ways. The letters for these registers identify byte (Bx), half-word (Hx), single-word (Sx), double-word (Dx) and quad-word (Qx) access. 

{{ :en:multiasm:paarm:vectorreg1.svg |}}

{{ :en:multiasm:paarm:vectorreg2.svg |}}


More information on these registers and operations performed with floating-point is described in the following section, “Advanced Assembly Programming”.

==== CPU Configuration====
The Raspberry Pi 5 has an ARM Cortex-A76 processor with 4 CPU cores. Each core has its own stack pointers, status registers and other registers. Before looking at CPU registers, some specifics must be explained. The single core has several execution levels: EL0, EL1, EL2, and EL3. These execution levels in datasheets are called Exception Levels – the level at which the processor resources are managed. EL0 is the lowest level; all user applications are executed at this level. EL1 is meant for operating systems; EL2 is intended for a Hypervisor application to control resources for the OS and the lower exception layers. The CPU's general-purpose registers are independent of Exception levels, but it is essential to understand which Exception Level executes the code. This is called “System configuration” because the processor has multiple cores, and each core has multiple exception levels. To configure the system and access the system registers, the MRS and MSR instructions must be used. Note that the registers that have the suffix “_ELn” have a separate, banked copy in some or all of the levels, except for EL0. This suffix also defines the lowest exception level, which can access the particular system register. Only a few system registers are accessible from EL0, though the Cache Type Register (CTR_EL0) is one of them.

''<fc #800000>MRS</fc> <fc #008000>X0</fc>, CTR_EL0 <fc #6495ed><fc #87ceeb>@ Move CTR_EL0 into X0 – now the X0 register can be modified</fc></fc>''

''<fc #800000>MSR</fc> CTR_EL0, <fc #008000>X0</fc> <fc #6495ed>@ Move X0 into CTR_EL0 – write back the modifications</fc>''

{{ :en:multiasm:paarm:exceptionlevels.svg |}}

We will look only at AArch64 registers to narrow the number of registers. 
There are many registers dedicated to the CPU. Specialised registers will be left aside again to narrow the amount of information, and only those registers meant for program execution will be reviewed. As there is more than one core, each core must have a dedicated status register. All registers that store some status on AArch64 CPU cores are collected in the table below. Don’t get confused by many of these listed status registers. The registers whose names are in bold are relevant to the programming because only those registers store the actual status of instruction execution.

<table tab_label>
<caption>some of statsu registers for ARMv8 processor</caption>
^ Register      ^ description                                                               ^
| AFSR0_EL1..3 and AFSR1_EL1..2  | Auxiliary Fault Status Register 0/1 (EL1..EL3) provides additional fault information for exceptions taken to EL1, EL2 or EL3.  |
| DBGAUTHSTATUS_EL1   | The Debug Authentication Status Register provides information about the debug authentication interface's state. |
| DISR_EL1 | The Deferred Interrupt Status Register stores the records that an ESB (Error synchronisation barrier) instruction has consumed an SError (System Error) exception. |
| DSPSR_EL0 | The Debug Saved Program Status Register holds the saved process state for the Debug state. When entering the Debug state, PSTATE information is written in this register. Values are copied from this register to PSTATE on exiting the Debug state.|
| ERXGSR_EL1 | The Selected Error Record Group Status Register shows the status for the records in a group of error records. Accesses ERRGSR for the group of error records <n> selected by ERRSELR_EL1.SEL[15:6]. |
| ERXSTATUS_EL1 | Selected Error Record Primary Status Register Accesses ERR<n>STATUS for the error record <n> selected by ERRSELR_EL1.SEL |
| **FPSR** | The Floating-point Status Register provides floating-point system status information. |
| ICH_EISR_EL2 | Interrupt Controller End of Interrupt Status Register indicates which List registers have outstanding EOI (End Of Interrupt) maintenance interrupts. |
| ICH_ELRSR_EL2 | Interrupt Controller Empty List Register Status Register. These registers can locate a usable List register when the hypervisor delivers an interrupt to a VM (Virtual Machine). |
| IFSR32_EL2 | The Instruction Fault Status Register (EL2) allows access to the AArch32 IFSR register only from AArch64 state. Its value does not affect execution in AArch64 state. |
| ISR_EL1 | Interrupt Status Register shows the pending status of IRQ and FIQ interrupts and SError exceptions. |
| MDCCSR_EL0 | Monitor DCC Status Register is a read-only register containing control status flags for the DCC (Debug Communications Channel) |
| OSLSR_EL1 | OS Lock Status Register provides the status of the OS Lock. |
| **SPSR_EL1..3** | The Saved Program Status Register (EL1..EL3) holds the saved process state when an exception is taken to EL1, EL2, or EL3. |
| **SPSR_abt** | Saved Program Status Register (Abort mode) holds the saved process state when an exception is taken to Abort mode. |
| **SPSR_fiq** | The Saved Program Status Register (FIQ mode) holds the saved process state when an exception is taken into FIQ mode. |
| **SPSR_irq** | Saved Program Status Register (IRQ mode) holds the saved process state when an exception is taken to IRQ mode. |
| **SPSR_und** | Saved Program Status Register (Undefined mode) holds the saved process state when an exception is taken to Undefined mode. |
| TFSRE0_EL1 | The Tag Fault Status Register (EL0) holds accumulated Tag Check Faults occurring in EL0 that are not taken precisely. |
| TFSR_EL1..3 | Tag Fault Status Register (EL1..EL3) holds accumulated Tag Check Faults occurring in EL1, EL2 or EL3 that are not taken precisely |
| TRCAUTHSTATUS | The Trace Authentication Status Register provides information about the authentication interface's state for debugging. The CoreSight Architecture Specification offers more information. |
| TRCOSLSR | Trace OS Lock Status Register returns the status of the Trace OS Lock |
| TRCRSR | The Trace Resources Status Register is used to set or read the status of the resources. |
| TRCSSCSR<n> | Trace Single-shot Comparator Control Status Register <n> returns the status of the corresponding Single-shot Comparator Control. |
| TRCSTATR | Trace Status Register returns the trace unit status. |
| VDISR_EL2..3 | Virtual Deferred Interrupt Status Register (EL2..EL3) Records that a SError exception has been consumed by an ESB instruction executed at EL1 or EL2. |
</table>

You can see how many states this processor has. Not all of them are used during program execution. Many registers are related to debugging and resource management. On the Raspberry Pi, the OS and bootloader have already configured all CPU Cores and the registers. Trace registers are used only when hardware debugging is enabled, such as JTAG or TRACE32.
Summarising: in Cortex-A76, ARMV8.2-A, installed on Raspberry Pi 5, on EL1, the system OS (kernel) is running. Since the OS is booted from the SD card, there is no Hypervisor software, which means EL2 is not used, and the Linux OS is the only one running on the Raspberry Pi 5 board. All users’ software runs in EL0. However, any code executed by the OS kernel runs at EL1, and it can be designed and executed as a kernel module.

There are rules for creating a Linux OS kernel module – it must contain functions that initialise the module and exit when the job is finished. The skeleton for the kernel module is given below in C. It will require a GCC compiler to compile the code, but inline assembly can be written directly in the code itself. After changing the Exception level from EL0 to EL1, only some system instruction executions will be allowed.

<codeblock code_label>
<caption>Linux kernel module example</caption>
<code>
// mymod.c
#include <linux/module.h>
#include <linux/kernel.h>

static int __init mymod_init(void)
{
    asm volatile(
        "mrs x0, CurrentEL\n"
        "lsr x0, x0, #2\n"  // EL >> 2
        // x0 now contains current EL (expect 1)
    );
    pr_info("mymod: running in EL1\n");
    return 0;
}

static void __exit mymod_exit(void)
{
    pr_info("mymod: exit\n");
}

module_init(mymod_init);
module_exit(mymod_exit);
MODULE_LICENSE("GPL");

</code>
</codeblock>

There are restrictions on the use of privileged instructions in the code. In EL0, privileged instruction execution will trap into the kernel. Note that switching between EL0 and EL1 is allowed only in the kernel and firmware. The firmware code will require access to the whole chip documentation, and at the moment, this documentation is confidential. So, the only option left is to design a kernel module and use the available EL1 resources.

===== Data Types and Encoding =====

Higher-level programming languages define several data types; some languages don’t need to determine the data type because during program compilation (or interpretation), the value type is automatically assigned to them. The first type of division can be made between real and integer numbers. Then the integers can also be divided into only positive values and the values that can be positive or negative. The remaining division can be performed by determining how many bytes are needed to store the largest required value.

Many instructions include additional suffixes to indicate the length of the data used. As an example of multiple data lengths, load/store instructions will be used. Loading floating-point (real numbers) values is described in the next chapter. 
All the following examples use integers. One 64-bit register can be split into eight 8-bit values, four 16-bit values, or two 32-bit values. It is important to remember that the entire 64-bit register will be used whether storing an 8-bit or a 16-bit value. The remaining bytes in the register depend on the instruction used.

''<fc #800000>LDR</fc> <fc #008000>X0</fc>, [<fc #008000>X1</fc>] <fc #6495ed>@ fill the register X0 with the data located at address stored in X1 register</fc>''

''<fc #800000>STR</fc> <fc #008000>X1</fc>, [<fc #008000>X2</fc>] <fc #6495ed>@ store the content from register X1 into the memory at memory address given in the X2 register</fc>''

The ''<fc #800000>LDR</fc>'' and ''<fc #800000>STR</fc>'' instructions are basic instructions that load or store the entire register from or to memory. It means that 8 bytes or 4 bytes are loaded into the register. Remember that register size depends on the notation used: Xn or Wn. 
To load a single byte, use the ''<fc #800000>LDRB</fc>'' and ''<fc #800000>LDRSB</fc>''instructions. For these instructions, the destination register can be addressed only as a 32-bit register.

''<fc #800000>LDRB</fc> <fc #008000>W0</fc>, [<fc #008000>X1</fc>]''

''<fc #800000>LDRSB</fc> <fc #008000>W0</fc>, [<fc #008000>X1</fc>]''

Similar destination register restrictions apply to 16-bit, ''<fc #800000>LDRH</fc>'', and ''<fc #800000>LDRSH</fc>'' instructions. These restrictions stem from the fact that there is no sign extension for 64-bit registers, only for 32-bit registers. Any write to a 32-bit register automatically clears the upper bits in the whole 64-bit register.
To store a byte or half-word in memory, the data must be held in the 32-bit register before it is written. And only the least significant bytes are stored by use of the ''<fc #800000>STRB</fc>'' or ''<fc #800000>STRH</fc>'' instruction. 


==== Big/little-endian ====

Overall, by ARMv8.0, a new feature was added to the processor. In this section, the primary focus is on endianness, and the feature named “FEAT_MixedEnd” from ARMv8 processors allows programmers to control the endianness of the memory. This means that the ARMv8 have implemented both little-endian and big-endian. Note that the feature “FEAT_MixedEnd” is optional. Not all processor manufacturers implement this feature, but it can be verified. To find out the presence of the “FEAT_MixedEnd” feature, the ID_AA64MMFR0_EL1 register’s BigEnd bit field must be checked. If the feature “FEAT_MixedEnd” is implemented, then the “FEAT_MixedEndEL0” feature is also implemented. This means that at EL0, endianness can be changed.
On the Raspberry Pi 5 running AArch64, the higher exception layers control the endianness of the lower exception layers. For example, the code running in the Exception Level EL1 layer can control the endianness of the EL0 Exception Level. Note that if the Linux OS is already running on the Raspberry PI, the kernel EL1 endianness should not be changed, because the OS runs at the EL1 layer and no OS can switch endianness at runtime (at the moment). 

As both modes are supported in ARMv8, it may be necessary to determine the EL0 endianness settings. The CPU's registers hold all the information about the current CPU endianness. Similar to the selected CPU, the ID_AA64MMFR0_EL1 register is an AArch64 Memory Model Feature register that stores endianness information.

{{ :en:multiasm:paarm:endianness.svg |}}

Two bit-fields identify information about endianness: the BigEndEL0 bits [19:16] and the BigEnd bits [11:8]. If the selected CPU doesn’t support mixed-endian at the EL0 level, it will be indicated by a value of 0x0 in the BigEndEL0 bit field. Note that the manual mentions the SCTLR_EL1.E0E bit has a fixed value. 
The SCTLR_EL1 is the System Control Register for EL1, which manages the system, including the memory system at EL1 and EL0 levels. The bit E0E (bit number 24) is set to zero by default. When the value is zero, explicit data accesses at EL0 are little-endian; when the value is one, explicit data accesses at EL0 are big-endian.

Other registers can also affect the system's endianness. And still, at the end, it is worth noting that the software is primarily written in little-endian format – all data and program code in memory are in little-endian byte order. Despite that, the network communication protocols (TCP/IP) mostly use big-endian byte order. Some file extensions are also adapted for network use and store data in big-endian format, for example, the JPEG and TIFF image formats and others.

===== Addressing Modes =====

The ARMv8 has simple addressing modes based on load and store principles. ARM cores perform Arithmetic Logic Unit (ALU) operations only on registers. The only supported memory operations are the load (which reads data from memory into registers) or store (which writes data from registers to memory). By use of LDR/STR instructions and their available options, all the data must be loaded in the general-purpose registers before the operations on them can be performed. Multiple addressing modes can be used for load and store instructions:

===Register addressing===
The register is used to store the address of the data in the memory. Or the data will be stored in the memory at the address given in the register. 

''<fc #800000>LDR</fc> <fc #008000>X0</fc>, [<fc #008000>X1</fc>] <fc #6495ed>@ fill the register X0 with the data located at address stored in X1 register</fc>''

''<fc #800000>STR</fc> <fc #008000>X1</fc>, [<fc #008000>X2</fc>] <fc #6495ed>@ store the content for register X1 into the memory at location given in the X2 register</fc>''

===Pre-indexed addressing===
An offset to the base register is added before the memory access. The address is calculated by adding the two registers: ''[<fc #008000>X1</fc>, <fc #008000>X2</fc>]'' will result in X1+X2. The register <fc #008000>X1</fc> represents the base register, and <fc #008000>X2</fc> represents the offset. The offset value can also be negative, but the final, calculated address must be positive.

''<fc #800000>LDR</fc> <fc #008000>X0</fc>, [<fc #008000>X1</fc>, <fc #008000>X2</fc>] <fc #6495ed>@ address pointed to by X1 + X2</fc>''

''<fc #800000>STR</fc> <fc #008000>X2</fc>, [<fc #008000>X4</fc>, <fc #008000>X3</fc>] <fc #6495ed>@ address pointed to by X4 + X3</fc>''

The LDR instruction machine code also allows a shift operation to the offset register while maintaining the same addressing mode.

''<fc #800000>LDR</fc> <fc #008000>X0</fc>, [<fc #008000>X1</fc>, <fc #008000>X2</fc>, <fc #cd5c5c>LSL</fc> <fc #ffa500>#2</fc>] <fc #6495ed>@ address is X1 + (X2*4)</fc>''

===Pre-indexed addressing with write back===
The ARM separates this addressing mode because the register that points to the address in memory can now be modified. The value will be added to the register before the access is performed.

''<fc #800000>LDR</fc> <fc #008000>X0</fc>, [<fc #008000>X1</fc>, <fc #ffa500>#32</fc>]<fc #800080>**!**</fc> <fc #6495ed>@ read the data into <fc #008000>X0</fc> register from address pointed to by X1+32, then X1=X1 + 32</fc>''

The exclamation mark at the end of the closing bracket indicates that the value of the register <fc #008000>X1</fc> must be increased by 32. After that, the memory access can be performed.


===Post-index with write back===
Like with the previous addressing mode, this one is also separated. As the addressing mode says, ‘post’ means that the value of the register will be increased after performing the memory access.

''<fc #800000>LDR</fc> <fc #008000>X0</fc>, [<fc #008000>X1</fc>],<fc #ffa500> #32</fc> <fc #6495ed>@ read X0 from address pointed to by X1, then X1=X1 + 32</fc>''

===Other addressing modes===
There are some other addressing modes available. Some addressing modes are used only for function or procedure calls, while others combine previously described addressing modes. This subsection will introduce some of the other available addressing modes.
  * Register to register, also called register direct addressing mode. It is used to copy data from one register to another. No memory access is performed with such operations.
  * Literal addressing, alternatively, the immediate addressing mode is used to identify the data directly in the instruction. 
  * PC-relative addressing

Literal addressing mode allows the use of literal addresses in the program code. Something similar is done with function names, but in this situation, the data are addressed by literal names. PC-relative addressing 
Some instructions allow loading (and storing) a pair of data (LDP and STP instructions). 

''<fc #800000>LDP</fc> <fc #008000>X0</fc>, <fc #008000>X1</fc>, [<fc #008000>X2</fc>,<fc #ffa500> #32</fc>] <fc #6495ed>@ read X0 from address pointed to by X2, and then read X1 from X2 + 32</fc>''

These instructions will be used to call a function or procedure.
<note important>Remember that general-purpose registers can be accessed in two different ways: as 64-bit registers <fc #008000>X0</fc>..<fc #008000>X31</fc>or as 32-bit registers <fc #008000>W0</fc>..<fc #008000>W31</fc>. </note>

The load and store instructions also work with 32-bit registers. Load instructions include additional options that can be used not only for data loading but also for other operations on the data in the register. For example, to load a single data byte into the register, use the ''<fc #800000>LDRB</fc>'' instruction. If a byte holds a negative value, the entire register must preserve the sign – this is called sign extension and is performed with the ''<fc #800000>LDRSB</fc>'' instruction. Like, for example, the value -100 in hexadecimal is ''<fc #ffa500>0x9C</fc>'' (in binary 2’s complement 10011100).

{{ :en:multiasm:paarm:ldrsbw0.svg |}}

The data are loaded from memory, and the sign bit is preserved only for a 32-bit wide value when the destination register is addressed as a 32-bit register. If the 64-bit register is used as the destination, the sign bit is preserved in the entire 64-bit register.

{{ :en:multiasm:paarm:ldrsbx0.svg |}}

Zero extension is only available for 32-bit registers because the most significant bytes are cleared when a 32-bit register is written.

{{ :en:multiasm:paarm:ldrbw0.svg |}}
===Complex addressing modes ===
This is not a real addressing mode, but some instructions allow the addressing to be a bit more complex. Loading from memory (or storing in it) data into the vector register. the LD1, LD2, LD3 and LD4 instructions loads vector register:

''<fc #800000>LD1</fc> {<fc #008000>V1</fc>.<fc #808000>16b</fc>}, [<fc #008000>X1</fc>] <fc #6495ed>@ Load 16 bytes (128 bits) from memory address X1 into the vector register V1.</fc>''

''<fc #800000>LD1</fc> {<fc #008000>V0</fc>.<fc #808000>4s</fc>, <fc #008000>V1</fc>.<fc #808000>4s</fc>}, [<fc #008000>X1</fc>], <fc #ffa500>#32</fc> <fc #6495ed>@ Load first 16 bytes in the V0 register and next 16 bytes into the V1 register (32 bytes in total). X1 is incremented by 32 after the load.</fc>''

''<fc #800000>LD1</fc> {<fc #008000>V0</fc>.<fc #808000>4s</fc>}, [<fc #008000>X1</fc>, <fc #008000>X2</fc>, <fc #800080>LSL</fc> <fc #ffa500>#4</fc>], <fc #ffa500>#32</fc> <fc #6495ed>@ Load 16 bytes into V0 register with register offset: the efective address used to load the data is = x1 + (X2 <<4) </fc>''

** Unprivileged addressing mode**

The unprivileged addressing mode simulates EL0 memory access even when the CPU is running at EL1 exception level. Such an addressing mode is used to copy the data between different exception levels.

''<fc #800000>LDTR  </fc> <fc #008000>W0</fc>, [<fc #008000>X1</fc>]''

''<fc #800000>STTR   </fc> <fc #008000>W2</fc>, [<fc #008000>X1</fc>]''

These two instruction examples load/store a 32-bit word from/to memory at address X1, but the data access is performed using EL0 permissions even if the CPU is currently running in EL1. If <fc #008000>X1</fc> register points to invalid user memory, the load instruction will fail with a fault.

** Atomic/exclusive addressing**

Exclusive addressing mode allows the processor to update shared memory between processes without data races. Exclusive operations use a load–reserve/store–conditional technique. Instruction ''<fc #800000>LDXR</fc>'' reads a value from memory and marks the address so the core can detect interference from other processes that may also read the exact memory location. A matching ''<fc #800000>STXR</fc>'' only commits the new value if no conflicting write has occurred. If another process (or core) has changed the location in the meantime, the store fails.

''<fc #800000>LDXR</fc> <fc #008000>X0</fc>, [<fc #008000>X1</fc>] <fc #6495ed>@ Load 64-bit value from memory address pointed by X1 and mark the address in the CPU exclusive monitor.</fc>''

Such instructions can be used as part of an atomic read-modify-write operation.

''<fc #800000>STXR</fc> <fc #008000>X0</fc>, <fc #008000>X1</fc>, [<fc #008000>X2</fc>] <fc #6495ed>@ Try to store X1 into memory at [X2] address. If no other process or CPU writes to this address since the LDXR instruction, the register X0 = 0 and the store succeeds. Otherwise the store has failed and X0 = 1</fc>''

Complete example of atomic read-modify-write technique:

''rmw:''\\
 ''<fc #800000>LDXR </fc> <fc #008000>X0</fc>, [<fc #008000>X1</fc>] <fc #6495ed>@ load the current value and set exclusive monitor</fc>''\\
 ''<fc #800000>ADD </fc> <fc #008000>X0</fc>, <fc #008000>X0</fc>, <fc #ffa500>#1</fc> <fc #6495ed>@ compute new value</fc>''\\
 ''<fc #800000>STXR </fc> <fc #008000>W2</fc>, <fc #008000>X0</fc>, [<fc #008000>X1</fc>] <fc #6495ed>@ try to store the new value atomically</fc>''\\
 ''<fc #800000>CBNZ</fc> <fc #008000>W2</fc>, rmw <fc #6495ed>@ if store failed, retry</fc>''

Atomic read-modify-write instructions such as <fc #800000>LDADD</fc>, <fc #800000>CAS</fc>, and <fc #800000>SWP</fc> perform the entire update in a single step. All these operations use a simple [<fc #008000>Xn</fc>] addressing form to keep behaviour predictable. This model supports locks, counters, and other shared data structures without relying on heavier synchronisation mechanisms.

''<fc #800000>LDADD</fc> <fc #008000>W0</fc>, <fc #008000>W1</fc>, [<fc #008000>X2</fc>] <fc #6495ed>@ atomically load the data, modify the value and write back to the memory. </fc>''\\
Analogically, this instruction can be described with pseudocode:\\
<codeblock code_label>
<caption>LDADD istruction pseudocode</caption>
<code>
oldValue = [X2]
[X2] = oldValue + W0
W1 = oldValue
</code>
</codeblock>

Another example of the CAS instruction: atomic compare-and-swap.

''<fc #800000>CAS</fc> <fc #008000>X0</fc>, <fc #008000>X1</fc>, [<fc #008000>X2</fc>] <fc #6495ed>@ atomically compare the value of [X2] with X0. If [X2]==X0 then [X2]=X1, otherwise tha data in the memory are left unchanged </fc>''

Similarly, the <fc #800000>CAS</fc> instruction performs a data swap. 

<codeblock code_label>
<caption>CAS istruction pseudocode</caption>
<code>
SWP X3, X4, [X5]
oldValue = [X5]
[X5] = X4
X3 = oldValue
</code>
</codeblock>

Atomic addressing mode does not have immediate offset, register offset, or pre-indexing/post-indexing options. Basically, the atomic addressing uses simple memory addressing.
===== Basic Instructions and Operations =====

Most of the program code consists of basic instructions that perform arithmetic operations, move data, perform logical operations, and control I/O digital lines, among other tasks. This section provides an introduction to the basic instructions of the ARM v8 instruction set. 

==== Arithmetical instructions ====

All arithmetical operations are performed directly on the processor's registers. The most common instructions are the same ones we use every day to add two or more values together, subtract one value from another, multiply two values, or divide one value by another. In ARM assembly, the <fc #800000>ADD</fc>, <fc #800000>SUB</fc>, <fc #800000>MUL</fc>, and <fc #800000>DIV</fc> instructions perform the same function. All these instructions and other arithmetic instructions require that both values be placed in the registers. At this moment, we assume that all values in the registers are preloaded and ready to use, as demonstrated in the following instruction examples. 

''<fc #800000>ADD</fc> <fc #008000>X0</fc>, <fc #008000>X1</fc>, <fc #008000>X2</fc> <fc #6495ed>@ adds the X1 and X2 values X0= X1 + X2</fc>''

If the postfix S is added (ADDS), the status register is updated. \\ 
''<fc #800000>ADDS</fc> <fc #008000>X0</fc>, <fc #008000>X1</fc>, <fc #008000>X2</fc> <fc #6495ed>@ X0 = X1 + X2  Status register SR is updated</fc>''\\ 
''<fc #800000>ADCS</fc> <fc #008000>X0</fc>, <fc #008000>X1</fc>, <fc #008000>X2</fc> <fc #6495ed>@ X0 = X1 + X2 + C from the SR register. The status register SR is updated</fc>''

A status update is helpful for the upcoming conditional instructions. ''<fc #800000>ADC</fc>'' or ''<fc #800000>ADCS</fc>'' are standard in multi-word arithmetic (e.g., 128-bit math).
The ''<fc #800000>SUB</fc>'' and ''<fc #800000>DIV</fc>'' instructions rely on the order of the used variables to preserve a correct mathematical expression.

''<fc #800000>SUB</fc> <fc #008000>X0</fc>, <fc #008000>X0</fc>, <fc #ffa500>#1</fc> <fc #6495ed>@ X0 = X0 – 1</fc>'' \\ 
''<fc #800000>SUB</fc> <fc #008000>X0</fc>, <fc #008000>X1</fc>, <fc #ffa500>#1</fc> <fc #6495ed>@ X0 = X1 – 1</fc>'' \\ 
''<fc #800000>UDIV</fc> <fc #008000>X3</fc>, <fc #008000>X4</fc>, <fc #008000>X5</fc> <fc #6495ed>@ X3 = X4 / X5</fc>''

All these arithmetical instructions have additional options, such as an optional shift of the second source operand. The <fc #800000>DIV</fc> instruction must have a prefix of S for Signed (<fc #800000>SDIV</fc>) or U for Unsigned (<fc #800000>UDIV</fc>) divide operations. Prefix S preserves the sign of the result, depending on the signs used for the operands. The prefix U always returns a positive value. 
Some instructions can be combined to achieve better computational performance. In such cases, the first arithmetic operation is performed on the second source register, and then the instruction's operation is performed. Such instructions are: ''<fc #800000>MADD</fc>'', ''<fc #800000>MSUB</fc>'', ''<fc #800000>SMADDL</fc>'', ''<fc #800000>SMSUBL</fc>'', ''<fc #800000>UMADDL</fc>'' and ''<fc #800000>UMSUBL</fc>''. Basically, all the listed instructions are ''<fc #800000>MADD</fc>'' and ''<fc #800000>MSUB</fc>'', but with different options. Let's look at ''<fc #800000>MADD</fc>'' and ''<fc #800000>MSUB</fc>'' instructions.

''<fc #800000>MADD</fc> <fc #008000>X1</fc>, <fc #008000>X2</fc>, <fc #008000>X3</fc>, <fc #008000>X4</fc> <fc #6495ed>@ X1 = X4 + X2*X3</fc>''\\ 
''<fc #800000>MSUB</fc> <fc #008000>X1</fc>, <fc #008000>X2</fc>, <fc #008000>X3</fc>, <fc #008000>X4</fc> <fc #6495ed>@ X1 = X4 - X2*X3</fc>''

Before performing addition or subtraction, first multiply the registers X2 and X3 (the second and third operands given to the instruction), and then perform the addition or subtraction. The prefixes S and U define whether the result can be a signed value or only a positive value (unsigned value). The postfix L, like <fc #800000>SMSUBL</fc> or <fc #800000>UMADDL</fc>, specifies that only 32-bit register values are used when multiplying the second and third operands. The remaining operands are 64-bit register values.

The next ARM version, ARMv8.3, processors are built by default with a PAC (Pointer Authentication) system. Earlier architectures must have been checked to see whether the PAC system is available. This enables the system to protect against pointer errors or corruption and adds additional arithmetic instructions. The system's security level can be significantly increased by marking and checking pointers. PAC adds a signature to the pointer, allowing verification that it has not been tampered with before use. As a result, additional postfixes for the ''<fc #800000>ADD</fc>'' instruction, such as ''<fc #800000>ADDG</fc>'' and ''<fc #800000>ADDPT</fc>'', are added. While these operations are less common in simple programs, they are powerful tools when writing optimised and secure code. 
The ''<fc #800000>ADDG</fc>'' instruction means ''<fc #800000>ADD</fc>'' with Tag and is focused on pointers. The Tag is used to mark the pointer with a small identifier, allowing detection of pointer corruption or incorrect usage, among other options. Primarily, these instructions are used to authenticate pointers and ensure memory safety, for example, by tracking the boundaries of memory regions.

For example: ''<fc #800000>ADDG</fc> <fc #008000>X0</fc>, <fc #008000>X1</fc>, <fc #ffa500>#16</fc>, <fc #ffa500>#5</fc>''\\ 
CPU takes the pointer from the ''<fc #008000>X1</fc>'' register and adds the first constant ''<fc #ffa500>#16</fc>'' multiplied by 16. The pointer ''<fc #008000>X0</fc>'' points to X1+256 and has a tag set to ''<fc #ffa500>#5</fc>'' or in binary form ''0101<sub>2</sub>''. ''<fc #008000>X0</fc>'' now points 256 bytes ahead of the memory address stored in the register ''<fc #008000>X1</fc>''.

Postfix PT adds support for pointer tagging or authentication. For example, ''<fc #800000>ADDPT</fc>'' adds authenticated pointers and preserves the PAC.\\  
''<fc #800000>ADDPT</fc> <fc #008000>X0</fc>, <fc #008000>X1</fc>, <fc #008000>X2</fc>'' 

The ''<fc #008000>X1</fc>'' register contains an authenticated pointer; this can be signed before with the ''<fc #800000>PACIA</fc>'' or other PAC-enabled instruction. Register ''<fc #008000>X2</fc>'' is the value, an offset from the ''<fc #008000>X1</fc>'' pointer. The result is a pointer with an offset and tagged with the same tag as the ''<fc #008000>X1</fc>'' pointer. Such arithmetic operations are also available for the ''<fc #800000>SUB</fc>'' instruction, but not available for the ''<fc #800000>MUL</fc>'' multiplication and ''<fc #800000>DIV</fc>'' division instructions. Such a system enables powerful system-level encryption.


==== Instruction options ====

All assembly language types use similar mnemonics for arithmetic operations (some may require additional suffixes to identify some options for the instruction). A32 assembly instructions have specific suffixes to make commands executed conditionally, and those four most significant bits for many instructions give this ability. Unfortunately, there is no such option for A64, but there are special conditional instructions that we will describe later.
We looked at a straightforward instruction and its exact machine code in the previous section. Examining machine code for each instruction is a perfect way to learn all the available options and restrictions. To help understand and read the instruction set documentation, another example of the <fc #800000>''ADD''</fc> instruction in the A64 instruction set will be provided. 
The <fc #800000>''ADD''</fc> instruction: let's first look at the assembler instruction that adds two registers and stores the result in a third register.

<fc #800000>''ADD''</fc>   <fc #008000>''X0''</fc>, <fc #008000>''X1''</fc>, <fc #008000>''X2     '' </fc> <fc #6495ed>''@X0 = X1 + X2''</fc>

We need to look at the instruction set documentation to determine the possible options for this instruction. The documentation lists three main differences between the <fc #800000>''ADD''</fc> instructions. Despite that, for the data manipulation instruction, the ‘S’ suffix can be added to update the status flags in the processor Status Register.


**1.The ADD and ADDS instructions with extended registers:**

<fc #800000>''ADD''</fc>   <fc #008000>''X3''</fc>, <fc #008000>''X4''</fc>, <fc #008000>''W5''</fc>, <fc #cd5c5c>''UXTW''</fc> <fc #6495ed>''@X3 = X4 + W5''</fc>

{{:en:multiasm:paarm:addextended_1.jpg|}}


<fc #800000>''ADDS''</fc>   <fc #008000>''X3''</fc>, <fc #008000>''X4''</fc>, <fc #008000>''W5''</fc>, <fc #cd5c5c>''UXTW''</fc> <fc #6495ed>''@X3 = X4 + W5 and update the status flags''</fc>

{{:en:multiasm:paarm:addsextended_1.jpg|}}

The machine code representation of the assembler instruction would be like:

<fc #800000>''ADDS''</fc>   <fc #008000>''X3''</fc>, <fc #008000>''X4''</fc>, <fc #008000>''W5''</fc>, <fc #cd5c5c>''UXTW''</fc>

  * **Rd** = <fc #008000>X3</fc> <fc #6495ed>@ pointer to the register where the result will be stored</fc>
  * **Rn** = <fc #008000>X4</fc> <fc #6495ed>@ pointer to the First operand of the provided operands</fc>
  * **Rm** = <fc #008000>W5</fc> <fc #6495ed>@ pointer to the Second operand of the provided operands, which will be extended to 64 bits</fc>  

We already know that the ‘sf’ bit identifies the length of the data (32 or 64 bits). The main difference between these two instructions is in the ‘S’ bit. The same is in the name of the instruction. The ‘S’ bit is meant to signal to the processor that the status bits should be updated after instruction execution. These status bits are crucial for conditions. The 30th ‘op’ bit and ‘opt’ bits are fixed and not used for this instruction. The three option bits (13th to 15th) extend the operation. These bits are used to extend the second source (Rm) operand. This is handy when the source operands differ in length, such as when the first operand is 16-bit wide and the second is 8-bit wide. The second register must be extended to maintain the data alignment. 
Overall, there are three bits: 8 different options to extend the second source operand. The table below explains all these options. Let's look only at those options; the bit values are irrelevant for learning the assembler.

<table tab_label>
<caption>Extension options</caption>
| UXTB or SXTB | Unsigned or Signed byte (8-bit) is extended to a word (32-bit) |
| UXTH or SXTH | Unsigned or Signed halfword (16-bit) is extended to word (32-bit) |
| UXTW or SXTW | Unsigned/Signed word (32-bit) is extended to double word (64-bit) |
| UXTX, SXTX or LSL | Unsigned/Signed double word (64-bit) is extended to double word (64-bit), and there is no use for such extension with the unsigned data type.|
</table>

For the UXTX, the LSL shift is preferred if the ‘imm3’ bits are set from 0 to 4. Other ranges are reserved and unavailable because the result can be unpredictable. Moreover, this shift is only available if the ‘Rd’ or the ‘Rn’ operands are equal to ‘11111’, which is the stack pointer (SP). In all other cases, the UXTX extension will be used.
In the conclusion for this instruction type, it is handy when the operands are of different lengths, but that’s not all. The shift provided to the second operand allows us to multiply it by 2, 4, 8 or 16, but it works only if the destination register is 64 bits wide (the Xn registers are used). The shift amount is restricted to 4 bits only, even when the ‘imm3’ can identify the larger values. Also, the SXTB/H/W/X are used when the second operand can store negative integers.

''<fc #800000>ADDS</fc> <fc #008000>X3</fc>, <fc #008000>X4</fc>, <fc #008000>W5</fc>, <fc #cd5c5c>SXTX</fc> <fc #ffa500>#2</fc> 
    <fc #9acd32>/ *extend the W5 register to 64 bits and then shift it by 2 (LSL), which makes a multiplication by 4 (W5=W5*4). Add the multiplied value to the X4+(W5*4), store the result in the X3 register X3 = X4 + (W5*4) */</fc>''


''<fc #800000>ADD</fc> <fc #008000>X3</fc>, <fc #008000>X4</fc>, <fc #008000>W5</fc>, <fc #cd5c5c>UXTX</fc> <fc #ffa500>#1</fc>
    <fc #9acd32>/ *Take the lowest byte from W5 (W5[7:0])
    Zero-extend it to 64-bit
    Shifts left by 1 (multiply by 2)
    Add to X4 and store in X3*/</fc>''

''<fc #800000>ADD</fc> <fc #008000>X7</fc>, <fc #008000>X8</fc>, <fc #008000>W9</fc>, <fc #cd5c5c>SXTX</fc> <fc #ffa500>#2</fc>
    <fc #9acd32>/ * Take W9[15:0], sign-extend to 64 bits without shifting; Add to X8 and store in X7; X7 = X8 + W9[15:0] */</fc>''

**2.The ADDS (ADD) instructions with immediate value:**
In machine code, it is possible to determine the maximum value that can be added to a register. The ‘imm12’ bits limit the value to 0-4095. Besides that, the ‘sh’ bit allows to shift left (LSL) the immediate value by 12 bits.

{{:en:multiasm:paarm:addimediate.jpg|}}

Examples with immediate the <fc #800000>''ADD''</fc> instruction
  * <fc #800000>''ADD''</fc>   <fc #008000>''W0''</fc>, <fc #008000>''W1''</fc>, <fc #ffa500>''#100''</fc> <fc #6495ed>''@W0 = W1 + 100 - Basic 32-bit ADD.''</fc>
  * <fc #6495ed>''@ Add 100 to W1, store the result in W0 and no shift is performed''</fc>

  * <fc #800000>''ADD''</fc>   <fc #008000>''X0''</fc>, <fc #008000>''X1''</fc>, <fc #ffa500>''#4095''</fc> <fc #6495ed>''Basic 64-bit ADD.''</fc>
  * <fc #6495ed>''@ Add 4095 to X1, stores in X0''</fc>

  * <fc #800000>''ADD''</fc>   <fc #008000>''X2''</fc>, <fc #008000>''X3''</fc>, <fc #ffa500>''#1''</fc>,<fc #cd5c5c>''LSL''</fc>, <fc #ffa500>''#12''</fc> <fc #6495ed>''@ 64-bit ADD with shifted immediate (LSL #12)''</fc>
  * <fc #6495ed>''@ Add 4096 to X3 (1 << 12 = 4096) ''</fc>
  * <fc #6495ed>''@ Store the result in X2 ''</fc>

  * <fc #800000>''ADD''</fc>   <fc #008000>''W5''</fc>, <fc #008000>''W6''</fc>, <fc #ffa500>''#2''</fc>,<fc #cd5c5c>''LSL''</fc>, <fc #ffa500>''#12''</fc> <fc #6495ed>''@ 32-bit ADD with shifted immediate''</fc>
  * <fc #6495ed>''Add 8192 to W6 and store the result in W5 (2 << 12 = 8192)''</fc>

  * <fc #800000>''ADD''</fc>   <fc #008000>''X4''</fc>, <fc #008000>''SP''</fc>, <fc #ffa500>''#256''</fc>, <fc #6495ed>''@ Using SP as base register''</fc>
  * <fc #6495ed>''@ Add 256 to SP. Useful for frame setup or stack management''</fc>

  * <fc #800000>''ADDS''</fc>   <fc #008000>''X7''</fc>, <fc #008000>''X8''</fc>, <fc #ffa500>''#42''</fc>, <fc #6495ed>''@ ADDS (immediate) – flag-setting''</fc>
  * <fc #6495ed>''@ Add 42 to X8, store the result in X7 and finally update condition flags (NZCV)''</fc>

  * <fc #800000>''ADDS''</fc>   <fc #008000>''X9''</fc>, <fc #008000>''X10''</fc>, <fc #ffa500>''#3''</fc>,<fc #cd5c5c>''LSL''</fc>, <fc #ffa500>''#12''</fc> <fc #6495ed>''@ ADDS with shifted immediate''</fc>
  * <fc #6495ed>''@ Add 12288(3 << 12 = 12288) to X10, store the result in X9  ''</fc>
  * <fc #6495ed>''@ Update condition flags stored in status register ''</fc>

  * <fc #800000>''ADDS''</fc>   <fc #008000>''X11''</fc>, <fc #008000>''SP''</fc>, <fc #ffa500>''#512''</fc>, <fc #6495ed>''@ ADDS with SP base''</fc>
  * <fc #6495ed>''@ Add 512 to SP, store the result in X11 and update condition flags''</fc>

**3.The ADDS (ADD) instruction with a shifted register:**
The final add instruction type adds two registers together, with one register shifted; the shift can be LSL (Logical Shift Left), LSR (Logical Shift Right), or ASR (Arithmetic Shift Right). The fourth shift option is not available. The number of bits in the ‘imm6’ field identifies the number of bits to be shifted for the ‘Rm’ register before it is added to the ‘Rn’ register. 

{{:en:multiasm:paarm:addshifted.jpg|}}

Similar options are available for many other ARMv8 instructions. The instruction set documentation may provide the necessary information to determine the possibilities and restrictions on instruction usage. By examining the instruction's binary form, it is possible to identify its capabilities and limitations. Assembler code is converted to binary, and the final binary code for the instruction depends on the provided operands and, if available, options. 


==== Data copy/move instructions ====

Remember, the processor primarily performs operations on data stored in registers. The data must be loaded into registers, and the result must be stored back in memory. For example, to change the value stored at a particular memory address, the ARM would require three instructions. First, the value from memory needs to be loaded into a register, then modified, and finally stored back into the memory from the register. Other architectures, such as x86, may allow operations on data directly in memory without register use.

The ''<fc #800000>LDR</fc>'' and ''<fc #800000>STR</fc>'' are basic instructions that load data from memory into a register and store data from a register into memory, respectively.\\
''<fc #800000>LDR</fc> <fc #008000>X0</fc>, [<fc #008000>X1</fc>] <fc #6495ed>@ fill the register X0 with the data located at address stored in X1 register</fc>''
''<fc #800000>STR</fc> <fc #008000>X1</fc>, [<fc #008000>X2</fc>] <fc #6495ed>@ store the content from register X1 into the memory at memory address given in the X2 register</fc>''

The ''<fc #800000>LDR</fc>'' instruction loads the data from the memory address pointed to in the ''<fc #008000>X1</fc>'' register into the destination register ''<fc #008000>X0</fc>''. The register in square brackets, ''[<fc #008000>X1</fc>]'', is called the base register because its value is used as a memory address. Similarly, the STR instruction stores data from the ''<fc #008000>X1</fc>'' register to the memory location specified by the ''<fc #008000>X2</fc>'' register. 
If the register holding the memory address must be updated after each memory access, then post-indexed or pre-indexed modes can be used. Pre-indexed mode updates the base register before reading the value from memory. Post-indexed mode will update the base register after reading the value from memory.

''<fc #800000>LDR</fc> <fc #008000>X0</fc>, [<fc #008000>X1</fc>, <fc #ffa500>#8</fc>]<fc #800080>**!**</fc> <fc #6495ed>@ Read the data located at address X1+8 and write into register X0 {PRE-INDEXED MODE X1 = X1 + 8}</fc>''\\
''<fc #800000>LDR</fc> <fc #008000>X6</fc>, [<fc #008000>X7</fc>], <fc #ffa500>#16</fc> <fc #6495ed>@ loads a value to X6 register and then increases X7 by 16. {POST-INDEXED MODE X7 = X7 + 16}</fc>''\\
''<fc #800000>STR</fc> <fc #008000>X6</fc>, [<fc #008000>X7</fc>], <fc #ffa500>#16</fc> <fc #6495ed>@ Store the value and then increase X7 by 16.</fc>''

There is also a third option: using the offset value. This option must be used with caution because the offset value is multiplied by 8 (8 bytes).\\
''<fc #800000>LDR</fc> <fc #008000>X0</fc>, [<fc #008000>X1</fc>, <fc #ffa500>#8</fc>] <fc #6495ed>@ Read the data located at address X1+8*8 and write into register X0 {X1 = X1 + 8*8}</fc>''

<note important>Note that the exclamation mark after the square bracket makes a significant difference in how the data is accessed.</note>

Load and store instructions have the most additional options, more than for the arithmetical and logical operations. For example, the ''<fc #800000>LDADD</fc>'' instruction combines a load and an arithmetic operation. This is a part of the so-called atomic operations. The ''<fc #800000>LDADD</fc>'' instruction atomically loads a value from memory, adds the value held in a register, and finally stores the result back in memory at a different location. NOTE that the registers used in this instruction must not be the same. This is something like what would be for the x86 architecture. Unfortunately, no other arithmetic operations are available besides addition.\\
''<fc #800000>LDADD</fc> <fc #008000>W1</fc>, <fc #008000>W2</fc>, [<fc #008000>X0</fc>]'' \\
The register ''<fc #008000>X0</fc>'' holds a memory address. The data/value is loaded into the ''<fc #008000>W2</fc>'' register, and then the value is added to the ''<fc #008000>W1</fc>'' register value, after which the new value ''[<fc #008000>X0</fc>]+<fc #008000>W1</fc>'' is stored back into memory at the exact location pointed by ''[<fc #008000>X0</fc>]''. Basically, the ''<fc #008000>W2</fc>'' register now holds the ''[<fc #008000>X0</fc>]''- pointed data that was present before the ''<fc #008000>W1</fc>'' value was added. Similar instructions are available to perform atomic logic operations on the memory data.

To copy content from one register to another, the ''<fc #800000>MOV</fc>'' instruction is used. The ''<fc #800000>FMOV</fc>'' instruction can also copy floating-point values. These instructions allow typecasting a floating-point value to an integer and vice versa. Here are some independent instruction examples\\
''<fc #800000>MOV</fc> <fc #008000>X1</fc>, <fc #008000>X0</fc> <fc #6495ed>@ X1 = X0 (64 bit register copy)</fc>''\\
''<fc #800000>MOV</fc> <fc #008000>W1</fc>, <fc #008000>W0</fc> <fc #6495ed>@ W1 = W0 (32 bit register copy)</fc>''\\
''<fc #800000>FMOV</fc> <fc #008000>S1</fc>, <fc #008000>S0</fc> <fc #6495ed>@ float → float (32-bit floating-point copy between vector registers)</fc>''\\
''<fc #800000>FMOV</fc> <fc #008000>X0</fc>, <fc #008000>D1</fc> <fc #6495ed> @ FP64 → int64 (copy from vector register to general-purpose register)</fc>''\\
''<fc #800000>FMOV</fc> <fc #008000>D2</fc>, <fc #008000>X3</fc> <fc #6495ed>@ int64 → FP64 (copy from general-purpose register to vector register)</fc>''\\
''<fc #800000>MOV</fc> <fc #008000>V1</fc>.<fc #808000>16b</fc>, <fc #008000>V0</fc>.<fc #808000>16b</fc> <fc #6495ed>@ vector register copy one byte</fc>''\\
The ''<fc #800000>MOV</fc>'' instructions can also be used to write a value into the register immediately. In the following example, all instructions are executed one by one:\\
''<fc #800000>MOV</fc> <fc #008000>X0</fc>, <fc #ffa500>#123</fc> <fc #6495ed>@ assign value 291 to the register</fc>''\\
''<fc #800000>MOVZ</fc> <fc #008000>X0</fc>, <fc #ffa500>#0x1234</fc>, <fc #800080>LSL</fc> <fc #ffa500>#48</fc><fc #6495ed> @ X0 = 0x1234 0000 0000 0000. The X0 value gets overvritten</fc>''\\
''<fc #800000>MOVK</fc> <fc #008000>X0</fc>, <fc #ffa500>#0xABCD</fc>, <fc #800080>LSL</fc> <fc #ffa500>#0</fc> <fc #6495ed>@ X0 = 0x1234 0000 0000 ABCD, if before instruction execution the register value was 0x1234 0000 0000 0000</fc>''


==== Data copy/move instructions ====

These instructions do not work with values that require arithmetic operations. Still, they are mainly used to manipulate individual bits in registers, widely used to test or verify values, and to perform other functions. Basic logic instructions for AARCH64 are:\\
''<fc #800000>AND</fc> <fc #008000>X0</fc>, <fc #008000>X1</fc>, <fc #008000>X2</fc> <fc #6495ed>@ logical AND between X1 and X2, result is stored in X0</fc>''\\
''<fc #800000>ORR</fc> <fc #008000>X6</fc>, <fc #008000>X7</fc>, <fc #008000>X8</fc> <fc #6495ed>@ logical OR between X7 and X8, result is stored in X6</fc>''\\
''<fc #800000>EOR</fc> <fc #008000>X12</fc>, <fc #008000>X13</fc>, <fc #008000>X14</fc> <fc #6495ed>@ logical XOR between X13 and X14, result is stored in X12</fc>''\\
''<fc #800000>NEG</fc> <fc #008000>X24</fc>, <fc #008000>X25</fc> <fc #6495ed>@ logical NOT, X24 is set to inverted X25</fc>''

Remember that most instructions, which operate with registers, can update the status register by adding the postfix S at the end of the instruction. Logical instructions are fundamental for low-level programming. These instructions allow taking control over bits and are widely used in system code, device drivers, and embedded systems. Some instructions can perform combined bitwise operations, like ''<fc #800000>ORN</fc>'', which performs an OR operation with the inverted second operand. 
===== Procedures and Functions Call Standards =====

There is no difference between procedures and functions in assembler. The distinction is more about their purpose and structure - functions usually return values and are often called from multiple places. Procedures, on the other hand, are often more focused on a specific task and typically do not return a value. In this section, we will use functions, as this term is commonly used in other programming languages as well. The function is a labelled block of code that performs a specific task. This block can be reused from different parts of a program. Functions follow a consistent structure that includes setting up a stack frame, saving necessary registers, performing the task, and returning control to the function caller.
First, we need to get familiar with branch instructions. Branching is a way in which a processor handles decision-making and control flow. Branches jump to another location in the code, either conditionally or unconditionally. These locations are usually labelled with a unique label. They let the program repeat specific actions, skip parts of code, or handle different tasks based on comparisons.

''<fc #800000>B</fc> exit_loop <fc #6495ed>@ unconditional branch to exit_loop</fc>''\\
This instruction forces the CPU to jump directly to exit_loop, and the processor will proceed with the following instruction right after the exit_loop label. It does not check any statuses in the status register. This instruction has a range restriction: 128 MB of address space from the current PC register address. This means that the label must be within 128 MB of the device. If the program code is small, i.e. 2MB of instruction code, then this restriction can be ignored. Otherwise, this must be considered, but there is another instruction that does not have such restrictions.\\
''<fc #800000>ADR</fc> <fc #008000>X0</fc>, exit_loop <fc #6495ed>@ load exit_loop address into X0 register</fc>''\\
''<fc #800000>BR</fc> <fc #008000>X0</fc> <fc #6495ed>@ unconditional branch to address stored in X2 register</fc>''\\
But the register must hold an address for that location. The address can be loaded in many ways; in this example, the ''<fc #800000>ADR</fc>'' instruction is used to load it.


==== Conditional Branches ====

Conditional branches rely on the status flags in the status register. These flags are:
  * N - Negative
  * C - Carry
  * V - Overflow
  * Z - Zero

Example code:\\
''<fc #800000>SUBS</fc> <fc #008000>X0</fc>, <fc #008000>X1</fc>, <fc #ffa500>#3</fc> <fc #6495ed>@ subtract and update status register</fc>''\\
''<fc #800000>AND</fc> <fc #008000>X1</fc>, <fc #008000>X2</fc>, <fc #008000>X3</fc> <fc #6495ed>@ Perform logical AND operation</fc>''\\
''<fc #800000>B</fc>.<fc #800080>EQ</fc> label <fc #6495ed>@ conditional branch using flags set with SUBS instruction</fc>''\\
As the comments pointed out, the logical AND instruction does not update the status register, which is why the branch condition relies on the SUBS instruction result. In the status register, all status flags are updated by the last instruction that issues the status flag update. AArch64 supports many conditional branches. Here are examples of conditions:\\
''<fc #800000>B</fc>.<fc #800080>EQ</fc> label <fc #6495ed>@ if Z==1 Equal</fc>''\\
''<fc #800000>B</fc>.<fc #800080>NE</fc> label <fc #6495ed>@ if Z==0 not Equal</fc>''\\
''<fc #800000>B</fc>.<fc #800080>CS</fc> label <fc #6495ed>@ if C==1 Carry set Greater than, equal to, or unordered (identical to HS).</fc>''\\
''<fc #800000>B</fc>.<fc #800080>HS</fc> label <fc #6495ed>@ if C==1 Identical to B.CS</fc>''\\
''<fc #800000>B</fc>.<fc #800080>CC</fc> label <fc #6495ed>@ if C==0 Less than (identical to B.LO)</fc>''\\
''<fc #800000>B</fc>.<fc #800080>LO</fc> label <fc #6495ed>@ if C==0 identical to B.CC</fc>''\\
''<fc #800000>B</fc>.<fc #800080>MI</fc> label <fc #6495ed>@ if N==1 Less than. The result is negative</fc>''\\
''<fc #800000>B</fc>.<fc #800080>PL</fc> label <fc #6495ed>@ if N==0 Greater than, equal to, or unordered. The result is positive or zero</fc>''\\
''<fc #800000>B</fc>.<fc #800080>VS</fc> label <fc #6495ed>@ if V==1 Signed overflow</fc>''\\
''<fc #800000>B</fc>.<fc #800080>VC</fc> label <fc #6495ed>@ if V==0 No signed overfollow</fc>''\\
''<fc #800000>B</fc>.<fc #800080>HI</fc> label <fc #6495ed>@ if C==1 AND Z==0 Grater than</fc>''\\
''<fc #800000>B</fc>.<fc #800080>LS</fc> label <fc #6495ed>@ if C==0 AND Z==1 Less than or equal</fc>''\\
''<fc #800000>B</fc>.<fc #800080>GE</fc> label <fc #6495ed>@ if N==V Greater than or equal</fc>''\\
''<fc #800000>B</fc>.<fc #800080>LT</fc> label <fc #6495ed>@ if N==V Less than</fc>''\\
''<fc #800000>B</fc>.<fc #800080>GT</fc> label <fc #6495ed>@ if N==V and Z==0 Greater than</fc>''\\
''<fc #800000>B</fc>.<fc #800080>LE</fc> label <fc #6495ed>@ if N==V and Z==0 Less than or equal to</fc>''\\
''<fc #800000>B</fc>.<fc #800080>AL</fc> label <fc #6495ed>@ branch always</fc>''\\
''<fc #800000>B</fc>.<fc #800080>NV</fc> label <fc #6495ed>@ branch never</fc>'' \\
These listed instructions check the condition flags set by a previous instruction that updates the status register, such as ''<fc #800000>CMP</fc>'' or ''<fc #800000>ANDS</fc>'' instruction. Note that ''<fc #800000>B</fc>.<fc #800080>AL</fc>'' and ''<fc #800000>B</fc>.<fc #800080>NV</fc>'' are logically useless instruction condition. The condition ''<fc #800000>B</fc>.<fc #800080>AL</fc>'' is mostly replaced with the B instruction (Branch) without any condition. That’s because the result is the same. The ''<fc #800000>B</fc>.<fc #800080>NV</fc>'' condition isn’t used because it's not needed. This is something like a NOP (No Operation) instruction, which forces the processor to do nothing. It can be pointed out that many conditions check the same condition flag, but they differ in naming. These mnemonics exist only for code readability. Taking a deeper look at the instruction set documentation, there are also instruction aliases, not only condition aliases. ARM keeps both mnemonic sets (aliases) because historically ARM assembly used CS (Carry Set) or CC (Carry Clear) mnemonics for arithmetic carry. Then came value comparison, for example, unsigned comparison HS (higher or same) or LO (Lower) to make the comparison meaningful. ARM assembly supports all comparison mnemonics, and the aliasing mnemonics produce the same binary code. Similarly, the aliasing instructions also share the same meaning, to preserve the code readability, even if the binary code for the aliasing instructions is equal. This, of course, makes reverse engineering harder because binary code can be translated into any of the aliasing instructions.

Example: Loop with Condition\\

<codeblock code_label>
<code>
loop_start:
    CMP X0, #0
    B.EQ loop_end
    SUB X0, X0, #1
    B loop_start
loop_end:
    NOP
</code>
</codeblock>
This loop runs until X0 becomes zero. 

Example: Compare and branch\\
<codeblock code_label>
<code>
CBZ X0, label     @ Branch if zero
CBNZ X0, label    @ Branch if not zero
</code>
</codeblock>
These combine a test and a branch into one instruction. Useful for tight loops and fast decisions.

Example: Test and branch\\
<codeblock code_label>
<code>
TBZ X1, #3, label  @ Branch if bit 3 is zero
TBNZ X1, #3, label @ Branch if bit 3 is not zero
</code>
</codeblock>
These check a single bit in a register and branch based on its value. Suitable for flag or bit checks.

<note important>Note that these branch instructions do not update the link register and cannot be used to call the functions or procedures.</note>

The branch with a link is one of the instructions that updates the link register. The following example uses the ''<fc #800000>BL</fc>'' instruction to call a function ''add_two''.
<codeblock code_label>
<code>
add_two:
    ADD X0, X0, X1    @ X0 = X0 + X1
    RET
	
MOV X0, #5	@ write value 5 into X1 register
MOV X1, #3	@ write value 3 into X1 register
BL add_two @ branch to function "add_two" and store PC+4 into Link register
</code>
</codeblock>
This branches to a function, and the link register is updated at the same time. To return and execute the next instruction after the branch instruction, the RET instruction must be used as the very last instruction of the function. Functions in assembly rely on the same calling conventions used in higher-level languages. The AArch64 calling convention specifies how arguments are passed, how return values are handled, and which registers must be preserved across function calls.\\
The link register is a special-purpose register in the AArch64 architecture. It is held in the ''<fc #008000>X30</fc>'' register. The primary purpose of the ''<fc #008000>X30</fc>'' register is to store the address of the program code instruction that the program needs to return to after a function call. It maintains the return address when a function is called. The processor sets it automatically when using branch instructions like ''<fc #800000>BL</fc>'', which stands for //branch with link//. This allows the program to return to the correct place after a function finishes.\\
The link register helps avoid extra memory access. It saves the return address in a register rather than pushing it to the stack. This makes function calls faster. A function also begins with a unique label that is not the same as any instructions. If a function needs some arguments, then arguments are passed using registers ''<fc #008000>X0</fc>'' up to ''<fc #008000>X7</fc>''. Register use restricts the number of arguments and sizes. When more than eight 64-bit variables must be passed to the called function, or when an array or a specific data structure must be passed, the stack must be used. If the function must return a value, the ''<fc #008000>X0</fc>'' register can be used to store a single value. Like function status can be returned in the ''<fc #008000>X0</fc>'' register, but if the function returns more than just one value, then the stack would be the best place to store multiple values. \\
A typical function:\\
''function_name: <fc #6495ed>@ Function label or function name</fc>''\\
''<fc #800000>STP</fc> <fc #008000>X29</fc>, <fc #008000>X30</fc>, [<fc #008000>SP</fc>, <fc #ffa500>#-16</fc>]<fc #800080>!</fc> <fc #6495ed>@ Prologue of function</fc>''\\
''<fc #800000>MOV</fc> <fc #008000>X29</fc>, <fc #008000>SP</fc> <fc #6495ed>@ Prologue of function</fc>''\\
''... <fc #6495ed>@ Function body</fc>''\\
''<fc #800000>LDP</fc> <fc #008000>X29</fc>, <fc #008000>X30</fc>, [<fc #008000>SP</fc>], <fc #ffa500>#16</fc> <fc #6495ed>@ Epilogue of function</fc>''\\
''<fc #800000>RET </fc> <fc #6495ed>@ Epilogue of function</fc>''\\
This code saves the Stack pointer and Link register, then updates the Stack pointer. After doing the work, it restores the saved values and returns. However, if a function calls another function, or if the return address needs to be saved across longer sections of code, then the value in ''<fc #008000>X30</fc>'' should be saved to the stack. This prevents it from being overwritten. The use of ''<fc #800000>MOV</fc>'', ''<fc #800000>STR</fc>'', or ''<fc #800000>STP</fc>'' instructions can help save necessary information, including the stack pointer and the link register, to the stack memory. For example, a line of code, ''<fc #800000>STP</fc> <fc #008000>X30</fc>, [<fc #008000>SP</fc>,<fc #ffa500> #-16</fc>]<fc #800080>!</fc>'' stores register ''<fc #008000>X30</fc>'' (the Link Register) on the stack and updates the stack pointer. Later, the link register can be restored with the ''<fc #800000>LDP</fc> <fc #008000>X30</fc>, [<fc #008000>SP</fc>], <fc #ffa500>#16</fc>'' instruction. Such a pattern is typical at the very beginning and right after the end of a function, and it may be modified slightly. As in a typical function example with ''<fc #800000>LDP</fc>'' or ''<fc #800000>STP</fc>'' instructions, these instructions load and store register pairs. This ensures the program can return to the correct place even after nested function calls.


==== The Stack pointer ====

The stack pointer (''<fc #008000>SP</fc>'') is a special register in AArch64 that always points to the top of the stack. The stack itself is an organised block of memory where data is stored in sequential order. It means that the data cannot be placed everywhere inside the stack. It is used to store the data, return addresses, local variables temporarily, and saved registers during function calls. In AArch64, the stack grows downward, meaning that each time data is pushed onto the stack, the stack pointer value decreases and the stack moves to a lower address. In AArch64, each exception level has its own stack pointer. This allows the processor to handle function calls, interrupts, and exceptions safely without mixing data from different exception levels.

The stack is a Last In, First Out (LIFO) structure. The data pushed onto the stack will become the first data to be removed later. The stack pointer always indicates the current top of the stack. 
<note important>On AArch64, the stack must always remain **16-byte-aligned**.</note>
These requirements are defined by the ABI (Application Binary Interface). This alignment ensures compatibility with SIMD and floating-point operations and avoids unaligned memory access errors. SIMD (NEON) instructions operate on 128-bit registers ''<fc #008000>Q0</fc>''.. ''<fc #008000>Q31</fc>''(16 bytes). The processor is optimised for aligned load and store instructions. This allows faster memory operations, reducing data transfer cycles and more.

The stack pointer behaves like a general-purpose register in most arithmetic and memory operations, but with some restrictions. It cannot be used as the destination for instructions if the result would be an unaligned address. \\
''<fc #800000>STR</fc> <fc #008000>X0</fc>, [<fc #008000>SP</fc>, <fc #ffa500>#-16</fc>]**<fc #800080>!</fc>** <fc #6495ed>@stores the value in X0 at the address SP-16, then updates SP</fc>''\\
''<fc #800000>LDR</fc> <fc #008000>X0</fc>, [<fc #008000>SP</fc>],<fc #ffa500> #16</fc> <fc #6495ed>@loads a value from the top of the stack into X0 register and then increases SP by 16</fc>''\\
Prologues of a function include instructions to push data into the stack, and epilogues include instructions to pop data out of the stack. But actually, data stays in memory; the stack pointer points to the location where the new data can be written. The stack is physical memory, but in the entire memory address space, it reserves only a small portion of that range. 

Working with stack and function calls, four key elements are used: Stack Pointer itself, the stack (a memory), the stack frame and the frame pointer (the register ''<fc #008000>X29</fc>''). The stack pointer and the stack are already explained. The stack pointer points to the stack, but the stack itself is built from multiple Stack Frames. The Stack Frame is an area of the stack that holds data and register values to be preserved during function calls. The register ''<fc #008000>X29</fc>'' (''<fc #008000>FP</fc>'' – Frame Pointer) contains the memory address of the previous (pre-Prologue) ''<fc #008000>X29</fc>'' value that is residing in the Current Stack Frame. Pushing the stack means creating a new stack frame, and this is done with function prologues.

When a function begins executing, it often needs to preserve specific registers and allocate space for local variables. This is called creating a stack frame. The frame holds all the data that must be restored when the function returns. A typical function prologue (the code that runs when entering a function) might look like this:\\
''<fc #800000>STP</fc> <fc #008000>X29</fc>, <fc #008000>X30</fc>, [<fc #008000>SP</fc>,<fc #ffa500> #-16</fc>]**<fc #800080>!</fc>**''\\
''<fc #800000>MOV</fc> <fc #008000>X29</fc>, <fc #008000>SP</fc>''\\
This saves the previous frame pointer (''<fc #008000>X29</fc>'') and link register (''<fc #008000>X30</fc>'') on the stack, then updates the current frame pointer to the new value of ''<fc #008000>SP</fc>''. The link register holds the return address, so saving it ensures the program can return to the caller correctly after the function finishes.

When the function is done, it uses an epilogue to restore the saved registers and free the stack space:\\
''<fc #800000>LDP</fc> <fc #008000>X29</fc>, <fc #008000>X30</fc>, [<fc #008000>SP</fc>], <fc #ffa500>#16</fc>''\\
''<fc #800000>RET</fc>''\\
This restores the old frame pointer and link register, moves the stack pointer back up, and returns to the saved address.

A function may need to back up registers to free up registers for a task. When such a scenario occurs, the callee-saved registers are saved first. If a callee-saved register needs to be saved, the first one to be used is ''<fc #008000>X19</fc>''. If a function needs more, it will work its way up the register list till ''<fc #008000>X28</fc>'', the final callee-saved register.
The prologue will be like this example:\\
''<fc #800000>STP</fc> <fc #008000>FP</fc>, <fc #008000>LR</fc>, [<fc #008000>SP</fc>, <fc #ffa500>#-0x20</fc>]**<fc #800080>!</fc>** <fc #6495ed>@ Push (make) new Frame with size of 32 bytes</fc>''\\
''<fc #800000>STP</fc> <fc #008000>X19</fc>, <fc #008000>X20</fc>, [<fc #008000>SP</fc>, <fc #ffa500>#0x10</fc>] <fc #6495ed>@ Backup (save) X19 and X20 onto the Frame</fc>''\\
''<fc #800000>MOV</fc> <fc #008000>FP</fc>, <fc #008000>SP</fc> <fc #6495ed>@ Update fp to update the back chain</fc>''

The epilogue for the above prologue would be like this example:\\
''<fc #800000>LDP</fc> <fc #008000>X19</fc>, <fc #008000>X20</fc>, [<fc #008000>SP</fc>, <fc #ffa500>#0x10</fc>] <fc #6495ed>@ Restore old X19 and X20</fc>''\\
''<fc #800000>LDP</fc> <fc #008000>FP</fc>, <fc #008000>LR</fc>, [<fc #008000>SP</fc>], <fc #ffa500>#0x20</fc> <fc #6495ed>@ Restore old FP, LR, and SP. Frame is popped (destroyed)</fc>''\\
''<fc #800000>RET</fc> <fc #6495ed>@ End function, return to Caller</fc>''

** Stack Pointer in Exception Levels **

AArch64 supports multiple exception levels (EL0 to EL3). Each level can have its own stack pointer. The processor provides two stack pointers for EL1 and above:
  * SP_EL0 – stack pointer used when running code at EL0 (user mode)
  * SP_ELx – stack pointer used for each higher exception level (EL1, EL2, EL3)
When an exception or interrupt occurs, the CPU automatically switches to the appropriate stack pointer for the new level. This prevents user-level code from corrupting kernel or hypervisor data and keeps the stacks for each privilege level separate.
It can be manually accessed or configured by using the system registers:\\
''<fc #800000>MRS</fc> <fc #008000>X0</fc>, SP_EL0 <fc #6495ed>@ Read user-mode stack pointer</fc>''\\
''<fc #800000>MSR</fc> SP_EL0, <fc #008000>X1</fc> <fc #6495ed>@ Write a new value to it</fc>''

** Using the Stack for Parameter Passing **

In AArch64, the first eight function arguments are passed in registers ''<fc #008000>X0</fc>'' up to''<fc #008000>X7</fc>''. If there are more than eight arguments, the rest are passed on the stack. The caller places these extra arguments at known offsets below ''<fc #008000>SP</fc>'' before executing a ''<fc #800000>BL</fc>'' (branch with link) instruction. The callee can access them either through the frame pointer or via ''<fc #800000>LDR</fc>'' instructions relative to the ''<fc #008000>SP</fc>'' stack pointer. Example:\\
''<fc #800000>STR</fc> <fc #008000>X8</fc>, [<fc #008000>SP</fc>, <fc #ffa500>#-16</fc>]**<fc #800080>!</fc>**''\\
''<fc #800000>BL</fc> long_function''\\
''<fc #800000>ADD</fc> <fc #008000>SP</fc>, <fc #008000>SP</fc>, <fc #ffa500>#16</fc>''\\
This pushes the ninth argument onto the stack before calling //long_function//, which can then read it back.

==== Interrupts ====

An interrupt is a special signal that may cause the processor to temporarily stop normal program execution to handle an event that requires immediate attention. Interrupts are part of the previously described exception system or exception layer.
Interrupts on the processors are handled similarly, regardless of the architecture – the processor saves the current program state, program counter and status register. AArch64 then switches to privileged exception levels and finally jumps to the exception vector – a specific address that tells the processor where to find the instructions to handle the current event. The exception vector contains the address of a specific function. In the ARMv8 documentations, the interrupt is treated as a subset of exceptions. The exception is any event that can force the CPU to stop current normal code execution and start with the exception handler. 
There are four types of exceptions.

Synchronous exception – the exceptions of this type are always caused by the currently executed instruction. For example, the use of the str instruction to store some data at a memory location that does not exist or for which write operations are not available. In this case, a synchronous exception is generated. Synchronous exceptions can also be used to create a “software interrupt”. A software interrupt is a synchronous exception generated by the svc instruction. 

IRQ (Interrupt Request) – these are normal interrupts. They are always asynchronous, meaning they have nothing to do with the currently executing instruction. In contrast to synchronous exceptions, asynchronous exceptions are not always generated by the processor itself but by external hardware.

FIQ (Fast Interrupt Request) – this type of exception is called “fast interrupts” and exists solely for prioritising exceptions. It is possible to configure some interrupts as “normal” and others as “fast”. Fast interrupts will be signalled first and will be handled by a separate exception handler.

SError (System Error) – like IRQ and FIQ, SError exceptions are asynchronous and are generated by external hardware. Unlike IRQ and FIQ, SError always indicates some error condition. 

Each exception type needs its own handler, the special function that handles an exact event. Also, separate handlers should be defined for each different exception level at which an exception is generated. If the current code is working on EL1, those states can be defined as follows: the EL1t Exception is taken from EL1, while the stack pointer is shared with EL0. This happens when the SPSel register holds the value 0. EL1h Exception is taken from EL1 at the time when the dedicated stack pointer was allocated for EL1. This means that SPSel holds the value 1. EL0_64 Exception is taken from EL0 executing in 64-bit mode, and EL0_32 Exception is taken from EL0 executing in 32-bit mode.
In total, 16 exception handlers must be defined (four exception levels multiplied by four execution states). A special structure that holds addresses of all exception handlers is called the exception vector table, or just the vector table. The [[https://developer.arm.com/documentation/ddi0487/ca/|AArch64 Reference Manual]] has information about the vector table structure. Each exception vector has its own offset:
{{:en:multiasm:paarm:ddi0487c_a_armv8_arm.pdf_-_adobe_acrobat_reader_64-bit_.jpg|}}

There is no fixed number of interrupts available for the processor. The total number of available interrupts is defined by the Generic Interrupt Controller (GIC) implemented in the system. The Raspberry Pi 5 have a GIC-500 interrupt controller, and according to [[https://documentation-service.arm.com/static/5e9085b8c8052b1608761814?token=|ARM GIC architecture]], the Raspberry Pi 5 can have up to 1020 different interrupt IDs:
  * ID0..ID15 is used for Software Generated Interrupts (system calls)
  * The next 16 IDs are used for Private Peripheral Interrupts for a single core
  * The rest of the IDs are for Shared Peripheral interrupts

Practically, the Raspberry Pi 5 may have hundreds of interrupts from different sources in use, because the SoC chip BCM2712 have a lot of internal peripheral interrupts, the RP1 chip (the one that handles I/O lines and other peripherals on board) uses additional interrupts over PCIe bus. The Linux OS creates its own software interrupt, and finally, Linux combines them through the GIC.
 
The interrupts can be disabled and enabled. For example:\\
''<fc #800000>MSR</fc> DAIFCLR, <fc #ffa500>#2</fc><fc #6495ed> @ enable IRQ (interrupt request)</fc>''\\
''<fc #800000>MSR</fc> DAIFCLR,<fc #ffa500> #1</fc><fc #6495ed> @ Enable FIQ</fc>''\\
''<fc #800000>MSR</fc> DAIFSET, <fc #ffa500>#2</fc><fc #6495ed> @ disable IRQ</fc>'' 


** The Stack Pointer and Interrupt Handling **

When an interrupt or exception occurs, the processor automatically saves the minimal state. It then switches to the stack pointer associated with the current exception level. The interrupt handler can safely use the stack at that level without overwriting user or kernel data. For example, when an IRQ occurs at EL1, the CPU switches from the user’s stack (SP_EL0) to the kernel’s stack (SP_EL1). This change is invisible to user code and helps isolate privilege levels.
Inside an interrupt handler, the code must save and restore any registers it modifies. A minimal handler might look like this:\\
''irq_handler: <fc #6495ed>@ the label for the interrupt handler</fc>''\\
''<fc #800000>STP</fc> <fc #008000>X0</fc>, <fc #008000>X1</fc>, [<fc #008000>SP</fc>, <fc #ffa500>#-16</fc>]**<fc #6495ed>!</fc>**''\\
''<fc #6495ed>@ Handle the interrupt (event)</fc>''\\
''<fc #800000>LDP</fc> <fc #008000>X0</fc>, <fc #008000>X1</fc>, [<fc #008000>SP</fc>], <fc #ffa500>#16</fc>''\\
''<fc #800000>ERET</fc> <fc #6495ed>@ retorn from innetrupt (exception) handler</fc>''

Here, the stack pointer ensures the handler has a private area to store data safely, even if multiple interrupts occur.

<codeblock code_label>
<caption>Simple examples of interrupt handlers</caption>
<code>
irq_el1_handler:
    @ Save registers
    STP X0, X1, [SP, #-16]!
    STP X2, X3, [SP, #-16]!

    @ Acknowledge interrupt (example for GIC)
    MRS X0, ICC_IAR1_EL1       @ Read interrupt ID
    CMP X0, #1020              @ Spurious?
    BEQ irq_done

    @ Handle interrupt (custom code here)
    BL handle_device_irq

    @ Signal end of interrupt
    MSR ICC_EOIR1_EL1, X0

irq_done:
    @ Restore registers
    LDP X2, X3, [SP], #16
    LDP X0, X1, [SP], #16
    ERET                       @ Return from exception
</code>
</codeblock>


===== Compatibility for Compilers and OSes =====

So, as the Raspberry Pi 5 uses a 64-bit ARMv8.2A CPU (Cortex-A76), it supports the AArch64 instruction set. That means that all assembler compilers can be used that target the AArch64 instruction set. The Raspberry Pi 5 OS configuration allows the Operating System to start in 32-bit mode. To avoid unnecessary problems, the OS must run in 64-bit mode. This can be verified by checking the /boot/firmware/config.txt file. In the file, the parameter “arm_64bit” must be set to one (to 64-bit mode). If this parameter is missing, then it can be added at the end of the file.
There are options with compiler selection: one option is GNU Assembler (GAS) via GCC; another is the LLVM assembler in CLANG. CLANG was originally not meant to compile assembler code, but LLVM is. LLVM includes an integrated assembler that supports most LLVM targets, including ARM and ARM64. Both can be installed on the Raspberry Pi operating system:
  * GNU GCC: ''sudo apt install build-essential''
  * CLANG: ''sudo apt install clang''
  * or if just LLVM is needed: ''sudo apt install llvm''

Either way, the assembly code must be saved in a file with the extension “S” or “.s” before it can be assembled into an executable binary file. To make it easier to navigate between all created programs, it is recommended to create a new folder for each new code project. In such a case, the folder will contain the assembler code file, the compiled object file, and the executable binary file. 

All examples will be compiled with GCC assembler or GAS (GNU Assembler). Assuming that the assembler code already exists in a program.S file, it can be compiled with “''as -o program.o program.s''” and then using the linker command “'' ld -o program program.o''”. The first step creates a program.o file, or object file that contains all objects, the machine code, text messages, constants, etc., into one single file. Later, the object file becomes an executable binary once all objects in the file have been linked.

Sometimes compiling the code on another PC may be advantageous. Larger codes can be compiled faster. But for small code projects, it may be easier to compile them directly on Raspberry Pi OS. This removes necessary connections between the PC and the Raspberry Pi. 

Here goes the basic assembly code to print out a text message in a hello.s:
<codeblock code_label>
<code>
.global _start
_start:
    MOV X0, #1              @ stdout
    LDR X1, =msg            @ address of string
    MOV X2, #13             @ string length
    MOV X8, #64             @ write syscall
    SVC #0
    MOV X8, #93             @ exit syscall
    MOV X0, #0
    SVC #0
msg: .ascii "Text Message!\n"
</code>
</codeblock>
Build it by calling "''as -o hello.o hello.s''" and then "''ld -o hello hello.o''" commands in the Linux OS command line. After that it can be executed by the "''./hello''" command. During program execution time, the “''Text Message!''” should be shown in the terminal. This example uses the Linux OS system calls.

==== System calls  ====

The way Raspberry Pi OS calls created code depends on how it is built and how it runs, because this can be done at several different levels. The code can be written and executed as a user-space program on some OS, such as Linux. It is the most common and at the same time the easiest way to create and execute assembly code. Raspberry Pi Os will build and create assembly code into an ELF executable.\\
Commands “''as -o program.o program.s''” and “''ld -o program program.o''” create a Linux executable file. This file follows the ELF format (Executable and Linkable Format). After that, the developed program can be executed as an ordinary Linux program (i.e. ‘./program’). At this moment, the shell (bash) calls the Linux kernel using the ''execve()'' system call. The Linux kernel reads the ELF header of the newly created program and loads the code and data into memory. The kernel also arranges the stack and, if arguments are passed to the program, prepares the registers and their initial values. And the last thing for the kernel is to set the program counter register to the ''_start'' symbol in the program, the memory address where the Linux kernel copied the program code it created. At this point, the created program code begins executing - the designed program is being called like an ordinary function - we can pass the arguments and receive the data.

In a pure assembly program, the label “''_start''” must be present. This creates an entry point into the program where Linux can jump in. Let's take the same example from the previously developed code and break it.\\
  * ''<fc #9400d3>**.global**</fc> _start   <fc #6495ed> @ define _start label as global in the whole program</fc>''
  *'' _start: 		<fc #6495ed>@ the point where created program begins</fc>''
  *''     <fc #800000>MOV</fc> <fc #008000>X0</fc>, <fc #ffa500>#1   </fc> <fc #6495ed>@ Set stdout</fc>''
  *''     <fc #800000>LDR</fc> <fc #008000>X1</fc>, =msg  <fc #6495ed>@ pass an argument - address of string</fc>''
  *''     <fc #800000>MOV</fc> <fc #008000>X2</fc>, <fc #ffa500>#13</fc>   <fc #6495ed>@ pass an argument - string length</fc>''
  *''     <fc #800000>MOV</fc> <fc #008000>X8</fc>,<fc #ffa500> #64</fc>   <fc #6495ed>@ write syscall</fc>''
  *''     <fc #800000>SVC</fc><fc #ffa500> #0</fc>        <fc #6495ed>@ call syscall</fc>''
  *''     <fc #800000>MOV</fc> <fc #008000>X8</fc>, <fc #ffa500>#93</fc>   <fc #6495ed>@ exit syscall</fc>''
  *''     <fc #800000>MOV</fc> <fc #008000>X0</fc>, <fc #ffa500>#0</fc> <fc #6495ed>@ pass an argument 0 (success)</fc>''
  *''     <fc #800000>SVC</fc> <fc #ffa500>#0</fc> <fc #6495ed>@ Finish this program execution and exit</fc>''
  *'' msg: <fc #9400d3>**.ascii**</fc> <fc #ff00ff>"Text Message!\n"</fc>''

Another rule is proper exit from the program. In Linux, the code cannot use the RET instruction to return to the label ‘_start’ where the program was started. Because there is no return address, the created and program code is the first thing executed after loading into memory, and it must end by telling the kernel to end this program’s execution. This is also done through system calls:\\
''<fc #800000>MOV</fc> <fc #008000>X8</fc>, <fc #ffa500>#93</fc>    <fc #6495ed> @ syscall number for exit</fc>''\\
''<fc #800000>MOV</fc> <fc #008000>X0</fc>, <fc #ffa500>#0</fc>      <fc #6495ed>@ return code</fc>''\\
''<fc #800000>SVC</fc> <fc #ffa500>#0</fc>          <fc #6495ed>@ make the syscall</fc>''\\
After these instructions are executed, the kernel will stop the process and free up the stack memory. Processor control is returned to the OS shell.


==== Bare-Metal program code ====

As for the second level, the code can be made bare-metal and does not depend on an OS. This type of code is much more complex because it heavily relies on knowledge of hardware, its address space, and related details. On the other hand, this type of code in hardware will execute much faster than the same code in the OS. That’s because the OS schedules multiple tasks, and the program can be halted for some time while other programs run. These types of codes are present on all devices with processors, whether or not the device uses an OS. The bootloader is a bare-metal program that prepares the hardware for the OS. If there is no OS, the program is designed to perform a specific task on the hardware.

The special program runs immediately after a system reset. Taking an example of the bootloader in the Raspberry Pi. Before the bootloader loads into the memory, the very special startup binary code sets the CPU state, initialises memory and sets SP to a high address. Caches and Memory management units (MMUs) are turned off by default to save energy in case something fails, but the startup script can be edited to enable them. At the end of the startup code, the Program Counter register is set to a specified address where the main program starts (i.e., the bootloader). The bootloader checks the hardware, initialises and prepares it to work with the operating system. In this case, the bootloader and the operating system can be replaced with a bare-metal program. Program code runs directly on the processor; there is no system-call layer or stack setup unless the program code provides it.

Example code to blink the LED connected to Raspberry Pi GPIO17 (pin 11 on the physical board). Theoretically, the example code should work on older Raspberry Pi versions, but since the board now contains a dedicated chip, it becomes much harder because the PCIe must be initialised, the address space must be known, and much more. Unfortunately, there is no comprehensive documentation for Raspberry Pi 5, including its address space and other details.

<codeblock code_label>
<caption>Bare metal program code</caption>
<code>
	.equ PERIPH_BASE,   0x40000000  @ RP1 peripheral base
	.equ GPIO_BASE,   PERIPH_BASE + 0x0D0000  @ GPIO controller base
	.equ GPFSEL0,     0x00  @ Function select register 0
	.equ GPFSEL1,     0x04  @ Function select register 1
	.equ GPSET0,      0x1C  @ Pin output set register
	.equ GPCLR0,      0x28  @ Pin output clear register

@ the next section defines that ongoing code is program code
    .section .text
    .global _start
_start:
    @ Initialise stack pointer
    LDR     X0, =stack_top
    MOV     SP, X0

    @ Configure GPIO17 as output
    LDR     X1, =GPIO_BASE
    LDR     W2, [X1, #GPFSEL1]     @ Each pin takes 3 bits
    BIC     W2, W2, #(0x7 << 21)   @ Clear bits for GPIO17
    ORR     W2, W2, #(0x1 << 21)   @ Set bits to 001 (GPIO17 is output)
    STR     W2, [X1, #GPFSEL1]

blink_loop:
    @ Set GPIO17 high
    MOV     W3, #(1 << 17)
    STR     W3, [X1, #GPSET0]

    @ Simple delay loop
    MOV     X4, #0x200000  @ set value
delay1:	
    SUBS    X4, X4, #1     @ decrease register value
    B.NE    delay1         @ if not equal to zero, repeat

    @ Set GPIO17 low
    STR     W3, [X1, #GPCLR0]

    @ Delay again
    MOV     X4, #0x200000
delay2:
    SUBS    X4, X4, #1
    B.NE    delay2

    B       blink_loop             @ repeat 

@Reserve space for stack
    .align 16
stack_top:
    .space 4096

</code>
</codeblock>
This code must be loaded with the bootloader as “kernel8.img” at address ''0x80000''. These commands will compile and prepare the program to replace the OS:
  - ''as blink.s -o blink.o''
  - ''ld blink.o -Ttext=0x80000 -o blink.elf''
  - ''objcopy -O binary blink.elf kernel8.img''

After that, the code will continue to execute until the power is switched off, the processor is RESET, or an unexpected exception occurs. Here, the code is responsible for everything, including setting up the stack pointer, enabling caches, handling interrupts, and working with devices directly at hardware addresses. This requires reviewing all hardware-related documentation. But these programs tend to be much faster and more robust than OS related programs. This is a typical way to design programs such as device firmware, RTOS, or bootloader.

The same code can be adjusted to work as a kernel module or a driver. In such a case, the code will require much more editing and investigation into OS-related documentation. The kernel must know how often this program should execute, and it may also need to work with mutexes. This is better implemented in C, as it involves multiple runtime libraries from the Linux OS.

===== Peripheral Management in RPi =====

The Raspberry Pi has ported out some General-Purpose Input/Output (GPIO) lines. Several lines share functionality across the board's peripherals. To access peripherals like GPIO, they must be enabled in the Raspberry Pi 5 OS. Many things can be done with the Raspberry Pi's available GPIOs. In the picture below, the GPIOs and their alternative functions (in scope) are shown.

{{:en:multiasm:paarm:2025-12-04_15_54_09-.jpg|}}


For example, to enable UART communication on GPIO 14 and 15, the “./boot/config.txt” file must be edited by adding these two lines:
  * ''dtoverlay=uart0-pi5''
  * ''dtparam=uart0=on''

There are libraries designed to work with GPIO and/or communication interfaces such as UART, SPI, and I2C. For example, to work with GPIO, a special library in C or Python can be used, or GPIO can be accessed via memory mapping of IO registers. Note that GPIO parameters must be set before using them as outputs or inputs, like setting pull-up or pull-down resistors, setting direction, etc. On Linux, using libraries such as libgpiod or wiringPi/pigpio can save a lot of time. But these libraries are primarily available in C or even higher-level programming languages. The hardest part is to find proper hardware addresses. The Raspberry Pi5 installer RP1 peripheral controller documentation contains two addresses: System-Address, which is 40-bit long, and Proc-Address, which is 32-bit long. Taking the example of peripherals on the APB0 internal bus, the address in Proc-Address is ''0x40000000'', but in System-Address it is ''0x<fc #008000>C0</fc>40000000''. The difference is coloured out. Similar differences appear across all other peripheral buses, and these differences may occur in the code depending on the OS version, installed libraries, and drivers.

==== General Purpose Inputs and Outputs (GPIO) ====

Usable GPIO pins on the Raspberry Pi 5 are in IO_BANK0, and this provides information on the addresses available for GPIO programming. Other banks, IO_BANK1 and IO_BANK2, are reserved for internal use inside the board. The base address for IO_BANK0 is 0x400D0000, and the remaining addresses for GPIO programming are documented in the section “3.1.4.Registers” at the [[https://datasheets.raspberrypi.com/rp1/rp1-peripherals.pdf|RP1 datasheet]].
<note>To be sure that the code will work as desired, the program code must be adjusted accordingly, and the addresses must be verified.</note>

In the previous section, the basic GPIO pin toggling was implemented. GPIOs can be used for many purposes, such as replacing hardware components that are not available on the current board. The technique is called bit-banging. A Raspberry Pi has only one I2C interface, and let's assume that two or more sensors must be connected to the board and share the same I2C device address. This means that only one sensor per I2C channel can be connected, and as the Raspberry Pi has only one interface, only one sensor can be used. With the bit-bang technique, it is possible to create many communication interfaces if at least two GPIO pins are available. Note that the bit-banging technique uses processor power, so there are also limits for the number of interfaces. That’s because I2C must maintain a 100kHz clock rate for successful communication, and bit-banging techniques require significant processor power. Under heavy CPU load, the emulated I2C interfaces may run slower than necessary.

The following code example will be executed in the User space and may require root permissions to access the “''/dev/gpiomem''” device. 
<codeblock code_label>
<caption>I2C with bit-bang technique</caption>
<code>
.global _start
.section .text

@ Constants
.equ GPIO_BASE,  0x400D0000
.equ GPFSEL0,    0x00
.equ GPSET0,     0x1C
.equ GPCLR0,     0x28
.equ GPLEV0,     0x34

.equ SDA_BIT,    (1 << 2)
.equ SCL_BIT,    (1 << 3)

_start:
    @ Open /dev/gpiomem (fd=3)
    MOV X8, #56              @ syscall openat
    MOV X0, #-100            @ AT_FDCWD
    LDR X1, =path
    MOV X2, #2               @ O_RDWR
    MOV X3, #0
    SVC #0                   @ open -> fd in X0
    MOV X19, X0              @ save fd

    @ mmap GPIO
    MOV X8, #222             @ syscall mmap
    MOV X0, #0               @ addr
    MOV X1, #4096            @ length
    MOV X2, #3               @ PROT_READ|PROT_WRITE
    MOV X3, #1               @ MAP_SHARED
    MOV X4, X19              @ fd
    MOV X5, #0               @ offset
    SVC #0
    MOV X20, X0              @ base addr

    @ configure pins 2,3 as outputs
    LDR W1, [X20, #GPFSEL0]
    BIC W1, W1, #(0b111 << 6)    @ clear bits for GPIO2
    ORR W1, W1, #(0b001 << 6)    @ set output
    BIC W1, W1, #(0b111 << 9)    @ clear bits for GPIO3
    ORR W1, W1, #(0b001 << 9)
    STR W1, [X20, #GPFSEL0]

    @ Send start condition: SDA low while SCL high
    BL scl_high
    BL delay
    BL sda_low
    BL delay

    @ Send byte 0xA5 (1010 0101)
    MOV W0, #0xA5
    MOV W1, #8
send_bit:
    BL scl_low
    TST W0, #0x80
    BEQ send_zero
    BL sda_high
    B after_bit
send_zero:
    BL sda_low
after_bit:
    BL delay
    BL scl_high
    BL delay
    BL scl_low
    LSL W0, W0, #1
    SUBS W1, W1, #1
    B.NE send_bit

    @ Stop: SDA high while SCL high
    BL sda_low
    BL scl_high
    BL delay
    BL sda_high

    @ Exit
    MOV X8, #93
    MOV X0, #0
    SVC #0

@ ---- Subroutines ----
sda_high:
    MOV W2, #SDA_BIT
    STR W2, [X20, #GPSET0]
    RET
sda_low:
    MOV W2, #SDA_BIT
    STR W2, [X20, #GPCLR0]
    RET
scl_high:
    MOV W2, #SCL_BIT
    STR W2, [X20, #GPSET0]
    RET
scl_low:
    MOV W2, #SCL_BIT
    STR W2, [X20, #GPCLR0]
    RET
delay:
    MOV W3, #100
delay_loop:
    SUBS W3, W3, #1
    B.NE delay_loop
    RET

.section .rodata
path: .asciz "/dev/gpiomem"
</code>
</codeblock>
<note>Note that the code is a minimalistic prototype; it does not check for ACK or byte reads, which are necessary for an I2C communication interface. With an oscilloscope, it is possible to investigate signals visually and, if needed, adjust the delay loop. Remember that the timing depends on CPU speed. </note>
Overall, working with any peripheral requires studying its documentation to identify its base addresses and the offset addresses needed to control it.

==== Communication interfaces ====

The hardware can be used to send and receive a byte through I2C. The hardware itself performs all the control over digital signals. Everything that is needed again, find the base addresses for the hardware. Unfortunately, on the Raspberry Pi official homepage, the developers state that all available documentation on peripherals is intended for use with operating systems. On Raspberry Pi 5, the GPIO and the communication interfaces are on a separate chip, the RP1-C0. 

In chapter 2 of the [[https://datasheets.raspberrypi.com/rp1/rp1-peripherals.pdf|RP1-CO documentation]], it states that the board has multiple communication interfaces, including 7 I2C interfaces. All the base addresses are listed, but unfortunately, there is no information on how to control the I2C, SPI or UART communication interfaces. Only a few interfaces are available for user use because the Raspberry Pi 5 board has additional hardware that also requires some of those interfaces, or digital lines are used for special purposes. Sometimes, all the digital lines used for communication interfaces are repurposed for other purposes, rendering the communication interfaces useless. 

First, it is necessary to determine which interfaces are available on the Raspberry Pi and, of course, to find the device addresses. Another option is to use Linux. To use any communication interface, it must be enabled in the OS. The easiest way is to use the standard Linux i.e. i2c-dev interface, and it can be used with the following function calls:
  * open("/dev/i2c-1", O_RDWR), close(fd) and exit(0) to access the file descriptor
  * write(fd, &tx_byte, 1) to write 1 byte (used to send the byte)
  * read(fd, &rx_byte, 1) to read 1 byte (used to receive the byte)

In the assembler code, the I2C device must be opened by calling the OS system function “''openat(AT_FDCWD, "/dev/i2c-1", O_RDWR, 0)''”. In the assembler, it will look like :
<codeblock code_label>
<caption>open access to I2C</caption>
<code>
path_i2c:
    .asciz "/dev/i2c-1"	
    .section .text
    .global _start
_start:
	MOV     X0, #-100		@ #AT_FDCWD
	LDR     X1, =path_i2c		
	MOV     X2, #2			@ O_RDWR - read and write
	MOV     X3, #0
	MOV     X8, #56			@ #SYS_openat
	SVC     #0

</code>
</codeblock>

After the system call is executed, the ''<fc #008000>X0</fc>'' register will hold the file descriptor that can be used for future system calls. It should be stored and in the example register ''<fc #008000>X19</fc>'' is used:”'' <fc #800000>MOV</fc> <fc #008000>X19</fc>, <fc #008000>X0</fc>''”. Later on, to send or read one byte, system calls WRITE or READ are used, like in the code below:

<codeblock code_label>
<caption>I2C Rx/TX</caption>
<code>
I2Cbyte:
    .byte 0xAB
    MOV     X0, X19
    LDR     X1, =I2Cbyte
    MOV     X2, #1			@ nuber of bytes
    MOV     X8, #64			@#64=SYS_write #63=SYS_read 
    SVC     #0
</code>
</codeblock>

The code sends the data; the device address is not set in these examples. The system call with number ''<fc #ffa500>#64</fc>'' generates the START signal, sends the device address (with the write bit set), sends the ''.byte'' value, and finally sends the STOP signal. After the system call is executed, the ''<fc #008000>X0</fc>'' register holds the number of bytes transmitted. After this example code executes, the value must be equal to 1.

To unlock the full potential of any communication interface on Raspberry Pi, it will take a lot of effort to find the register addresses, digital lines, and their parameters, and even then, something will be missing. For example, the same code that works on Raspberry Pi version 3 or 4 will not work on version 5. The hardware addresses differ, and it seems the Operating System is translating them into 32-bit addresses. The communication interface depends heavily on the Operating System kernel version, kernel modules, and configuration. It is recommended to use system calls for other communication interfaces as well, as experimenting with the hardware without complete documentation is not recommended. As a result, the code may take harmful actions, damaging the Raspberry Pi.

==== Pulse Width Modulation ====

Basically, with a single GPIO line, you can do a lot: control almost any hardware, read or take measurements, generate wireless signals, and more. Of course, it is easier to control hardware designed for specific purposes, such as I2C communication in the previously described examples. As with the bit-banging technique, it is possible to generate periodic digital signals and control their duty cycle. 

In Raspberry Pi 5, the RP1 chip documentation contains much more information on pulse-width modulation than on basic communication interfaces. PWM registers are on the internal peripheral bus; the base addresses for PWM0 and PWM1 are ''0x40098000'' and ''0x4009C000'', respectively. This hardware is located in the RP1 chip and accessed through PCIe. The Linux OS sets up hardware address mapping, and this mapping is not exposed as a simple fixed physical address that can be accessed with just ''<fc #800000>LDR</fc>''/''<fc #800000>STR</fc>'' instructions from user space. To access the PWM registers, it is necessary to execute the code at least at the EL1 level and to know already the PCIe mapping, or the mapping can be implemented manually. Again, this is too advanced and carries a risk of breaking something.

Before proceeding, the base addresses must be checked at least three times (**NO JOKES**) and, if needed, replaced. The PWM0_BASE and IO_BANK0_BASE addresses are already mapped and known for the Raspberry Pi 5. In the example, the GPIO line 18 will be used. 
<codeblock code_label>
<caption>Address mapping</caption>
<code>
.equ PWM0_BASE,      0xXXXXXXXX      @ filled in by platform code
.equ IO_BANK0_BASE,  0xYYYYYYYY      @ filled in by platform code
.equ PWM_CHAN2_CTRL, 0x34
.equ PWM_CHAN2_RANGE,0x38
.equ PWM_CHAN2_DUTY, 0x40
.equ GPIO18_CTRL,    0x094
.equ FUNC_PWM0_2,    0xZZ           @ FUNCSEL value for PWM0[2] on GPIO18
</code>
</codeblock>
These are the constants used later on in the code. Note that three constants are holding dummy values – these values depend on the hardware. The rest of the code will control GPIO18 and generate a pulse at the specified frequency and duty cycle. The frequency and duty cycle parameters can be passed to the code. The ''<fc #008000>X0</fc>'' register will hold the period value, and the ''<fc #008000>X1</fc>'' register will hold the duty cycle value.
<codeblock code_label>
<caption>PWM initalisation</caption>
<code>
.global pwm_init_chan2
pwm_init_chan2:
    LDR     X2, =IO_BANK0_BASE
    ADD     X2, X2, #GPIO18_CTRL
    LDR     W3, [X2]                @ read current CTRL
    BIC     W3, W3, #0x1f  		@ clear FUNCSEL bits [4:0]
    ORR     W3, W3, #FUNC_PWM0_2	@ set FUNCSEL to FUNC_PWM0_2
    STR     W3, [X2]

</code>
</codeblock>
At this moment, the GPIO line is ready and internally connected to the PWM generator. The following code lines provide the PWM generator with all the necessary parameters: period and duty cycle.
<codeblock code_label>
<caption>set PWM duty and frequency</caption>
<code>
    LDR     X4, =PWM0_BASE
    ADD     X5, X4, #PWM_CHAN2_RANGE    @ X5=PWM0_BASE+PWM_CHAN2_RANGE
    STR     W0, [X5]                    @ low 32 bits used
    ADD     X5, X4, #PWM_CHAN2_DUTY
    STR     W1, [X5]
</code>
</codeblock>
Note that the code uses only 32-bit values of the ''<fc #008000>X0</fc>'' and ''<fc #008000>X1</fc>'' registers, where the input arguments are stored. The last step is to activate the PWM generator for Channel 2, setting its parameters, such as the PWM generation mode to trailing edge.
<codeblock code_label>
<caption>Enable PWM</caption>
<code>
    ADD     X5, X4, #PWM_CHAN2_CTRL   @ x5=PWM0_BASE+PWM_CHAN2_CTRL
    LDR     W6, [X5]
    @ Check the datasheet -> clear existing MODE bits (example) 
    BIC     W6, W6, #(0x7)
    @ set mode to trailing-edge
    ORR     W6, W6, #0x1
    @ set enable bit (placeholder) check datasheet
    ORR     W6, W6, #(1 << 8)
    STR     W6, [X5]
    RET
</code>
</codeblock>

This code can be used as a kernel module, but it cannot be executed directly from a regular user program on Pi OS. That’s because the mapping is involved, and with that, the kernel is also involved (because of the PCIe mapping).

** The second approach **

The second approach is similar to I2C communication: it uses system calls. The Raspberry Pi must be prepared with a device-tree overlay, for example, “''dtoverlay=pwm-2chan,pin=18,func=4''”. Note that to enable PWM, the string must be used, not just the value. Now it is time to create system calls. In the I2C example, three different system calls were made; the only difference between the code fragments is the system call value and arguments. Everything needed to activate PWM generation is to open the file, write some variables to it, and save it by closing it. 
<codeblock code_label>
<caption>Write_file function</caption>
<code>
@ void write_file(const char *path, const char *str)
write_file:
    @ x0 = path, x1 = string
    @ openat(AT_FDCWD, path, O_WRONLY, 0)
    MOV     X2, #O_WRONLY
    MOV     X3, #0
    MOV     X8, #SYS_OPENAT
    MOV     X4, #AT_FDCWD
    MOV     X0, X4          @ AT_FDCWD
    @ x1 already holds path
    SVC     #0              @ X0 = fd
    MOV     X19, X0         @ save fd
    @ find string length
    MOV     X0, X1          @ pointer to string
1:  LDRB    W2, [X0], #1
    CBZ     W2, 2f          @ jump to label 2 (f means forward)
    B       1b              @ jump to label 1 (b means backward)
2:
    @ now X0 points past NUL; length = (X0 - 1) - str
    SUB     X2, X0, X1      @ remove the NUL as it is not needed
    SUB     X2, X2, #1
    @ write(fd, str, len)
    MOV     X0, X19
    MOV     X8, #SYS_write
    SVC     #0
    @ close(fd)
    MOV     X0, X19
    MOV     X8, #SYS_close
    SVC     #0
    RET

</code>
</codeblock>
Now, the code to activate the PWM generator sets the period and duty. It is necessary to know which file to edit and which values to write to the files. Required constants for this example are:
<codeblock code_label>
<caption>Constants</caption>
<code>
    .equ SYS_openat, 56
    .equ SYS_write,  64
    .equ SYS_close,  57
    .equ SYS_exit,   93
    .equ AT_FDCWD,  -100
    .equ O_WRONLY,   1

    .section .rodata

path_export:
    .asciz "/sys/class/pwm/pwmchip0/export"
path_period:
    .asciz "/sys/class/pwm/pwmchip0/pwm0/period"
path_duty:
    .asciz "/sys/class/pwm/pwmchip0/pwm0/duty_cycle"
path_enable:
    .asciz "/sys/class/pwm/pwmchip0/pwm0/enable"
str_chan0:
    .asciz "0\n"        
str_period:
    .asciz "20000000\n" @ 20 ms period  = 20000000 ns  (for example, servo-style period)
str_duty:
    .asciz "5000000\n"  @ 5 ms high time = 5000000 ns  (25% duty)
str_enable:
    .asciz "1\n"          @ enable PWM
</code>
</codeblock>
And the main code, which sets all values:
<codeblock code_label>
<caption>Main code</caption>
<code>
    LDR     X0, =path_export    @ export PWM channel 0
    LDR     X1, =str_chan0
    BL      write_file

    LDR     X0, =path_period    @ set period
    LDR     X1, =str_period
    BL      write_file

    LDR     X0, =path_duty      @ set duty_cycle
    LDR     X1, =str_duty
    BL      write_file

    LDR     X0, =path_enable    @ enable PWM generation
    LDR     X1, =str_enable
    BL      write_file

    MOV     X0, #0               @ exit
    MOV     X8, #SYS_exit
    SVC     #0

</code>
</codeblock>
The code can be upgraded to accept the arguments: period and duty cycle. This would be similar to the write_file function, which takes two arguments. 

===== Advanced Assembly Programming =====

ARM processors are more capable than just toggling a GPIO pin or sending data through communication interfaces. Data processing requires more computational power than can be imagined, and it is not simply about ''<fc #800000>ADD</fc>'' or ''<fc #800000>SUB</fc>'' instructions. Performing multiple operations on multiple data sets requires knowledge of hardware possibilities and experience in such programming. The following small examples may repeat across different tasks: setting up counters or pointers to variables, browsing memory with post-indexed loads and stores, performing vector operations, returning the result of a function, and so on. There exist some basic rules: arguments to the function are passed through registers ''<fc #008000>X0</fc>''..''<fc #008000>X7</fc>'', and the return value is in register ''<fc #008000>X0</fc>''. If used, the stack must be 16-byte aligned each time, and ''<fc #800000>LDP</fc>''/''<fc #800000>STP</fc>'' instructions are the best choice for working with the stack when entering or exiting a function. An essential part of the code is tail handling: when the algorithm is designed to operate in parallel with 4 bytes and the user provides, for example, 15 bytes; in parallel, the operation can be executed on 4 bytes, parallel operations on the remaining 3 bytes cannot be performed, which may lead to an error.

First, let's take a look at the complete example code. The following example will demonstrate a function which adds two float arrays: ''result[i]=a[i] b[i]'', where a and b are passed as arguments, as well as the size of the array.
<codeblock code_label>
<caption>Multiply multiple elements in parallel</caption>
<code>
@ X0=result, X1=a, X2=b, X3=n (the number of floats)
.global vadd_f32
vadd_f32:
    @ Convert element count to bytes that will be processed
    @ 16 floats per loop → 64 bytes per array per loop
    @ Build loop trip count = n / 16
    MOV     X9, #16
    UDIV    X8, X3, X9            @ X8 = blocks of 16
    CBZ     X8, .tail

.loop:
    @ Load 16 floats from a and b (4x vectors)
    LD1     {V0.4S, V1.4S, V2.4S, V3.4S}, [X1], #64   @ a
    LD1     {V4.4S, V5.4S, V6.4S, V7.4S}, [X2], #64   @ b
    @ Add vectors
    FADD    V0.4S, V0.4S, V4.4S
    FADD    V1.4S, V1.4S, V5.4S
    FADD    V2.4S, V2.4S, V6.4S
    FADD    V3.4S, V3.4S, V7.4S
    ST1     {V0.4S, V1.4S, V2.4S, V3.4S}, [X0], #64     @ Store the result
    SUBS    X8, X8, #1
    B.NE    .loop

.tail:
    @ Handle remaining n % 16 scalars
    @ Compute r = n % 16
    MSUB    X9, X9, X8, X3         @ X9 = n - 16*blocks
    CBZ     X9, .done

.tail_loop:
    LDR     S0, [X1], #4
    LDR     S1, [X2], #4
    FADD    S0, S0, S1
    STR     S0, [X0], #4
    SUBS    X9, X9, #1
    B.NE    .tail_loop

.done:
    RET
</code>
</codeblock> 

This example contains new instructions and a bit different type of labels. Note that the dot before the label makes it act like a local label. Like “.loop” label can be used in many places in the code. The compiler translates this label to something unique, like “loop23,” if it's the 23rd (maybe) loop the compiler has found in the code. To be sure, the compiler code must be investigated. There is another way to branch to a local label: a numeric label like “1:” which was used in earlier examples. In the code, it is possible to identify a jump back to B.EQ 1b to the label “1:” or a jump forward to B.EQ 1f to the next local label “1:”. Such branches must be very short; it's best to use them in tiny loops.
The instruction ''<fc #800000>LD1    </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>, <fc #008000>V3</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>], <fc #ffa500>#64</fc>'' loads 64 bytes in total (16 float variables for this example). After instruction executions, the register ''<fc #ffa500>X1</fc>'' gets increased by 64. The ''<fc #800000>LD1</fc>'' instruction loads the element structure from the memory. This is one of the advanced SIMD (NEON) instructions. 

The curly brackets ''{<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>, <fc #008000>V3</fc>.<fc #808000>4S</fc>}'' indicate which NEON registers (''<fc #008000>V0</fc>''..''<fc #008000>V3</fc>'') are used to be loaded with four single-precision floating-point numbers (4 bytes – 32 bits). These registers operate like general-purpose registers; writing one byte to one of those registers will zero out the rest of the most significant bits. To save all the data in these registers, a different access method must be used: treat them as vectors using the Vx register. This means it is treated as though it contains multiple independent values rather than a single value. Operations with vector registers are a bit complex; to make them easier to understand, let's divide those instructions into several groups. One group is meant to load and store vector register values, and another performs arithmetic operations on vector registers.


==== NEON (SIMD – Single Instruction Multiple Data) ====

NEON is a powerful extension to the ARM processors. It is capable of performing a single instruction with multiple variables or data. Like in the previous example, numerous floating-point variables were loaded into a single vector register ”''<fc #008000>V0</fc>''”. One NEON register can hold up to 16 bytes (128 bits). These floating-point values are loaded from the memory pointed to by any of the general-purpose registers, like the ''[<fc #008000>X1</fc>]'' register, after which the ''[<fc #008000>X1</fc>]'' register may be incremented by 64 or any other value. 

Float variable F0 is the one pointed to by the ''[<fc #008000>X1</fc>]'' register. Four vectors were loaded with one single “''<fc #800000>LD1</fc>''” instruction.
<table tab_label>
<caption>Variable mapping</caption>
| F0 | F1 | F2 | F3 | ''V0.4 = {F0, F1, F2, F3}'' |
| F4 | F5 | F6 | F7 | ''V1.4 = {F4, F5, F6, F7}'' |
| F8 | F9 | F10 | F11 | ''V2.4 = {F8, F9, F10, F11}'' |
| F12 | F13 | F14 | F15 | ''V3.4 = {F12, F13, F14, F15}'' |
</table>

In AArch64, NEON and floating-point operations share the same set of 32 registers: ''<fc #008000>V0</fc>''..''<fc #008000>V31</fc>'', where each is 128 bits wide. Note that in instruction ''<fc #800000>LD1    </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>, <fc #008000>V3</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>], <fc #ffa500>#64</fc>'' the postfix “<fc #808000>S</fc>” identifies the type of variable. These are named as elements in the NEON architecture documentation.
  * <fc #008000>V0</fc>.<fc #808000>2D</fc> can hold two elements: two double-precision floating-point variables or two 64-bit integers, where the letter d indicates doubleword
  * <fc #008000>V3</fc>.<fc #808000>4S</fc> can hold four elements: four single-precision floating-point variables or four 32-bit integers, where the letter s indicates word
  * <fc #008000>V16</fc>.<fc #808000>8H</fc> can hold eight elements: eight 16-bit integers, where the letter h indicates a halfword
  * <fc #008000>V31</fc>.<fc #808000>16B</fc> can hold 16 elements: sixteen 8-bit integers, where the letter b indicates byte

{{:en:multiasm:paarm:2025-12-04_17_56_45-.png|}}

The NEON allows the CPU to perform operations on multiple data items with a single instruction. Standard scalar instructions can handle only one scalar value in one register, but NEON registers are specially designed so that they can hold multiple values. And this is how Single Instruction, Multiple Data (SIMD) operations work: one instruction operating on several elements in parallel. This technique is widely used for image, signal, and multimedia processing, as well as for accelerating math-heavy tasks like machine learning, encryption or other Artificial Intelligence AI tasks. The processor used in the Raspberry Pi 5 supports the NEON and floating-point (FP) instruction sets.

This is important to specify the register and how to interpret it when working with NEON instructions. Elements in a vector are ordered from the least significant bit. That is, element 0 uses the least significant bits of the register.\\
''<fc #800000>FADD</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc> <fc #6495ed>@ Add 4 single-precision floats from v1 and v2, store in v0.</fc>''\\
''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>8H</fc>, <fc #008000>V1</fc>.<fc #808000>8H</fc>, <fc #008000>V2</fc>.<fc #808000>8H</fc> <fc #6495ed>@ Add 8 halfwords (16-bit integers) elementwise.</fc>''\\
''<fc #800000>MUL</fc> <fc #008000>V0</fc>.<fc #808000>16B</fc>, <fc #008000>V1</fc>.<fc #808000>16B</fc>, <fc #008000>V2</fc>.<fc #808000>16B</fc> <fc #6495ed>@ Multiply 16-byte elementwise.</fc>''

Elementwise operations are performed by lane, where each element occupies its own lane. Like ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>8H</fc>, <fc #008000>V1</fc>.<fc #808000>8H</fc>, <fc #008000>V2</fc>.<fc #808000>8H</fc>''. The instruction specifies that there will be eight distinct 16-bit variables. The instruction is designed to operate on the integers, and the result is computed from eight 16-bit integers. In the picture below, it is visible which element from vector ''<fc #008000>V1</fc>'' is added to vector ''<fc #008000>V2</fc>'' and how the result is obtained in vector ''<fc #008000>V0</fc>''.
in the image is ilustrated instruction ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>8H</fc>, <fc #008000>V1</fc>.<fc #808000>8H</fc>, <fc #008000>V2</fc>.<fc #808000>8H</fc>'' operation. 

{{:en:multiasm:paarm:2025-12-04_18_45_46-pictures_for_the_book.pptx_-_powerpoint.jpg|}}

In the image, it is also visible how the elements are added together in separate lanes (eight lanes). The lanes used for operations depend on the vector register elements used in the operation. Another example with only four lanes involved would be from this instruction: ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>''

{{:en:multiasm:paarm:neon_simd2.jpg|}}

The NEON can also perform operations by narrowing the destination register.\\
''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>4H</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>'' 

{{:en:multiasm:paarm:neon_simd_stoh.jpg|}}

And another example with widening the destination register:\\
''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4H</fc>, <fc #008000>V2</fc>.<fc #808000>4H</fc>''
{{:en:multiasm:paarm:neon_simd_htos.jpg|}}

The instruction specifies that there will be eight distinct 16-bit variables. The ''<fc #800000>ADD</fc>'' instruction is designed to operate on integers, and the result is computed from eight 16-bit integers,  four 32-bit integers and so on. SIMD instructions also allow taking only one element from a vector and using it as a scalar value. As a result, the vector, for example, is multiplied by a scalar value\\
''<fc #800000>FMLA</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>[<fc #ffa500>1</fc>] <fc #6495ed>@ v0 += v1 * (scalar from v2 element nr1)</fc>'' 

{{:en:multiasm:paarm:neon_fmla.jpg|}}


The data preparation for NEON is also an essential part. Note that in the previous code example, the values into the vector registers were loaded by use of the LD1 instruction. NEON has special LDn and STn instructions, where n is 1 to 4 (i.e., LD1, ST4, ST2, LD3). Each of the load (or store) variants performs differently and must be chosen accordingly to the data. Furthermore, the LOAD instructions will be explained, as the STORE instruction works similarly but stores data from a register into memory.

The LD1 instruction:
  * Loads a single-element structure to a vector register
  * Loads multiple single-element structures into a vector register
  * With postfix ‘R’, like LD1R, loads a single-element structure into all lanes of the vector register 

To load a single point value into the register, the register must be pointed to the exact element number where the value will be loaded, like in this example: ''<fc #800000>LD1</fc> {<fc #008000>V0</fc>.<fc #808000>S</fc>[<fc #ffa500>3</fc>], [<fc #008000>X1</fc>]''. The ''<fc #008000>V0</fc>'' register here is divided into four lanes, each capable of holding either a single-precision floating-point number or a 32-bit integer. After curly brackets ‘''[<fc #ffa500>3</fc>]''’ point out the exact lane (element), in which the data must be loaded, in this case, the element number 3 is chosen.

The instruction ''<fc #800000>LD1R</fc> {<fc #008000>V0</fc>.<fc #808000>S</fc>, [<fc #008000>X1</fc>]'' loads one 32-bit value (integer or single-precision floating-point) in each of four vector lanes. All vector elements will contain the same value.

The instruction ''<fc #800000>LD1</fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>} [<fc #008000>X1</fc>]'' loads eight 32-bit values into the lanes of first ''<fc #008000>V0</fc>'' and then on ''<fc #008000>V1</fc>''. All the data are loaded sequentially in the form x0, x1, x2, x3 and so on. Note that x here is meant for any chosen data type and that the data by itself are in sequential order. Similar operations are performed with ''<fc #800000>LD2</fc>'', ''<fc #800000>LD3</fc>'' and ''<fc #800000>LD4</fc>'' instructions; the data loaded into the vector register are interleaved. The ''<fc #800000>LD2</fc>'' instruction loads the structure of two elements into two vector registers. There must always be two vector registers identified. The structure in the memory may be in such order: x0, y0, x1, y1, x2, y2, … and using the LD2 instruction, the x part of the structure is loaded into the first pointed register and the y part into the second vector register. This instruction is used, for example, to process audio data, where x may be the left audio channel and y the right:\\
''<fc #800000>LD2    </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]''

{{:en:multiasm:paarm:neon_ld2.jpg|}}

The ''<fc #800000>LD3</fc>'' instruction takes memory data as a three-element structure, such as an image's RGB data. For this instruction, three vector registers must be identified. Assuming the data in memory: r0, g0, b0, r1, g1, b1, r2, g2, b2,…, the r (red) channel will be loaded in the first identified register, the ‘g’ in the second and ‘b’ in the third, the last identified register.\\
''<fc #800000>LD3    </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]''

{{:en:multiasm:paarm:neon_ld3.jpg|}}


The ''<fc #800000>LD4</fc>'' instruction is also used to load interleaved data into four vector registers.\\ 
''<fc #800000>LD4    </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>, <fc #008000>V3</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]''

{{:en:multiasm:paarm:neon_ld4.jpg|}}

The ''<fc #800000>LD4R</fc>'' instruction will take four values from memory and fill the first vector with the first value, the second with the second, and so on. Such instructions can be used to perform arithmetical operations on the same value with multiple different scalar values. Similar operations are performed with ''<fc #800000>LD3R</fc>'' and ''<fc #800000>LD2R</fc>'' instructions. Note that the ''<fc #800000>//LDn//</fc>'', the ‘n’ identifies the number of vector registers loaded with the structure elements. Remember that the structure elements are the same as the number of vector registers used by an instruction.


===== Vectorisation =====

With ARMv8.2, the scalable vector extension is introduced, and later, with ARMv9.2, the scalable matrix extension is introduced. They share a common technique and idea; the difference is that a vector operates on one-dimensional data, whereas a matrix operates on two-dimensional data. 

Unlike NEON, where each vector register is 128 bits wide and the size is fixed, SVE is scalable. This means that its vectors can have different lengths. It mainly depends on the hardware and can be 128 to 2048 bits wide, divided into multiples of 128 bits. The main advantage over NEON is that the programmer no longer needs to consider vector length, allowing the same program code to be used on other SVE machines with different register sizes. This makes SVE forward scalable.

SVE adds its own independent registers alongside NEON: ''<fc #008000>Z0</fc>''..''<fc #008000>Z31</fc>'', ''<fc #008000>P0</fc>''..''<fc #008000>P15</fc>'' and ''<fc #008000>FFR</fc>'' registers. The ''<fc #008000>Zn</fc>'' registers hold the data, and the register ''<fc #008000>Pn</fc>'' acts as a predicate, enabling or disabling specific lanes during operation. Predicate registers are bitmasks, with each bit corresponding to one lane of the vector. Each ''<fc #008000>Pn</fc>'' register bit controls one lane: 0-disabled, 1-enabled lane during the operation. The number of bits in the ''<fc #008000>Pn</fc>'' register depends on the implemented SVE vector register length and the element size used in the operation. This mechanism replaces the “loop remainder” handling that was required in NEON. SVE programming is similar to the NEON programming, but the operation requires an additional predicate register, like in the example:\\
''<fc #800000>ADD</fc> <fc #008000>Z0</fc>.<fc #808000>S</fc>, <fc #008000>P0</fc><fc #808000>/M</fc>, <fc #008000>Z1</fc>.<fc #808000>S</fc>, <fc #008000>Z2</fc>.<fc #808000>S</fc>''

The predicate register ''<fc #008000>P0</fc>'' is used as a lane-mark register: bit 0 controls lane 0, bit 1 controls lane 1, and so on. This defines which lane is active for an operation across all ''<fc #008000>Zn</fc>'' registers. Register ''<fc #008000>P0</fc>'' postfix “''<fc #808000>/M</fc>''” means that the active lanes must merge with the destination register, where active lanes execute the operation and inactive lanes remain unchanged.

The Scalable Matrix Extension (SME) is built on the Scalable Vector Extension (SVE) by adding architectural support for matrix-oriented operations. This enables large, dense numerical computations, such as those in machine learning (ML), signal processing, graphics, and scientific workloads. SVE handles vectorised elements, and SME introduces mechanisms for efficiently handling 2D data matrices. Matrices can now be multiplied, accumulated, and transformed at scale by hardware. SME is also scalable, like SVE.

Remember that the SME is introduced with ARMv9, and the Raspberry Pi 5 uses a Broadcom BCM2712, an ARM Cortex-A76, which is a 64-bit ARMv8.2-A core. Even the Raspberry Pi 5 processor offers optional v8.3/v8.4 extensions for performance and security, but unfortunately, it does not include the Scalable Vector Extension (SVE). Besides that, it's good to know the new architectural concepts and possibilities. Taking SVE and SME as examples, the next ARM may introduce a scalable 3D matrix extension that can handle 3D matrices, and in the far future (maybe not so far…), n-dimensional matrix operations may be performed by hardware in some clock cycles.
===== Assembly Code Optimisation =====
===== Energy Efficient Coding =====

Assembler code is assembled into a single object code. Compilers, instead, take high-level language code and convert it to machine code. And during compilation, the code may be optimised in several ways. For example, there are many ways to implement statements, FOR loops or Do-While loops in the assembler. There are some good hints for optimising the assembler code as well, but these are just hints for the programmer. 
  - Take into account the instruction execution time (or cycle). Some instructions take more than one CPU cycle to execute, and there may be other instructions that achieve the desired result. 
  - Try to use the register as much as possible without storing the temporary data in the memory.
  - Eliminate unnecessary compare instructions by doing the appropriate conditional jump instruction based on the flags that are already set from a previous arithmetic instruction. Remember that arithmetic instructions can update the status flags if the postfix ''<fc #008000>S</fc>'' is used in the instruction mnemonic.
  - It is essential to align both your code and data to get a good speedup. For ARMv8, the data must be aligned on 16-byte boundaries. In general, if alignment is not used on a 16-byte boundary, the CPU will eventually raise an exception. 
  - And yet, there are still multiple hints that can help speed up the computation. In small code examples, the speedup will not be noticeable. The processors can execute millions of instructions per second.

Processors today can execute many instructions in parallel using pipelining and multiple functional units. These techniques allow the reordering of instructions internally to avoid pipeline stalls (Out-of-Order execution), branch prediction to guess the branching path, and others. Without speculation, each branch in the code would stall the pipeline until the outcome is known. These situations are among the factors that reduce the processor's computational power. 

==== Speculative instruction execution ====

Let's start with an explanation of how speculation works. The pipeline breaks down the whole instruction into small microoperations. The first microoperation (first step) is to fetch the instruction from memory. The second step is to decode the instruction; this is the primary step, during which the hardware is prepared for instruction execution. And of course, the next step is instruction execution, and the last one is to resolve and commit the result. The result of the instruction is temporarily stored in the pipeline buffer and waits to be stored either in the processor’s registers or in memory. \\
''<fc #800000>CMP  </fc> <fc #008000>X0</fc>, <fc #ffa500>#0</fc>''\\
''<fc #800000>B</fc>.<fc #9400d3>EQ </fc> JumpIfZeroLabel''\\
''<fc #800000>ADD  </fc> <fc #008000>X1</fc>, <fc #008000>X1</fc>, <fc #ffa500>#1</fc>     <fc #6495ed>@ This executes speculatively while B.EQ remains unresolved</fc>''\\
The possible outcomes are shown in the picture below.
{{:en:multiasm:paarm:spec_cmp.jpg|}}

In the example above, the comparison is made on the ''<fc #008000>X0</fc>'' register. The next instruction creates a branch to the label. In the pipeline, while ''<fc #800000>B</fc>.<fc #9400d3>EQ </fc>'' instruction is being executed; the next instruction is already prepared to be executed, if not already (by speculation). When the branch outcome becomes known, the pipeline either commits the results (if the prediction was correct) or flushes the pipeline and re-fetches from the correct address (if the prediction was wrong). In such a way, if the register is not equal to zero, processor speculation wins in performance; otherwise, the third instruction result is discarded, and any microoperation for this instruction is cancelled—the processor branches to a new location. 

From the architectural point of view, the speculations with instructions are invisible, as if those instructions never ran. But from a microarchitectural perspective (cache contents, predictors, buffers), all speculation leaves traces. Regarding registers, speculative updates remain in internal buffers until commit. No architectural changes happen until then. Regarding memory, speculative stores are not visible to other cores or devices and remain buffered until committed. But the speculative loads can occur. They may bring the data into cache memory even if it’s later discarded.

In this example, the AMR processor will perform speculative memory access:\\
''<fc #800000>LDR</fc> <fc #008000>X0</fc>, [<fc #008000>X1</fc>]''\\
''<fc #800000>STR</fc> <fc #008000>X2</fc>, [<fc #008000>X3</fc>]''\\
If the registers ''<fc #008000>X1</fc>'' and ''<fc #008000>X3</fc>'' are equal, the store may affect the load result. Processor instead of stalling until it knows for sure it will speculatively issue the load. The processor performs loads earlier to hide memory latency and keep execution units busy. The X1 register result may not be known for sure, since previous instructions may have computed it. If aliasing is later detected, the load result is discarded and reissued, a process called speculative load execution. It’s one of the primary sources of side-channel leaks (Spectre-class vulnerabilities) because speculative loads can cache data from locations the program shouldn’t have access to.
 
==== Barriers(instruction synchronization / data memory / data synchronization / one way BARRIER) ====

Many processors today can execute instructions out of the programmer-defined order. This is done to improve performance, as instructions can be fetched, decoded, and executed in a single cycle (instruction stage). Meanwhile, memory access can also be delayed or reordered to maximise throughput on data buses. Mostly, this is invisible to the programmers. 

The barrier instructions enforce instruction order between operations. This does not matter whether the processor has a single core or multiple cores; these barrier instructions ensure that the data is stored before the next operation with them, that the previous instruction result is stored before the next instruction is executed, and that the second core (if available) accesses the newest data. ARM has implemented special barrier instructions to do that: ''<fc #800000>ISB</fc>'', ''<fc #800000>DMB</fc>'', and ''<fc #800000>DSB</fc>''. The instruction synchronisation barrier (''<fc #800000>ISB</fc>'') ensures that subsequent instructions are fetched and executed after the previous instruction's results are stored (the previous instruction's operations have finished). A data memory barrier (''<fc #800000>DMB</fc>'') ensures that the order of data read or written to memory is fixed. And the Data synchronisation barrier (''<fc #800000>DSB</fc>'') ensures that both the previous instruction and the data access are complete and that the order of data access is fixed. 

Since the instructions are prefetched and decoded ahead of time, those earlier fetched and executed instructions might not yet reflect the newest state. The ''<fc #800000>ISB</fc>'' instruction – it forces the processor to stop fetching the next instruction before the previous instruction operations are finished. This type of instruction is required to ensure proper changes to a processor's control registers, memory access permissions, exception levels, and other settings. These instructions are not necessary to ensure the equation is executed correctly, unless the equation is complex and requires storing temporary variables in memory.\\
''<fc #800000>MRS    </fc> <fc #008000>X0</fc>, SCTLR_EL1        <fc #6495ed>@ Read system control register</fc>''\\
''<fc #800000>ORR    </fc> <fc #008000>X0</fc>, <fc #008000>X0</fc>, <fc #ffa500>#</fc>(<fc #ffa500>1</fc> << <fc #ffa500>0</fc>)    <fc #6495ed>@ Set bit 0 (enable MMU)</fc>''\\
''<fc #800000>MSR    </fc> SCTLR_EL1, <fc #008000>X0       </fc> <fc #6495ed>@ Write back</fc>''\\
''<fc #800000>ISB                         </fc> <fc #6495ed>@ Ensure new control state takes effect</fc>''

The code example ensures that the control register settings are stored before the following instructions are fetched. The setting enables the Memory Management Unit and updates new address translation rules. The following instructions, if prefetched without a barrier, could operate under old address translation rules and result in a fault.
Sometimes the instructions may sequentially require access to memory, and situations where previous instructions store data in memory and the next instruction is intended to use the most recently stored data. Of course, this example is to explain the data synchronisation barrier. Such an example is unlikely to occur in real life. Let's imagine a situation where the instruction computes a result and must store it in memory. The following instruction must use the result stored in memory by the previous instruction. 

The first problem may occur when the processor uses special buffers to maximise data throughput to memory. This buffer collects multiple data chunks and, in a single cycle, writes them to memory. The buffer is not visible to the following instructions so that they may use the old data rather than the newest. The data synchronisation barrier instruction will solve this problem and ensure that the data are stored in memory rather than left hanging in the buffer.\\
**Core 0:** \\
''<fc #800000>STR    </fc> <fc #008000>W1</fc>, [<fc #008000>X0</fc>]      <fc #6495ed> @ Store the data</fc>''\\
''<fc #800000>DMB    </fc> <fc #9400d3>ISH           </fc> <fc #6495ed>@ Make sure that the data is visible</fc>''\\
''<fc #800000>STR    </fc> <fc #008000>W2</fc>, [<fc #008000>X3</fc>]       <fc #6495ed>@ store the function status flag</fc>''\\

**Core 1:**\\
''<fc #800000>LDR    </fc> <fc #008000>W2</fc>, [<fc #008000>X3</fc>]       <fc #6495ed>@ Check the flags</fc>''\\
''<fc #800000>DMB</fc>     <fc #9400d3>ISH           </fc> <fc #6495ed>@ Ensure data is seen after the flag</fc>''\\
''<fc #800000>LDR    </fc> <fc #008000>W1</fc>, [<fc #008000>X0</fc>]       <fc #6495ed>@ Read the data (safely)</fc>''\\
In the example, the data memory barrier will ensure that the data is stored before the function status flag. The DMB and DSB barriers are used to ensure that data are available in the cache memory. And the purpose of the shared domain is to specify the scope of cache consistency for all inner processor units that can access memory. These instructions and parameters are primarily used for cache maintenance and memory barrier operations. The argument that determines which memory accesses are ordered by the memory barrier and the Shareability domain over which the instruction must operate. This scope effectively defines which Observers the ordering imposed by the barriers extends to. 

In this example, the barrier takes the parameter; the domain “ISH” (inner sharable) is used with barriers meant for inner sharable cores. That means the processors (in this area) can access these shared caches, but the hardware units in other areas of the system (such as DMA devices, GPUs, etc.) cannot. The “''<fc #800000>DMB</fc> <fc #9400d3>ISH</fc>''” forces memory operations to be completed across both cores: Core0 and Core1. The ‘''<fc #9400d3>NSH</fc>''’ domain (non-sharable) is used only for the local core on which the code is being executed. Other cores cannot access that cache area and are also not affected by the specified barrier. The ‘''<fc #9400d3>OSH</fc>''’ domain (outer sharable) is used for hardware units with memory access capabilities (like GPU). All those hardware units will be affected by this barrier domain. And the ‘''<fc #9400d3>SY</fc>''’ domain forces the barrier to be observed by the whole system: all cores, all peripherals – everything on the chip will be affected by barrier instruction.\\
These domain have their own options. These options are available in the table below.
<table tab_label>
<caption>Domain options</caption>
^ Option ^ Order access (before-after) ^ Shareability domain ^
| OSH    | Any-Any | Outer shareable |
| OSHLD    | Load-Load, Load-Store | Outer shareable |
| OSHST    | Store-Store | Outer shareable |
| NSH | Any-Any | Non-shareable |
| NSHLD | Load-Load, Load-Store | Non-shareable |
| NSHST | Store-Store | Non-shareable |
| ISH | Any-Any | Inner shareable |
| ISHLD | Load-Load, Load-Store | Inner shareable |
| ISHST | Store-Store | Inner shareable |
| SY | Any-Any | Full system |
| ST | Store-Store | Full system |
| LD | Load-Load, Load-Store | Full system |
</table>

Order access specifies which classes of accesses the barrier operates on. A "load-load" and "load-store" barrier requires all loads to be completed before the barrier occurs, but it does not require that the data be stored. All other loads and stores must wait for the barrier to be completed. The store-store barrier domain affects only data storing in the cache, while loads can still be issued in any order around the barrier. Meanwhile, "any-any" barrier domain requires both loading and storing the data to be completed before the barrier. The next ongoing load or store instructions must wait for the barrier to complete.

All these barriers are practical with high-level programming languages, where unsafe optimisation and specific memory ordering may occur. In most scenarios, there is no need to pay special attention to memory barriers in single-processor systems. Although the CPU supports out-of-order and predictive execution, in general, it ensures that the final execution result meets the programmer's requirements. In multi-core concurrent programming, programmers need to consider whether to use memory barrier instructions. The following are scenarios where programmers should consider using memory barrier instructions.
  * Share data between multiple CPU cores. Under the weak consistency memory model, a CPU's disordered memory access order may cause contention for access.
  * Perform operations related to peripherals, such as DMA operations. The process of starting DMA operations is usually as follows: first, write data to the DMA buffer; second, set the DMA-related registers to start DMA. If there is no memory barrier instruction in the middle, the second-step operations may be executed before the first step, causing DMA to transmit the wrong data.
  * Modifying the memory management strategy, such as context switching, requesting page faults, and modifying page tables. 
In short, the purpose of using memory barrier instructions is to ensure the CPU executes the program's logic rather than having its execution order disrupted by out-of-order and speculative execution.

==== Conditional instructions ====

Meanwhile, speculative instruction execution consumes power. If the speculation gave the correct result, the power wasn’t wasted; otherwise, it was. Power consumption must be taken into account when designing the program code. Not only is power consumption essential, but so is data safety. The Cortex-A76 on the Raspberry Pi 5 has a very advanced branch predictor, but mispredictions still cause wasted instructions, pipeline flushes, and energy spikes. It’s better to avoid unpredictable branches inside loops. Unfortunately, that is mostly impossible.

The use of the conditional select (''<fc #800000>CSEL</fc>'') instruction may help with small, unpredictable branches.\\
''<fc #800000>CMP    </fc> <fc #008000>X0</fc>, <fc #008000>X1</fc>''\\
''<fc #800000>CSEL   </fc> <fc #008000>X2</fc>, <fc #008000>X3</fc>, <fc #008000>X4</fc>, <fc #9400d3>LT  </fc> <fc #6495ed>@ more predictable and avoids branch flush</fc>''

This conditional instruction writes the value of the first source register to the destination register if the condition is TRUE. If the condition is FALSE, it writes the value of the second source register to the destination register. So, if the ''<fc #008000>X0</fc>'' value is less than the value in the ''<fc #008000>X1</fc>'' register, the ''<fc #800000>CSEL</fc>'' instruction will write the ''<fc #008000>X3</fc>'' register value into the ''<fc #008000>X2</fc>'' register; otherwise, the ''<fc #008000>X4</fc>'' register value will be written into the ''<fc #008000>X2</fc>'' register.

Other conditional instructions can be used similarly:

{{:en:multiasm:paarm:conditionainst.jpg|}}

These conditional instructions are helpful in branchless conditional checks. Taking into account that these instructions can also be executed speculatively, this execution won't waste power compared to branching if branch prediction fails.
==== Power saving ====

Some special instructions are meant to put the processor into sleep modes and wait for an event to occur. The processor can be woken up by an interrupt or by an event. In these modes, the code may be explicitly created to initialise interrupts and events, and to handle them. After that, the processor may be put into sleep mode and remain asleep unless an event or interrupt occurs. The following code example can be used only in bare-metal mode – without an OS.
<codeblock code_label>
<caption>IDLE loop</caption>
<code>
.global idle_loop
idle_loop:
1:  WFI             @ Wait For Interrupt, core goes to low-power
    B   1b          @ After the interrupt, go back and sleep again
</code>
</codeblock>
<note>Note that interrupt handling and initialisation must also be implemented in the code; otherwise, the CPU may encounter an error that may force a reboot. </note>
The example only waits for interrupts to occur. To wait for events and interrupts, the ''<fc #800000>WFI</fc>'' instruction must be replaced with the ''<fc #800000>WFE</fc>'' instruction. Another CPU core may execute an ''<fc #800000>SEV</fc>'' instruction that signals an event to all cores.

On a Raspberry Pi 5 running Linux, it is not observable whether the CPU enters these modes, because the OS generates many events between CPU cores and also handles many interrupts from communication interfaces and other Raspberry Pi components.
Another way to save more energy while running the OS on the Raspberry Pi is to reduce the CPU clock frequency. There is a scheme called dynamic voltage and frequency scaling (DVFS), the same technique used in laptops, that reduces power consumption and thereby increases battery life. On the internet, there is a paper named “Cooling a Raspberry Pi Device ”. The paper includes one chapter explaining how to reduce the CPU clock frequency. The Linux OS exposes CPU frequency scaling through sysfs, e.g.:
  * ”/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor”
  * “/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq”

It is possible to use syscalls in assembler to open and write specific values into them. 
<codeblock code_label>
<caption>Power saving</caption>
<code>
.global _start
.section .text
_start:
    @ openat(AT_FDCWD, path, O_WRONLY, 0)
    mov     x0, #-100               @ AT_FDCWD
    ldr     x1, =gov_path           @ const char *pathname
    mov     x2, #1                  @ O_WRONLY
    mov     x3, #0                  @ mode (unused)
    mov     x8, #56                 @ sys_openat
    svc     #0
    mov     x19, x0                 @ save fd

    @ write(fd, "powersave\n", 10)
    mov     x0, x19
    ldr     x1, =gov_value
    mov     x2, #10                 @ length of "powersave\n"
    mov     x8, #64                 @ sys_write
    svc     #0

    @ close(fd)
    mov     x0, x19
    mov     x8, #57                 @ sys_close
    svc     #0

    @ exit(0)
    mov     x0, #0
    mov     x8, #93                 @ sys_exit
    svc     #0

.section .rodata
gov_path:
    .asciz "/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor"
gov_value:
    .asciz "powersave\n"
</code>
</codeblock>
Similar things can be done with CPU frequencies, or even by turning off a separate core. This is just one example template that can be used to put the processor into a specific power mode. By changing the stored path in //gov_path// variable and //gov_value// value. The main idea is to use the OS's system call functions. The OS will do the rest

====== Programming in Assembler for PC ======
PC assembly programming is currently a relatively niche skill. It's all the more important to learn the basics of this language to enhance your programming skills in this language, which are essential for creating compilers, advanced firmware, and optimising critical software. It's also a valuable skill in cybersecurity and finding unusual software bugs by understanding disassembled machine code.
There are several well-known and used assembler programs for PCs on the market. Of these, three are considered the most popular: NASM((https://www.nasm.us/)), MASM((https://learn.microsoft.com/pl-pl/cpp/assembler/masm/microsoft-macro-assembler-reference?view=msvc-170)), and FASM((https://flatassembler.net/)). In this book, we'll focus on MASM, as it's a standard component of Visual Studio.
===== Evolution of x86 Processors =====
The evolution of x86 processors began with the Intel 8086 microprocessor introduced in 1978. It is worth noticing that the first IBM machines, the IBM PC and IBM XT, used the cheaper version 8088 with an 8-bit data bus. The 8088 was software compatible with 8086, but because of the use of an 8-bit external data bus, it allowed for a reduction in the cost of the whole computer. IBM later used the 8086 processors in personal computers named PS/2. Its successor, 80286, was used in the IBM AT personal computer, the extended version of the XT, and in the IBM PS/2 286 machine, which became the standard architecture for a variety of personal computers designed by many vendors. This started the history of personal computers based on the x86 family of processors. In this section, we will briefly describe the most important features and differences between models of x86 processors.

==== 8086 ====
The 8086 is a 16-bit processor, which means it uses 16-bit registers. With the use of the segmentation operating in so-called real addressing mode and a 20-bit address bus, it can access up to 1MB of memory. The base clocking frequency of this model is 5 - 10 MHz. It implements a three-stage instruction pipeline (loosely pipelined) which allows the execution of up to three instructions at the same time. In parallel to the processor, Intel designed a whole set of supporting integrated circuits, which made it possible to build the computer. One of these chips is 8087 - a match coprocessor known now as the Floating Point Unit.

==== 80186 ====
This model includes additional hardware units and is designed to reduce the number of integrated circuits required to build the computer. Its clock generator operates at 6 - 20 MHz. 80186 implements a few additional instructions. It's considered the faster version of 8086.

==== 80286 ====
The 80286 has an extended 24-bit address bus and theoretically can use 16MB of memory. In this model, Intel tried to introduce protected memory management with a theoretical address space of 1 GB. Due to problems with compatibility and efficiency of execution of the software written for 8086, extended features were used by niche operating systems only, with IBM OS/2 as the most recognised.

==== 80386 ====
It is the first 32-bit processor designed by Intel for personal computers. It implements protected and virtual memory management that 
was successfully used in operating systems, including Microsoft Windows. Architectural elements of the processor (registers, buses) are extended to 32 bits, which allows addressing up to 4GB of memory. 80386 extends the instruction pipeline with three stages, allowing for the execution of up to 6 instructions simultaneously. Although no internal cache was implemented, this chip enables connecting to external cache memory. Intel uses the IA-32 name for this architecture and instruction set. The original name 80386 was later replaced with i386.

==== i486 ====
It is an improved version of the i386 processor. Intel combined in one chip the main CPU and FPU (except i486SX version) and memory controller, including 8 or 16 kB of cache memory. The cache is 4-way associative, common for instructions and data. It is a tightly pipelined processor which implements a five-stage instruction pipeline where every stage operates in one clock cycle. The clock frequency is between 16 - 100 MHz.

<note>
Tightly pipelined means that all stages of the pipeline perform their duties within the same time period.
Loosely pipelined implies that some kind of buffer is used between pipeline stages to decouple the units and allow them to work more independently.
</note>

<note>
Set-associative is the method of placing memory regions in the cache. It represents a compromise between the complexity of the design of the cache controller and flexibility. In general, more "ways" (e.g. 4-way over 2-way) means more flexibility at the cost of complexity.
</note>
==== Pentium ====
Pentium is a successor of the i486 processor. It is still a 32-bit processor but implements a dual integer pipeline, which makes it the first superscalar processor in the x86 family. I can operate with the clock frequency ranging from 60 to 300 MHz. An improved microarchitecture also includes a separate cache for instructions and data, and a branch prediction unit, which helps to reduce the influence of pipeline invalidation for conditional jump instructions. The cache memory is 8kB for instructions and 8kB for data, both 2-way associative.

==== Pentium MMX ====
Pentium MMX (MultiMedia Extension) is the first processor which implements the SIMD instructions. It uses FPU physical registers for 64-bit MMX vector operations. The clock speed is 120 - 233 MHz. Intel also decided to improve the cache memory as compared to the Pentium. Both instruction and data cache are twice as big (16kB), and they are 4-way associative.

<note>
SIMD stands for Single Instruction Multiple Data. Such instructions allow for performing the same operation on more than one data in a single instruction. Intel introduced such instructions in Pentium MMX and continued in further processors as SSE and AVX instructions, naming them vector instructions.
</note>

==== Pentium Pro ====
Pentium Pro implements a new architecture (P6) with many innovative units, organised in a 14-stage pipeline, which enhances the overall performance. The advanced instruction decoder generates micro-operations, RISC-like translations of the x86 instructions. It can produce up to two micro-operations representing simple x86 instructions and up to six micro-operations from the microcode sequencer, which stores microcodes for complex x86 instructions. Micro-operations are stored in the buffer, called the reorder buffer or instruction pool, and by the reservation station are assigned to a chosen execution unit. In Pentium Pro, there are six execution units. An important feature of P6 architecture is that instructions can be executed in an out-of-order manner with flexibility of the physical register use known as register renaming. All these techniques allow for executing more than one instruction per clock cycle. Additionally, Pentium Pro implements a new, extended paging unit, which allows addressing 64GB of memory. The instruction and data L1 cache have 8kB in size each. The physical chip also includes the L2 cache assembled as a separate silicon die. This made the processor too expensive for the consumer market, so it was mainly implemented in servers and supercomputers.
==== Pentium II ====
Pentium II is the processor based on experience gathered by Intel in the development of the previous Pentium Pro processor and MMX extension to Pentium. Pentium II combines P6 architecture with SIMD instructions operating at a maximum of 450 MHz. The L1 cache size is increased to 32 KB (16 KB data + 16 KB instructions). Intel decided to exclude the L2 cache from the processor's enclosure and assemble it as a separate chip on a single PCB board. As a result, Pentium II has a form of PCB module, not an integrated circuit as previous models. Although offering slightly worse performance, this approach made it much cheaper than Pentium Pro, and the implementation of multimedia instructions made it more attractive for the consumer computer market. 
==== Pentium III ====
Pentium III is very similar to Pentium II. The main enhancement is the addition of the Streaming SIMD Extensions (SSE) instruction set to accelerate SIMD floating point calculations. Due to the enhancement of the production process, it was also possible to increase the clocking frequency to the range of 400 MHz to 1.4 GHz.
==== Pentium 4 ====
Pentium 4 is the last 32-bit processor developed by Intel. Some late models also implement 64-bit enhancement. It is based on NetBurst architecture, which was developed as an improvement to P6 architecture. The important modification is a movement of the instruction cache from the input to the output of the instruction decoder. As a result, the cache, named trace cache, stores micro-operations instead of instructions. To increase the market impact, Intel decided to enlarge the number of pipeline stages, using the term "hyperpipelining" to describe the strategy of creating a very deep pipeline. A deep pipeline could lead to higher clock speeds, and Intel used it to build the marketing strategy. The Pentium 4's pipeline in the initial model is significantly deeper than that of its predecessors, having 20 stages. The Pentium 4 Prescott processor even has a pipeline of 31 stages. Operating frequency ranges from 1.3 GHz to 3.8 GHz. Intel also implemented the Hyper Threading technology in the Pentium 4 HT version to enable two virtual (logical) cores in one physical processor, which share the workload between them when possible. With Pentium 4, Intel returned to the single chip package for both the processor core and L2 cache. Pentium 4 extends the instruction set with SSE2 instructions, and Pentium 4 Prescott with SSE3. NetBurst architecture suffered from high heat emission, causing problems in heat dissipation and cooling.

==== AMD Opteron ====
Opteron is the first processor which supported the 64-bit instruction set architecture, known at the beginning as AMD64 and in general as x86-64. Its versions include the initial one-core and later multi-core processors. AMD Opteron processor implements x86, MMX, SSE and SSE2 instructions known from Intel processors, and also AMD multimedia extension known as 3DNow!.

==== Pentium D ====
Pentium D is a multicore 64-bit processor based on the NetBurst architecture known from Pentium 4. Each unit implements two processor cores.

==== Core Processors ====
  * Pentium Dual Core.
After facing problems with heat dissipation in processors based on the NetBurst microarchitecture, Intel designed the Core microarchitecture, derived from P6. One of the first implementations is Pentium Dual-Core. After some time, Intel changed the name of this processor line back to Pentium to avoid confusion with Core and Core 2 processors. There is a vast range of Core processors models with different sizes of cache, numbers of cores, offering lower or higher performance. From the perspective of this book, we can think of them as modern, advanced and efficient 64-bit processors, implementing all instructions which we consider. There are many internet sources where additional information can be found. One of them is the Intel website((https://www.intel.com/content/www/us/en/products/details/processors.html)), and another commonly used is Wikipedia((https://en.wikipedia.org/wiki/X86)).
  * Core
All Intel Core processors are based on the Core microarchitecture. Intel uses different naming schemas for these processors. Initially, the names represented the number of physical processor cores in one chip; Core Duo has two physical processors, while Core Quad has four.
  * Core 2
Improved version of the Core microarchitecture. The naming schema is similar to the Core processors.
  * Core i3, i5, i7 i9
Intel changed the naming. Since then, there has been no strict information about the number of cores inside the chip. Name rather represents the overall processor's performance.
  * Core X and Core Ultra X
The newest (at the time of writing this book, introduced in 2023) naming drops the "i" letter and introduces the "Ultra" versions of high-performance processors. There exist Core 3 - Core 9 processors as well as Core Ultra 3 - Core Ultra 9.
 
==== The most important processors ====

^ Model               ^ Year  ^ Class    ^ Address bus  ^ Max memory  ^ Clock freq.          ^ Architecture     ^
| 8086                | 1978  | 16-bit   | 20 bits      | 1 MB        | 5 - 10 MHz           | x86-16           |
| 80186               | 1982  | 16-bit   | 20 bits      | 1 MB        | 6 - 20 MHz           | x86-16           |
| 80286               | 1982  | 16-bit   | 24 bits      | 16 MB       | 4 - 25 MHz           | x86-16           |
| 80386               | 1985  | 32-bit   | 32 bits      | 4 GB        | 12.5 - 40 MHz        | IA-32            |
| i486                | 1989  | 32-bit   | 32 bits      | 4 GB        | 16 - 100 MHz         | IA-32            |
| Pentium             | 1993  | 32-bit   | 32 bits      | 4 GB        | 60 - 200 MHz         | IA-32            |
| Pentium MMX         | 1996  | 32-bit   | 32 bits      | 4 GB        | 120 - 233 MHz        | IA-32            |
| Pentium Pro         | 1995  | 32-bit   | 36 bits      | 64 GB       | 150 - 200 MHz        | IA-32            |
| Pentium II          | 1997  | 32-bit   | 36 bits      | 64 GB       | 233 - 450 MHz        | IA-32            |
| Pentium III         | 1999  | 32-bit   | 36 bits      | 64 GB       | 400 MHz - 1.4 GHz    | IA-32            |
| Pentium 4           | 2000  | 32-bit   | 36 bits      | 64 GB       | 1.3 GHz - 3.8 GHz    | IA-32            |
| AMD Opteron         | 2003  | 64-bit   | 40 bits      | 1 TB        | 1.4 GHz - 3.5 GHz    | x86-64           |
| Pentium 4 Prescott  | 2004  | 64-bit*  | 36 bits      | 64 GB       | 1.3 GHz - 3.8 GHz    | IA-32, Intel64*  |
| Pentium D           | 2005  | 64-bit   |              |             | 2.66 GHz - 3.73 GHz  | Intel64          |
| Pentium Dual-Core   | 2007  | 64-bit   |              |             | 1.3 GHz - 3.4 GHz    | Intel64          |
| Core 2              | 2006  | 64-bit   | 36 bits      | 64 GB       | 1.06 GHz - 3.5 GHz   | Intel64          |
* in some models
===== Specific Elements of the x86 and x64 Architectures =====
The x86 architecture was created in the late 70th years of XX century. The technology available at that time didn't allow for implementing advanced integrated circuits containing millions of transistors in one silicon die. The 8086, the first processor in the whole family of x86, required additional supporting elements to operate. This led to the necessity of making some decisions about the compromise between efficiency, computational possibilities and the size of the silicon and cost. This is the reason why Intel invented some specific elements of the architecture. One of them is the segmentation mechanism to extend the addressing space from 64 kB typically available to 16-bit processors to 1 MB. We present in this chapter some specific features of the x86 and x64 architectures.

==== Segmented addressing in real mode ====
The 8086 can address the memory in so-called real mode only. In this mode, the address is calculated with two 16-bit elements: segment and offset. The 8086 implements four special registers to store the segment part of the address: CS, DS, ES, and SS. During program execution, all addresses are calculated relative to one of these registers. The program is divided into three segments containing the main elements. The code segment contains processor instructions and their immediate operands. The instructions address are related to the CS register. The data segment is related to the DS register. It contains data allocated by the program. The stack segment contains the program stack and is related to the SS register. If needed, it is possible to use an extra segment related to the ES register. It is by default used by string instructions.
<figure realsegments>
{{ :en:multiasm:cs:Real_segments.png?600 |Illustration of assignment of segments and segment registers in real mode}}
<caption>Segments and segment registers in real mode</caption>
</figure>

Although the 8086 processor has only four segment registers, there can be many segments defined in the program. The limitation is that the processor can access only four of them at the same time, as presented in Fig {{ref>realsegments}}. To access other segments, it must change the content of the segment register.

The address, which consists of two elements, the segment and the offset, is named a logical address. Both numbers which form a logical address are 16-bit numbers. So, how to calculate a 20-bit address with two 16-bit values? It is done in the following way. The segment part, taken always from the chosen segment register, is shifted four bit positions left. Four bits at the right side are filled with zeros, forming a 20-bit value. The offset value is added to the result of the shift. The result of the calculations is named the linear address. It is presented the Fig {{ref>realcalc}}. In the 8086 processor, the linear address equals the physical address, which is provided via the address bus to the memory of the computer. 

<figure realcalc>
{{ :en:multiasm:cs:real_calculation.png?600 |Illustration of address calculation in real mode}}
<caption>Address calculation in real mode</caption>
</figure>

The segment in the memory is called a physical segment. The maximum size of a single physical segment can't exceed 64 kB (65536 B), and it can start at an address evenly divisible by 16 only. In the program, we define logical segments which can be smaller than 64 kB, can overlap, or even start at the same address.

==== Segmented addressing in protected mode ====
With the introduction of processors capable of addressing larger memory spaces, the real addressing mode was replaced with addressing with the use of descriptors. We will briefly describe this mechanism, taking 80386 as an example.
The 80386 processor is a 32-bit machine built according to IA-32 architecture. Using 32 bits, it is possible to address 4 GB of linear address space. Segmentation in 32-bit processors can be used to implement the protection mechanisms. They prevent access to segments which are created by other processes to ensure that processes can operate simultaneously without interfering. Every segment is described by its descriptor. The descriptor holds important information about the segment, including the starting address, limit (size of the segment), and attributes. As it is possible to define many segments at the same time, descriptors are stored in the memory in descriptor tables. IA-32 processors contain two additional segment registers, FS and GS, which can be used to access two extra data segments (Fig {{ref>ia32segments}}), but all segment registers are still 16-bit in size. 

<figure ia32segments>
{{ :en:multiasm:cs:IA32_segments.png?600 |Illustration of assignment of segments and segment registers in protected mode}}
<caption>Segments and segment registers in protected mode</caption>
</figure>

The segment register in protected mode holds the segment selector, which is an index in the table of descriptors. The table is stored in memory, created and managed by the operating system. Each segment register has an additional part that is hidden from direct access. To speed up the operation, the descriptor is downloaded from the table into the hidden part automatically, each time the segment register is modified. It is schematically presented in Fig {{ref>ia32addr}}.

<figure ia32addr>
{{ :en:multiasm:cs:IA32_addressing.png?600 |Illustration of addressing with descriptors in protected mode}}
<caption>Address calculation in protected mode</caption>
</figure>

Although segmentation allows for advanced memory management and the implementation of memory protection, none of the popular operating systems, including Windows, Linux or MacOS, ever used it. In Windows, all segment registers, via descriptors, point to the zero base address and the maximal limit, resulting in the flat memory model. In this approach, the segmentation de facto does not function. Memory protection is implemented at the paging level. The flat memory model is shown in the Fig {{ref>ia32flat}}.

<figure ia32flat>
{{ :en:multiasm:cs:IA32_flat.png?600 |Illustration of flat memory addressing mode}}
<caption>Flat memory addressing</caption>
</figure>

==== Paging ====

Paging is a mechanism for translating linear addresses to physical addresses. Operating systems extensively use it to implement memory protection and virtual memory mechanisms. The virtual memory mechanism allows the operating system to use the entire address space with less physical memory installed in the computer. The paging is supported by the memory management unit using a structure of tables, stored in memory.
In 32-bit mode, the tables are organised in a two-level structure, with a single page directory table and a set of page tables. The pages can be 4kB or 4MB in size. The 1-bit information about the page size (named PS) is stored in the page directory entry.

  * 4kB pages
Each entry in the page directory holds a pointer to the page table, and every page table stores the pointers to the pages. 
The linear address is divided into three parts. The first (highest) part is a 10-bit index of the entry in the page directory table, the middle 10 bits form an index of the entry in the page table, and finally, the last (least significant) 12 bits are the offset within the table in memory. This mechanism is shown in Fig {{ref>paging324kB}}.

<figure paging324kB>
{{ :en:multiasm:cs:paging_32_4kB.png?600 |Illustration of paging in 32-bit mode with 4kB pages}}
<caption>Paging in 32-bit mode with 4kB pages</caption>
</figure>

  * 4MB pages
For pages of 4MB in size, the page table level is omitted. The page directory holds the direct pointer to the table in memory. In this situation, the entry in the page directory is indexed with 10 bits, and the offset within the page is 22 bits long. It is shown in the Fig {{ref>paging324MB}}.

<figure paging324MB>
{{ :en:multiasm:cs:paging_32_4MB.png?600 |Illustration of paging in 32-bit mode with 4MB pages}}
<caption>Paging in 32-bit mode with 4MB pages</caption>
</figure>

In 64-bit mode, the tables are organised in a four or five-level structure. In a 4-level structure, the highest level is a single page map level 4 table, next, there are page directory pointer tables, page directory tables, and page tables. The linear address is divided into six parts. Each table is indexed with 9 bits and can store up to 512 entries. It is shown in the Fig {{ref>paging644kB}}. The pages can be 4kB, 2MB, or 1GB in size. For 2MB pages, the page tables level is omitted, while for 1GB pages, the page directory level is additionally omitted.

<figure paging644kB>
{{ :en:multiasm:cs:paging_64_4kB.png?600 |Illustration of paging in 64-bit mode with 4kB pages}}
<caption>Paging in 64-bit mode with 4kB pages</caption>
</figure>

As currently built computers have significantly less physical memory than can be theoretically addressed with a 64-bit linear address, producers decided to limit the usable address space. That's why the 4-level paging mechanism recognises only 48 bits, leaving the upper bits unused. In 5-level paging, 57 bits are recognised. The most significant part of the address should have the value of the highest recognisable bit from the address. As the most significant bit of the number represents the sign, duplicating this bit is known as sign extension.

Sign-extended addresses having 48 or 57 recognised bits are known as canonical addresses, while others are non-canonical. It is presented in Fig {{ref>canonical}}. In current machines, only canonical addresses are valid.

<figure canonical>
{{ :en:multiasm:cs:canonical.png?600 |Illustration of canonical addresses in 64-bit mode}}
<caption>Canonical addresses in 64-bit mode</caption>
</figure>
==== Addressing in x64 processors ====
Because the segmentation wasn't used by operating systems software producers, AMD and Intel decided to abandon segmentation in 64-bit processors. For backwards compatibility, modern processors can run 32-bit mode with segmentation, but the newest versions of operating systems use 64-bit addressing, named long mode and referred to as x64. The only possible addressing is a flat memory model with segment registers and descriptors set to zero address as the beginning of the linear memory space.
<note>
In x64 bit mode, some instructions can use segment registers FS and GS as base registers. Operating systems' kernels also use them.
</note>
The 64-bit mode theoretically allows the processor to address a vast memory of 16 exabytes in size. It is expected that such a big memory will not be installed in currently built computers, so the processors limit the available address space, not only at the paging level but also physically, having a limited number of address bus lines.
===== Register Set =====
==== x86 registers ====
The 8086 processor has seven general-purpose registers for data storage, data manipulation and indirect addressing. Four registers: AX, BX, CX and DX can be accessed in two 8-bit halves. The lower half has the "L" letter, and the upper half has the "H" letter in the register name instead of "X". General-purpose registers are presented in Fig {{ref>gpregistresx86}}.

<figure gpregistresx86>
{{ :en:multiasm:cs:GP_registers_x86.png?600 |Illustration of general purpose registers in x86 16-bit processor}}
<caption>General purpose registers in 16-bit 8086 processor</caption>
</figure>

They have some special functions, as listed below.
  * AX - Accumulator, used for general data manipulation; some instructions work only with the accumulator.
  * BX - Base register, used for indirect addressing in base addressing mode.
  * CX - Counter register, used for iteration counting for loops and repeated instructions.
  * DX - Data register, used as an extension of the accumulator, and for I/O address space addressing.
  * SI - Source index, used for indirect index addressing of source data tables or strings.
  * DI - Destination index, used for indirect index addressing of destination data tables or strings.
  * BP - Base pointer, often used for accessing data on the stack without the need for the stack pointer modification.

Some registers have special purposes.
  * SP - Stack pointer, used for stack manipulation, automatically handles storing and retrieving the return address of procedures while calling and returning.
  * IP - Instruction pointer, points to the next instruction to be executed; sometimes called program counter.
These registers are presented in Fig {{ref>spregistresx86}}.

<figure spregistresx86>
{{ :en:multiasm:cs:special_registers_x86.png?600 |Illustration of stack pointer and instruction pointer registers in x86 16-bit processor}}
<caption>Stack pointer and instruction pointer in 16-bit 8086 processor</caption>
</figure>

Every processor has a register containing bits used to inform the software about the state of the ALU, about the result of the last arithmetic or logic operation, and used to control the behaviour of the CPU. In x86, it is called the Flags register, and is shown in fig {{ref>flagsx86}}.

<figure flagsx86>
{{ :en:multiasm:cs:flags_register_x86.png?500 |Illustration of the flags register in x86 16-bit processor}}
<caption>Flags register in 16-bit 8086 processor</caption>
</figure>

The meaning of informational flags is as follows:
  * CF - Carry flag, set if an operation generates a carry or borrow; for example, if the result of an addition is larger than 16 bits. Used for unsigned arguments.
  * PF - Parity flag, set if the result of an operation contains an even number of bits equal to "1".
  * AF - Auxiliary Carry flag, similar to CF but informs about the carry or borrow from 3th to the 4th bit. It is used for BCD calculations.
  * ZF - Zero flag, set if the result of an operation is zero.
  * SF - Sign flag, it is a copy of the most significant bit of the result
  * OF - Overflow flag, set if the result of an operation is too large or small to fit in the destination operand. Used for signed arguments.

Control flags allow modification of the processor's behaviour:
  * TF - Trap flag, used for debugging to execute a program one instruction at a time. 
  * IF - Interrupt flag, if set, interrupts are enabled.
  * DF - Direction flag, determines the direction of string operations. If cleared, string operations work from lower to higher addresses; if set, they operate in the opposite direction.
  
The 8086, being a 16-bit processor, can address memory in real mode only (please refer to section "Segmented addressing in real mode"). It has four segment registers:
  * CS - Code segment register, used together with IP to access the instruction in the code segment.
  * DS - Data segment register, used to access the default data segment.
  * ES - Extra segment register, used to access an additional data segment.
  * SS - Stack segment register, used together with SP or BP to access elements on the stack.
Segment registers are shown in Fig {{ref>segmentregsx86}}.

<figure segmentregsx86>
{{ :en:multiasm:cs:segment_registers_x86.png?200 |Illustration of the segment registers in x86 16-bit processor}}
<caption>Segment registers in 16-bit 8086 processor</caption>
</figure>
==== IA-32 registers ====
Starting from the 80386 processor general-purpose registers, stack pointer, instruction pointer, and flags register were extended to 32 bits. Inter added an extra 16 bits to each of them, leaving the possibility to access the lower half as in previous models. For example, an accumulator can be accessed as 32-bit EAX, 16-bit AX, and 8-bit AH and AL. The resulting register set is shown in Fig {{ref>GPregsIA32}}.

<figure GPregsIA32>
{{ :en:multiasm:cs:GP_registers_IA32.png?600 |Illustration of the register set in IA-32 32-bit processor}}
<caption>Register set in 32-bit IA-32 processors</caption>
</figure>

Similarly, the IP and flags registers were extended to 32-bit size (fig.{{ref>eflagsIA32}}). Additional bits in the Eflags register are mainly used for handling the virtual memory mechanism and memory protection. They are used mainly by the operating system software, so we will not focus on details here.


<figure eflagsIA32>
{{ :en:multiasm:cs:eflags_register_IA32.png?600 |Illustration of the Eflags register in IA-32 32-bit processor}}
<caption>Eflags register in 32-bit IA-32 processors</caption>
</figure>

In 32-bit machines, two additional segment registers were added: FS and GS (fig.{{ref>segregsIA32}}). It is worth noticing that all segment registers are still 16 bits long, but there are hidden parts, not available directly to the programmer, used for storage of descriptors (refer to the section "Segmented addressing in protected mode").


<figure segregsIA32>
{{ :en:multiasm:cs:segment_registers_IA32.png?200 |Illustration of segment registers in IA-32 32-bit processor}}
<caption>Segment registers in 32-bit IA-32 processors</caption>
</figure>
==== x64 registers ====
The 64-bit extension to the IA-32 architecture was proposed by AMD. Registers were extended to 64 bits, and additionally, eight new general-purpose registers were added. They obtained the names R8 - R15. All registers (including already known registers) can be accessed as 64-bit, 32-bit lower-order half, 16-bit lower-order quarter (word) and 8-bit lower-order byte. Registers are presented in fig.{{ref>GPregsx64}} and in fig.{{ref>newregsx64}}. The Eflags register is extended to 64 bits and named Rflags. The upper 32 bits of the Rflags are reserved.

<figure GPregsx64>
{{ :en:multiasm:cs:GP_registers_x64.png?600 |Illustration of the register set in x64 64-bit processor}}
<caption>Register set in 64-bit x64 processors</caption>
</figure>

<figure newregsx64>
{{ :en:multiasm:cs:new_registers_x64.png?600 |Illustration of new registers in x64 64-bit processor}}
<caption>New registers in 64-bit x64 processors</caption>
</figure>

==== FPU registers ====
The floating point unit, sometimes referred to as x87, works using eight 80-bit data registers, as presented in figure {{ref>fpudataregs}}. They are organised as a stack, so data items are pushed into the top register. Calculations are done using the top of the stack and another register, usually the next one. Registers are named ST(0) to ST(7), where ST(0) is the top of the stack; register ST(7) is the bottom. The ST name is another way to refer to ST(0). The registers contain the sign bit, exponent part and significand part, encoded according to the IEEE754 standard. Please refer to the section about data encoding and to the section about the FPU description for details of floating-point instructions.


<figure fpudataregs>
{{ :en:multiasm:cs:fpu_data_registers.png?600 |Illustration of the register set in FPU coprocessor}}
<caption>Register set in x87 Floating Point Unit coprocessor</caption>
</figure>

There are also control and status registers in the Floating Point Unit as presented in figure {{ref>fpucontrolregs}}.

<figure fpucontrolregs>
{{ :en:multiasm:cs:fpu_control_registers.png?600 |Illustration of the control registers FPU coprocessor}}
<caption>Control registers in x87 Floating Point Unit coprocessor</caption>
</figure>

FPU instruction pointer, data pointer and last opcode registers are used for cooperation with the main processor. The Tag register (figure {{ref>fputagreg}}) holds the 2-bit information of the states of each data register. It helps to manage the stack organisation of the data registers.

<figure fputagreg>
{{ :en:multiasm:cs:fpu_tag_register.png?400 |Illustration of the TAG register in FPU coprocessor}}
<caption>TAG register in x87 Floating Point Unit coprocessor</caption>
</figure>

The Status Word register contains information about the FPU's current state, including the top of the stack, exception flags, and condition codes (figure {{ref>fpustatereg}}). The latter ones inform about the result of the last operation, similarly to the flags in the RFlag register.

<figure fpustatereg>
{{ :en:multiasm:cs:fpu_sw_register.png?500 |Illustration of the Status Word register in FPU coprocessor}}
<caption>Status Word register in x87 Floating Point Unit coprocessor</caption>
</figure>

The Control Word register (figure {{ref>fpucontrolreg}} makes it possible to enable or disable exceptions, control the precision of calculations and rounding of their results. The infinity bit is not meaningful in new processors.

<figure fpucontrolreg>
{{ :en:multiasm:cs:fpu_cw_register.png?500 |Illustration of the Control Word register in FPU coprocessor}}
<caption>Control Word register in x87 Floating Point Unit coprocessor</caption>
</figure>
==== MMX registers ====
MMX instructions are executed with the use of 64-bit packed data. Intel decided to map MMX registers onto the existing FPU registers. The MMX unit has 8 registers named MM0 - MM7, as presented in figure {{ref>mmxregs}}. Because they are physically the same as FPU registers, it is not possible to mix freely MMX and FPU instructions.

<figure mmxregs>
{{ :en:multiasm:cs:mmx_registers.png?600 |Illustration of the MMX registers}}
<caption>MMX registers</caption>
</figure>
==== XMM registers ====
SSE instructions are executed with the use of 128-bit packed data. To perform calculations, new 128-bit registers were introduced. The SSE unit has 8 registers named XMM0 - XMM7, as presented in figure {{ref>xmmregs}}. Because these registers are separate from all previous ones, there are no conflicts between SSE and other instructions.

<figure xmmregs>
{{ :en:multiasm:cs:xmm_registers.png?500 |Illustration of the 128-bit XMM registers}}
<caption>128-bit XMM registers</caption>
</figure>

To control the behaviour and to provide information about the status of the SSE unit, the MXCSR register is implemented (figure {{ref>mxcsrreg}}). It is a 32-bit register with bits with their functions similar to the FPU unit's Control Word and Status Word registers concatenated together.

<figure mxcsrreg>
{{ :en:multiasm:cs:mxcsr_register.png?500 |Illustration of the MXCSR SSE control register}}
<caption>MXCSR control register for SSE unit</caption>
</figure>
==== YMM registers ====
The YMM registers are the 256-bit extension to the XMM registers. They are introduced to implement the AVX set of instructions. Additionally, in 64-bit processors, the number of available YMM registers was increased to 16 (figure {{ref>ymmregs}}).

<figure ymmregs>
{{ :en:multiasm:cs:ymm_registers.png?600 |Illustration of the 256-bit YMM registers}}
<caption>256-bit YMM registers</caption>
</figure>
==== ZMM registers ====
ZMM registers are the further extension of XMM registers to 512 bits. In 64-bit machines, 32 of such registers are available (figure {{ref>zmmregs}}).

<figure zmmregs>
{{ :en:multiasm:cs:zmm_registers.png?800 |Illustration of the 512-bit ZMM registers}}
<caption>512-bit ZMM registers</caption>
</figure>
XMM are the physical lower halves of YMM, which are the lower halves of ZMM. Anything written to the XMM register appears in part of the YMM and ZMM registers as presented in figure {{ref>xyzmmregs}}

<figure xyzmmregs>
{{ :en:multiasm:cs:xyzmm_register.png?800 |Illustration of the relation between XMM, YMM and ZMM registers}}
<caption>The relation between XMM, YMM and ZMM registers</caption>
</figure>
Together with AVX extension and ZMM registers, eight 16-bit opmask registers were introduced to x64 processors (figure {{ref>opmaskregs}}). They can be used, e.g. to provide conditional execution or mask several elements of vectors in AVX instructions.

<figure opmaskregs>
{{ :en:multiasm:cs:opmask_registers.png?400 |Illustration of the opmask registers}}
<caption>Opmask registers</caption>
</figure>


==== Additional registers ====
In the x64 architecture, there are additional registers used for control purposes. The CR0, CR2, CR3, CR4, and CR8 registers are used for controlling the operating mode of the processor, virtual and protected memory mechanisms and paging. The debug registers, named DR0 through DR7, control the debugging of the processor’s operations and software. Memory type range registers MTRR can be used to map the address space used for the memory-mapped I/O as non-cacheable.
Different processors have different sets of model-specific registers, MSR. There are also machine check registers MCR. They are used to control and report on processor performance, detect and report hardware errors. Mostly, the described registers are not accessible to an application program and are controlled by the operating system, so they will not be described in this book. For more details, please refer to Intel documentation ((https://www.intel.com/content/www/us/en/content-details/851056/intel-64-and-ia-32-architectures-software-developer-s-manual-volume-1-basic-architecture.html?wapkw=253665)).
===== Data Types and Encoding =====
==== Fundamental data types ====
Fundamental data types cover the types of data elements that are stored in memory as 8-bit (Byte), 16-bit (Word), 32-bit (Doubleword), 64-bit (Quadword) or 128-bit (Double quadword), as shown in figure {{ref>fundamentaldata}}. Many instructions allow for processing data of these types without special interpretation. It depends on the programming engineer how to interpret the inputs and results of instruction execution. The order of bytes in the data types containing more than a single byte is little-endian. The lower (least significant) byte is stored at the lower address of the data. This address represents the address of the data as a whole.

<figure fundamentaldata>
{{ :en:multiasm:cs:fundamental_data_types.png?700 |Illustration of fundamental data types}}
<caption>Fundamental data types</caption>
</figure>

The data types which can be used in the x64 architecture processors can be integers or floating point numbers. Integers are processed by the main CPU as single values (scalars). They can also be packed into vectors and processed with the specific SIMD instructions, including MMX and partially by SSE and AVX instructions. Integers can be interpreted as unsigned or signed. They can also represent the pointer to some variable or address within the code. Scalar real numbers are processed by the FPU. SSE and AVX instructions support calculations with scalars or vectors composed of real values. All possible variants of data types are stored within the fundamental data types. Further in this chapter, we describe all possible singular or packed integers and floating-point values.

==== Integers ====
Integers are the numbers without the fractional part. In x64 architecture, it is possible to define a variety of data of different sizes, all of them based on bytes. A single byte forms the smallest possible information item stored in memory. Even if not all bits are effectively used, the smallest element which can be stored is a byte. Two bytes form the word. It means that in x64 architecture, Word data type means 16 bits. Two words form the Double Word data type (32 bits), and four words form the Quad Word data type (64 bits). With the use of large registers in modern processors, it is possible to use in a few instructions the Double Quad Word data type, containing 128 bits (sometimes called Octal Word).

==== Integer scalar data types ====
Integer data types can be one, two, four or eight bytes in length. Unsigned integers are binary encoded in natural binary code. It means that the starting value is 0 (zero), and the maximum value is formed with bits "1" at all available positions. The x64 architecture supports the unsigned integer data types shown in table {{ref>uinttypes}}.
<table uinttypes>
<caption>Unsigned integer data types</caption>
^ Name        ^ Number of bits  ^ Minimum value  ^ Maximum value         ^ Minimum value (hex)  ^ Maximum value (hex)  ^
| Byte        | 8               | 0              | 255                   | 0x00                 | 0xFF                 |
| Word        | 16              | 0              | 65535                 | 0x0000               | 0xFFFF               |
| Doubleword  | 32              | 0              | 4294967295            | 0x00000000           | 0xFFFFFFFF           |
| Quadword    | 64              | 0              | 18446744073709551615  | 0x0000000000000000   | 0xFFFFFFFFFFFFFFFF   |
</table>

Signed integers are binary encoded in 2's complement binary code. The highest bit of the value is a sign bit. If it is zero, the number is non-negative; if it is one, the value is negative. It means that the starting value is encoded as the highest bit equal to 1 and all other bits equal to zero. The maximum value is formed with a "0" bit at the highest position and bits "1" at all other positions. The x64 architecture supports the signed integer data types shown in table {{ref>sinttypes}}.
<table sinttypes>
<caption>Signed integer data types</caption>
^ Name        ^ Number of bits  ^ Minimum value  ^ Maximum value         ^ Minimum value (hex)  ^ Maximum value (hex)  ^
| Signed Byte | 8               | -128           | 127                   | 0x80                 | 0x7F                 |
| Signed Word | 16              | -32768         | 32767                 | 0x8000               | 0x7FFF               |
| Signed Doubleword  | 32       | -2147483648    | 2147483647            | 0x80000000           | 0x7FFFFFFF           |
| Signed Quadword    | 64       | -9223372036854775808 | 9223372036854775807  | 0x8000000000000000   | 0x7FFFFFFFFFFFFFFF   |
</table>
==== Integer vector data types ====
Vector data types were introduced with SIMD instructions starting with the MMX extension, and followed in the SSE and AVX extensions. They form the packed data types containing multiple elements of the same size. The elements can be considered as signed or unsigned depending on the algorithm and instructions used.

The 64-bit packed integer data type contains eight Bytes, four Words or two Doublewords as shown in figure {{ref>packedint64}}.

<figure packedint64>
{{ :en:multiasm:cs:packed_integers_64.png?600 |Illustration of 64-bit packed integer data types}}
<caption>64-bit packed integer data types</caption>
</figure>

The 128-bit packed integer data type contains sixteen Bytes, eight Words, four Doublewords or two Quadwords as shown in figure {{ref>packedint128}}.

<figure packedint128>
{{ :en:multiasm:cs:packed_integers_128.png?750 |Illustration of 128-bit packed integer data types}}
<caption>128-bit packed integer data types</caption>
</figure>

The 256-bit packed integer data type contains thirty-two Bytes, sixteen Words, eight Doublewords, four Quadwords or two Double Quadwords as shown in figure {{ref>packedint256}}.

<figure packedint256>
{{ :en:multiasm:cs:packed_integers_256.png?800 |Illustration of 256-bit packed integer data types}}
<caption>256-bit packed integer data types</caption>
</figure>

The 512-bit packed integer data type contains sixty-four Bytes, thirty-two Words, sixteen Doublewords, eight Quadwords or four Double Quadwords as shown in figure {{ref>packedint512}}. Double Quadwords are not used as operands, they are the results of some operations only.

<figure packedint512>
{{ :en:multiasm:cs:packed_integers_512.png?800 |Illustration of 512-bit packed integer data types}}
<caption>512-bit packed integer data types</caption>
</figure>


==== Floating point values ====
Floating point values store the data encoded for calculation on real numbers. Depending on the precision required for the algorithm, we can use different data sizes. Scalar data types are supported by the FPU (Floating Point Unit), offering single precision, double precision or double extended precision real numbers. In C/C++ compilers, they are referred to as float, double and long double data types, respectively. Vector (packed) floating-point data types can be processed by many SSE and AVX instructions, offering fast vector, matrix or artificial intelligence calculations. Vector units can process half precision, single precision and double precision formats. The 16-bit Brain Float format was introduced to calculate the dot scalar product to improve the efficiency of AI training and inference algorithms. Floating point data types are shown in figure {{ref>floattypes}} and described in table {{ref>tablefloattypes}}. The table shows the number of bits used. In reality, the number of mantissa bits is assumed to be one bit longer, because the highest bit representing the integer part is always "1", so there is no need to store it (except for Double extended data format, where the integer bit is present).

<figure floattypes>
{{ :en:multiasm:cs:floating_types.png?700 |Illustration of floating point data types}}
<caption>Floating point data types in x64 architecture</caption>
</figure>

<table tablefloattypes>
<caption>Floating point data types</caption>
^ Name              ^ Bits  ^ Mantissa bits  ^ Exponent bits  ^ Min value                                 ^ Max value                                 ^
| Double extended   | 80    | 64             | 15             | {{:en:multiasm:cs:float_ep_min.png?125}}  | {{:en:multiasm:cs:float_ep_max.png?125}}  |
| Double precision  | 64    | 52             | 11             | {{:en:multiasm:cs:float_dp_min.png?120}}  | {{:en:multiasm:cs:float_dp_max.png?115}}  |
| Single precision  | 32    | 23             | 8              | {{:en:multiasm:cs:float_sp_min.png?120}}  | {{:en:multiasm:cs:float_sp_max.png?110}}  |
| Half precision    | 16    | 10             | 5              | {{:en:multiasm:cs:float_hp_min.png?110}}  | {{:en:multiasm:cs:float_hp_max.png?100}}  |
| Brain Float       | 16    | 7              | 8              | {{:en:multiasm:cs:float_bf_min.png?120}}  | {{:en:multiasm:cs:float_bf_max.png?110}}  |
</table>

==== Floating point vector data types ====
Floating point vectors are formed with single or double precision packed data formats. They are processed by SSE or AVX instructions in a SIMD approach of processing. A 128-bit packed data format can store four single-precision data elements or two double-precision data elements. A 256-bit packed data format can store eight single-precision values or four double-precision values. A 512-bit packed data format can store sixteen single-precision values or eight double-precision values. These packed data types are shown in figure {{ref>packedfloattypes}}. Instructions operating on 16-bit half-precision values or Brain Floats can use twice as many such elements simultaneously in comparison to single-precision data.
It is worth mentioning that some instructions operate on a single floating-point value, using only the lowest elements of the operands. 

<figure packedfloattypes>
{{ :en:multiasm:cs:packed_floats.png?800 |Illustration of packed floating point data types}}
<caption>Pcked floating point data types in x64 architecture</caption>
</figure>

==== Bit field data type ====
A bit field is a data type whose size is counted by the number of bits it occupies. The bit field can start at any bit position in the fundamental data type and can be up to 32 bits long. MASM supports it with the RECORD data type. The bit field type is shown in figure {{ref>bitfieldtype}}.

<figure bitfieldtype>
{{ :en:multiasm:cs:bit_field_type.png?300 |Illustration of bit field data type}}
<caption>The bit field data type</caption>
</figure>


==== Pointers ====
Pointers store the address of the memory which contains interesting information. They can point to the data or the instruction. If the segmentation is enabled, pointers can be near or far. The far pointer contains the logical address (formed with the segment and offset parts). The near pointer contains the offset only. The offset can be 16, 32 or 64 bits long. The segment selector is always stored as a 16-bit number. Illustration of possible pointer types is shown in figure {{ref>pointertypes}}.

<figure pointertypes>
{{ :en:multiasm:cs:pointers_types.png?600 |Illustration of near and far pointers types}}
<caption>The near and far pointers types</caption>
</figure>

<note>
The offset is often the result of complex addressing mode calculations and is called an effective address.
</note>

===== Addressing Modes in Instructions =====
Addressing mode specifies how the processor reaches the data in the memory. The x86 architecture implements immediate, direct and indirect memory addressing. Indirect addressing can use a single or two registers and a constant to calculate the final address. The addressing mode takes the names of the registers used. The 32-bit mode makes the choice of the register for addressing more flexible and enhances addressing with the possibility of scaling: multiplying one register by a small constant. In 64-bit mode, addressing relative to the instruction pointer was added for easy relocation of programs in memory. In this chapter, we will focus on the details of all addressing modes in 16, 32 and 64-bit processors.
In each addressing mode, we are using the simple examples with the mov instruction. The move instruction copies data from the source operand to the destination operand. The order of the operands in instructions is similar to that of high-level languages. The left operand is the destination, the right operand is the source, as in the following example:
<code asm>
mov destination, source
</code>
<note>
Calculating the addresses for control transfer instructions, including jumps and procedure calls, will be described in the section about these instructions.
</note>
==== Immediate addressing ====
The immediate argument is a constant encoded as part of the instruction. This means that this value is encoded in a code section of the program and can't be modified during program execution.
In 16-bit mode, the size of the constant can be 8 or 16 bits, and in 32- and 64-bit mode, it can be up to 32 bits. The use of the immediate operand depends on the instruction. It can be, for example, a numerical constant, an offset of the address or additional displacement used for efficient address calculation, a value which specifies the number of iterations in shift instructions, a mask for choosing the elements of the vector to operate with or a modifier which influences the instruction behaviour. Please refer to the specific instruction documentation for the detailed description.
The examples of instructions using immediate addressing
<code asm>
mov ax, 10     ; move the constant value of 10 to the accumulator AX
mov ebx, 35    ; move the constant 35 to the EBX register
mov rcx, 5     ; move the constant 5 to the RCX register
</code>
==== Direct addressing ====
In direct addressing mode, the data is reached by specifying the target offset (displacement) as a constant. The processor uses this offset together with the appropriate segment register to access the byte in memory. The displacement can be specified in the instruction as a number, a previously defined constant or a constant expression. With segmentation enabled, it is possible to use the segment prefix to select the segment register we want to use. The example instructions which use numbers or a variable name as the displacement are shown in the following code and presented in figure {{ref>directx86}}.
<code asm>
; copy one byte from the BL register to memory address 0800h in the data segment
mov ds:[0800h], bl

; copy two bytes from memory at address 0600h to the AX accumulator
mov ax, ds:[0600h]   

; copy eight bytes from the variable in memory (previously defined) to RAX
mov rax, variable
</code>

<figure directx86>
{{ :en:multiasm:cs:addressing_direct_x86.png?600 |Illustration of direct addressing mode in x86 processor}}
<caption>Direct addressing mode in x86 architecture</caption>
</figure>

==== x64 RIP-relative direct addressing ====
64-bit processors have some specific addressing modes. The default mode for direct addressing mode in the x64 architecture is addressing relative to the RIP register. If there is no base or index register in the instruction, the address is encoded as a 32-bit signed value. This value represents the distance between the data byte in memory and the address of the next instruction (current value of RIP).
In figure {{ref>directRIPrelativex64}}, the RIP relative addressing where the variable is moved to AL is shown.

<figure directRIPrelativex64>
{{ :en:multiasm:cs:direct_RIP_relative_x64.png?600 |Illustration of direct RIP-relative addressing mode in x64 processor}}
<caption>Direct RIP-relative addressing mode in x64 architecture</caption>
</figure>

==== x64 32-bit direct addressing mode ====
In this mode, the instruction holds the 32-bit signed value, which is sign-extended to 64 bits by the processor. This limits the addressing space to two areas. The first region starts from the address 0 and can reach 2GB of memory. The second is 2GB at the very end of the entire address space. As a 2GB memory size is insufficient for modern operating systems, this addressing mode is not supported by Windows compilers. MASM can use this mode in conjunction with a base or index register.

==== x64 64-bit direct addressing ====
In this mode, the address is a 64-bit unsigned value. As in general, the arguments of the instruction are 32-bit in length, this addressing mode can be used only with a specific version of the MOV instruction and only with the accumulator (AL, AX, EAX, RAX).
MASM assembler does not use this mode.


==== Indirect addressing ====
In the x86 architecture, there is a possibility to use one or two registers in one instruction to calculate the effective address, which is the final offset within the current segment (or section). In case of the use of two registers, one of them is called the base register, the second one is called the index register. 
In 16-bit processors, the base registers can be BX and BP only, while the index registers can be SI or DI. If BP is used, the processor automatically chooses the stack segment by default. For BX used as the base register, or for instructions with an index register only, the processor accesses the data segment by default. 
The 32-bit architecture makes the choice of registers much more flexible, and any of the eight registers (including the stack pointer) can be used as the base register. Here, the stack segment is chosen if the base register is EBP or ESP. The index register can be any of the general-purpose registers, excluding the stack pointer. Additionally, the index register can be scaled by a factor of 1, 2, 4 or 8.
In the 64-bit architecture, which introduces eight additional registers, any of the sixteen general-purpose registers can be the base register or index register (excluding the stack pointer, which can be base only). In the following sections, we will show examples of all possible addressing modes.


==== Base addressing ====

Base addressing mode uses the base register only. The base register is not scaled, no index register is used, and no additional displacement is added.
In figure {{ref>basex86}}, the use of BX as the base register is shown to transfer data from memory to AL.

<figure basex86>
{{ :en:multiasm:cs:addressing_base_x86.png?600 |Illustration of indirect base addressing mode in x86 processor}}
<caption>Indirect base addressing mode in x86 architecture</caption>
</figure>

The code shows other examples of base addressing.
<code asm>
; copy one byte from the data segment in memory at the address from the BX register to AL
mov al, [bx]  

; copy two bytes from the data segment in memory at the address from the EBX register to AX
mov ax, [ebx] 

; copy eight bytes from memory at address from RBX to RAX (no segmentation in x64)
mov rax, [rbx]
</code>

==== Base addressing with displacement ====
Base addressing mode with displacement uses the base register with an additional constant added. So the final effective address is a sum of the content of the base register and the constant. It can be interpreted as the base register holds the address of the data table, and the constant is an offset of the byte in the table.
In figure {{ref>basedispx86}}, the use of BX as the base register with additional displacement is shown to transfer data from memory to AL.

<figure basedispx86>
{{ :en:multiasm:cs:addressing_base_disp_x86.png?600 |Illustration of indirect base addressing mode with displacement in x86 processor}}
<caption>Indirect base addressing mode with displacement in x86 architecture</caption>
</figure>

Some code examples are shown below. The number represents the previously defined constant. MASM accepts various syntaxes of base plus displacement address notation.
<code asm>
; copy one byte from the data segment in memory at the address from the BX register
; increased by "number" to AL
mov al, [bx] + number
mov al, [bx + number]
mov al, [bx] number
mov al, number [bx]
mov al, number + [bx]
mov al, [number + bx]
</code>
==== Index addressing with displacement ====
Index addressing mode with displacement uses the index register with an additional constant added. So the final effective address is a sum of the content of the index register and the constant. It can be interpreted as the address of the data table is a constant, and the index register is an offset of the byte in a table.
In figure {{ref>indexdispx86}}, the use of DI as the index register with additional displacement is shown to transfer data from memory to AL.

<figure indexdispx86>
{{ :en:multiasm:cs:addressing_index_disp_x86.png?600 |Illustration of indirect index addressing mode with displacement in x86 processor}}
<caption>Indirect index addressing mode with displacement in x86 architecture</caption>
</figure>

Some code examples are shown below. The table represents the previously defined table in memory. MASM accepts various syntaxes of index plus displacement address notation.
<code asm>
; copy one byte from the data segment in memory at the address calculated 
; as the sum of the table increased by index from DI to AL
mov al, [di] + table
mov al, [di + table]
mov al, [di] table
mov al, table [di]
mov al, table + [di]
mov al, [table + di]
</code>
==== Base Indexed addressing ====
In this addressing mode, the combination of two registers is used. In a 16-bit processor, the base register is BX or BP, and the index register is SI or DI. In a 32- or 64-bit processor, any register can be the base, and all except the stack pointer can be the index. The final effective address is the sum of the contents of two registers. 
In figure {{ref>baseindexx86}}, the use of BP as the base and DI as the index register to transfer data from memory to AL is shown.

<figure baseindexx86>
{{ :en:multiasm:cs:addressing_base_index_x86.png?600 |Illustration of indirect base plus index addressing mode in x86 processor}}
<caption>Indirect base plus index addressing mode in x86 architecture</caption>
</figure>

MASM assembler accepts different notations of the base + index registers combination, as shown in the code. In the x86, the order of registers written in the instruction is irrelevant.
<code asm>
; copy one byte from the data segment in the memory at the address calculated 
; as the sum of the base (BX) register and the index (SI) register to AL
mov al, [bx] + [si]   
mov al, [bx + si]  
mov al, [bx] [si]
mov al, [si] [bx]
</code>

In x86, there are only four possible combinations of base and index registers. They are depicted in the code. 
<code asm>
; copy one byte from the data or stack segment in memory at the address calculated 
; as the sum of the base register and index register to AL
mov al, [bx] + [si]   ; data segment
mov al, [bx] + [di]   ; data segment
mov al, [bp] + [si]   ; stack segment
mov al, [bp] + [di]   ; stack segment
</code>

In 32- or 64-bit processors, the first register used in the instruction is the base register, and the second is the index register.
While segmentation is enabled use of EBP or ESP as a base register determines the segment register choice. Notice that it is possible to use the same register as base and index in one instruction.
<code asm>
; copy one byte from the data or stack segment in memory at the address calculated 
; as the sum of the base register and index register to AL
mov al, [eax] + [esi]   ; data segment
mov al, [ebx] + [edi]   ; data segment
mov al, [ecx] + [esi]   ; data segment
mov al, [edx] + [edi]   ; data segment
mov al, [esi] + [esi]   ; data segment
mov al, [edi] + [edi]   ; data segment
mov al, [ebp] + [esi]   ; stack segment
mov al, [esp] + [edi]   ; stack segment
</code>
==== Base Indexed addressing with displacement ====
In this addressing mode, the combination of two registers with an additional constant is used. In a 16-bit processor, the base register BX or BP, and the index register SI or DI. The constant can be encoded as an 8- or 16-bit value. In such a processor, this is the most complex mode available. In a 32- or 64-bit processor, any register can be the base, and all except the stack pointer can be the index. The constant is up to a 32-bit signed value. The final effective address is the sum of the contents of two registers and the displacement.
In figure {{ref>baseindexdispx86}}, the use of BX as the base and SI as the index register with displacement to transfer data from memory to AL is shown. It can be interpreted as the constant is the address of the table of structures, the base register holds the offset of the structure in a table, and the index register keeps the offset of an element within the structure.

<figure baseindexdispx86>
{{ :en:multiasm:cs:addressing_base_index_disp_x86.png?600 |Illustration of indirect base plus index plus displacement addressing mode in x86 processor}}
<caption>Indirect base plus index plus displacement addressing mode in x86 architecture</caption>
</figure>

MASM assembler accepts different notations of the base + index + displacement, as shown in the code. In the x86, the order of registers written in the instruction is irrelevant.
<code asm>
; copy one byte from the data segment in the memory at the address calculated 
; as the sum of the base (BX) register, the index (SI) register and a displacement to AL
mov al, [bx] + [si] + table   
mov al, [bx + si] + table 
mov al, [bx] [si] + table
mov al, table [si] [bx]
mov al, table [si] + [bx]
mov al, table [si + bx]
</code>

In 32- or 64-bit processors, the first register used in the instruction is the base register, and the second is the index register. While segmentation is enabled, the use of EBP or ESP as a base register determines the segment register choice. The displacement can be placed at any position in the address argument expression. Some examples are shown below.
<code asm>
; copy one byte from the data or stack segment in memory at the address calculated 
; as the sum of the base, index and displacement (table) to AL
mov al, [eax] + [esi] + table   ; data segment
mov al, table + [ebx] + [edi]   ; data segment
mov al, table [ecx] [esi]       ; data segment
mov al, [edx] [edi] + table     ; data segment
mov al, table + [ebp] + [esi]   ; stack segment
mov al, [esp] + [edi] + table   ; stack segment
</code>


==== Index addressing with scaling ====
Index addressing mode with scaling uses the index register multiplied by a simple constant of 1, 2, 4 or 8. This addressing mode is available for 32- or 64-bit processors and can use any general-purpose register except of stack pointer.
In figure {{ref>indexscaleIA32}}, the use of EBX as the index register with a scaling factor of 2 is shown to transfer data from memory to AL.

<figure indexscaleIA32>
{{ :en:multiasm:cs:addressing_index_scale_IA32.png?600 |Illustration of indirect index addressing mode with scaling in IA32 processor}}
<caption>Indirect index addressing mode with scaling in IA32 architecture</caption>
</figure>

Because in these instructions, there is no base register used, if there is segmentation enabled, the data segment is always chosen.
<code asm>
; copy one byte from the data or stack segment in memory at the address calculated 
; as the multiplication of the index register by a constant to AL
mov al, [eax * 2]   ; data segment
mov al, [ebx * 4]   ; data segment
mov al, [ecx * 8]   ; data segment
mov al, [edx * 2]   ; data segment
mov al, [esi * 4]   ; data segment
mov al, [edi * 8]   ; data segment
mov al, [ebp * 2]   ; data segment
</code>
==== Base Indexed addressing with scaling ====
Base indexed addressing mode with scaling uses the sum of the base register with the content of the index register multiplied by a simple constant of 1, 2, 4 or 8. This addressing mode is available for 32- or 64-bit processors and can use any general-purpose register as base and almost any general-purpose register as index, except of stack pointer.
The figure {{ref>baseindexscaleIA32}} presents the use of EDI register as base, EAX as the index register with a scaling factor of 4 to transfer data from memory to AL.

<figure baseindexscaleIA32>
{{ :en:multiasm:cs:addressing_base_index_scale_IA32.png?600 |Illustration of indirect base index addressing mode with scaling in IA32 processor}}
<caption>Indirect base index addressing mode with scaling in IA32 architecture</caption>
</figure>

The scaled register is assumed as the index, the other one is the base (even if it is not used first in the instruction). While segmentation is enabled, the use of EBP or ESP as a base register determines the segment register choice.
<code asm>
; copy one byte from the data or stack segment in memory at the address calculated 
; as the sum of the base register and index register to AL
mov al, [eax] + [esi * 2]   ; data segment
mov al, [ebx] + [edi * 4]   ; data segment
mov al, [ecx] + [eax * 8]   ; data segment
mov al, [edx] + [ecx * 2]   ; data segment
mov al, [esi] + [edx * 4]   ; data segment
mov al, [edi] + [edi * 8]   ; data segment
mov al, [ebp] + [esi * 2]   ; stack segment
mov al, [esp] + [edi * 4]   ; stack segment
</code>
==== Base Indexed addressing with displacement and scaling ====
Base indexed addressing mode with displacement and scaling uses the sum of the base register, the content of the index register multiplied by a simple constant of 1, 2, 4 or 8, and an additional constant. This addressing mode is available for 32- or 64-bit processors and can use any general-purpose register as base and almost any general-purpose register as index, except of stack pointer. The displacement can be up to a 32-bit signed value, even in a 64-bit processor.
The figure {{ref>baseindexdispscaleIA32}} presents the use of EDI register as base, EAX as the index register with a scaling factor of 4 to transfer data from memory to AL. The interpretation is similar to the base-indexed addressing mode with displacement for the x86 16-bit machine. The constant is the address of the beginning of the table of structures, the base register contains the offset of the structure, index register holds the scaled number of the element in a structure pointing to the chosen byte.

<figure baseindexdispscaleIA32>
{{ :en:multiasm:cs:addressing_base_index_disp_scale_IA32.png?600 |Illustration of indirect base index addressing mode with displacement and scaling in IA32 processor}}
<caption>Indirect base index addressing mode with displacement and scaling in IA32 architecture</caption>
</figure>

As in the base indexed mode with scaling without displacement, the scaled register is assumed as the index, and the other one is the base (even if it is not used first in the instruction). While segmentation is enabled, the use of EBP or ESP as a base register determines the segment register choice. The displacement can be placed at any position in the instruction.
<code asm>
; copy one byte from the data or stack segment in memory at the address calculated 
; as the sum of the base, scaled index and displacement (table) to AL
mov al, [eax] + [esi * 2] + table   ; data segment
mov al, table + [ebx] + [edi * 4]   ; data segment
mov al, table + [ebp] + [esi * 2]   ; stack segment
mov al, [esp] + [edi * 4] + table   ; stack segment
</code>


==== Summary for indirect addressing ====
In 16-bit processors, the base registers can be BX and BP only. The first one is used to access the data segment, the second one automatically chooses the stack segment by default. The additional offset can be unused or can be encoded as an 8 or 16-bit signed value. The schematic of the x86 effective address calculation for indirect address generation is shown in figure {{ref>effectivex86}}

<figure effectivex86>
{{ :en:multiasm:cs:effective_x86.png?600 |Illustration of possible combination of base and index registers in effective address calculation for indirect addressing mode in x86 processor}}
<caption>Possible combination of registers in effective address calculation for indirect addressing mode in x86 architecture</caption>
</figure>

The 32-bit architecture makes the choice of registers much more flexible, and any of the eight registers (including the stack pointer) can be used as the base register. Here, the stack segment is chosen if the base register is EBP or ESP. Index register can be any of the general-purpose registers except of stack pointer. The index register can be scaled by a factor of 1, 2, 4 or 8. Additional displacement can be unused or can be encoded as an 8, 16 or 32-bit signed value.

In the 64-bit architecture, any of the sixteen general-purpose registers can be the base register. As in 32-bit processors index register can not be the stack pointer and can be scaled by 1, 2, 4 or 8. Additional displacement can be unused or encoded as an 8, 16 or 32-bit signed number.

In figure {{ref>effectiveIA32}}, the schematic of the effective address calculation and use of registers for indirect addressing in the IA32 architecture is shown.

<figure effectiveIA32>
{{ :en:multiasm:cs:effective_IA32.png?600 |Illustration of possible combination of base and index registers in indirect addressing mode in IA32 processor}}
<caption>Possible combination of registers in indirect addressing mode in IA32 architecture</caption>
</figure>
===== Principles of Instructions Encoding =====

In x86 processors, instructions are encoded using varying numbers of bytes. Some instructions are encoded in just a single byte, others require several bytes, and some are as long as 15 bytes. In general, instructions are composed with some fields as presented in figure {{ref>instr_fields}}.

<figure instr_fields>
{{ :en:multiasm:cs:instruction_fields.png?600 |Illustration of fields of the instrution}}
<caption>Fields of the instructions</caption>
</figure>

The fields are as follows
  * Instruction prefixes
  * Opcode
  * MODR/M byte
  * SIB byte (Scale Index Base byte)
  * Displacement
  * Immediate data

The only field which is present in every instruction is the opcode. Other fields appear depending on the detailed function of the instruction, its arguments and modifications. Their appearance also depends on the target version of the processor (16, 32 or 64-bit). In the following sections, we will show some principles of each field encoding.

====Registers encoding in the instructions====
Registers are used in the instructions as the operands which hold the data for calculations, or as elements of the address expression to calculate the indirect address in memory. Wherever the register appears, it is encoded with the same bit pattern. Initially, in 16 and 32-bit machines, three bits were used. In 64-bit processors, an additional, leading bit is used, which doubles the number of available registers. In table {{ref>register_16}}, the encoding of general-purpose registers in 16 and 32-bit processors is presented.

<table register_16>
<caption>Encoding of registers in 16- and 32-bit x86 processors</caption>
^ Bits              ^ 000  ^ 001  ^ 010  ^ 011  ^ 100  ^ 101  ^ 110  ^ 111  ^
| 8-bit registers   | AL   | CL   | DL   | BL   | AH   | CH   | DH   | BH   |
| 16-bit registers  | AX   | CX   | DX   | BX   | SP   | BP   | SI   | DI   |
| 32-bit registers  | EAX  | ECX  | EDX  | EBX  | ESP  | EBP  | ESI  | EDI  |
</table>

In table {{ref>register_64_0}}, the encoding of general-purpose registers in x64 processors is shown. The additional leading bit = 0 comes from the REX prefix. For instructions without the REX prefix, the encoding of 8-bit registers is the same as in x86.

<table register_64_0>
<caption>Encoding of registers in x64 processors with a leading bit = 0</caption>
^ Bits              ^ 0.000  ^ 0.001  ^ 0.010  ^ 0.011  ^ 0.100  ^ 0.101  ^ 0.110  ^ 0.111  ^
| 8-bit registers   | AL   | CL   | DL   | BL   | SPL  | BPL  | SIL  | DIL  |
| 16-bit registers  | AX   | CX   | DX   | BX   | SP   | BP   | SI   | DI   |
| 32-bit registers  | EAX  | ECX  | EDX  | EBX  | ESP  | EBP  | ESI  | EDI  |
| 64-bit registers  | RAX  | RCX  | RDX  | RBX  | RSP  | RBP  | RSI  | RDI  |
</table>

In table {{ref>register_64_1}}, the encoding of general-purpose registers in x64 processors is shown. The additional bit = 1 comes from the REX prefix.

<table register_64_1>
<caption>Encoding of registers in x64 processors with a leading bit = 1</caption>
^ Bits              ^ 1.000  ^ 1.001  ^ 1.010  ^ 1.011  ^ 1.100  ^ 1.101  ^ 1.110  ^ 1.111  ^
| 8-bit registers   | R8L   | R9L   | R10L   | R11L   | R12L   | R13L   | R14L   | R15L   |
| 16-bit registers  | R8W   | R9W   | R10W   | R11W   | R12W   | R13W   | R14W   | R15W   |
| 32-bit registers  | R8D   | R9D   | R10D   | R11D   | R12D   | R13D   | R14D   | R15D   |
| 64-bit registers  | R8    | R9    | R10    | R11    | R12    | R13    | R14    | R15    |
</table>

In table {{ref>register_ext_0}} encoding of extension registers is presented. Please note that FPU and MMX registers, which are available on 32-bit processors, encoding is the same as with the additional leading bit = 0.
<table register_ext_0>
<caption>Encoding of registers in x64 processors with a leading bit = 0</caption>
^ Bits              ^ 0.000  ^ 0.001  ^ 0.010  ^ 0.011  ^ 0.100  ^ 0.101  ^ 0.110  ^ 0.111  ^
| FPU registers   | ST0   | ST1   | ST2   | ST3   | ST4   | ST5   | ST6   | ST7   |
| MMX registers   | MMX0  | MMX1  | MMX2  | MMX3  | MMX4  | MMX5  | MMX6  | MMX7  |
| XMM registers   | XMM0  | XMM1  | XMM2  | XMM3  | XMM4  | XMM5  | XMM6  | XMM7  |
| YMM registers   | YMM0  | YMM1  | YMM2  | YMM3  | YMM4  | YMM5  | YMM6  | YMM7  |
</table>

In table {{ref>register_ext_1}} encoding of extension registers is presented. As there are only 8 MMX registers, for these registers, there is no difference between a leading bit equal to 0 or 1. FPU registers are not available with the leading bit = 1.
<table register_ext_1>
<caption>Encoding of additional registers in x64 processors with a leading bit = 1</caption>
^ Bits              ^ 1.000  ^ 1.001  ^ 1.010  ^ 1.011  ^ 1.100  ^ 1.101  ^ 1.110  ^ 1.111  ^
| MMX registers   | MMX0  | MMX1  | MMX2  | MMX3  | MMX4  | MMX5  | MMX6  | MMX7  |
| XMM registers   | XMM8  | XMM9  | XMM10  | XMM11  | XMM12  | XMM13  | XMM14  | XMM15  |
| YMM registers   | YMM8  | YMM9  | YMM10  | YMM11  | YMM12  | YMM13  | YMM14  | YMM15  |
</table>
====Instruction prefixes====
There are four groups of prefixes:
  * Lock and repeat
  * Segment override and branch hints
  * Operand size override
  * Address size override
In modern 64-bit processors, REX and VEX prefixes were introduced to extend the number of registers and to enable the RISC-like three-argument instructions.

The **lock prefix** is valid for instructions that work in a read-modify-write manner. An example of such an instruction can be adding a constant or register content to the variable in the memory.
<code> lock add QWORD PTR [rax], 5 </code>
The lock prefix appears as a single byte with the value 0x0F before the opcode. It disables DMA requests (or any other requests that gain control of the buses) during the execution of the instruction to prevent accidental modification of the memory contents at the same address by both the processor and DMA controller.

The **repeat prefixes** are valid for string instructions. They modify the behaviour of instruction from single to repeated execution. The REP prefix is encoded as a single byte with the value 0xF3. It causes the instruction to repeat up to CX times, decreasing CX with every repetition and appropriately modifying the index registers. The REPE/REPZ (0xF3) and REPNE/REPNZ (0xF2) prefixes are used with string instructions which compare or scan strings. They cause the instruction to repeat with the possibility of finishing the execution if the zero flag is set (REPE) or cleared (REPNE).

The **segment override** prefix is used in the segmented mode of operation. It causes the instruction to access data placed in a segment other than the default. For example, the mov instruction accesses data in the segment pointed by DS, which can be changed to ES with a prefix.
<code asm>
mov BYTE PTR [ebx], 5     ;DS as the default segment
mov BYTE PTR ES:[ebx], 5  ;ES segment override (results in appearance of the byte 0x26 as the prefix)
</code> 
  * 0x2E – CS segment override
  * 0x36 – SS segment override
  * 0x3E – DS segment override
  * 0x26 – ES segment override
  * 0x64 – FS segment override
  * 0x65 – GS segment override
In 64-bit mode, the CS, SS, DS and ES segment overrides are ignored.

The **branch hint** prefixes can appear together with conditional jump instructions. These prefixes can be used to support the branch prediction unit of the processor in determining if the branch shall be taken (0x3E) or not taken (0x2E). This function is enabled if there is no history in the branch prediction unit yet, and static branch prediction is used.

The **operand size** and **address size override** prefixes can change the default size of operands and addresses. For example, if the processor operates in 32-bit mode, using the 0x66 prefix changes the size of an operand to 16 bits, and using the 0x67 prefix changes the address encoding from 32 bits to 16 bits. To better understand the behaviour of prefixes, let us consider a simple instruction with different variants. Let's start with a 32-bit processor.
<code asm>
mov BYTE PTR [ebx], 0x5   ;encoded as 0xC6, 0x03, 0x05
mov WORD PTR [ebx], 0x5   ;encoded as 0x66, 0xC7, 0x03, 0x05, 0x00 
mov DWORD PTR [ebx], 0x5  ;encoded as 0xC7, 0x03, 0x05, 0x00, 0x00, 0x00
</code>

We can notice that because the default operand size is a 32-bit doubleword, so prefix 0x66 appears in the 16-bit version (WORD PTR). It is also visible that the 8-bit version (BYTE PTR) has a different opcode (0xC6, 0x03 instead of 0xC7, 0x03). Also, the size of the argument is different.

The address override prefix (0x67) appears if we change the register to a 16-bit bx.
<code asm>
mov BYTE PTR [bx], 0x5   ;encoded as 0x67, 0xC6, 0x07, 0x05
mov WORD PTR [bx], 0x5   ;encoded as 0x67, 0x66, 0xC7, 0x07, 0x05, 0x00 
mov DWORD PTR [bx], 0x5  ;encoded as 0x67, 0xC7, 0x07, 0x05, 0x00, 0x00, 0x00
</code>

The same situation can be observed if we use a 32-bit address register (ebx) and assemble the same instructions for a 64-bit processor.
<code asm>
mov BYTE PTR [ebx], 0x5   ;encoded as 0x67, 0xC6, 0x03, 0x05
mov WORD PTR [ebx], 0x5   ;encoded as 0x67, 0x66, 0xC7, 0x03, 0x05, 0x00 
mov DWORD PTR [ebx], 0x5  ;encoded as 0x67, 0xC7, 0x03, 0x05, 0x00, 0x00, 0x00
</code>

While we use a native 64-bit address register in a 64-bit processor, the address size override prefix disappears.
<code asm>
mov BYTE PTR [rbx], 0x5   ;encoded as 0xC6, 0x03, 0x05
mov WORD PTR [rbx], 0x5   ;encoded as 0x66, 0xC7, 0x03, 0x05, 0x00 
mov DWORD PTR [rbx], 0x5  ;encoded as 0xC7, 0x03, 0x05, 0x00, 0x00, 0x00
</code>

**REX** prefix (0x41, 0x48, 0x49, etc.) appears in long mode, e.g. if we try to enlarge the argument size to 64 bits or use any of the new registers. Long mode enables the possibility of using new registers and 64-bit calculations with the original set of instructions. The REX prefix contains four fixed bits and four control bits as shown in figure {{ref>rex_fields}}.

<figure rex_fields>
{{ :en:multiasm:cs:rex_prefix.png?400 |Illustration of REX prefix}}
<caption>REX prefix</caption>
</figure>

  * bit W: operand size override, it is "1" for 64-bit operand, "0" for default operand size
  * bit R: extension to ModRM.reg 
  * bit X: extension to SIB.index
  * bir B: extension to ModRM.rm or the SIB.base 
Bits R, X and B enable the use of new registers.

<code asm>
mov BYTE PTR [r8], 0x5    ;encoded as 0x41, 0xC6, 0x00, 0x05
mov BYTE PTR [r9], 0x5    ;encoded as 0x41, 0xC6, 0x01, 0x05
mov BYTE PTR [r10], 0x5   ;encoded as 0x41, 0xC6, 0x02, 0x05
mov DWORD PTR [r8], 0x5   ;encoded as 0x41, 0xC7, 0x00, 0x05, 0x00, 0x00, 0x00 
mov QWORD PTR [rbx], 0x5  ;encoded as 0x48, 0xC7, 0x03, 0x05, 0x00, 0x00, 0x00
mov QWORD PTR [r8], 0x5   ;encoded as 0x49, 0xC7, 0x00, 0x05, 0x00, 0x00, 0x00
</code>
The REX prefix 
From the code shown in the last example, you can observe where the register used in the instructions is encoded.
<note>We encourage you to assemble or disassemble some further instructions. You can do it with an Online x86 / x64 Assembler and Disassembler ((https://defuse.ca/online-x86-assembler.htm#disassembly)) by Taylor Hornby. Experimentation can give you valuable information on where some information is encoded in the instruction. </note>
====Instruction opcode====
The instruction opcode is the mandatory field in every instruction. It encodes the main function of the operation. Expanding the processor's capabilities by adding new instructions required defining longer opcodes. The opcode can be 1, 2 or 3 bytes in length. New instructions usually contain an additional byte or two bytes at the beginning called an escape sequence. Possible opcode sequences are:
<code>
opcode
0x0F opcode
0x0F 0x38 opcode
0x0F 0x3A opcode
</code>

In the new instructions, the VEX or XOP prefix is used as the escape sequence. 
  * 0xC4 Three-byte VEX
  * 0xC5 Two-byte VEX
  * 0x8F Three-byte XOP
VEX-encoded instructions are written with V at the beginning. Let's look at the example of the blending instruction.
<code>
blendvpd xmm0, xmm1               ; encoded as 0x66, 0x0F, 0x38, 0x15, 0xC1 
vblendvpd xmm0, xmm1, xmm2, xmm3  ; encoded as 0xC4, 0xE3, 0x71, 0x4B, 0xC2, 0x30 
</code>
The first blendvpd instruction has only two arguments; in this encoding scheme is not possible to encode more. It uses the mandatory prefix 0x66 and 0x0F, 0x38 escape sequence. The second version, vblendvpd, has four arguments. It is encoded with a three-byte VEX escape sequence 0xC4, 0xE3, 0x71.

====MOD R/M byte====
The ModR/M byte encodes the addressing mode, a register which is used as an operand in the instruction, and registers used for addressing of memory. It can also extend the opcode in some instructions. It has the fields as shown in the figure {{ref>modrm_byte}}.

<figure modrm_byte>
{{ :en:multiasm:cs:modrm_byte.png?300 |Illustration of MOD R/M byte}}
<caption>MOD R/M byte</caption>
</figure>

In general, the fields have meaning:
  * Mod - Mode. This 2-bit field gives the register/memory mode. 
  * Reg - Register. This 3-bit field specifies one of the general-purpose registers used as the operand. It can also be the opcode extension.
  * R/M - Register/memory. This 3-bit field specifies the second register as the operand or combination of registers used in the calculation of the address in memory.

In x86, the Mod field specifies one of four possible memory addressing modes, and the R/M field specifies which register, or pair of registers, is used for address calculation. If the Mod field is 11 (binary), the R/M field specifies the second register in the instruction. Details are shown in table {{ref>modrm_16}}

<table modrm_16>
<caption>Encoding of addressing mode with MOD R/M byte in 16-bit x86 processors</caption>
^      ^ R/M                                                                                                                                                ||||||||
^ Mod  ^ 000                 ^ 001                 ^ 010                 ^ 011                 ^ 100            ^ 101            ^ 110            ^ 111            ^
^ 00   | [BX + SI]           | [BX + DI]           | [BP + SI]           | [BP + DI]           | [SI]           | [DI]           | [disp16]       | [BX]           |
^ 01   | [BX + SI + disp8]   | [BX + DI + disp8]   | [BP + SI + disp8]   | [BP + DI + disp8]   | [SI + disp8]   | [DI + disp8]   | [BP + disp8]   | [BX + disp8]   |
^ 10   | [BX + SI + disp16]  | [BX + DI + disp16]  | [BP + SI + disp16]  | [BP + DI + disp16]  | [SI + disp16]  | [DI + disp16]  | [BP + disp16]  | [BX + disp16]  |
^ 11   | AX/AL               | CX/CL               | DX/DL               | BX/BL               | SP/AH          | BP/CH          | SI/DH          | DI/BH          |

</table>

Let's look at some examples of instruction encoding. First, look at the data transfer between two registers.
<code asm>
              ;                         MOD REG R/M   MOD               REG   R/M
mov al, dl    ;encoded as 0x88, 0xD0    11  010 000   Register operand  DL    AL
mov ax, dx    ;encoded as 0x89, 0xD0    11  010 000   Register operand  DX    AX
mov dx, si    ;encoded as 0x89, 0xF2    11  110 010   Register operand  SI    DX
mov si, dx    ;encoded as 0x89, 0xD6    11  010 110   Register operand  DX    SI
</code>
Notice that in the first and second lines, different opcodes are used, but the MOD R/M bytes are identical. The type of instruction determines the order of data transfer.

Now, a few examples of indirect addressing without displacement.
<code asm>
              ;                         MOD REG R/M   MOD               REG   R/M    
mov dx,[si]   ;encoded as 0x8B, 0x14    00  010 100   Reg. only addr.   DX    [SI]
mov dx,[di]   ;encoded as 0x8B, 0x15    00  010 101   Reg. only addr.   DX    [DI]
mov dx,[bx+di];encoded as 0x8B, 0x11    00  010 001   Reg. only addr.   DX    [BX+DI]
mov cx,[bx+di];encoded as 0x8B, 0x09    00  001 001   Reg. only addr.   CX    [BX+DI]
</code>

Now, a few examples of indirect addressing with displacement.
<code asm>
              ;                             MOD REG R/M   MOD              REG  R/M       Disp  
mov dx,[bp+62];encoded as 0x8B, 0x56, 0x3E  01  010 110   Reg.+disp addr.  DX   [BP+disp] 0x3E
mov [bp+62],dx;encoded as 0x89, 0x56, 0x3E  01  010 110   Reg.+disp addr.  DX   [BP+disp] 0x3E
mov dx,[si+13];encoded as 0x8B, 0x54, 0x0D  01  010 100   Reg.+disp addr.  DX   [SI+disp] 0x0D
mov si,[bp]   ;encoded as 0x8B, 0x76, 0x00  01  110 110   Reg.+disp addr.  SI   [BP+disp] 0x00
</code>
If we look in fort two lines, we can observe that the MOD R/M bytes are identical. The only difference is the opcode, which determines the direction of the data transfer.

Notice also that the last instruction is encoded as BP + displacement, even if there is no displacement in the mnemonic. If you look into the table {{ref>modrm_16}}, you can observe that there is no addressing mode with [BP] only. It must appear with the displacement.

In 32-bit mode, registers used for addressing can be specified with the SIB byte, but this is not always the case. If a single register is used (with some exceptions), it still can be encoded with the MOD R/M byte, and there is no SIB byte in the instruction. 
In 64-bit long mode, the MOD R/M byte encoding works in a similar manner to 32-bit, with the difference that the MOD R/M byte is extended with R and B bits from the REX prefix, enabling the use of more registers in instructions.

<table modrm_640>
<caption>Encoding of addressing mode with MOD R/M byte in x64 processors for leading bit =0</caption>
^      ^ 0.R/M                                                                                                                                  ||||||||
^ Mod  ^ 0.000           ^ 0.001           ^ 0.010           ^ 0.011           ^ 0.100           ^ 0.101           ^ 0.110           ^ 0.111           ^
^ 00   | [RAX]           | [RCX]           | [RDX]           | [RBX]           | [SIB]           | [RIP + disp]    | [SI]            | [DI]            |
^ 01   | [RAX + disp8]   | [RCX + disp8]   | [RDX + disp8]   | [RBX + disp8]   | [SIB + disp8]   | [RBP + disp8]   | [RSI + disp8]   | [RDI + disp8]   |
^ 10   | [RAX + disp32]  | [RCX + disp32]  | [RDX + disp32]  | [RBX + disp32]  | [SIB + disp32]  | [RBP + disp32]  | [RSI + disp32]  | [RDI + disp32]  |
^ 11   | RAX             | RCX             | RDX             | RBX             | RSP             | RBP             | RSI             | RDI             |

</table>


<table modrm_641>
<caption>Encoding of addressing mode with MOD R/M byte in x64 processors for leading bit =1</caption>
^      ^ 1.R/M                                                                                                                                  ||||||||
^ Mod  ^ 1.000           ^ 1.001           ^ 1.010           ^ 1.011           ^ 1.100           ^ 1.101           ^ 1.110           ^ 1.111           ^
^ 00   | [R8]           | [R9]           | [R10]           | [R11]           | [SIB]           | [RIP + disp]    | [R14]            | [R15]            |
^ 01   | [R8 + disp8]   | [R9 + disp8]   | [R10 + disp8]   | [R11 + disp8]   | [SIB + disp8]   | [R13 + disp8]   | [R14 + disp8]   | [R15 + disp8]   |
^ 10   | [R8 + disp32]  | [R9 + disp32]  | [R10 + disp32]  | [R11 + disp32]  | [SIB + disp32]  | [R13 + disp32]  | [R14 + disp32]  | [R15 + disp32]  |
^ 11   | R8             | R9             | R10             | R11             | R12             | R13             | R14             | R15             |

</table>

====Scale Index Base byte====
The SIB byte was added in 32-bit machines to enable the use of any register in addressing. It encodes the scaling factor, index and base register used in address calculations. As a result, the choice of registers for calculations is significantly wider. For details, please refer to the addressing modes section. The SIB byte has the fields as shown in the figure {{ref>sib_byte}}.

<figure sib_byte>
{{ :en:multiasm:cs:sib_byte.png?300 |Illustration of Scale Index Base byte}}
<caption>Scale Index Base byte</caption>
</figure>

The scaling factor specifies the number 1, 2, 4 or 8 by which the content of the index register is multiplied in the process of the address calculation as presented in table {{ref>SIB_scale}}. For details on addressing, please refer to the section Addressing Modes in Instructions.

<table SIB_scale>
<caption>Encoding of scaling factor in SIB byte</caption>
^ Bits  ^ 00 ^ 01 ^ 10 ^ 11 ^
| Scaling factor  | 1 | 2 | 4 | 8 |
</table>

The index and base fields specify the index and base registers, respectively. In 32-bit processors, the register encoding is similar to that presented in table {{ref>register_16}}. Details for 32-bit addressing are presented in tables {{ref>SIB_index}} and {{ref>SIB_base}}.

<table SIB_index>
<caption>Encoding of registers in the index field of the SIB byte in 32-bit processors</caption>
^ Bits index             ^ 000  ^ 001  ^ 010  ^ 011  ^ 100  ^ 101  ^ 110  ^ 111  ^
| 32-bit index register  | EAX  | ECX  | EDX  | EBX  | *  | EBP  | ESI  | EDI  |
</table>

<table SIB_base>
<caption>Encoding of registers in the base field of the SIB byte in 32-bit processors</caption>
^ Bits base             ^ 000  ^ 001  ^ 010  ^ 011  ^ 100  ^ 101  ^ 110  ^ 111  ^
| 32-bit base register  | EAX  | ECX  | EDX  | EBX  | ESP  | EBP  | ESI  | EDI  |
</table>

In 64-bit long mode, the SIB byte is extended with X and B bits from the REX prefix, enabling the use of more registers in instructions. Details are presented in tables {{ref>SIB_index64}} and {{ref>SIB_base64}}.

<table SIB_index64>
<caption>Encoding of registers in the index field of the SIB byte extended with X bit in 64-bit processors</caption>
^ Bits X.Index           ^ 0.000  ^ 0.001  ^ 0.010  ^ 0.011  ^ 0.100  ^ 0.101  ^ 0.110  ^ 0.111  ^
| 64-bit index register  | RAX    | RCX    | RDX    | RBX    | *      | RBP    | RSI    | RDI    |
||
^ Bits B.Index           ^ 1.000  ^ 1.001  ^ 1.010  ^ 1.011  ^ 1.100  ^ 1.101  ^ 1.110  ^ 1.111  ^
| 32-bit index register  | R8     | R9     | R10    | R11    | R12    | R13    | R14    | R15    |

</table>

<table SIB_base64>
<caption>Encoding of registers in the base field of the SIB byte extended with B bit in 64-bit processors</caption>
^ Bits B.Base           ^ 0.000  ^ 0.001  ^ 0.010  ^ 0.011  ^ 0.100  ^ 0.101  ^ 0.110  ^ 0.111  ^
| 32-bit base register  | RAX    | RCX    | RDX    | RBX    | RSP    | RBP    | RSI    | RDI    |
||
^ Bits B.Base           ^ 1.000  ^ 1.001  ^ 1.010  ^ 1.011  ^ 1.100  ^ 1.101  ^ 1.110  ^ 1.111  ^
| 32-bit base register  | R8     | R9     | R10    | R11    | R12    | R13    | R14    | R15    |
</table>

In the tables {{ref>SIB_index}} and {{ref>SIB_index64}}, the * means that there is no index register encoded in the instruction. It results in base-only or direct addressing.

Let's look at some code examples, considering the 32-bit version first. In all instructions, there is a MOD R/M byte and an SIB byte. MOD R/M is identical for all instructions. REG field indicates eax register and the combination of MOD and R/M indicates that registers are specified with the SIB byte.
<code asm>
;MOD R/M (second byte) is 0x04 for all instructions:
                       ;                    MOD REG R/M   REG  MOD & R/M
                       ;                     00 000 100   eax  SIB is present

;SIB (third byte) is 0x0B, 0x4B, 0x8B or 0xCB:
                       ;                  Scale Index Base  Scale Index Base
mov eax, [ebx+ecx]     ;0x8B, 0x04, 0x0B     00   001  011     x1   ecx  ebx
mov eax, [ebx+ecx*2]   ;0x8B, 0x04, 0x4B     01   001  011     x2   ecx  ebx
mov eax, [ebx+ecx*4]   ;0x8B, 0x04, 0x8B     10   001  011     x4   ecx  ebx
mov eax, [ebx+ecx*8]   ;0x8B, 0x04, 0xCB     11   001  011     x8   ecx  ebx
</code>

And other examples for x64 processors. The SIB byte is extended with bits from the REX prefix. We'll start with the similar examples as shown for 32-bit machines.

<code asm>
;REX prefix (first byte) is 0x48 for all instructions:
                       ;                   7                           0
                       ;                 +---+---+---+---+---+---+---+---+
                       ;                 | 0   1   0   0 | W | R | X | B |
                       ;                 +---+---+---+---+---+---+---+---+
                       ;                                   1   0   0   0

;MOD R/M (second byte) is 0x04 for all instructions:
                       ;                    MOD R.REG R/M   REG  MOD & R/M
                       ;                     00 0.000 100   eax  SIB is present

                       ;                        Scale X.Index B.Base  Scale Index Base
mov rax, [rbx+rcx]     ;0x48, 0x8B, 0x04, 0x0B     00   0.001  0.011     x1   rcx  rbx
mov rax, [rbx+rcx*2]   ;0x48, 0x8B, 0x04, 0x4B     01   0.001  0.011     x2   rcx  rbx
mov rax, [rbx+rcx*4]   ;0x48, 0x8B, 0x04, 0x8B     10   0.001  0.011     x4   rcx  rbx
mov rax, [rbx+rcx*8]   ;0x48, 0x8B, 0x04, 0xCB     11   0.001  0.011     x8   rcx  rbx
</code>

If any of the new registers (R8-R15) is used in the instruction, it changes the bits in the REX prefix.

<code asm>
                       ;                        Scale X.Index B.Base  Scale Index Base
mov rax, [r10+rcx]     ;0x49, 0x8B, 0x04, 0x0A     00   0.001  1.010     x1   rcx  r10
mov rax, [rbx+r11]     ;0x4A, 0x8B, 0x04, 0x1B     00   1.001  0.011     x1   r11  rbx
mov r12, [rbx+rcx]     ;0x4C, 0x8B, 0x24, 0x0B     10   0.001  0.011     x1   rcx  rbx

                       ;Last instruction has the MOD R/M REG field extended 
                       ;by the R bit from the REX prefix.
                       ;                    MOD R.REG R/M   REG  MOD & R/M
                       ;                     00 1.100 100   r12  SIB is present
</code>

Certainly, the presented examples do not exhaust all possible situations. For a more detailed explanation, please refer to the documentation by AMD((https://docs.amd.com/v/u/en-US/40332-PUB_4.08)), Intel((https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html)), OSDev wiki((https://wiki.osdev.org/X86-64)) or other interesting sources mentioned at the bottom of this section.
====Displacement====
Displacement gives the offset for memory operands. Depending on the addressing mode, it can be the direct memory address or an additional offset added to the contents of the base, index register or both. Displacement can be 1, 2, or 4 bytes long. Some instructions allow using an 8-byte displacement. In these instructions, there is no immediate field.

====Immediate====
Some instructions require an immediate value. The instruction determines the length of the immediate value. The immediate can be 1, 2, 4 or 8 bytes long. When an 8-byte immediate value is encoded, no displacement can be encoded.

<note>For detailed information on instructions encoding, please refer to the documentation provided by Intel and AMD. You can also find interesting information on websites: X86 Opcode and Instruction Reference ((http://ref.x86asm.net/index.html)) by 
MazeGen, x86 and amd64 instruction reference ((https://www.felixcloutier.com/x86/)) by Félix Cloutier.</note>

===== Instruction Set of x86 - Essentials =====
==== Instruction groups ====
The x64 processors can execute an extensive number of different instructions. In the documentation of processors, we can find several ways of dividing all instructions into groups. The most general division, according to AMD, defines five groups of instructions:
  * General Purpose instructions
  * System instructions
  * SSE instructions
  * 64-bit media instructions
  * x87 Floating-Point instructions

Intel defines the following groups of instructions.
  * General Purpose
  * X87 FPU 
  * X87 FPU and SIMD State Management
  * MMX Technology
  * SSE Extensions
  * SSE2 Extensions 
  * SSE3 Extensions
  * SSSE3 Extensions
  * IA-32e mode: 64-bit mode instructions
  * System Instructions
  * VMX Instructions
  * SMX Instructions

There is also a long list of extensions defined, including SSE4.1, SSE4.2, Intel AVX, AMD 3DNow! and many others. For a detailed description of instruction groups, please refer to
  * "AMD64 Architecture Programmer’s Manual"((https://docs.amd.com/v/u/en-US/40332-PUB_4.08)),
  * "Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture"((https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html)).

Details of every instruction you can find in the description of the instruction set 
  * "AMD64 Architecture Programmer’s Manual Volume 3: General Purpose and System Instructions"((https://docs.amd.com/v/u/en-US/24594_3.37)),
  * "Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B, 2C, & 2D): Instruction Set Reference, A-Z" ((https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html)). 

There are also specialised websites with detailed explanations of instructions that you can use to get a lot of additional information. Among others, you can visit: 
  * X86 Opcode and Instruction Reference ((http://ref.x86asm.net/index.html)) by MazeGen, 
  * x86 and amd64 instruction reference ((https://www.felixcloutier.com/x86/)) by Félix Cloutier.

In this book, we will present most of the general-purpose instructions and provide general ideas on the chosen extensions, including FPU, MMX, SSE, and AVX.

==== General Purpose Instructions ====
General-purpose instructions can be divided into some subgroups.
  * Data Transfer Instructions
  * Binary Arithmetic Instructions
  * Decimal Arithmetic Instructions
  * Logical Instructions
  * Shift and Rotate Instructions
  * Bit and Byte Instructions
  * Control Transfer Instructions
  * String Instructions
  * I/O Instructions
  * Enter and Leave Instructions
  * Flag Control (EFLAG) Instructions
  * Segment Register Instructions
  * Miscellaneous Instructions
  * User Mode Extended State Save/Restore Instructions
  * Random Number Generator Instructions
  * BMI1 and BMI2 Instructions

==== Condition Codes ====
Before describing instructions, let's present the condition codes. The condition code takes the form of a suffix to the instruction and influences its behaviour in such a way that if the condition is met, the instruction is executed; if the condition is not met, the processor moves on to the next instruction in the program. The condition that is checked during the execution of the conditional instruction is based on the current state of the flags in the EFLAGS register. The flags in the EFLAGS register are modified by instructions, mainly arithmetic, logical, shift, or special flag manipulation instructions. It is important to note that flags are not modified when copying data, so to check whether the value just read is zero, you should perform, for example, a comparison.
Condition codes together with flags checked are presented in table {{ref>table_condition_codes}}.

<table table_condition_codes>
<caption>Condition codes</caption>
^ Condition code //cc//        ^ Flags checked ^ Comment ^
| E        | ZF = 1           | Equal |
| Z        | ZF = 1           | Zero |
| NE       | ZF = 0           | Not equal |
| NZ       | ZF = 0           | Not zero |
| A        | CF=0 and ZF=0    | Above |
| NBE      | CF=0 and ZF=0    | Not below or equal |
| AE       | CF=0             | Above or equal |
| NB       | CF=0             | Not below |
| B        | CF=1             | Below |
| NAE      | CF=1             | Not above or equal |
| BE       | CF=1 or ZF=1     | Below or equal |
| NA       | CF=1 or ZF=1     | Not above |
| G        | ZF=0 and SF=OF   | Greater |
| NLE      | ZF=0 and SF=OF   | Not less or equal |
| GE       | SF=OF            | Greater or equal |
| NL       | SF=OF            | Not less |
| L        | SF<>OF           | Less |
| NGE      | SF<>OF           | Not greater or equal |
| LE       | ZF=1 or SF<>OF   | Less or equal |
| NG       | ZF=1 or SF<>OF   | Not greater |
| C        | CF=1             | Carry |
| NC       | CF=0             | Not carry |
| O        | OF=1             | Overflow |
| NO       | OF=0             | Not ovrflow |
| S        | SF=1             | Sign (negative) |
| NS       | SF=0             | Not sign (non-negative) |
| P        | PF=1             | Parity |
| PE       | PF=1             | Parity even |
| NP       | PF=0             | Not parity |
| PO       | PF=0             | Parity odd |
</table>
==== Data transfer instructions ====
Almost all assembler tutorials start with the presentation of the **mov** instruction, which is used to copy data from the source operand to the destination operand. Our book is not an exception, and we've already shown this instruction in examples presented in previous sections. 
=== MOV ===
Let's look at some additional variants.
<code asm>
mov al, bl         ;copy one byte from bl to al
mov ax, bx         ;copy word (two bytes) from bx to ax
mov eax, ebx       ;copy doublweword (four bytes) from ebx to eax
mov rax, rbx       ;copy quadword (eight bytes) from rbx to eax 
</code>
In the **mov** instruction, the size of the source argument must be the same as the size of the destination argument. Arguments can be stored in registers, in memory addressed directly or indirectly. One of them can be constant (immediate). Only one memory argument is allowed. This comes from instructions encoding. In instructions, there is only one possible direct or indirect argument to be encoded. That's why most instructions, not only **mov**, can operate with one memory argument only. There are some exceptions, for example, string instructions, but such instructions use specific indirect addressing.
<code asm>
mov al, 100        ;0xB0, 0x64         copy constant (immediate) of the value 100 (0x64) to al
mov al, [bx]       ;0x67, 0x8A, 0x07   copy byte from the memory at address stored in bx to al (indirect addressing)

;Notice the difference between two following instructions
mov eax, 100       ;0xB8, 0x64, 0x00, 0x00, 0x00   copy constant 100 to eax
mov eax, [100]     ;0xA1, 0x64, 0x00, 0x00, 0x00   copy value from memory at address 100

;It is possible to copy a constant to memory addressed directly or indirectly
;operand size specifier dword ptr is required to inform the processor about the size of the argument
mov dword ptr ds:[200], 100   ;0xC7, 0x05, 0xC8, 0x00, 0x00, 0x00, 0x64, 0x00, 0x00, 0x00
                              ;copy value of 100, encoded as dword (four bytes), 0x64 = 100
                              ;to memory at address 200, encoded as four bytes,  0xC8 = 200
                              
mov dword ptr [ebx], 100      ;0xC7, 0x03, 0x64, 0x00, 0x00, 0x00
                              ;copy value of 100, encoded as dword (four bytes), 0x64 = 100
                              ;to memory addressed by ebx 
</code>
=== Conditional move ===
Starting from the P6 machines, the conditional move instruction **cmov//cc//** was introduced. This works similarly to **mov**, but copies data if the specified condition is true. The condition code is one of the codes presented in the section "Condition Codes". If the condition is false, the instruction simply passes through without modifying the arguments. Conditional move instructions can be used to avoid conditional jumps.
For example, if we need to copy data from ebx to ecx, if the result of the previous operation is negative, we can write the following instruction.
<code asm>
cmovs ecx, ebx
</code>

=== Sign extension ===
In the situation of copying data of a smaller size (expressed in number of bits) to a bigger destination argument, the question arises as to what to do with the remaining bits. Let us consider copying an 8-bit value from bl to the 16-bit ax register. If the value copied is unsigned or positive (let it be 5), the remaining bits should be cleared.
<code asm>
              ;   ah      al
mov al, bl    ;        00000101   = 5 in al
mov ah, 0     ;00000000
              ;0000000000000101   = 5 in ax
</code>

If the value is negative (e.g. -5) the situation changes.
<code asm>
              ;   ah      al
mov al, bl    ;        11111011   = -5 in al
mov ah, 0     ;00000000
              ;0000000011111011   = 251 in ax
</code>

It is visible that to preserve the original value, the upper bits must be filled with ones, not zeros.
<code asm>
              ;   ah      al
mov al, bl    ;        11111011   = -5 in al
mov ah, 0xFF  ;11111111
              ;1111111111111011   = -5 in ax
</code>

There are special instructions which perform automatic sign extension, copying the sign bit to all higher bit positions. They can be considered as type conversion instructions. These instructions do not have any arguments as they operate on the accumulator only.
  * **cbw** - converts byte in al to word in ax
  * **cwd** - converts word in ax to doubleword in dx:ax
  * **cwde** - converts word in ax to doubleword extended in eax
  * **cdq** - converts doubleword in eax to quadword in edx:eax

Sign extension instructions work solely with the accumulator. Fortunately, there are also more universal instructions which copy and extex data at the same time. 
  * **movsx** - copies and sign-extends a byte to a word or doubleword or word to doubleword.
  * **movzx** - copies and zero-extends a byte to a word or doubleword or word to doubleword.
  * **movsxd** - copies and extends a doubleword to quadword in x64 processors.

=== Exchange instructions ===
The exchange instructions swap the values of operands. A single exchange instruction can replace three mov instructions while swapping the contents of two arguments, so they can be useful in optimising some algorithms. They are helpful in the implementation of semaphores, even in multiprocessor systems.
The **xchg** instruction swaps the values of two arguments. If one of the arguments is in memory, the instruction behaves as with the LOCK prefix, allowing for semaphore implementation.
The **cmpxchg** has three arguments: source, destination and accumulator. It compares the destination argument with the accumulator; if they are equal, the destination argument value is replaced with the value from the source operand. It is used to test and modify semaphores. Its operation is presented in fig {{ref>instr_cmpxchg}}. In newer machines, the eight- and sixteen-byte versions were added: **cmpxchg8b** and **cmpxch16b**. They always use ECX:EBX or RCX:RBX as the source argument and EDX:EAX or RDX:RAX as the accumulator. The destination argument is in the memory.
<figure instr_cmpxchg>
{{ :en:multiasm:cs:cmpxchg.png?500 |Illustration of cmpxchg instruction}}
<caption>Explanation of cmpxchg instruction</caption>
</figure>
The **xadd** instruction exchanges two arguments, adds them, and stores the sum in a destination argument. Together with a LOCK prefix, it can be used to implement a DO loop executed by more than one processor simultaneously.

The **bswap** instruction is a single-argument instruction; it changes the order of bytes in a 32- or 64-bit register. It can be used to convert little-endian data to big-endian representation and vice versa, as shown in figure {{ref>instr_bswap}}.
<figure instr_bswap>
{{ :en:multiasm:cs:bswap.png?500 |Illustration of bswap instruction}}
<caption>Explanation of bswap instruction in 32-bit mode</caption>
</figure>
=== Stack instructions ===
A stack is a special structure in the memory that automatically stores the return address (address of the next instruction) while procedure calling (it is described in detail in the section about the **call** instruction). It is also possible to use the stack for local variables in functions, to pass arguments to procedures, and for temporal data storage. In x86 architecture, the stack is supported by hardware with the special stack pointer register. Instructions operating on the stack automatically modify the stack pointer in a way that it always points to the top of the stack. The **push** instruction decrements the stack pointer and places the data onto the stack. As a result, the stack pointer points to the last data on the stack. It is shown in figure {{ref>instr_push}}.
<figure instr_push>
{{ :en:multiasm:cs:push.png?500 |Illustration of push instruction}}
<caption>Explanation of push instruction</caption>
</figure>

The **pop** instruction takes data off the stack, copies it into the destination argument, and increments the stack pointer. After its execution, the stack pointer points to the previous data stored on the stack. It is shown in figure {{ref>instr_pop}}.
<figure instr_pop>
{{ :en:multiasm:cs:pop.png?500 |Illustration of pop instruction}}
<caption>Explanation of pop instruction</caption>
</figure>
There are also instructions that push or pop all eight general-purpose registers (including the stack pointer). The 16-bit registers are pushed with **pusha** and popped with **popa** instructions. For 32-bit registers, the **pushad** and **popad** instructions can be used, respectively. The order of registers on the stack is shown in figure {{ref>instr_pushadpopad}}. These instructions are not supported in 64-bit mode.
<figure instr_pushadpopad>
{{ :en:multiasm:cs:pushadpopad.png?500 |Illustration of pushad and popad instructions}}
<caption>Explanation of pushad and popad instructions</caption>
</figure>

==== Arithmetic instructions ====
Arithmetic instructions perform calculations on binary encoded data. It is worth noting that the processor does not distinguish between unsigned and signed values; it is the responsibility of the programming engineer to provide correct input values and properly interpret the results obtained.
<note>
There are instructions which support decimal arithmetic, but due to the rare use of BCD numbers in modern software, they are not available in x64 mode.
</note>
=== Addition and subtraction ===
There are two adding instructions. The **add** adds two values from the destination and source arguments and stores the result in the destination argument. It modifies the flags in the EFLAG register according to the result. The **adc** instruction additionally adds "1" if the carry flag (CF) is set. It allows the processor to calculate the sum of the values bigger than can be encoded in a register (for example, 128-bit integers in a 64-bit processor).
Similarly, there are two subtraction instructions. The **sub** subtracts the source argument from the destination argument, stores the result in the destination, and modifies the flags according to the result. The **sbb** instruction calculates the difference of arguments minus "1" if the CF flag is set (here, CF plays the role of the borrow flag).

=== Incrementation and decrementation ===
The **inc** instruction adds "1" to, and **dec** instruction subtracts "1" from the argument. The argument is treated as an unsigned integer.

=== Multiply ===
Two multiply instructions are implemented. The **mul** is a one-argument instruction. It multiplies the content of the argument and the accumulator, treated as unsigned numbers. The size of the accumulator corresponds to the size of the argument. The result is stored in the accumulator. As the multiplication can give the result even twice as big as the input values, it is stored in a bigger accumulator size, as shown in the table {{ref>table_mul}}.

<table table_mul>
<caption>Multiply instruction argument and result size</caption>
^ Argument ^ Accumulator ^ Result ^
| 8 bits        | AL           | AX |
| 16 bits       | AX           | DX:AX |
| 32 bits       | EAX          | EDX:EAX |
| 64 bits       | RAX          | RDX:RAX |
</table>

The **imul** instruction implements the signed multiply. It can have one, two or three arguments. The single-argument version behaves the same way as the **mul** instruction. The two-argument version multiplies the 16-, 32-, or 64-bit register as the destination operand by the argument of the same size. The three-argument version multiplies the content of the source argument by the immediate and stores the result in the destination of the same size as the arguments. The destination must be the register.

=== Divide ===
Two divide instructions are implemented. The **div** is a one-argument instruction. It divides the content of the accumulator by the argument, treated as unsigned numbers. The size of the accumulator is twice as big as the size of the argument. The result is stored as two integer values of the same size as the argument. The quotient is placed in the lower half of the accumulator, and the remainder in the higher half of the accumulator. Depending on the size of the argument, the accumulator is understood as a pair of registers DX:AX, EDX:EAX or RDX:RAX, as shown in the table {{ref>table_div}}.

<table table_div>
<caption>Divide instruction arguments and results size</caption>
^ Argument ^ Accumulator ^ Quotient ^ Remainder ^
| 8 bits        | AX           | AL | AH |
| 16 bits       | DX:AX        | AX | DX |
| 32 bits       | EDX:EAX      | EAX | EDX |
| 64 bits       | RDX:RAX      | RAX | RDX |
</table>

The **idiv** instruction implements the signed divide. It behaves the same way as the **div** instruction except for the type of numbers.
==== Logical instructions ====
The set of logical instructions contains **and**, **or**, **xor** and **not** instructions. All of them perform bitwise Boolean operations corresponding to their names. The **not** is a single-argument instruction; others have two arguments.

==== Shift and rotate instructions ====
Shift and rotate instructions treat the argument as the shift register. Each bit of the argument is moved to the neighbour position on the left or right, depending on the shift direction. The number of bit positions for the shift can be specified as a constant or in the CX register. Shift instructions can be used for multiplying (shift left) and dividing (shift right) by a power of two.
Shift instructions have two versions: logical and arithmetical. Logical shift left **shl** and arithmetical shift left **sal** behave the same, filling the empty bits (at the LSB position) with zeros. Logical shift right **shr** fills the empty bits (at the MSB position) with zeros, while the arithmetical shift right **sar** makes a copy of the most significant bit, preserving the sign of a value. It is shown in figure {{ref>instr_shift}}.

<figure instr_shift>
{{ :en:multiasm:cs:shift.png?600 |Illustration of shift arithmetical and logical left and right instructions}}
<caption>Explanation of shift instructions</caption>
</figure>

There are two double shift instructions which move bits from the source argument to the destination argument. The number of bits is specified as the third argument. Shift double right has **shrd** mnemonic, while shift double left has **shld** mnemonic. The operation of shift double instructions is presented in figure {{ref>instr_shiftdouble}}.

<figure instr_shiftdouble>
{{ :en:multiasm:cs:shiftd.png?600 |Illustration of double shift instructions}}
<caption>Explanation of double shift instructions</caption>
</figure>

For all shift instructions, the last bit shifted out is placed in the carry flag.

Rotate instructions shift bits left **rol** or right **ror** in the argument, and additionally move bits around from the lowest to the highest or from the highest to the lowest position. Behaviour of rotate instructions is shown in figure {{ref>instr_rotate}}.
<figure instr_rotate>
{{ :en:multiasm:cs:rotate.png?600 |Illustration of rotate instructions}}
<caption>Explanation of rotate instructions</caption>
</figure>

Rotate through carry left **rcl** and right **rcr**, treat the carry flag as the additional bit while rotating. They can be used to collect bits to form multi-bit data. Behaviour of rotate with carry instructions is shown in figure {{ref>instr_rotatec}}.
<figure instr_rotatec>
{{ :en:multiasm:cs:rotatec.png?600 |Illustration of rotate with carry instructions}}
<caption>Explanation of rotate with carry instructions</caption>
</figure>
==== Bit and Byte Instructions ====
Bit test instruction **bt** makes a copy of the selected bit in the carry flag. The bit for testing is specified by a combination of two arguments. The first argument, named the bit base operand, holds the bit. It can be a register or a memory location. The second operand is the bit offset, which specifies the position of the bit operand. It can be a register or an immediate value. It starts counting from 0, so the least significant bit has the position 0. An example of the behaviour of the **bt** instruction is shown in figure {{ref>instr_bt}}.
<figure instr_bt>
{{ :en:multiasm:cs:bt14.png?600 |Illustration of bit test instruction}}
<caption>Explanation of bit test instruction</caption>
</figure>

Bit test and modify instructions first make a copy of the selected bit, and next modify the original bit value with the one specified by the instruction. The **bts** sets the bit to one, **btr** clears the bit (resets to zero value), **btc** changes the state of the bit to the opposite (complements).

The bit scan instructions search for the first occurrence of the bit of the value 1. The bit scan forward **bsf** scans starting from the least significant bit towards higher bits, bit scan reverse **bsr** starts from the most significant bit towards lower bits. Both instructions return the index of the found bit in the destination register. If there is no bit of the value 1, the zero flag is set, and the destination register value is undefined.

The **test** instruction performs the logical AND function without storing the result. It just modifies flags according to the result of the AND operation.

The **set//cc//** instruction sets the argument to 1 if the chosen condition is met, or clears the argument if the condition is not met. The condition can be freely chosen from the set of conditions available for other instructions, for example, **cmov//cc//**. This instruction is useful to convert the result of the operation into the Boolean representation.

The **popcnt** instruction counts the number of bits equal to "1" in a data. The applications af this instruction include genome mining, handwriting recognition, digital health workloads, and fast hamming distance counts((https://patents.google.com/patent/US8214414)).

The **crc32** instruction implements the calculation of the cyclic redundancy check in hardware. The polynomial of the value 11EDC6F41h is fixed.

==== Control transfer instructions ====
Before describing the instructions used for control transfer, we will discuss how the destination address can be calculated. The destination address is the address given to the processor to make a jump to. 
=== Near and far transfer ===
While the segmentation is enabled, the destination address can be given as the offset only or in full logical form. If there is an offset only, the instruction modifies solely the instruction pointer, the jump is performed within the current segment and is called **near**. If the address is provided in full logical form, containing segment and offset parts, the CS and IP registers are modified. Such an instruction can perform a jump between segments and is called **far**.
=== Absolute and relative address ===
An **absolute address** is given as a value specifying the destination address as the number of the byte counted from the beginning of the memory, or, if segmentation is enabled, as the offset from the beginning of the segment. A **relative address** is calculated as the difference between the current value of the instruction pointer and the absolute destination address. It is provided in the instructions as the signed number representing the distance between the current and destination addresses. If it is possible to encode the difference as an 8-bit signed value, the jump is called **short**. Usually, an assembler automatically chooses the shortest possible encoding.
=== Conditional and unconditional control transfer ===
Conditional transfer instructions check the state of chosen flags in the Flags register and perform the jump to the specified address if the condition gives a true result. If the condition results in false, the processor goes to the next instruction in the instruction stream. Conditions are specified the same way as in **cmov//cc//** instruction as the suffix to the main mnemonic. Unconditional transfer instructions are always executed the same way. They jump to the specified address without any condition checking.
=== Unconditional control transfer instructions ===
Unconditional control transfer instructions perform the jump to the new address to change the program flow. 
The **jmp** instruction jumps to a destination address by putting the destination address in the instruction pointer register. If segmentation is enabled and the destination address is placed in another segment than the current one, it also modifies the CS register.
The **call** instruction is designed to handle subroutines. It also jumps to a destination address, but before putting the new value into the instruction pointer, it pushes the returning address onto the stack. The returning address is the address of the next instruction after the call. This allows the processor to use the returning address later to get back from the subroutine to the main program.
The **ret** instruction forms a pair with the **call**. It uses the information stored on the stack to return from a subroutine.
The process of calling a procedure and returning to the main program is shown in figure {{ref>procedure_call}}.
<figure procedure_call>
{{ :en:multiasm:cs:procedure_call.png?500 |Illustration of call and return instructions}}
<caption>Explanation of call and ret instructions</caption>
</figure>
<note>
In assembler, subroutines are called procedures. In other languages, you can find the names: function (it can return the resulting value), method (in object-oriented languages) or subprogram.
</note>
=== Interrupts ===
An interrupt mechanism in x86 works with hardware-signalled interrupts or with special interrupt instructions. Return from an interrupt is performed by executing the **iret** instruction. In 32 and 64-bit architectures, the mnemonic for this instruction is **iretd**. The **iret** instruction differs from the **ret** instruction with popping of the stack not only the return address but also the content of the Flags register. This keeps the content of this register unmodified after return, and additionally prevents unintentional blocking following interrupts.
The process of interrupt handler calling and returning to the main program is shown in figure {{ref>interrupt_x86}}.
<figure interrupt_x86>
{{ :en:multiasm:cs:interrupt_x86.png?550 |Illustration of interrupt signalling and return from the handler}}
<caption>Illustration of interrupt signalling and return from the handler</caption>
</figure>
Software interrupts are handled the same way as signalled by the hardware. The **int** instruction signals the interrupt of a given number. There are also some special interrupt instructions. The **int1** and **int3** are one-byte special machine codes used for debugging, **into** signals a software overflow exception if the OF flag is set, and **bound** raises the bound range exceeded exception (int 5) when the tested value is over or under the defined bounds. The last two instructions are not valid in 64-bit mode.

<note>
In 32 and 64-bit operating systems, the interrupts are handled by the OS and called through the interrupt descriptors, called gates.
</note>

=== Conditional control transfer instructions ===
The **j//cc//** instructions are used to test the state of flags and perform the jump to the destination address if the condition is met. In modern pipelined processors, it is recommended to avoid using conditional jumps if possible, ensuring that the program flows continuously, without the need to invalidate the pipeline. It is important to remember that flags are modified as a result of executing the arithmetic or logic instruction, but not the **mov** instruction. For example, if we need to test if some variable is zero, we can write such code:
<code asm>
cmp var1, 0     ;compare variable
jz is_zero      ;conditional jump to address is_zero
mov rax, "1"    ;if not zero put ASCII code of "1" in rax
jmp not_zero    ;jump unconditionally over next instruction
is_zero:        ;label to jump to if var1 is zero
mov rax, "0"    ;if zero put ASCII code of "0" in rax
not_zero:       ;label to jump to if var1 is not zero
</code>
<note>
You can try to optimise this code by avoiding jumps. Try to use the conditional **mov** instruction.
</note>

=== Loop instructions ===
The **loop** instruction is used to implement a loop, which is executed a known number of times. The number of iterations should be set before a loop in the counter register (CX/ECX/RCX). The **loop** instruction automatically decrements the counter register, checks if it reaches zero and if not jumps to the address, which is the argument of the instruction and is assumed as the beginning address of a loop. If the counter reaches zero, the **loop** instruction goes further to the next instruction in a stream.
There are also conditional versions of the **loop** instruction, which allow finishing the iteration process before the counter reaches zero. The **loope** or **loopz** instructions continue the iteration if the counter is above zero and the zero flag (ZF) is set. The **loopne** or **loopnz** continue iteration if the counter is above zero and the zero flag (ZF) is cleared.
The **loop** instruction can cause the system to iterate many times if the counter register is zero before entering the loop. As the first step is the decrementing of the counter, it will result in a value composed of all "1". For CX, the loop will be executed 65536 times, for ECX more than 4 billion times and for RCX 184 quintillion 466 quadrillion 744 trillion 73 billion 709 million 551 thousand and 616 times! Understandably, we should avoid such a situation. The **jcxz**, **jecxz** and **jrcxz** instructions can help to jump over the entire loop if the counter register is zero at the beginning, as in the following code.

<code asm>
lea rbx, table   ;table with values to sum
mov rcx, size    ;size of a table - we can't ensure it's not zero
xor rdx, rdx     ;zero rdx - it will be the sum af elements
jrcxz end_loop   ;jump over the loop if rcx is zero
begin_loop:
add rdx, [rbx]   ;add the item to the resulting value
inc rbx          ;point to another item in a table
loop begin_loop  ;loop
end_loop:
</code>

<note>
According to the information found on the Internet, the **loop** instructions are not optimised for modern pipelined processors, and are often replaced with compare and conditional jump instructions.
</note>

==== String Instructions ====
String instructions are developed to perform operations on elements of data tables, including text strings. These instructions can access two elements in memory - source and destination. If segmentation is enabled, the source operand is identified with SI/ESI and placed always in the data segment (DS), the destination operand is identified with DI/EDI and stored in the extended data segment (ES). In 64-bit mode, the source operand is identified with RSI, and the destination operand is identified with RDI. They can operate on bytes, words, doublewords or quadwords. The size of the element is specified as the suffix of the instruction or derived from the size of the arguments specified in the instruction. 

=== String copy ===
The **movs** instruction copies the element of the source string to the destination string. It requires two arguments of the size of bytes, words, doublewords or quadwords.
The **movsb** instruction copies a byte from the source string to the destination string.
The **movsw** instruction copies a word from the source string to the destination string.
The **movsd** instruction copies a doubleword from the source string to the destination string.
The **movsq** instruction copies a quadword from the source string to the destination string.
<note>
The locations of the source and destination operands are always accessed with the use of the source and destination index registers, which must be loaded correctly before the string instruction is executed. Arguments, if present, are used to determine the size of the element only.
</note>

=== Store string ===
These instructions store the content of the accumulator to the destination operand.
The **stos** instruction copies the content of the accumulator to the destination string. It requires one argument of the size of byte, word, doubleword or quadword.
The **stosb** instruction copies a byte from the AL to the destination string.
The **stosw** instruction copies a word from the AX to the destination string.
The **stosd** instruction copies a doubleword from the EAX to the destination string.
The **stosq** instruction copies a quadword from the RAX to the destination string.

=== Load string ===
These instructions load the content of the source string to the accumulator.
The **lods** instruction copies the content of the source string to the accumulator. It requires one argument of the size of byte, word, doubleword or quadword.
The **lodsb** instruction copies a byte from the source string to the AL.
The **lodsw** instruction copies a word from the source string to the AX.
The **lodsd** instruction copies a doubleword from the source string to the EAX.
The **lodsq** instruction copies a quadword from the source string to the RAX.
=== String compare ===
Strings can be compared, which means that the element of the destination string is compared with the element of the source string. These instructions set the status flags in the flags register according to the result of the comparison. The elements of both strings remain unchanged.
The **cmps** instruction compares the element of a source string with the element of the destination string. It requires one argument, which specifies the size of the accumulator and the data element.
The **cmpsb** instruction compares a byte from the source string with a byte from the destination string.
The **cmpsw** instruction compares a word from the source string with a word from the destination string.
The **cmpsd** instruction compares a doubleword from the source string with a doubleword from the destination string. 
The **cmpsq** instruction compares a quadword from the source string with a quadword from the destination string.

=== String scan ===
Strings can be scanned, which means that the element of the destination string is compared with the accumulator. These instructions set the status flags in the flags register according to the result of the comparison. The accumulator and string element remain unchanged.
The **scas** instruction compares the accumulator with the element of the destination string. It requires one argument, which specifies the size of the accumulator and the data element.
The **scasb** instruction compares the AL with a byte from the destination string.
The **scasw** instruction compares the AX with a word from the destination string.
The **scasd** instruction compares the EAX with a doubleword from the destination string.
The **scasq** instruction compares the RAX with a quadword from the destination string.

=== Repeated string instructions ===
All string instructions can be preceded by the repetition prefix to automate the processing of multiple-element tables. Use of the prefix enables the instructions to automatically repeat the instruction execution according to the content of the counter register and modify the source and destination addresses in index registers, accordingly to the size of the element. Index registers can be incremented or decremented depending on the direction flag (DF) state. If DF is "0", the addresses are incremented; if DF is "1" addresses are decremented. While the string element's size is a byte, the addresses are modified by 1. For words, the addresses are modified by 2, for doublewords by 4, and for quadwords by 8.
The **rep** prefix allows block copying, storing and loading of an entire string rather than a single element.
The use of repeated string instructions enables copying the entire string from one place in memory to another, or filling up the memory regions with a pattern.

The **repe** or **repz** prefixes additionally test if the zero flag is "1", to finish prematurely the process of string scan or comparison. 
The **repne** or **repnz** prefixes test if the zero flag is "0" to stop the iteration throughout the string. 
The conditional prefixes are intended to be used with **scas** or **cmps** instructions.
The use of repeated string instructions with conditional prefixes enables string comparison for equality or differences, or to find the element in a string.

To properly use the repeated string instructions, follow these steps:
  - Set the SI/ESI/RSI with the address of the source string.
  - Set the DI/EDI/RDI with the address of the destination string.
  - Clear of set the DF to determine the direction of string processing - from lower to higher or from higher to lower addresses, respectively.
  - Set the counter register CX/ECX/RCX with the number of elements to process
  - Execute the string instruction with repetition prefix and suffix according to the size of the element.
==== I/O Instructions ====
These instructions allow the processor to transfer data between the accumulator register and a peripheral device.
A peripheral device can be addressed directly or indirectly. Direct addressing uses an 8-bit constant as the peripheral address (named in x86 I/O port), and it accesses only the first 256 port addresses. Indirect addressing uses the DX register as the address register, enabling access to the entire I/O address space of 65536 addresses.
The **in** instruction reads data from a port to the accumulator. The **out** instruction writes the data from the accumulator to the port. The size of the accumulator determines the size of the data to be transferred. It can be AL, AX or EAX.
The I/O instructions also have string versions. Instructions to read the port to a string are **ins**, **insb**, **insw**, and **insd**. Instructions to write a string to a port are **outs**, **outsb**, **outsw**, and **outsd**. In all string I/O instructions, the port is addressed with the DX register. Rules for addressing the memory are the same as in string instructions.
==== Enter and Leave Instructions ====
Enter instruction creates the stack frame for the function. The stack frame is a place on the stack reserved for the function to store arguments and local variables. Traditionally, we access the stack frame with the use of the RBP register, but we need to preserve its content before use. The **enter** instruction can be nested or non-nested. Not-nested saves the RBP on the stack, copies the stack pointer value to RBP, and adjusts the stack pointer with the constant value, which is the first operand of the instruction. After these steps, the RSP points to the top of the stack frame, and the RBP points to the stack base. The nested version creates the path to the higher-level functions' stack frames by adding their momentary value of RBP. The **leave** instruction reverses what **enter** did at the end of the function. The **enter** should be placed at the very beginning of the function, while the **leave** just before **ret**.
<note>
According to the information on compiler behaviour, the **enter** instruction is never used by compilers, while the **leave** instruction is rarely, but sometimes used.
</note>
==== Flag Control Instructions ====
Flag control instructions are typically used to set or clear the chosen flag in the RFLAGS register. We can only control three flags directly. The carry (CF) flag can be used in conjunction with the rotate-with-carry instructions to convert the series of bits into a binary-encoded value. The direction (DF) flag determines the direction of modification of index registers RSI and RDI when executing string instructions. If the DF flag is clear, the index registers are incremented; if the DF flag is set, the registers are decremented after each iteration of a string instruction. The interrupt (IF) flag enables or disables hardware interrupts. If the IF flag is set, the hardware interrupts are enabled; if the IF flag is clear, hardware interrupts are masked.
The summary of instructions is shown in the table {{ref>table_flags_instructions}}.
<table table_flags_instructions>
<caption> Flags manipulating instructions</caption>
^ Instruction ^ Behavoiur ^ flag affected ^
| **stc** | set carry flag | CF=1 |
| **clc** | clear carry flag | CF=0 |
| **cmc** | complement carry flag | CF=not CF |
| **std** | set direction flag | DF=1 |
| **cld** | clear direction flag | DF=0 |
| **sti** | set interrupt flag | IF=1 |
| **cli** | clear interrupt flag | IF=0 |
</table>

The flags register can be pushed onto the stack and popped afterwards. This can be done inside the procedure, but also to test or manipulate bits in the flags register, for which modifications are not supported by a special instruction.
The **pushf** pushes the FLAGS register, the **pushfd** pushes the EFLAGS register, and the **pushfq** pushes the RFLAGS register onto the stack.
The **popf** pops the FLAGS register, the **popfd** pops the EFLAGS register, and the **popfq** pops the RFLAGS register from the stack.
There is also a possibility to copy SF, ZF, AF, PF, and CF to the AH register with the **lahf** instruction, and store these flags back from AH with the use of the **sahf** instruction.

==== Segment Register Instructions ====
Segment register instructions are used to load a far pointer to a pair of registers. One of the pair is the segment, which is determined by the instruction; another is the offset and appears as the destination argument. The source argument is the far pointer stored in the memory. These instructions include **lds** – load far pointer using DS, **les** – load far pointer using ES, **lfs** – load far pointer using FS, **lgs** – load far pointer using GS, and **lss** – load far pointer using SS. 
The following example shows loading far pointer in 16-bit mode.
<code asm>
; Load far pointer to DS:BX
; Variable Far_point holds the 32-bit address

lds  BX,Far_point

; Instruction above is equal to:

mov  AX,WORD PTR Far_point+2 ; Take higher word of far pointer
mov  DS,AX                   ; Store it in DS
mov  BX,WORD PTR Far_point   ; Store lower word of far pointer in BX
</code>
In 64-bit mode, **lds** and **les** instructions are not supported.

==== Miscellaneous instructions ====
=== No operation ===
The **nop** instruction performs no operation. The only result is incrementaion of the instruction pointer. In real, it is an alias to the instruction **xchg eax, eax**.
<code asm>
nop             ;encoded as 0x90
xchg eax, eax   ;encoded as 0x90
</code>

=== Load effective address ===
The **lea** instruction calculates the effective address as the result of the proper address expression and stores the result in a destination operand. We can store the effective address in a single register to avoid complex address calculation inside a loop, like in the following example.
<code asm>
; Load effective address to BX
; Table is the beginning of the table in the memory

  lea   BX,Table[SI]

; Now we can use BX only to make the program run faster:
hoop:
  mov   AX,[BX] ; Take value from table
  inc   BX      ; Next element in the table
  cmp   AX,0    ; Check if element is 0
  jne   hoop    ; Jump to „hoop” if AX isn’t 0
</code>
<note>
Because the **lea** instruction adds source arguments, it is sometimes used instead of the **add** instruction.
</note>

=== Undefined instructions ===
The undefined instructions can be used to test the behaviour of the system software in case of the appearance of an unknown opcode in the instruction stream. The **ud** and **ud1** instructions can have a source operand (register or memory address) and a destination operand (register). Operands are not used. The **ud2** instruction does not have an operand. Executing any undefined instruction results in an invalid opcode exception (#UD) throw.

=== Table lookup ===
The **xlatb** instruction copies the byte from a table into the AL register. The byte is addressed as the sum of the BX/EX/RBX and AL registers. There is also an **xlat** version, which enables specifying the address in the memory as the argument. It can be somewhat misleading because the argument is never used by the processor. This instruction can be used to implement the conversion from a 4-digit binary value into a hexadecimal digit, as in the following code.

<code asm>
.DATA
conv_table DB ”0123456789ABCDEF”

.CODE
; Load base address of table to BX
  lea   RBX, conv_table
  and   AL, 0Fh  ; Limit AL to 4 bits
  xlatb          ; Take element from the table
  mov   char, AL ; Resulting char is in AL
</code>
=== Processor identification ===
The **cpuid** instruction provides processor identification information. It operates similarly to the function, with the input value sent via an accumulator (EAX). Depending on the EAX value gives different information about the processor. The requested information is returned in processor registers. For example, if EAX is zero, it returns the vendor information string: "GenuineIntel" for Intel processors, "AuthenticAMD" for AMD models in ECX, EDX and EBX registers. It is shown in figure {{ref>cpuid_vendor}}.

<figure cpuid_vendor>
{{ :en:multiasm:cs:cpuid_vendor.png?400 |Illustration of vendor string reading by cpuid instruction}}
<caption>Illustration of  vendor string reading by cpuid instruction</caption>
</figure>

=== MOVBE instruction ===
The **movbe** instruction moves data after swapping data bytes. It operates on words, doublewords or quadwords and is usually used to change the endianness of the data.

=== Cache manipulating instructions ===
Cache memory is managed by the processor, and usually, its decisions keep the performance of software execution at a good level. However, the processor offers instructions that allow the programmer to send hints to the cache management mechanism and prefetch data in advance of using it (**prefetchw**, **prefetchwt1**) and to synchronise the cache and memory and flush the cache line to make it available for other data (**clflush**, **clflushopt**). There are also additional instructions implemented for cache management introduced together with multimedia and vector extensions.
==== User Mode Extended State Save/Restore Instructions ====
Some instructions allow for saving and restoring the state of several units of the processor. They are intended to help processors in fast context switching between processes and to be used instead of saving each register separately at the beginning of a subroutine and restoring it at the end. The content of registers is stored in memory pointed by EDX:EAX registers. Instructions for saving the state are **xsave**, **xsavec**, and **xsaveopt**. Instructions for restoring the state are **xrstor** and **xgetbv**.

==== Random Number Generator Instructions ====
In the x64 architecture, there are two instructions for generating a random number. These are **rdseed** and **rdrand**. A random number is generated by a specially designed hardware unit. The difference between instructions is that **rdseed** gets random bits generated from entropy gathered from a sensor on the chip. It is slower but offers better randomness of the number. The **rdrand** gets bits from a pseudorandom number generator. It is faster, offering output that is sufficiently secure for most cryptographic applications.

==== BMI1 and BMI2 Instructions ====
The abbreviation BMI comes from Bit Manipulation Instructions. These instructions are designed for some specific manipulation of bits in the arguments, enabling programmers to use a single instruction instead of a few.
The **andn** instruction extends the group of logical instructions. It performs a bitwise AND of the first source operand with the inverted second source operand.
There are additional shift and rotate instructions that do not affect flags, which allows for more predictable execution without dependency on flag changes from previous operations. 
. These instructions are **rorx** - rotate right, **sarx** - shift arithmetic right, **shlx** - shift logic left, and **shrx** - shift logic right.
Also, unsigned multiplication without affecting flags, **mulx**, was introduced. 
Other instructions manipulate bits as the group name stays.

The **lzcnt** instruction counts the number of zeros in an argument starting from the most significant bit. The **tzcnt** counts zeros starting from the least significant bit. For an argument that is not zero, **lzcnt** returns the number of zeros before the first 1 from the left, and **tzcnt** gives the number of zeros before the first 1 from the right. 
The **bextr** instruction copies the number of bits from source to destination arguments starting at the chosen position. The third argument specifies the number of bits and the starting bit position. Bits 7:0 of the third operand specify the starting bit position, while bits 15:8 specify the maximum number of bits to extract, as shown in figure {{ref>bextr_instr}}.

<figure bextr_instr>
{{ :en:multiasm:cs:bextr.png?400 |Illustration of bit extraction instruction}}
<caption>Illustration of bit extraction instruction</caption>
</figure>

The **blsi** instruction extracts the single, lowest bit set to one, as shown in figure {{ref>blsi_instr}}.

<figure blsi_instr>
{{ :en:multiasm:cs:blsi.png?400 |Illustration of the lowest set bit extraction instruction}}
<caption>Illustration of lowest set bit extraction instruction</caption>
</figure>

The **blsmsk** instruction sets all lower bits below a first bit set to 1. It is shown in figure {{ref>blsmsk_instr}}.

<figure blsmsk_instr>
{{ :en:multiasm:cs:blsmsk.png?400 |Illustration of the instruction which sets all lower bits below a first bit set to 1.}}
<caption>Illustration of the instruction which sets all lower bits below a first bit set to 1</caption>
</figure>

The **blsr** instruction resets (clears the bit to zero value) the lowest set bit. It is shown in figure {{ref>blsr_instr}}.

<figure blsr_instr>
{{ :en:multiasm:cs:blsr.png?400 |Illustration of the instruction which resets a first bit set to 1.}}
<caption>Illustration of the instruction which resets a first bit set to 1</caption>
</figure>

The **bzhi** instruction resets high bits starting from the specified bit position, as shown in figure {{ref>bzhi_instr}}.

<figure bzhi_instr>
{{ :en:multiasm:cs:bzhi.png?400 |Illustration of the instruction which resets high bits starting from the specified bit position.}}
<caption>Illustration of the instruction which resets high bits starting from the specified bit position</caption>
</figure>

The **pdep** instruction performs a parallel deposit of bits using a mask. Its behaviour is shown in figure {{ref>pdep_instr}}.

<figure pdep_instr>
{{ :en:multiasm:cs:pdep.png?600 |Illustration of the parallel deposit instruction}}
<caption>Illustration of the parallel deposit instruction</caption>
</figure>

The **pext** instruction performs a parallel extraction of bits using a mask. Its behaviour is shown in figure {{ref>pext_instr}}.

<figure pext_instr>
{{ :en:multiasm:cs:pext.png?600 |Illustration of the parallel extraction instruction}}
<caption>Illustration of the parallel extraction instruction</caption>
</figure>

===== Procedures, Functions and Calls in Windows and Linux =====

The procedure is a separate fragment of the code that can be called from another part of the program. The function is similar to a procedure, but it also returns a result when it is finished. Procedures and functions are standalone elements. In object-oriented programming, a function which is a member of a class or object is named a method. 
The name depends on the programming language. In assembler, the common name for a separate fragment of the code called from another part is a procedure. The procedure can be separately tested and used multiple times in the same program or in other projects. Using procedures makes the code easier to manage and reusable, increasing the overall efficiency of software creation. 

Procedures can be defined using a pair of directives. The **PROC** directive is used at the beginning of the procedure, and the **ENDP** directive is used at the end. The **PROC** directive can automatically:
  * Preserve the contents of the registers whose values should not change, but are needed for use in the procedure.
  * Set up local variables on the stack.
  * Set up parameters placed on the stack.
  * Adjust the stack when the procedure ends.
With the use of additional directives, it is possible to provide information about stack utilisation for stack unwinding.

Procedures can have parameters. In general, parameters can be passed through the stack, registers, common memory or a combination of these. In different operating systems, the rules of passing parameters differ. In 64-bit Windows, the fast call calling convention is used. In this convention, the first four parameters are passed through registers, and each subsequent parameter is passed through the stack. If the parameters are integers, they are passed through general-purpose registers. If parameters are floating-point numbers, they are passed through XMM registers as scalars. If the procedure plays the role of a function, it returns the resulting value. Integers are returned through the accumulator (RAX), and floating-point values are returned through XMM0. Parameters passing in Windows x64 ABI is summarised in a table {{ref>masmparampass}}.
<table masmparampass>
<caption>Parameter passing in 64-bit Windows calling convention</caption>
^ Parameter ^ integer register ^ floating point register ^
| first | RCX | XMM0 | 
| second | RDX | XMM1 | 
| third | R8 | XMM2 | 
| fourth | R9 | XMM3 | 
| subsequent | stack | stack | 
</table>
If the procedure has integer parameters only, XMM registers will not be modified. If the procedure has floating-point arguments, general-purpose registers remain unchanged. If the types of parameters are mixed, they are passed through the corresponding registers. It is shown in the following example written in C.
<code c>
function_1(int a, int b, int c, int d, int e);
// a in RCX, b in RDX, c in R8, d in R9, e is pushed on the stack

function_2(float a, double b, float c, double d, float e);
// a in XMM0, b in XMM1, c in XMM2, d in XMM3, e is pushed on the stack

function_3(int a, double b, int c, double d, int e);
// a in RCX, b in XMM1, c in R8, d in XMM3, e is pushed on the stack
</code>

The example of the procedure that adds two arguments and returns the sum can look like the following code.
<code asm>
AddProc PROC
  mov RAX, RCX  ; First parameter is in RCX
  add RAX, RDX  ; Second parameter is in RDX, RAX returns the result
  ret
AddProc ENDP
</code>

The Microsoft Windows x64 calling convention requires that even when the parameters are passed through registers, a 32-byte space for them should be reserved on the stack. It is referred to as a shadow space or home space. The shadow space size can be increased to store local variables of the procedure. Why does the x64 calling convention require the shadow space to be explained in the Microsoft blog article((https://devblogs.microsoft.com/oldnewthing/20160623-00/?p=93735)).\\
Another requirement is that the stack must be aligned to the 16-byte boundaries. This is done for the performance, because data transfer between the processor and memory at aligned addresses can be done with faster versions of instructions (**MOVAPD** instead of **MOVUPD**). Note that the returning address, which is pushed on the stack automatically when the procedure is called, is 8 bytes long, so even without the shadow space, the stack pointer adjustment is required. Our simple AddProc function doesn't adjust the stack because it doesn't call any function which needs the stack to be aligned. If we decide to call a system function, we must align the stack before making a call. Before modification of the stack pointer, we must preserve its content and restore it before returning.
<code asm>
AlignProc PROC
  push RBP             ; preserve RBP
  mov RBP, RSP         ; store RSP in RBP
  and RSP, NOT 0Fh     ; align stack to the nearest address divisible by 16
  call SystemFunction  ; call any system function
  leave                ; restore RSP and RBP back
  ret
AlignProc ENDP
</code>
<note>
It is OK to use the AND logic function to align the stack because the stack grows towards lower addresses. Clearing four least significant bits ensures that the address is 16-byte aligned and is lower than the previous one.
</note>
Certainly, these rules are to be used if there is a need to call a system function or to maintain compatibility with a high-level compiler. If the procedure is written in pure assembly and called from an assembly program, it is the programmer's decision whether they want to follow these rules.\\
The rules of passing parameters, stack and registers use, and data storage layout in 64-bit Microsoft Windows are described in the document about x64 Application Binary Interface (ABI)((https://learn.microsoft.com/en-us/cpp/build/x64-software-conventions?view=msvc-170)).\\
In the Linux x64 Calling Convention, the first six arguments of type integer/pointers are passed in registers and subsequent arguments through the stack. For the floating point arguments, the first eight are passed in XMM registers, and the subsequent ones through the stack. Parameters passing in Linux is summarised in a table {{ref>linuxparampass}}.
<table linuxparampass>
<caption>Parameter passing in 64-bit Linux calling convention</caption>
^ Parameter ^ integer register ^ floating point register ^
| first | RDI | XMM0 |
| second | RSI | XMM1 | 
| third | RDX | XMM2 | 
| fourth | RCX | XMM3 | 
| fivth | R8 | XMM4 | 
| sixth | R9 | XMM5 | 
| seventh | stack | XMM6 | 
| eigth | stack | XMM7 | 
| subsequent | stack | stack | 
</table>

==== Calling the system functions ====
The operating systems offer a set of functions which help write an application. These functions include reading characters and text from standard input, usually the keyboard, displaying characters or text on standard output, usually the monitor, handling files, data streams and many others. In previous generations of operating systems, the software interrupt mechanism was used. In Microsoft DOS, it was **int 21h** while in 32-bit versions of Linux it was **int 80h** (or in the C-style hex notation int 0x80). Calling the system function required preparing the arguments in scratch registers and signalling the software interrupt.
<note>
You can still find many examples using the software interrupt system call on the Internet. In Linux, they should work properly, although they are slower than the new method. In 64-bit Windows, the **int 21** method is no longer supported.
</note>
Modern 64-bit operating systems use alternative methods for calling system functions. They significantly differ between Linux and Windows, so we'll briefly summarise both.

==== Callig Windows system functions ====
The Microsoft Windows operating system implements functions visible to programmers in the API (Application Programming Interface). Functions are identifiable by names, and they can be called as any other function in the program. Windows API functions for 32 and 64-bit Windows are documented on the Microsoft website ((https://learn.microsoft.com/en-us/windows/win32/apiindex/api-index-portal)).
Let's see the Hello World example written in the Windows API.
<code asm>
; include the library with system functions
includelib kernel32.lib

; define function names as external symbols
EXTERN GetStdHandle: PROC
EXTERN WriteConsoleA: PROC

; data section with constants and variables definitions
.DATA

STD_OUTPUT_HANDLE = -11
stdout_handle     dq 0
hello_msg         db "Hello World", 0
dummy             dq 0

; code section
.CODE 
MyAssemblerFunction PROC

; the stack must be aligned to an address divisible by 16 - mod(16)
; after the function call is aligned to mod(8)
; the Windows requires the shadow space on the stack
    push  rbp        ; push rpb to the stack
    mov   rbp, rsp   ; store rsp to rbp
    sub   rsp, 48    ; shadow space (32 bytes) and stack alignment (additional 8 bytes)

; we need the handle of the console window
    mov   rcx, STD_OUTPUT_HANDLE
    call  GetStdHandle
    mov   stdout_handle, rax

; display the text in the console window
    mov   rcx, stdout_handle
    mov   rdx, offset hello_msg
    mov   r8,  sizeof hello_msg
    mov   r9,  dummy
    call  WriteConsoleA

; restore the stack pointer and rbp
    mov   rsp, rbp
    pop   rbp

; return from the function
    ret
MyAssemblerFunction ENDP
END
</code>

==== Callig Linux system functions ====
The Linux operating system still supports the traditional calling of system functions using software interrupts. It is based on the **int 0x80** interrupt, which recognises the number of the function in the EAX register and up to six arguments in EBX, ECX, EDX, ESI, EDI, and EBP. 
The example of the Hello World program in Linux interrupt-based system call is shown in the following code.

<code asm>
section   .text
global    _start
_start:    
; write function
     mov   ebx, 1    ; first argument - stdio
     mov   ecx, msg  ; second argument - text buffer
     mov   edx, len  ; third argument - text length
     mov   eax, 4    ; function number - write
     int   0x80

; exit from program
     mov   eax, 1    ; function number - exit
     int   0x80

section    .data
msg  db    "Hello World!", 10
len  equ   $ - msg
</code>

Modern processors have new instructions especially designed for calling system functions. They are supported in the Linux operating system. The **syscall** instruction doesn't use the interrupt mechanism. It uses registers only to provide the address of the function, store the return address and flags register, and load the instruction pointer. This makes the **syscall** instruction execution significantly faster than **int 80h**, and is the preferred mechanism for system calls in 64-bit Linux systems. The RIP is stored in RCX, and RFLAGS is stored in R11. The RIP is loaded with the content of the special register IA32_LSTAR MSR, which is an element of Architectural Model-Specific Registers, implemented starting from certain models of 32-bit processors. The Linux system sets this register, and the programmer selects the function using the RAX register, as in the previous model of system calls.

<code asm>
global	_start
section	 .text

_start:
	
; write function
     mov   rdi, 1    ; first argument of the function - stdout
     mov   rsi, msg  ; second argument - text buffer
     mov   rdx, len  ; third argument - number of characters
     mov   rax, 1    ; write function
     syscall

; exit from program
     mov   rdi, 0    ; result code of the program
     mov   rax, 60   ; exit function
     syscall

msg: db    "Hello World!", 10
len  equ   $ - msg
</code>
===== Compatibility with HLL Compilers (C++, C#) and Operating Systems =====

The integration of assembler code with applications written in high-level languages brings benefits in particular scenarios, such as the implementation of complex mathematical algorithms and real-time tasks that require efficient and compact code. Nowadays, no one uses an assembler to implement a graphical user interface (GUI), as there is no reason to do so. Modern desktop operating systems are designed to provide a rich user experience, supporting languages such as C#, C++, and Python in the implementation of user interfaces (UIs) through the use of libraries. While those UI generation functions can be executed from the assembler level, there is virtually no reason to do it. A more effective approach is when the main application is written in a high-level language and executes assembly code as needed to perform backend operations efficiently.

==== Programming in Assembler for Windows ====
Windows OS historically supports unmanaged code, written primarily in C++. This kind of code runs directly on the CPU. Since the introduction of the .NET framework, Windows has provided developers with a safer way of executing their code, called "managed code". The difference is that managed code, written commonly in C#, is executed by a .NET framework interpreter rather than being compiled to machine code, as unmanaged code is. The use of managed code brings multiple advantages for developers, including automated memory management and code isolation from the operating system. This, however, raises several challenges when integrating managed code and assembly code.

<note tip>Code written in assembler and compiled to machine code is always unmanaged one!</note>

<todo pczekalski>Continue here</todo>

**Programming for applications written in unmanaged code**


**Programming for applications written in managed code**
In the case of managed code, things get more complex. The .NET framework features automated memory management, a mechanism that automatically releases unused memory (e.g., objects for which there are no more references) and optimises the location of variables for improved performance. It is known as a .NET Garbage Collector (GC). GC instantly traces references and, in the event of an object relocation in memory, updates all references accordingly. It also releases objects that are no longer referenced. This automated mechanism, however, applies only across managed code apps. The problem arises when developers integrate a front-end application written in managed code with assembler libraries written in unmanaged code. All pointers and references passed to the assembler code are not automatically traced by the GC. Using dynamically allocated variables on the .NET side and accessing them from the assembler code is a very common scenario. GC cannot "see" any reference to the object (variable, memory) made in the unmanaged code; thus, it may release memory or relocate it without updating the reference address. It causes very hard-to-debug errors that occur randomly and are very serious (e.g. null pointer exception).
Luckily, there is a strict set of rules that must be followed when integrating managed and unmanaged code. We discuss it below.


==== Programming in Assembler for Linux ====
===== FPU =====
The Floating Point Unit is developed to speed up calculations on real numbers, encoded in a computer as floating-point numbers. At the beginning of the x86 processors' history, the FPU was a separate integrated circuit. Since the i486DX, it has been introduced as a standard element of each processor. The FPU operates on single, double or extended precision values, using its own set of registers and instructions. For details about FPU registers, please refer to the section "Register Set".
<note>
Although modern extensions for processors can currently perform calculations with real numbers faster with the use of vector instructions, they do not achieve the precision available with the extended precision encoding.
</note>
Registers in the FPU are organised as a stack with 8 levels of depth. Physical registers are named R0 - R8, while registers visible by the FPU take names ST(0) to ST(7). The ST(0), also referred to as ST, is always the top of the stack, while ST(7) is the bottom of the stack. The top of the stack is pointed to with three bits in the FPU status word register. Each time the data is loaded to the FPU, the stack top is decremented, and each time the data is popped off the stack, it is incremented. The initial state is as shown in figure {{ref>fpuinit}}.

<figure fpuinit>
{{ :en:multiasm:cs:fpu_initial.png?200 |Illustration of the initial state of FPU registers and stack}}
<caption>The initial state of FPU registers and stack</caption>
</figure>

While data is loaded to the FPU, it is pushed onto the stack, as shown in figure {{ref>fpuload}}.

<figure fpuload>
{{ :en:multiasm:cs:fpu_load.png?200 |Illustration of pushing data onto the FPU stack}}
<caption>The FPU stack after pushing single data</caption>
</figure>

The stack organisation of registers makes it easier to implement math calculations according to the RPN (Reverse Polish Notation), also called postfix notation.
Further in this section, we'll present the FPU coprocessor's instructions. They can be grouped as:
  * data transfer instructions, 
  * load constants instructions, 
  * basic arithmetic instructions, 
  * comparison instructions, 
  * transcendental instructions, 
  * FPU control instructions.

==== Data transfer instructions ====

Data in the memory used by FPU can be stored as a single precision, double precision or double extended-precision floating point value or as an integer of the type word, double word, quadword, or 18-digit BCD number. The FPU always converts it into double-extended precision floating point while loading it into an internal register and converts the data to the required format while storing it back into memory.
Loading instructions, always first decrement the stack top field in the status word register and next, load the value onto the new top of the register stack.
The **fld** instruction loads a single precision, double precision or double extended-precision onto the FPU register stack. It can also copy data from other FPU register into ST. The **fild** instruction loads integer values of word, doubleword or quadword. The **fbld** instruction loads an 18-digit binary decimal encoded value.
Store instructions take data off the FPU register stack and place it in the memory.
The **fst** instruction stores a single precision or a double precision value to the memory. It can also copy a value to another FPU register. The **fstp** instruction also works with 80-bit double extended precision values and additionally pops data off the stack by incrementing the stack top field in the status word register. It can also copy a value to another FPU register, popping it off the FPU register stack.
The **fist** converts the value from the top of the FPU register stack into a word or doubleword integer and stores it in memory. The **fistp** also can store a 64-bit quadword integer and additionally pops the value from the stack top.
The **fbstp** pops value off the FPU register stack and writes it as an 80-bit BCD encoded integer in memory.

<note>
Please note that it is not possible to exchange values between the FPU stack and CPU registers directly. It is also not possible to load the constant encoded as an immediate value. You can do it with the use of temporal variables placed in memory.
</note>

The **fxch** instruction exchanges values in two FPU registers, where one of them is the top of the stack. This instruction, used without any argument, exchanges ST and ST(1).
The **fcmov//cc//** instructions provide the conditional data transfer. They are introduced in the P6 processor to avoid conditional jumps. They test the condition based on flags in the EFLAGS register. The source operand can be any FPU register, while the destination is always ST(0). The **fcmov//cc//** instructions do not modify the stack top in the FPU.
There are eight such instructions summarised in the table {{ref>fpufcmovcc}}
<table fpufcmovcc>
<caption>Variations of **fcmov//cc//** instruction</caption>
^ Mnemonic ^ flags checked ^ description ^
| **fcmove** | ZF=1 | equal |
| **fcmovne** | ZF=0 | not equal |
| **fcmovb**  | CF=1 | below |
| **fcmovbe** | CF=1 or ZF=1 | below or equal |
| **fcmovnb** | CF=0 | not below |
| **fcmovnbe** | CF=0 and ZF=0 | not below or equal |
| **fcmovu** | PF=1 | unordered |
| **fcmovnu** | PF=0 | not unordered |
</table>
<note>
Unordered means that at least one of the arguments of the comparison instruction does not represent a proper numerical value (is NaN).
</note>

==== Load constants instructions ====
Some constant values can be pushed onto the FPU register stack without the need to define them in memory. Such loading is faster than in instructions that access memory. They are summarised in the table {{ref>fload_constants}}.
<table fload_constants>
<caption>Load constants instructions</caption>
^ Mnemonic ^ value loaded into ST ^
| **fldz** |  0  |
| **fld1** |  1  |
| **fldpi**  |{{ :en:multiasm:cs:fldpi.png?105 }}|
| **fldl2e** |{{ :en:multiasm:cs:fldl2e.png?105 }}|
| **fldl2t** |{{ :en:multiasm:cs:fldl2t.png?105 }}|
| **fldlg2** |{{ :en:multiasm:cs:fldlg2.png?105 }}|
| **fldln2** |{{ :en:multiasm:cs:fldln2.png?105 }}|
</table>

==== Basic arithmetic instructions ====
This group of instructions contains addition, subtraction, multiplication and division instructions in various versions. Arguments of these instructions determine their behaviour. Let's consider some examples.
If the instruction has a single argument, it must be a memory argument, which specifies a single precision or a double precision floating point number. The result is always stored in ST(0). The version with two arguments works with registers only. The order of arguments determines the order of the calculation and result placement. For example, **fsub ST(0), ST(i)** subtracts ST(i) from ST(0) and stores the result in ST(0). The **fsub ST(i), ST(0)** subtracts ST(0) from ST(i) and stores the result in ST(i). The popped version with two arguments additionally pops the stack. For example, **fsubp ST(i), ST(0)** subtracts ST(0) from ST(i), stores the result in ST(i) and pops the stack. No argument version implies ST(1) as the destination and ST(0) as the source argument. For example, **fsubp** subtracts ST(0) from ST(1), stores the result in ST(1) and pops the stack. The result is then at the stack top. Basic arithmetic instructions are summarised in table {{ref>ffparithmetic}}, //float// represents the single precision argument in memory, //double// represents the double precision argument in memory. The ST(i) is the i-th FPU register.
<table ffparithmetic>
<caption>Basic floating point arithmetic instructions</caption>
^ Mnemonic ^ operation ^ result ^ pop ^
|ADDITION| | | |
| **fadd float** |  ST(0) + float  |  ST(0)  |  no  |
| **fadd double** |  ST(0) + double  |  ST(0)  |  no  |
| **fadd ST(0), ST(i)**  |  ST(0) + ST(i)  |  ST(0)  |  no  |
| **fadd ST(i), ST(0)** |  ST(i) + ST(0)  |  ST(i)  |  no  |
| **faddp ST(i), ST(0)** |  ST(i) + ST(0)  |  ST(i)  |  yes  |
| **faddp** |  ST(1) + ST(0)  |  ST(1)  |  yes  |
|SUBTRACTION| | | |
| **fsub float** |  ST(0) - float  |  ST(0)  |  no  |
| **fsub double** |  ST(0) - double  |  ST(0)  |  no  |
| **fsub ST(0), ST(i)**  |  ST(0) - ST(i)  |  ST(0)  |  no  |
| **fsub ST(i), ST(0)** |  ST(i) - ST(0)  |  ST(i)  |  no  |
| **fsubp ST(i), ST(0)** |  ST(i) - ST(0)  |  ST(i)  |  yes  |
| **fsubp** |  ST(1) - ST(0)  |  ST(1)  |  yes  |
|MULTIPLICATION| | | |
| **fmul float** |  ST(0) * float  |  ST(0)  |  no  |
| **fmul double** |  ST(0) * double  |  ST(0)  |  no  |
| **fmul ST(0), ST(i)**  |  ST(0) * ST(i)  |  ST(0)  |  no  |
| **fmul ST(i), ST(0)** |  ST(i) * ST(0)  |  ST(i)  |  no  |
| **fmulp ST(i), ST(0)** |  ST(i) * ST(0)  |  ST(i)  |  yes  |
| **fmulp** |  ST(1) * ST(0)  |  ST(1)  |  yes  |
|DIVISION| | | |
| **fdiv float** |  ST(0) / float  |  ST(0)  |  no  |
| **fdiv double** |  ST(0) / double  |  ST(0)  |  no  |
| **fdiv ST(0), ST(i)**  |  ST(0) / ST(i)  |  ST(0)  |  no  |
| **fdiv ST(i), ST(0)** |  ST(i) / ST(0)  |  ST(i)  |  no  |
| **fdivp ST(i), ST(0)** |  ST(i) / ST(0)  |  ST(i)  |  yes  |
| **fdivp** |  ST(1) / ST(0)  |  ST(1)  |  yes  |
</table>
The addition and multiplication operations are commutative, while subtraction and division are not. That's why the reversed versions of subtraction and addition are implemented. The difference is the order of operations, while the destination remains the same as in non-reversed versions.

<table frevarithmetic>
<caption>Reversed floating point arithmetic instructions</caption>
^ Mnemonic ^ operation ^ result ^ pop ^
|REVERSED SUBTRACTION| | | |
| **fsubr float** |  float - ST(0)  |  ST(0)  |  no  |
| **fsubr double** |  double - ST(0)  |  ST(0)  |  no  |
| **fsubr ST(0), ST(i)**  |  ST(i) - ST(0)  |  ST(0)  |  no  |
| **fsubr ST(i), ST(0)** |  ST(0) - ST(i)  |  ST(i)  |  no  |
| **fsubrp ST(i), ST(0)** |  ST(0) - ST(i)  |  ST(i)  |  yes  |
| **fsubrp** |  ST(0) - ST(1)  |  ST(1)  |  yes  |
|REVERSED DIVISION| | | |
| **fdivr float** |  float / ST(0)  |  ST(0)  |  no  |
| **fdivr double** |  double / ST(0)  |  ST(0)  |  no  |
| **fdivr ST(0), ST(i)**  |  ST(i) / ST(0)  |  ST(0)  |  no  |
| **fdivr ST(i), ST(0)** |  ST(0) / ST(i)  |  ST(i)  |  no  |
| **fdivrp ST(i), ST(0)** |  ST(0) / ST(i)  |  ST(i)  |  yes  |
| **fdivrp** |  ST(0) / ST(1)  |  ST(1)  |  yes  |
</table>

There are also versions of four basic arithmetic instructions which operate with an integer memory argument. It can be a word or a doubleword.
<table fintarithmetic>
<caption>Basic integer arithmetic instructions</caption>
^ Mnemonic ^ operation ^ result ^ pop ^
|ADDITION| | | |
| **fiadd word** |  ST(0) + word  |  ST(0)  |  no  |
| **fiadd doubleword** |  ST(0) + doubleword  |  ST(0)  |  no  |
|SUBTRACTION| | | |
| **fisub word** |  ST(0) - word  |  ST(0)  |  no  |
| **fisub doubleword** |  ST(0) - doubleword  |  ST(0)  |  no  |
|REVERSED SUBTRACTION| | | |
| **fisubr word** |  word - ST(0)  |  ST(0)  |  no  |
| **fisubr doubleword** |  doubleword - ST(0)  |  ST(0)  |  no  |
|MULTIPLICATION| | | |
| **fimul word** |  ST(0) * word  |  ST(0)  |  no  |
| **fimul doubleword** |  ST(0) * doubleword  |  ST(0)  |  no  |
|DIVISION| | | |
| **fidiv word** |  ST(0) / word  |  ST(0)  |  no  |
| **fidiv doubleword** |  ST(0) / doubleword  |  ST(0)  |  no  |
|REVERSED DIVISION| | | |
| **fidivr word** |  word / ST(0)  |  ST(0)  |  no  |
| **fidivr doubleword** |  doubleword / ST(0)  |  ST(0)  |  no  |
</table>

The basic arithmetic instructions also contain instructions for other calculations. The **fprem** and **fprem1** calculate the partial remainder obtained from dividing the value in the ST(0) register by the value in the ST(1) register. The **fabs** calculate the absolute value of ST(0). The **fchs** changes the sign of ST(0). The **frndint** rounds the ST(0) to an integer. The **fscale** scales ST(0) by a power of two taken from ST(1), while **fxtract** separates the value in ST(0) into the exponent placed in ST(0) and the significand, which is pushed onto the stack. As a result, the exponent is in ST(1) and the significand in ST(0). It is also possible to calculate the square root of ST(0) with the **sqrt** instruction.

==== Comparison instructions ====
The comparison instructions compare two floating point values and set flags appropriate to the result. The operand of the **fcom** instruction can be a memory operand or another FPU register. It is always compared with the top of the stack. If no operand is specified, it compares ST(0) and ST(1). Popped version **fcomp** pops ST(0) off the stack. The instruction **fcompp** with double "P" at the end can't have any argument, compares ST(0) and ST(1) and pops both registers off the stack.
If one of the arguments is NaN, they generate the invalid arithmetic operand exception. To avoid unwanted exceptions, there are unordered versions of comparison instructions. These are **fucom**, **fucomp**, and **fucompp**. Unordered comparison instructions do not operate with memory arguments. Two instructions are implemented to compare integers. The **ficom** and **ficomp** have a single memory argument that can be a word or doubleword, which is compared with the top of the stack.
Original instructions set flags C0, C2 and C3 in the FPU status word register. After implementing FPU as the integral unit of the processor, a new set of instructions appeared that set flags in the FLAGS register directly. There are **fcomi**, **fcomip**, **fucomi** and **fucomip**. Their first argument is always ST(0), the second is another FPU register.
To the group of the comparison instructions also belong **fxam** and **ftst** instructions. The **fxam** instruction classifies the value of ST(0), while the **ftst** instruction compares ST(0) with the value of 0.0. They return the information in C0, C2 and C3 flags.

==== Transcendental instructions ====
The transcendental instructions perform calculations of advanced mathematical functions.
The **fsin** instruction calculates the sine, while the **fcos** calculates the cosine of the argument stored in ST(0). The **fsincos** calculates both sine and cosine with the same instruction. The sine is returned in ST(1), the cosine in ST(0). The **fptan** instruction calculates the partial tangent and **fpatan** the partial arctangent. After calculating the tangent, the value of 1.0 is pushed onto the stack to make it easier to calculate cotangent afterwards by execution **fdivr** instruction. The partial means that this instruction handles only a limited range of input arguments.
The instructions for exponential and logarithmic functions are summarised in table {{ref>ftrans}}.
<table ftrans>
<caption>Transcendental arithmetic instructions</caption>
^ Mnemonic     ^ operation                              ^ note on operands        ^
| **f2xm1**    | {{ :en:multiasm:cs:f2xm1.png?105 }}    |                         |
| **fyl2x**    | {{ :en:multiasm:cs:fyl2x.png?105 }}    | y is ST(1); x is ST(0)  |
| **fyl2xp1**  | {{ :en:multiasm:cs:fyl2xp1.png?105 }}  | y is ST(1); x is ST(0)  |
</table>

==== FPU control instructions ====
The FPU control instructions help the programmer to save and restore the contents of chosen registers if there is a need to use them in an interrupt handler or inside a function. It is also possible to initialise the state of the FPU unit or clear errors.
The **fincstp** increments and **fdecstp** decrements the FPU register stack pointer.
The following set of instructions can perform error checking while execution (instructions without "N") or perform the operation without checking for error conditions (instructions without "N").
The **finit** and **fninit** initialise the FPU (after checking error conditions or without checking error conditions).
The **fclex** and **fnclex** clear floating-point exception flags.
The **fstcw** and **fnstcw** store the FPU control word.
The **fldcw** loads the FPU control word.
The **fstenv** and **fnstenr** store the FPU environment. The environment consists of the FPU control word, status
word, tag word, instruction pointer, data pointer, and last opcode register.
The **fldenv** loads the FPU environment.
The **fsave** and **fnsave** save the FPU state. The state is the operating environment and full register stack.
The **frstor** restores the FPU state.
The **fstsw** and **fnstsw** store the FPU status word. There is no instruction for restoring the status word.
The **wait** or **fwait** waits for the FPU to finish the operation.
The **fnop** instruction is the no operation instruction for the FPU.
===== MMX, SSE and AVX extensions =====

At the point of personal computers' evolution, it became clear that they would be used not only for professional use, for example, in companies, financial institutions, and education, but also would be used as centres of home entertainment systems, enabling users to play games, watch videos, and listen to music. This led to empowering processors with the ability to process multimedia data. As the stereo sound has the form of a series of samples, and pictures are often represented by a matrix of three colour pixels, the method of improving the performance of multimedia processing is to introduce parallelism. At the processor level, the answer is SIMD - Single Instruction Multiple Data, which allows the execution unit to perform the same operation on many data units at the same time. Speaking more formally, one stream of instructions performs operations on many data streams. The first SIMD instructions introduced in the x86 family that follow this idea are MMX - MultiMedia eXtension.

==== MMX ====
MMX set of instructions operates on 64-bit packed data types. Packed means that the 64-bit data can contain 8 bytes, 4 words, or 2 doublewords. Based on this, the new data types were defined. Packed data types are also called vectors. Please refer to the section "Integer vector data types" for details. The MMX instructions operate using eight 64-bit registers named MM0 - MM7. 
=== Data transfer ===
To copy data from memory or between registers, two new data transfer instructions were introduced.
The **movd** instruction allows copying 32 bits of data between MMX registers and memory or between MMX registers and general-purpose registers of the main processor. The **movq** instruction allows copying 64 bits of data between MMX registers and memory or between two MMX registers. In all MMX instructions except data transfer, the first operand, which is a destination operand, is always an MMX register.
<note>
In modern 64-bit processors, the **movq** instruction is extended to copy 64-bit data between MMX registers and general-purpose registers of the main processor.
</note>
=== Basic vector calculations ===
The main idea of vector data processing is shown in figure {{ref>mmxprocessing}}. It shows the example of an operation performed with packed word vector data.
<figure mmxprocessing>
{{ :en:multiasm:cs:mmxprocessing.png?600 |Illustration of the idea of vector data processing}}
<caption>The idea of vector data processing</caption>
</figure>
When performing arithmetic operations, the main processor stores additional information in flags in the FLAG register. The MMX unit does not have flags for each calculated result, so some other approach should be used. Key information for arithmetic operations is the carry when adding and the borrow when subtracting. The simplest solution is to omit the carry, which, if the maximum value is exceeded, will result in truncation of the oldest bits and a reduction in the result. In the case of subtraction, the situation is reversed, and the resulting value will be larger than expected. For multimedia operations, a better solution is to limit the result to a maximum or minimum value. This approach is called saturation and comes in signed and unsigned versions. This means that, for example, when a pixel reaches its highest brightness, it will no longer be brightened. This way, information about brightness differences is lost, but the resulting image looks natural. Let's consider the addition operation on four arguments of the size of a word in three versions. In figure {{ref>mmxpaddw}}, the packed word addition with wraparound **paddw** is shown. In figure {{ref>mmxpaddsw}}, the packed word addition with signed saturation **paddsw** is presented, and finally, the packed word addition with unsigned saturation **paddusw** is shown in figure {{ref>mmxpaddusw}}.

<figure mmxpaddw>
{{ :en:multiasm:cs:mmxpaddw.png?500 |Illustration of packed word addition with wraparound}}
<caption>The illustration of packed word addition with wraparound</caption>
</figure>

<figure mmxpaddsw>
{{ :en:multiasm:cs:mmxpaddsw.png?500 |Illustration of packed word addition with signed saturation}}
<caption>The illustration of packed word addition with signed saturation</caption>
</figure>

<figure mmxpaddusw>
{{ :en:multiasm:cs:mmxpaddusw.png?500 |Illustration of packed word addition with unsigned saturation}}
<caption>The illustration of packed word addition with unsigned saturation</caption>
</figure>

The last letter in the instruction specifies the size of arguments and results. MMX addition and subtraction instructions are shown in the table {{ref>mmxaddsub}}
<table mmxaddsub>
<caption>MMAX addition and subtraction instructions</caption>
^ Mnemonic ^ operation ^ argument size ^ overflow management ^
| **paddb** | addition | 8 bytes | wraparound |
| **paddw** | addition | 4 words | wraparound |
| **paddd**  | addition | 2 doublewords | wraparound |
| **paddsb** | addition | 8 bytes | signed saturation |
| **paddsw** | addition | 4 words | signed saturation |
| **paddusb** | addition | 8 bytes | unsigned saturation |
| **paddusw** | addition | 4 words | unsigned saturation |
| **psubb** | subtraction | 8 bytes | wraparound |
| **psubw** | subtraction | 4 words | wraparound |
| **psubd**  | subtraction | 2 doublewords | wraparound |
| **psubsb** | subtraction | 8 bytes | signed saturation |
| **psubsw** | subtraction | 4 words | signed saturation |
| **psubusb** | subtraction | 8 bytes | unsigned saturation |
| **psubusw** | subtraction | 4 words | unsigned saturation |
</table>

Multiplication operation requires twice as much space for the result as the size of the arguments. The solution for this issue is to split the operation into two multiplication instructions, storing the higher and lower halves of the results. For vectors of words, the instructions are **pmulhw** (packed multiply with storing higher halves of the results) and **pmullw** (packed multiply with storing lower halves of the results), respectively. Later halves can be joined together, forming full results with unpack instructions. To unpack the higher halves of arguments, the **punpckhwd** (unpack from higher halves words to doublewords) instruction can be executed. For lower halves, the **punpcklwd** instruction can be used. The whole algorithm is shown in figure {{ref>mmxmultiplyandunpck}}.

<figure mmxmultiplyandunpck>
{{ :en:multiasm:cs:mmxmultiplyandunpck.png?650 |Illustration of packed word multiplication and unpacking results to doublewords}}
<caption>The illustration of packed word multiplication and unpacking results to doublewords</caption>
</figure>

The code that calculates the presented multiplication can look as follows:
<code asm>
Numbers	DW  01ACh, 2112h, 03F3h, 00A4h,
	    0006h, 0137h, 0AB7h, 00D8h
LEA	    ESI, Numbers
MOVQ	    mm0, [ESI]	        ; mm0 = 00A4 03F3 2112 01AC
MOVQ	    mm1, [ESI+8]	; mm1 = 00D8 0AB7 0137 0006
MOVQ	    mm2, mm0
PMULLW	    mm0, mm1		; mm0 = 8A60 50B5 2CDE 0A08
PMULHW	    mm1, mm2		; mm1 = 0000 002A 0028 0000
MOVQ	    mm2, mm0
PUNPCKLWD   mm0, mm1		; mm0 = 0028 2CDE 0000 0A08
PUNPCKHWD   mm2, mm1		; mm2 = 0000 8A60 002A 50B5
</code>

=== Advanced calculations ===
The MMX set of instructions also contains the **pmaddwd** (multiply and add packed words to doublewords) instruction. It computes the products of the corresponding signed word operands. The four intermediate doubleword products are summed in pairs to produce two doubleword results. Its behaviour is shown in figure {{ref>mmxpmaddw}}. This instruction can simplify the multiplication process in the case of multiplying two pairs of word arguments, while the other two pairs give results of zero.

<figure mmxpmaddw>
{{ :en:multiasm:cs:mmxpmaddw.png?650 |Illustration of packed word multiplication and sum to doublewords}}
<caption>The illustration of packed word multiplication and sum to doublewords</caption>
</figure>

=== Comparison ===
The set of comparison instructions allows for comparing values in two vectors. The result is stored as a mask of bits, with all ones at the element of the vector where the comparison result is true, and all zeros in the opposite case. There are six compare instructions as shown in table {{ref>mmxtabcompare}}.
<table mmxtabcompare>
<caption>MMX comparison instructions</caption>
^ Mnemonic ^ comparison type ^ argument size ^ 
| **pcmpeqb** | equal | 8 bytes | 
| **pcmpeqw** | equal | 4 words | 
| **pcmpeqd** | equal | 2 doublewords | 
| **pcmpgtb** | greater than | 8 bytes | 
| **pcmpgtw** | greater than | 4 words |
| **pcmpgtq** | greater than | 2 doublewords |
</table>
An example of comparison instruction for equality of two vectors of words is shown in figure {{ref>mmxcompare}}.
<figure mmxcompare>
{{ :en:multiasm:cs:mmxcompare.png?500 |Illustration of vector data comparison}}
<caption>Vector data comparison</caption>
</figure>

=== Data conversion ===
The unpack instructions presented in figure {{ref>mmxmultiplyandunpck}} are not the only ones. Unpack instructions of high-order data elements are **punpckhbw**, **punpckhwd**, **punpckhdq**, and for low-order data elements are **punpcklbw**, **punpcklwd**, **punpckldq**. The figure {{ref>mmxpunpckhbw}} presents unpacking of high-order bytes to words, and figure {{ref>mmxpunpcklbw}} low-order bytes to words.

<figure mmxpunpckhbw>
{{ :en:multiasm:cs:mmxpunpkhbw.png?650 |Illustration of unpacking high-order bytes to words}}
<caption>The illustration of unpacking high-order bytes to words</caption>
</figure>

<figure mmxpunpcklbw>
{{ :en:multiasm:cs:mmxpunpklbw.png?650 |Illustration of unpacking low-order bytes to words}}
<caption>The illustration of unpacking low-order bytes to words</caption>
</figure>

The pack instructions are used to shrink the size of arguments and pack them into smaller data. Only three pack instructions are implemented in MMX extension: **packsswb** - pack words into bytes with signed saturation, **packssdw** - pack doublewords into words with signed saturation, and **packuswb** - pack words into bytes with unsigned saturation. The example of pack instruction is shown in figure {{ref>mmxpack}}.

<figure mmxpack>
{{ :en:multiasm:cs:mmxpacksswd.png?650 |Illustration of packing doublewords to words}}
<caption>The illustration of packing doublewords to words</caption>
</figure>

=== Shift ===
Packed shift instructions perform shift operations and elements of the specified size. All elements of the vector are shifted separately. In a logical shift, empty bits are filled with zeros, in arithmetical shift right, the higher bit is copied to preserve the sign of values.
There are eight shift instructions, as presented in table {{ref>mmxshift}}
<table mmxshift>
<caption>MMX shift instructions</caption>
^ Mnemonic ^ operation ^ argument size ^ type of shift ^
| **psllw** | shift left | 4 words | logical |
| **pslld** | shift left | 2 doublewords | logical |
| **psllq** | shift left | 1 quadword | logical |
| **psrlw** | shift right | 4 words | logical |
| **psrld** | shift right | 2 doublewords | logical |
| **psrlq** | shift right | 1 quadword | logical |
| **psraw** | shift right | 4 words | arithmetic |
| **psrad** | shift right | 2 doublewords | arithmetic |
</table>

=== Logical ===
MMX logical instructions operate on the 64-bit data as a whole. They perform bitwise operations as shown in table {{ref>mmxlogical}}.
<table mmxlogical>
<caption>MMX logical instructions</caption>
^ Mnemonic ^ operation ^ 
| **pand** | AND |
| **pnand** | AND NOT |
| **por** | OR |
| **pxor** | XOR |
</table>

=== Co-existence of FPU and MMX ===
MMX instructions use the same physical registers as the FPU. As a result, mixing PFU and MMX instructions in the same fragment of the code is not possible. Switching between PFU and MMX in a code requires executing the **emms** instruction, which resets the FPU and MMX units. Fortunately, newer extensions (SSE, AVX) introduce a separate set of registers for improved flexibility. 

==== SSE ====
The SSE is a large group of instructions that implement the SIMD processing towards floating point calculations and increase the size and number of the registers. The abbreviation SSE comes from the name Streaming SIMD Extensions. As the number of instructions introduced in all SSE versions exceeds a few hundred, in this section, we will present the general overview of each SSE version and detailed information on some chosen interesting instructions.
The first group of SSE instructions defines a new vector data type containing four single-precision floating-point numbers. It's easy to calculate that it requires the 128-bit registers. These new registers are named XMM0 - XMM7 and are separated from any previously implemented registers, so SSE floating-point operations do not conflict with MMX and FPU.
=== Data transfer ===

In modern processors, it is very important to transfer data from and to memory effectively. The memory management unit can perform data transfer much faster if the data is aligned to a specific address. For SSE instructions, an address must be evenly divisible by 16. In the SSE extension, two versions of data transfer instructions were implemented. The **movups** copies packed single-precision data from any address, while the **movaps** moves data from an aligned address. The **mivss** moves the scalar single-precision value. It doesn't have to be aligned. It is also possible to copy data between the upper half of the XMM register and memory with the **movhps** instruction, between the lower half of the XMM register and memory with the **movlps**, and from the lower to higher half or from the higher to lower half of the XMM registers with the **movhlps** and **movlhps**, respectively. The **movmskps** instruction copies the most significant bits of single-precision floating-point values to a general-purpose register. It allows us to make a bit mask based on the sign bits of elements of the vector.

=== Calculations ===
The SSE implements the vector and scalar calculations on single-precision floating-point numbers. No prefix for instruction names operating on floating-point numbers was added, but the mnemonic suffix describes the type. PS (packed single) - action on vectors,
SS (scalar single) - operation on scalars. If the instructions operate on halves of XMM registers (i.e. either refer to bits 0..63 or 64..127), the instruction mnemonics contain the letter L or H.
The idea of vector and scalar operations is shown in figure {{ref>sse1vector}} and figure {{ref>sse1scalar}}, respectively.
<figure sse1vector>
{{ :en:multiasm:cs:sse1vector.png?600 |Illustration of the idea of SSE vector data processing}}
<caption>The idea of vector data processing in SSE</caption>
</figure>
<figure sse1scalar>
{{ :en:multiasm:cs:sse1scalar.png?600 |Illustration of the idea of SSE scalar data processing}}
<caption>The idea of scalar data processing in SSE</caption>
</figure>

In the SSE extension, mathematical calculations on single-precision floating-point numbers are implemented in both vector (packed) and scalar versions. These instructions are summarised in table {{ref>ssemath}}.
<table ssemath>
<caption>SSE math calculations instructions</caption>
^ Mnemonic ^ operation ^ argument type ^ 
| **addps** | addition | vector | 
| **addss** | addition | scalar | 
| **subps** | subtraction | vector | 
| **subss** | subtraction | scalar | 
| **mulps** | multiplication | vector | 
| **mulss** | multiplication | scalar | 
| **divps** | division | vector | 
| **divss** | division | scalar | 
| **rcpps** | reciprocal | vector |
| **rcpss** | reciprocal | scalar | 
| **sqrtps** | square root | vector |
| **sqrtss** | square root | scalar | 
| **rsqrtps** | reciprocal of square root | vector |
| **rsqrtss** | reciprocal of square root | scalar |
| **maxps** | maximum (bigger) value | vector |
| **maxss** | maximum (bigger) value | scalar |
| **minps** | minimum (smaller) value | vector |
| **minss** | minimum (smaller) value | scalar |
</table>

=== Comparison ===
Besides math calculations, there are instructions for comparing vector **cmpps** and scalar **cmpss** values. As a result, we obtain the all-ones or all-zeros fields as in MMX. The condition of comparison is encoded as the third 8-bit immediate argument. Assemblers usually implement a set of pseudoinstructions which automatically choose the constant value. The scalar version of these pseudoinstructions is presented in table {{ref>ssecompare}}

<table ssecompare>
<caption>SSE scalar comparison preudoinstructions</caption>
^ Pseudoinstruction ^ operation ^ instruction ^ 
| **cmpeqss** xmm1, xmm2 | equal | cmpss xmm1, xmm2, 0 | 
| **cmpltss** xmm1, xmm2 | less then | cmpss xmm1, xmm2, 1 | 
| **cmpless** xmm1, xmm2 | less or equal | cmpss xmm1, xmm2, 2 | 
| **cmpunordss** xmm1, xmm2 | unordered | cmpss xmm1, xmm2, 3 | 
| **cmpneqss** xmm1, xmm2 | not equal | cmpss xmm1, xmm2, 4 | 
| **cmpnltss** xmm1, xmm2 | not less then | cmpss xmm1, xmm2, 5 | 
| **cmpnless** xmm1, xmm2 | not less or equal | cmpss xmm1, xmm2, 6 | 
| **cmpordss** xmm1, xmm2 | ordered | cmpss xmm1, xmm2, 7 | 
</table>

With the use of the **comiss** instruction, it is possible to compare scalars and set the flags in the FLAG register directly according to the result of the comparison.

=== Logical instructions ===
There are four logical instructions which operate on all 128 bits of the XMM register. These are **andps**, **andnps**, **orps**, **xorps**. It is rather clear that the functions are logical and, logical and not, logical or and logical xor, respectively.

=== Data shuffle ===
Two unpack instructions, **unpcklps** and **unpckhps**, operate similarly to unpack instructions known already from MMX. Because the source and destination data are packed single-precision floating-point values, unlike in MMX, these instructions are not used to form longer data types, but change the positions of two elements from two vectors. It is presented in figure {{ref>sseunpack}}.

<figure sseunpack>
{{ :en:multiasm:cs:sseunpack.png?650 |Illustration of SSE unpacking single-precision floating-point values}}
<caption>The illustration of SSE unpacking single-precision floating-point values</caption>
</figure>

The more universal instruction is **shufps**. It selects two out of four single-precision values from the source argument and rewrites them to the bottom half of the destination argument. The upper half is filled with two single-precision values from the destination register. Which values ​​will be taken is determined by the third, 8-bit immediate argument. Each two-bit field of the immediate determines the number of packed single values. For 11 - it is X3 or Y3, for 10 - X2 or Y2, for 01 - X1 or Y1 and for 00 - X0 or Y0. It is presented in figure {{ref>sseshuffle}}.

<figure sseshuffle>
{{ :en:multiasm:cs:sseshuffle.png?650 |Illustration of SSE shuffle single-precision floating-point values}}
<caption>The illustration of SSE shuffle single-precision floating-point values</caption>
</figure>
=== Other instructions ===
Together with new data registers, an additional control register appeared in the processor. It is named MXCSR and is similar in meaning of flags to the FPU control register. New instructions are implemented, one to save MXCSR to memory **stmxcsr**, and one to restore this register from memory **ldmxcrs**. The instruction **fxsave** stores the state of both the x87 unit and the SSE extension, and **fxrstor** restores the state of both the x87 unit and the SSE extension.
Some additional MMX instructions were introduced together with the SSE extension.
In the SSE, the first set of data conversion instructions was implemented. The summary of all these instructions will be presented in the following chapter. Also, cache supporting instructions were added. These instructions will be described in the chapter on optimisation.

==== SSE2 ====
The SSE2 set of instructions implements integer vector operations using the XMM registers. In general, the same instruction mnemonics defined for MMX can be used with XMM registers, offering twice as long vectors. Additionally, the floating-point calculations are complemented with vector operations on the double-precision data type. In XMM registers, vectors of two double-precision values can be processed. SSE2 uses the same software environment (eight 128-bit XMM registers) as SSE.
In the SSE2 extension, the denormals-are-zeros mode was introduced. The processor automatically converts all unnormalized floating-point arguments to zero. In such a case, a flag indicating the denormalised argument is not set, and an exception is not raised. The denormals-are-zeros mode is not compliant with IEEE Standard 754, but it allows implementation of faster algorithms for advanced audio and video processing.
The arithmetic instructions are similar to SSE, but they possess the suffix of **pd** - packed double or **sd** - scalar double instead of **ps** and **ss**, respectively.

=== Conversion ===
In figure {{ref>sse2conversions}} we present the type conversion instructions. They enable conversion between integer and floating-point data of various sizes and in different registers. The green arrows and instruction nodes represent the conversion from single-precision floating-point to integers, pink represents the conversion from double-precision floating-point to integers, blue represents the conversion from integers to floating points, and orange represents the conversion between single and double precision floating-point. 

<figure sse2conversions>
{{ :en:multiasm:cs:sse2conversions.png?650 |Illustration of a variety of data type conversion instructions}}
<caption>The illustration of a variety of data type conversion instructions</caption>
</figure>

==== SSE3 ====
The SSE3 is a set of 13 instructions. The main innovation in SSE3 is the implementation of horizontal instructions. These instructions perform calculations on the elements of a vector within the same register. There are four such instructions. The **haddpd** performs horizontal addition of double-precision values, the **hsubpd** is a horizontal subtraction of double-precision values, the **haddps** is a horizontal addition of single-precision values, and the **hsubps** is a horizontal subtraction of single-precision values.
All horizontal instructions operate in a similar manner. The lower (bottom) part of the resulting vector is the result of operation on the bottom and top elements of the first (destination) operand; the higher (top) part of the resulting vector is the result of operation on the second (source) operand's bottom and top. The best way to present the principles of horizontal operations is a picture. Because in the subtraction operation the order of arguments is important, the **hsubpd** instruction is shown in figure {{ref>sse3hsubpd}}.
<figure sse3hsubpd>
{{ :en:multiasm:cs:sse3hsubpd.png?650 |Illustration of a horizontal subtraction instruction}}
<caption>The illustration of a horizontal subtraction instruction</caption>
</figure>

While there are more than two elements of source vectors, like in the **hsubps** instruction, it is also important to know the order of the elements in the resulting vector. Please look at the figure {{ref>sse3hsubps}}.
<figure sse3hsubps>
{{ :en:multiasm:cs:sse3hsubps.png?650 |Illustration of a horizontal single precision subtraction instruction}}
<caption>The illustration of a horizontal single precision subtraction instruction</caption>
</figure>

==== SSSE3 ====
This abbreviation stands for Supplemental Streaming SIMD Extension 3. It is a set of 16 instructions introduced in the Core 2 architecture. It implements integer horizontal operations on XMM registers. The principles are the same as in horizontal instructions in SSE3, but instructions can process vectors of doublewords or words. They are summarised in the table {{ref>SSSE3horizontaltable}}.

<table SSSE3horizontaltable>
<caption>SSSE3 horizontal integer instructions</caption>
^ Instruction ^ operation ^ data ^ 
| **phaddd** | addition | unsigned doublewords | 
| **phaddw** | addition | unsigned words | 
| **phaddsw** | saturated addition | signed words | 
| **phsubd** | subtracion | unsigned doublewords | 
| **phsubw** | subtracion | unsigned words | 
| **phsubsw** | saturated subtracion | signed words |
</table>

Two data shuffle instructions are worth mentioning. The **pshufb** instruction makes copies of bytes from the first 128-bit operand based on the control information taken from the second 128-bit operand. Each byte in the control operand determines the resulting byte in the respective position.
  * bit 7 is 1 - byte is cleared
  * bit 7 is 0 - byte contains a copy of the source byte
  * bits 0-3 - a number of the source byte to be copied
The illustration is shown in figure {{ref>sse3pshufb}}.
<figure sse3pshufb>
{{ :en:multiasm:cs:sse3pshufb.png?650 |Illustration of a byte shuffle instruction}}
<caption>The illustration of a byte shuffle instruction</caption>
</figure>

The **palignr** instruction combines bytes from two source operands as shown in figure {{ref>sse3palignr}}. The position of the byte split is specified as third immediate. In the figure, the immediate is equal to 2.

<figure sse3palignr>
{{ :en:multiasm:cs:sse3palignr.png?650 |Illustration of an aligned byte combine instruction}}
<caption>The illustration of an aligned byte combine instruction</caption>
</figure>

==== SSE4 ====
The SSE4 is composed of SSE4.1 and SSE4.2. These groups include instructions supplementing previous extensions. For example, there are eight instructions which expand support for packed integer minimum and maximum determination, or twelve instructions which improve packed integer format conversions with sign extension and zero extension.
The **dpps** and **dppd**  instructions calculate the dot product of four single-precision and two double-precision operands, respectively. Additionally, the arguments are controlled with the third immediate operand. The example showing the **dppd** is presented in figure {{ref>sse4dotproduct}}.
<figure sse4dotproduct>
{{ :en:multiasm:cs:sse4dotproduct.png?650 |Illustration of a dot product calculation instruction}}
<caption>The illustration of a dot product calculation instruction</caption>
</figure>

There are also advanced shuffle, insert and extract instructions which make it possible to manipulate positions of the data of various types. A few examples will be shown in the following figures. The **insertps** inserts a scalar single-precision floating-point value with the position of the vector's element in source and destination controlled with an 8-bit immediate. The example showing the **insertps** instruction is presented in figure {{ref>sse4insertps}}. In this example, the immediate contains the bit value of 10011000b.
<figure sse4insertps>
{{ :en:multiasm:cs:sse4insertps.png?500 |Illustration of an example of an advanced shuffle instruction}}
<caption>The illustration of an example of an advanced shuffle instruction</caption>
</figure>
In SSE4.2, the set of string compare instructions was added. As the XMM registers can contain sixteen bytes, it is much more efficient to implement string processing algorithms with bigger XMM registers than with registers in the main processor with the use of strong instructions. There are four string compare instructions (see table {{ref>sse4stringtable}}), but each of them can be configured to achieve different functionalities. The length of strings can be explicit or implicit. Explicit length means that the length of the first operand is specified with the RAX register, and the length of the second operand is specified with the RDX register. Implicit length means that both operands contain null-terminated strings. Instructions can produce two kinds of results. Index means that the index of the first or last result is returned. Mask means that the bit mask is returned (one bit for each two elements compared) or a mask of the size of the elements (similarly to MMX compare).
<table sse4stringtable>
<caption>SSE4.2 string compare instructions</caption>
^ Instruction ^ length ^ type of the result ^ 
| **pcmpestri** | explicit | index |
| **pcmpestrm** | explicit | mask |
| **pcmpistri** | implicit | index |
| **pcmpistrm** | implicit | mask |
</table>

The third, immediate operand encodes the comparison method and result encoding.

<table SSE4stringdata>
<caption>SSE4.2 string compare input data</caption>
^ bits 1:0 ^ data type ^
| 00 | unsigned BYTE |
| 01 | unsigned WORD |
| 10 | signed BYTE |
| 11 | signed WORD |
</table>

<table SSE4stringcomparisonmethod>
<caption>SSE4.2 string compare method encoding</caption>
^ bits 3:2 ^ operation ^ comment ^ 
| 00 | Equal Any | find any of the specified characters in the input string |
| 01 | Ranges | check if characters are within the specified ranges |
| 10 | Equal Each | check if the input strings are equal |
| 11 | Equal Ordered | check if the needle string is in the haystack string |
</table>
The SSE4.2 string compare instructions are advanced, powerful means for processing byte or word strings. The detailed explanation of SSE4.2 string instructions behaviour together with illustrations can be found on ((https://www.officedaytime.com/simd512e/simdimg/str.php?f=pcmpestri)).

==== AVX ====
AVX is the abbreviation of Advanced Vector Extensions. The AVX implements larger 256-bit YMM registers as extensions of XMM. In 64-bit processors number of YMM registers is increased to 16. Many SSE instructions are expanded to handle operations with new, bigger data types without modification of mnemonics. The most important improvement in the instruction set of x64 processors is the implementation of RISC-like instructions in which the destination operand can differ from two source operands. A three-operand SIMD instruction format is called the VEX coding scheme. The AVX2 extension implements more SIMD instructions for operation with 256-bit registers. The AVX-512 extends the register size to 512 bits. An interesting, comprehensive description of a variety of x64 AVX instructions is available on website ((https://www.officedaytime.com/simd512e/)).
===== MASM basics =====
An assembler, understood as software that translates assembler source code into machine code, can be implemented in various ways. While the processor's instructions remain constant, other language elements may be implementation-specific. In this chapter, we present the most important language elements specific to the MASM assembly implementation. For a detailed description of MASM assembler language implementation, please refer to the Microsoft© website((https://learn.microsoft.com/en-us/cpp/assembler/masm/microsoft-macro-assembler-reference?view=msvc-170)).
==== Alphabet ====
An alphabet is a set of characters which can be used in writing programs. In MASM, they include small and capital letters, digits, hidden characters and special characters.
  * Letters: A…Z, a…z
  * Digits: 0…9
  * Hidden ASCII characters: 09h, 20h, 0Dh, 0Ah
  * Special characters: + - * / = ( ) [ ] < > . , ‘ ” _ : ? @ $ & %
Special characters may have a defined meaning, and not all of them can be used freely in the program.
  * , – comma – separates the operands,
  * ‘…’ – apostrophe – text delimiter,
  * ”…” – quotation marks – text delimiter,
  * (…) – round brackets – determine order of expression counting,
  * ; – semicolon – begins the comment,
  * : – colon – delimits the labels and segment prefixes,
  * . – dot – used in record data types, begins some of the directives,
  * & – ampersand – used in macros to replace a formal argument with an actual value,
  * % – percent – expand operator used in macros,
  * <…> – angle brackets – text delimiter used in macros,
  * […] – square brackets – used in address expressions,
  * $ – dollar sign – actual value of instruction pointer,
  * = – equal sign – directive to define constants,
  * ? – question mark – indefinite value,
  * @ – at - begins predefined names,
  * _ – underline – often used in symbolic names instead of the space,
  * + - * / – mathematical operators.

Hidden ASCII characters.
  * 0D0Ah – CRLF (enter) – end of line,
  * 20h – space – separates items in a line
  * 09h – tabulation – used instead of space to improve readability of the code


==== Keywords and symbolic names ====
Reserved words, known also as keywords, are words that have special meaning in MASM. They represent elements of the language defined by MASM creators, and their use is reserved for special purposes only. They can’t be used as label names, variable names, constant names and similar user-defined items. They are:
  * Instructions,
  * Directives,
  * Attributes,
  * Operators,
  * Predefined symbols.

Symbolic names are words defined by the programmer used for identifying elements of the program. They are used to name constants, variables, addresses, segments and other items of the source code. Certain specific rules must be followed when creating the symbolic name. Symbolic name can’t begin with a digit and can consist of letters, digits and four special characters: $, @, _, and ?. Upper-case and lower-case letters are treated as the same, and only the first 31 characters are recognised.
Examples of proper symbolic names.
<code asm>
ABC123_4
Number_602602602_@1
?quest
$125
__Right_Here
</code>
Examples of improper symbolic names.
<code asm>
12_cats
?
‘name’
Hello.world
Right Here
</code>

==== Operators ====
Operators are used in expressions calculated during assembly. They enable performing arithmetic and logic calculations with numeric expressions and address expressions. Some operators are associated with macros or other specific elements of assembler language. We'll present some of them. For details about all operators, please refer to the MASM documentation ((https://learn.microsoft.com/en-us/cpp/assembler/masm/operators-reference?view=msvc-170)).

The operators which can be used in numeric expressions are 
  * ** * **, ** / **      - multiplication and division
  * **MOD**       - remainder of an integer division
  * **SHL**, **SHR**  - shift left and right
  * **OR**, **XOR**, **AND**, **NOT** - logical functions

The operators which can be used in numeric and address expressions are
  * ** + **, ** - **              - addition, subtraction
  * **HIGH**, **LOW**         - high/low 8 bits of a 16-bit variable/address
  * **HIGHWORD**, **LOWWORD** - high/low 16 bits of 32-bit variable/address
  * **HIGH32**, **LOW32**     - high/low 32 bits of 64-bit variable/address

The type operators are used in address expressions. They determine the number of bytes in a single variable, the number of elements in a data array, or the size of a whole data array {{ref>masmtypeoperators}}. They are very useful because they can automatically recalculate, for example, the number of iterations necessary to review a string of characters or a data array when its length changes.
<table masmtypeoperators>
<caption>MASM type operators</caption>
^ operator ^ value ^
| **TYPE** | type (number of bytes in one variable) |
| **LENGTH** | number of elements in one-dimensional array |
| **SIZE** | number of bytes in one-dimensional array |
| **LENGTHOF** | number of elements in a multi-dimensional array |
| **SIZEOF** | number of bytes in a multi-dimensional array |
| **PTR** | type cast operator |
</table>

<note>
The following dependencies occur:\\
SIZE = TYPE * LENGTH\\
SIZEOF = TYPE * LENGTHOF
</note>

The **PTR** operator is similar to type casting in other programming languages. In some cases, it is required to specify the size of an operand. For example, if we have the indirect increment instruction. The assembler can't determine the size of the operand in memory pointed with the RBX register, which is why we have to specify the operand size with the PTR operator.
<code asm>
   inc [RBX]          ; Error! - argument size is not specified
   inc BYTE PTR [RBX] ; Increment one byte addressed with RBX
   inc WORD PTR [RBX] ; Increment word addressed with RBX
   
   mov [RSI], AX      ; Store word from AX to memory - AX use determines the size
   mov [RSI], 5       ; Error! - constant operand does not determine the size
   
   mov BYTE PTR [RSI], 5  	; Store 8-bit value
   mov WORD PTR [ESI], 5  	; Store 16-bit value
   mov DWORD PTR [ESI], 5 	; Store 32-bit value

</code>

An important operator used in data definitions is **DUP**. It specifies the number of repetitions of the initial value. We'll present details of it later in this chapter. 
==== Code and data sections ====
Programs in modern 64-bit operating systems are divided into code and data sections. The operating system maintains the stack, and currently, no stack section is defined in user programs.
To start the code section, the **.CODE** directive is used. The code section contains all instructions in a program. To identify the beginning of the data section, the **.DATA** directive is used. The data section contains all the variables used in a program.
<note>
Up to 32-bit processors, the functional fragments of programs were referred to as segments. It was because they were assigned to segment registers in the processor. Currently, the segmentation mechanism is no longer operational, so the code and data fragments of programs are named sections. However, in many literature sources and internet websites, the name segment can still be frequently found.
</note>

==== Location counter ===
The location counter is an internal variable, maintained by the assembler, to assign addresses to program items. During assembly, it performs a similar role as the instruction pointer during program execution. The location counter contains the address of the currently processed variable in a data section and the instruction in a code section.
Any directive which starts a section defines a new location counter and sets it to 0. If the same section is continued in another place in a program, the location counter increments continuously throughout the whole section. Assembling subsequent bytes increases the content of the location counter by 1.
While the **SEGMENT** and **ENDS** directives are used, the **SEGMENT** directive, used for the specific section (segment) for the first time, creates the location counter for this section. The **ENDS** directive suspends byte counting in a given location counter until the next fragment of the section with the same name starts with another **SEGMENT** directive.
The current value of the location counter can be retrieved with the **$** sign.


==== Selected directives ====

The **ORG** directive sets the location counter to the specified value x. It is used to align the parts of the program to a specific address.

The **EVEN** directive aligns the next variable or instruction on an even byte. As the data elements in modern processors require alignment to addresses divisible by 16, the **ALIGN** directive is often used instead of **EVEN**.

The **ALIGN** directive aligns the next variable or instruction on an address of a byte that is a multiple of the argument.
The argument of **ALIGN** must be a power of two. Empty spaces are filled with zeros for the data section or appropriately-sized **NOP** instructions for the code section. Note that **ALIGN 2** is equal to **EVEN**.

The **LABEL** directive creates a new label by assigning the current location-counter value and the given type to the defined name. Usually in a program, the **:** (colon) sign is used for label definition, but the **LABEL** directive enables specifying the type of element which the label points to.

==== Data definition directives ====
There is a set of directives for defining variables. They enable the assignment of a name to a variable and the specification of its type. They are summarised in a table
{{ref>masmdatadefine}}.
<table masmdatadefine>
<caption>MASM variable definition directives</caption>
^ Name ^ data type ^ data size ^ comment ^
| **DB** | byte | 1 byte | |
| **BYTE** | byte | 1 byte | | 
| **SBYTE** | signed byte | 1 byte | |
| **DW** | word | 2 bytes | |
| **WORD** | word | 2 bytes | |
| **SWORD** | signed word | 2 bytes | |
| **DD** | doubleword | 4 bytes | |
| **DWORD** | doubleword | 4 bytes | |
| **SDWORD** | signed doubleword | 4 bytes | |
| **DF** | farword | 6 bytes | used as a pointer in 32-bit mode |
| **FWORD** | farword | 6 bytes | used as a pointer in 32-bit mode |
| **DQ** | quadword | 8 bytes | |
| **QWORD** | quadword | 8 bytes | |
| **SQWORD** | signed quadword | 8 bytes | |
| **DT** | 10 bytes | 10 bytes | used as 80-bit BCD integer for FPU |
| **TBYTE** | 10 bytes | 10 bytes | used as 80-bit BCD integer for FPU |
| **OWORD** | octalword | 16 bytes | |
| **REAL4** | single precision | 4 bytes | floatng point for FPU |
| **REAL8** | double precision | 8 bytes | floatng point for FPU |
| **REAL10** | extended double precision | 10 bytes | floatng point for FPU |
</table>

Variable definition directives can be used to define single variables, data tables or strings. The list of operands determines it. It is allowed to use **?** as an operand signalling that the initialisation value remains undefined.
<code asm>
var_x    DB 10            ; single byte variable with initial value 10
var_y    DW 20            ; single word variable with initial value 20
var_z    DD ?             ; single uninitialised doubleword 
table_a  DQ 1, 2, 3, 4, 5 ; table of five quadwords
string_b BYTE "I like assembler" ; string with ASCII codes of all characters
</code>

Previously mentioned **DUP** operator and type operators can be explained with some exemplary data definitions.
<code asm>
			          ; TYPE    LENGHT   SIZE
A     DB 10 DUP (?) 	          ; 1	    10	     10
AB    DW 10 DUP (?) 	          ; 2	    10	     20
ABC   DD 10 DUP (?) 	          ; 4	    10	     40
AD    DB 5 DUP (5 DUP (5 DUP(?))) ; 1       125      125
</code>

An example which shows the **DUP** and **SIZEOF** operators together with data definitions is in the following code. This code defines the uninitialised 256-byte data buffer and fills it with zeros. Please note that in the **mov [RBX], 0** instruction, **BYTE PTR** must be used, because neither [RBX] nor 0 determines the operand size.
<code asm>
.DATA
buffer DB 256 DUP (?)
.CODE
  lea RBX, buffer
  mov RCX, SIZEOF buffer
clear:
  mov BYTE PTR [RBX], 0
  inc RBX
  loop clear
</code>

==== Constants ====
Constants in an assembler program define the name for the value that can't be changed during normal program execution. It is the assembly-time assignment of the value and its name. Although their name suggests that their value can't be altered, it is true at the program run-time. Some forms of constants can be modified during assembly time. Usually, constants are used to self-document the code, parameterise the assembly process, and perform assembly-time calculations.
The constants can be integer, floating-point numeric, or text strings.\\
Integer numeric constants can be defined with the data assignment directives, **EQU** or the equal sign **=**. The difference is that a numeric constant defined with the EQU directive can’t be modified later in the program, while a constant created with the equal sign can be redefined many times in the program. Numeric constants can be expressed as binary, octal, decimal or hexadecimal values. They can also be a result of an expression calculated during assembly time. It is possible to use a previously defined constant in such an expression.
<code asm>
int_const1 EQU 5                ; no suffix by default decimal value
int_const_dec = 7               ; finished with "d", "D", "t", "T", or by default without suffix
int_const_binary = 100100101b   ; finished with "b", "B", "y", or "Y"
int_const_octal = 372o          ; finished with "o", "O", "q", or "Q"
int_const_hex = 0FFA4h          ; finished with "h", or "H"
int_const_expr = int_const_dec * 5
</code> 
Floating-point numeric constants can be defined with the **EQU** directive only. The number can be expressed in decimal or scientific notation.
<code asm> 
real_const1 EQU 3.1415          ; decimal
real_const2 EQU 6.28e2          ; scientific
</code>
Text string constants can be defined with **EQU** or **TEXTEQU** directives. Text constants assigned with the **EQU** or **TEXTEQU** directive can be redefined later in the program. The **TEXEQU** is considered a text macro and is described in the section about macros.
<code asm>
text_const1 EQU 'Hello World!'
text_const2 EQU "Hello World!"
</code>


==== Conditional assembly directives ====

The condition assembly directives have the same functionality as in high-level language compilers. They control the assembly process by checking the defined conditions and enabling or disabling the process for fragments of the source code.
  * **IF** expression, **IFE** expression - tests the value of the expression and performs (or do not) assemble according to the result (0-false),
  * **IFDEF** symbol - tests whether a symbol is defined,
  * **IFNDEF** symbol - tests whether a symbol is undefined,
  * **IFB** <argument> - tests whether the string argument is empty, 
  * **IFNB** <argument> - tests whether the string argument is empty. 
The conditional assembly statement, together with optional **ELSEIF** and **ELSE** statements, is shown in the following code.
<code asm>
IF expression1
statements
[[ELSEIF expression2
statements ]]
[[ELSE
statements ]]
ENDIF
</code>
==== Statements ====
Statements in assembler programs written in MASM are the lines of code composing the source files. Each MASM statement specifies an instruction for the processor or a directive for the assembler. Statements have up to four fields.
  * name - Specifies the name of the program line. This can serve as a label for the instruction, allowing other instructions to refer to it by name. Some directives also require naming, specifying a variable, type, constant, segment, macro, procedure and other elements of the source file.
  * operation - This is the main element which defines the action of the statement. This can be an instruction for the processor or an assembler directive.
  * operands - This field depends on the operation. Some operations do not accept operands, some require a list of one or more operands. Operands are also referred to as arguments.
  * comment - This field is for documentation purposes and is ignored by the assembler. Good comments make it easier to understand and maintain the program.
All fields in a statement are optional. A statement can be composed of a label only (ended with a colon), an operation only (if it doesn't require operands), or a comment only. A few examples of proper statements are presented in the following code.
<code asm>
; name    ; operation ; operands ; comment

cns_y      EQU         134       ; definition of a constant named cns_y with the value 134

           .DATA                 ; operation only - directive to start data section
var_x      DB          123       ; definition of a variable named var_x with init value 123

           .CODE                 ; operation only - directive to start code section
begin:                           ; name only - label that represents an address
           mov         rax, rbx  ; operation and corresponding operands
                                 ; comment only statement
           END                   ; operation only - end of the source file
</code>
===== Macros =====
Macros are elements of language that enable the replacement of one text with another. In assembler language, macros can be implemented differently in different assemblers understood as programming software. In this chapter, we'll present the MASM implementation. Although in other assemblers (NASM, NASM) the syntax can differ, the idea remains the same. There are a few types of macros in MASM:
  * Text macros,
  * Macro procedures,
  * Repeat blocks,
  * Macro functions,
  * Predefined macro functions,
  * String directives.
We'll go through all of them in the following sections.

==== Text macros ====
Text macros are a simple replacement of one text with another. They are defined with the directive **TEXTEQU**, which creates a new symbolic name and assigns the text to it. The text assigned should be closed in angle brackets, can be a previously defined macro, or can be a text representation of an expression. In the last case, the percent operator calculates the result of an expression and converts it into a text representation. This process is named expansion, and the percent operator is the expansion operator.

<code asm>
name TEXTEQU <text>
name TEXTEQU macroId | textmacro
name TEXTEQU %constExpr
</code>

Using text macros, it is possible to name some constants like PI, shorten other directives, and even create your own instruction names.

<code asm>
pi     TEXTEQU <3.1416>   ; Floating point constant
WPT    TEXTEQU <WORD PTR> ; Sequence of keywords
copy   TEXTEQU <mov>      ; New name for mov instruction
</code>

Previously defined text macro can be redefined later in the program.
<code asm>
message TEXTEQU <Hello World!> ; Define message with the text "Hello World!"
message TEXTEQU <Good Bye!>    ; Redefine message with new text "GoodBye!"
</code>

The explanation of percent operator is shown in the following code.
<code asm>
sum TEXTEQU <3+9>   ; sum is the text "3+9"
sum TEXTEQU %(3+9)  ; sum is the text "12"
</code>
==== Macro procedures ====
Macro procedure is a definition of a fragment of code that can be used later in the resulting program. In some assemblers, they are named multi-line macros. The use of the name of a macro procedure later in the source program causes the so-called expansion of the macro. It is worth noting that for every usage of the macro, the assembler will place in the resulting program all elements specified in the statements inside the macro. These include:
  * instructions, 
  * directives,
  * pseudoinstructions.

The simplest definition of a macro procedure contains the name, **MACRO** keyword, a list of statements inside the macro and **ENDM** at the end
<code asm>
name  MACRO
      statements
      ENDM
</code>

<note>
Please note that no macro name is repeated before the ENDM directive.
</note>

Macro procedure can have parameters. They are treated as macros themselves, so the use of the parameter name inside the macro results in the replacement of it with the actual value.
<code asm>
name  MACRO parameterlist
      statements
      ENDM
</code>

An example of a simple macro which makes a copy of data from the source to the destination can be seen in the following code.
<code asm>
copy MACRO arg1, arg2
     mov RAX, arg1
     mov arg2, RAX
     ENDM
</code>
Note that the arguments of the macro must fit the instructions used inside it. They can be previously defined variables, the first argument can be a constant, or you can even pass the names of other registers.

<code asm>
; Three following lines with "copy" macro use:

    copy 65, x_coeff
    copy x_coeff, RBX
    copy RBX, RCX
    
; Will be expanded as
    
    mov RAX, 65
    mov x_coeff, RAX
    mov RAX, x_coeff
    mov RBX, RAX
    mov RAX, RBX
    mov RCX, RAX
</code>

If the actual parameter is empty, it may result in an error in the expanded line, because empty parameters are assumed as empty texts.
<code asm>
; Use of "copy" macro without two parameters
    copy RBX
    
; Will result in 
    mov RAX, RBX
    mov , RAX      ; this line is invalid and will cause an error
</code>

The solution to avoid such a situation is to assign the default parameter value or mark the parameter as required. In the first case, the default value will be assigned to the missing parameter. 
<code asm>
; The "copy" macro with default values of parameters
copy MACRO arg1:=0, arg2:=RCX
     mov RAX, arg1
     mov arg2, RAX
     ENDM
     
; Use of "copy" macro without parameters
    copy 
    
; Will result in 
    mov RAX, 0
    mov RCX, RAX      
</code>

In the second case, the error will be signalled, but at the line with a macro use, not inside the macro, which makes it easier to localise the error.
<code asm>
; The "copy" macro with required parameters
copy MACRO arg1:REQ, arg2:REQ
     mov RAX, arg1
     mov arg2, RAX
     ENDM
     
; Use of "copy" macro without two parameters
    copy RBX       ; this line will cause an error
</code>

If a label needs to be defined inside the macro, it must be declared as local to avoid ambiguity. In general, any symbol can be declared as local. In such a case, it will only be visible inside the macro. The **LOCAL** directive can be used for this purpose. It must appear in the line next to the **MACRO** statement.

<code asm>
copy MACRO arg1, arg2
     LOCAL jump_over, jump_end

     cmp RAX, arg1  ; test if RAX = arg1
     je jump_over   ; if yes, do not copy arg1 to RAX
     mov RAX, arg1
     
jump_over:
     cmp RAX, arg2  ; test if RAX = arg2
     je jump_end    ; if ye,s do not copy RAX to arg2
     mov arg2, RAX

jump_end:
     ENDM
</code>
==== Repeat blocks ====
Repeat blocks are macros automatically repeated. The number of repetitions depends on the condition, or is determined by the constant value, or the list of argument values.
The **REPEAT** block is expanded the number of times specified in the constant expression.
<code asm>
REPEAT constexpr
	Statements
ENDM
</code>

The **WHILE** block is expanded as long as the condition is true. The condition can be any expression that evaluates to zero (false) or nonzero (true).

<code asm>
WHILE expression
	Statements
ENDM
</code>

The **FOR** block is expanded for each argument in a list. The parameter is a placeholder that represents the name of each argument
The argument list must contain comma-separated arguments and must be enclosed in angle brackets.

<code asm>
FOR parameter, <argumentlist>
	Statements
ENDM
</code>

The **FORC** is similar to **FOR** but takes a string of text rather than a list of arguments. The text must be enclosed in angle brackets. Statements inside the block are assembled once for each character from the string.
<code asm>
FORC parameter, <text>
	Statements
ENDM
</code>

Now, let's examine four repeat blocks that result in the same data definitions. All of them define five bytes with initialising values from 1 to 5.
<code asm>
 i = 1
REPEAT 5
 DB i
 i = i + 1
ENDM

 i = 1
WHILE i < 6
 DB i
 i = i + 1
ENDM

FOR i,<1,2,3,4,5>
 DB i
ENDM

FORC i,<12345>
 DB i
ENDM
</code>

==== Macro functions ====

A macro function is a macro which returns a value. As all macros are tex processing feature macro functions always return text. Returning the value is possible with the use of the **EXITM** directive. Argument of the **EXITM** must be text or the result of another macro function. Numeric values must be converted to text form with the expansion operator %.

==== Predefined macro functions ====
Masm offers a set of predefined macro functions, usually used to process texts. There are two versions of them. First is the macro function, which returns the resulting text. Second is the directive, which returns the same text and additionally creates a new symbol. The macro function **@SubStr** returns part of a source string. The **SUBSTR** directive assigns part of a source string to a new symbol. The macro function **@InStr** searches for one string within another and returns its position. The **INSTR** creates a new symbol containing the position of one string in another. The macro function **@SizeStr** determines the string size. The **SIZESTR** creates a new item and assigns to it the size of a string. The macro function **@CatStr** concatenates one or more strings to a single string and returns a concatenated string. The **CATSTR** directive concatenates one or more strings to a newly defined string. The following code demonstrates the equivalent use of both versions.
<code asm>
name SUBSTR string, start [, length ]
name INSTR [[start]], string, substring
name SIZESTR string
name CATSTR string [[, string ]]

name TEXTEQU @SubStr (string, start [, length ])
name TEXTEQU @InStr ([[start]], string, substring)
name TEXTEQU @SizeStr (string)
name TEXTEQU @CatStr (string [[, string ]])
</code>

===== Energy Efficient Programming =====
Energy-efficient programming is crucial while we consider the influence of the program execution, computation, and data processing on the amount of energy which is consumed during the process.

During computations, the processor typically consumes the majority of the energy. On average, the CPU consumes almost 90% of the energy, with the remaining energy consumed mainly by memory((https://dl.acm.org/doi/10.1145/3136014.3136031)). This is true regardless of the programming language type. It does not matter if the language is compiled or interpreted.\\

Energy efficiency is going to be a popular topic in research and among major players in the information technology industry. This goal can be achieved through improvements in hardware technology. Modern processors and system-on-chip integrated circuits (which play the role of the processor in highly integrated gadgets like smartphones, tablets, IoT units and other similar devices) implement a selection of power modes allowing them to turn off the power for unused internal modules. Additionally, the operating system can adjust the clock frequency to meet the current software demand. These techniques drastically reduce the overall device's power consumption, but there is still a potential for further improvement through efficient software implementation.
The need to reduce energy consumption in consumer electronics can influence the battery life of smartphones. Users try to avoid CPU-intensive apps to get a bit more time out of the phone battery. Owners of big data centres and cloud computing systems pay attention to software energy efficiency, understanding that it impacts the carbon emissions of data centres ((https://www.digikey.pl/pl/blog/compact-code-and-energy-efficient-software)).

The energy consumed by the computer running the software depends on the execution time, clock frequency and average power consumption of the hardware. It can't be expressed with a simple formula, but in general, all coefficients are directly proportional. If the power, time or frequency increases, the energy consumption also grows. It indicates where one can look for reductions in the energy needed to perform a task. It is worth noting that for the same code, the time and frequency are interdependent. If the system runs faster, the time of task execution will be shorter. This means that the code efficiency relies on the number of cycles the software requires to complete the task. This leads to the conclusion that reducing the number of instructions can further reduce energy consumption and make the computation more environmentally friendly. There are two main possible approaches to reduce the number of instructions:
  * improve algorithm,
  * optimise the executable code.
Code optimisation is implemented well in modern compilers. The machine code is generated using techniques that reduce the number of unwanted instructions, like loop unrolling. Compilers enable data parallelism with the extensive use of vector calculations. However, because they must be universal, the output code is always generic in some elements, and a good assembler programmer can identify places for improvement.

In certain computer systems, it is possible to reduce energy consumption by modifying the approach to software implementation from continuous execution to interrupt-driven execution. In continuous execution, the software continually executes the infinite loop and reads data from the input elements at regular intervals. In an interrupt-driven approach, the processor enters the idle state when there is nothing to be done. Input devices signal when new data appears, indicating a need to wake the processor up to execute the necessary part of the code. In the Windows operating system, power management is implemented as part of system functionality, and user software can't control devices and their mode of operation. Devices are controlled using drivers provided by their vendors. Interrupt handlers are elements of these drivers.

The most important thing to consider when writing a program in any programming language is to cooperate with the processor and not work against it. This includes the proper placement of data in memory, taking into account both alignment and cacheability, avoiding conditional jumps, manually unrolling loops, and utilising the maximum possible register sizes available in the machine. These are just a few examples. As a rule, assembly language programming, because the user knows what they want to achieve, usually allows for better code optimisation than compiler-generated code. Of course, this comes at the cost of more time spent writing the program and the increased potential for errors. It also requires extensive knowledge and experience from the programmer. However, it's essential to remember that a program written in assembly language does exactly what the programmer intended and is not dependent on fragments of code, sometimes completely unknown, when using external libraries.

===== Optimisation =====
Optimisation strongly depends on the microarchitecture of the processor. Some optimisation recommendations change together with new versions of processors. Producers usually publish the most up-to-date recommendations. The last release of the Intel documentation is "Intel® 64 and IA-32 Architectures Optimization" ((https://www.intel.com/content/www/us/en/developer/articles/technical/intel64-and-ia32-architectures-optimization.html)). AMD published the document "Software Optimization Guide for the AMD Zen5 Microarchitecture" ((https://docs.amd.com/v/u/en-US/58455_1.00)).
A selection of specific optimisation recommendations is described in this section.

==== The use of inc and dec instructions ====
It is natural for programmers to use **inc** or **dec** instructions to increment or decrement the variable. They are simple and appear to be executed faster than addition and subtraction with a constant "1". The **inc** and **dec** are single-byte instructions, while **add** and **sub** with the argument as a constant consume at least one byte more. The problem with **inc** and **dec** instructions stems from the fact that they do not modify all flags, whereas **add** and **sub** modify all. Modifying all flags frees the processor from the need to wait for previously executed instructions to finish their operation in terms of flag modification. Intel recommends replacing **inc** and **dec** with **add** and **sub** instructions, but compilers do not always consider this recommendation. 

==== Versions of logic instructions ====
While new extensions are introduced, several new instructions appear. In addition to advanced data processing instructions, simple logic instructions are also implemented. Previous versions of instructions are extended to operate with the latest, bigger registers. This may lead to confusion about which instruction to use, especially when they perform the same operation and give the same result. Let's consider three logic XOR instructions **pxor**, **xorps** and **xorpd**. All of them can operate on 128-bit XMM registers performing the bit-wise logic XOR function. At first sight, the instruction choice is meaningless - the result will be the same. In reality, the selection of the instruction matters. The performance analysis yields a result that, in different situations, the execution time can be longer or shorter. A deeper analysis reveals that when previous calculations are performed with integers, it is better to use integer operation **pxor**; if the data is floating-point, it is better to use the floating-point version **xorps** or **xorpd**. There is a section in the Intel optimisation manual about mixing SIMD data types. It is recommended to use packed-single instead of packed-double when possible.


==== Data placement ====
It is recommended to place variables in the memory at their natural boundaries. It means that if the data is 16 bytes, the address should be evenly divisible by 16. For 8-byte data, the address should be divisible by 8.

==== Registers use ====
It is recommended to use registers instead of memory for scalar data if possible. Keeping data in registers eliminates the need to load and store it in memory. 

==== Pause instruction ====
It is a common method to pause the program execution and wait for an event for a short period in a spin loop. In case of a brief waiting period, this method is more efficient than calling an operating system function, which waits for an event. In modern processors, the **pause** instruction should be used inside such a loop. It helps the internal mechanisms of the processor to allocate hardware resources temporarily to another logical processor.


==== Cache utilisation ====
In modern microarchitectures, the cache memory is essential for improving performance. In general, the processor handles cache memory in the most efficient way possible; however, it is easy to write a program that prevents the processor from utilising this mechanism effectively. 
The cache works on two main principles: 
  * temporal locality
  * spatial locality.
The term temporal locality refers to the fact that if a program recently used a certain portion of data, it is likely to need it again soon. It means that if data is used, it remains in a cache for a certain amount of time until other data is loaded into the cache. It is efficient to keep data in a cache instead of reloading it. 
The term spatial locality refers to whether a program has recently accessed data at a particular address; likely, it will soon need data at the next address. The cache helps the program run faster by automatically prefetching data and code that will likely be used or executed soon. 
It is recommended to write the programs in any programming language, keeping these rules in mind. Some recommendations are listed below:
  * The program should do as much work as possible on one small area of code and data; after doing the job, it can move to the next part.
  * The program should avoid frequent jumping over distant regions of memory. 
  * While processing big multidimensional data arrays, keep in mind their placement in memory (row-wise or column-wise), which is specific to the programming language.
  * Object-oriented programming helps to utilise cache because members of the class are grouped.
==== Cache temporal locality ====
 This feature helps improve performance in situations where the program uses the same variables repeatedly, e.g. in a loop.
In a situation where the data processed exceeds half the size of a level 1 cache, it is recommended to use the non-temporal data move instructions **movntq** and **movntdq** to store data from registers to memory. These instructions are hints to the processor to omit the cache if possible. It doesn't mean that the data is immediately stored directly in memory. It can remain in the internal processor's buffers, and it is likely that the last version is not visible to other units of the computer. It is the programmer's responsibility to synchronise the data using the **sfence** (Store Fence) instruction.

==== Cache support instructions ====
There are also instructions which allow the programmer to support the processor with cache utilisation.
  * **movntq** saving the contents of the MMX register, bypassing cache
  * **movntps** write the contents of the SSE register, bypassing cache
  * **maskmovq** write selected bytes from the MMX register, bypassing cache
  * **movntdqa** non-temporal aligned move

Fence instructions guarantee that the load and/or store instructions before the fence are completed before the corresponding instruction after the fence.
  * **spence** force the memory–cache synchronisation after store instructions
  * **lfence** force the memory–cache synchronisation after load instructions
  * **mfence** force the memory–cache synchronisation after load and store instructions

Some instructions are hints to the processor indicating that the programmer expects the data to be stored in cache rather than in memory, or they do not expect to use the data in cache anymore.
  * **prefetch** a hint to the processor, which indicates that the memory area should be considered higher in the hierarchy cache
  * **clflush** flushes a Cache Line from all levels of cache.


==== Further reading ====
The essential readings in an optimisation topic are the vendors' optimisation guides mentioned at the beginning of this section.

An exceptional position about optimisation in x64 processors is by Agner Fog((https://agner.org/optimize/optimizing_assembly.pdf)). This is a must-read for programmers who want to optimise their programs. Thanks to the author's extensive knowledge and hard work, this guide documents various tricks, as well as the execution times of individual processor instructions. It is mentioned in almost all internet articles and blogs about optimisation.

Interesting Understanding Windows x64 Assembly tutorial ((https://sonictk.github.io/asm_tutorial/#debuggingassembly/thetools/radare2/cutterandida)) not only about optimisation but also about using low-level programming in Windows.