=== Authors === IOT-OPEN.EU Reloaded Consortium partners proudly present the 2nd edition of the Introduction to the IoT book. The complete list of contributors is listed below. **ITT Group** * Raivo Sell, Ph. D., ING-PAED IGIP **Riga Technical University** * Agris Nikitenko, Ph. D., Eng. * Karlis Berkolds, M. sc., Eng. * Eriks Klavins, M. sc., Eng. **Silesian University of Technology** * Piotr Czekalski, Ph. D., Eng. * Krzysztof Tokarz, Ph. D., Eng. * Godlove Suila Kuaban, Ph. D., Eng. **Western Norway University of Applied Sciences** * Add data === Preface === This book is intended to provide comprehensive information about low-level programming in the assembler language for a wide range of computer systems, including: * embedded systems - constrained devices, * mobile and mobile-like devices - based on ARM architecture, * PCs and servers - based on Intel and AMD x86 and in particular 64-bit (x86_64) architecture. The book also contains an introductory chapter on general computer architectures and their components. While low-level assembler programming was once considered outdated, it is now back, and this is because of the need for efficient and compact code to save energy and resources, thus promoting a green approach to programming. It is also essential for cutting-edge algorithms that require high performance while using constrained resources. Assembler is a base for all other programming languages and hardware platforms, and so is needed when developing new hardware, compilers and tools. It is crucial, in particular, in the context of the announced switch from chip manufacturing in Asia towards EU-based technologies. We (the Authors) assume that persons willing to study this content possess some general knowledge about IT technologies, e.g., understand what an embedded system is and the essential components of computer systems, such as RAM, CPU, DMA, I/O, and interrupts and knows the general concept of software development with programming languages, C/C++ best (but not obligatory). {{ :en:multiasm:curriculum:logo_400.png?200 | Multiasm Project Logo}} === Project Information === This content was implemented under the following project: * Cooperation Partnerships in higher education, 2023, MultiASM: A novel approach for energy-efficient, high performance and compact programming for next-generation EU software engineers: 2023-1-PL01-KA220-HED-000152401. **Consortium Partners**\\ * Silesian University of Technology, Gliwice, Poland (Coordinator), * Riga Technical University, Riga, Latvia, * Western Norway University, Forde, Norway, * ITT Group, Tallinn, Estonia. {{ :en:multiasm:curriculum:pasek_z_logami_1000px.jpg?400 | Consortium Partner's Logos}} **Erasmus+ Disclaimer**\\ This project has been funded with support from the European Commission. \\ This publication reflects the views only of the author, and the Commission cannot be held responsible for any use which may be made of the information contained therein. **Copyright Notice**\\ This content was created by the MultiASM Consortium 2023–2026.\\ The content is copyrighted and distributed under CC BY-NC [[https://en.wikipedia.org/wiki/Creative_Commons_license|Creative Commons Licence]] and is free for non-commercial use.
{{ en:iot-open:ccbync.png?100 |CC BY-NC}}
In case of commercial use, please get in touch with MultiASM Consortium representative. ====== Introduction ====== The old, somehow abandoned Assembler programming comes to light again: "machine code to machine code and assembler to the assembler." In the context of rebasing electronics, particularly developing new CPUs in Europe, engineers and software developers need to familiarise themselves with this niche, still omnipresent technology: every code you write, compile, and execute goes directly or indirectly to the machine code.\\ Besides developing new products and the related need for compilers and tools, the assembler language is essential to generating compact, rapid implementations of algorithms: it gives software developers a powerful tool of absolute control over the hardware, mainly the CPU. Assembler programming applies to the selected and specific groups of tasks and algorithms.\\ Using pure assembler to implement, e.g., a user interface, is possible but does not make sense.\\ Nowadays, assembler programming is commonly integrated with high-level languages, and it is a part of the application's code responsible for rapid and efficient data processing without higher-level language overheads. This applies even to applications that do not run directly on the hardware but rather use virtual environments and frameworks (either interpreted or hybrid) such as Java, .NET languages, and Python.\\ It is a rule of thumb that the simpler and more constrained the device is, the closer the developer is to the hardware. An excellent example of this rule is development for an ESP32 chip: it has 2 cores that can easily handle Python apps, but when it comes to its energy-saving modes when only ultra-low power coprocessor is running, the only programming language available is assembler, it is compact enough to run in very constrained environments, using microampers of current and fitting dozen of bytes. This book is divided into four main chapters: * The first chapter introduces students to computer platforms and their architectures. Assembler programming is very close to the hardware, and understanding those concepts is necessary to write code successfully and effectively. * The second chapter discusses programming in assemblers for constrained devices that usually do not have operating systems but are bare-metal programmed. As it is impossible to review all assemblers and platforms, the one discussed in detail in this chapter is the most popular: AVR (ATMEL) microcontrollers (e.g. Arduino Uno development board). * The third chapter discusses assembler programming for the future of most devices: ARM architecture. This architecture is perhaps the most popular worldwide as it applies to mobile phones, edge and fog-class devices, and recently to the growing number of implementations in notebooks, desktops, and even workstations (e.g., Apple's M series). Practical examples use Raspberry Pi version 5. * The last, fourth chapter describes in-depth assembler programming for PCs in its 64-bit version for Intel and AMD CPUs, including their CISC capabilities, such as SIMD instruction sets and their applications. Practical examples present (among others) the merging of the assembler code with .NET frontends under Windows. The following chapters present the contents of the coursebook: * [[en:multiasm:cs]] * [[en:multiasm:paiot]] * [[en:multiasm:paarm]] * [[en:multiasm:papc]] ====== Computer Architectures ====== Assembler programming is a very low-level technique for writing the software. It uses the hardware directly relying on the architecture of the processor and computer with the possibility of having influence on every single element and every single bit in hundreds of registers which control the behaviour of the computer and all its units. That gives the programming engineer much more control over the hardware but also requires a high level of carefulness while writing programs. Modification of a single bit can completely change the way of a computer’s behaviour. This is why we begin our book on the assembler with a chapter about the architecture of computers. It is essential to understand how the computer is designed, how it operates, executes the programs, and performs calculations. It is worth mentioning that this is important not only for programming in assembler. The knowledge of computer and processor architecture is useful for everyone who writes the software in any language. Having this knowledge programming engineers can write more efficient applications in terms of memory use, time of execution and energy consumption. You can notice that in this book we often use the term processor for elements sometimes called in some other way. Historically the processor or microprocessor is the element that represents the central processing unit (CPU) only and requires also memory and peripherals to form the fully functional computer. If the integrated circuit contains all mentioned elements it is called a one-chip computer or microcontroller. A more advanced microcontroller for use in embedded systems is called an embedded processor. An embedded processor which contains other modules, for example, a wireless networking radio module is called a System on Chip (SoC). From the perspective of an assembler programmer differentiating between these elements is not so important, and in current literature, all such elements are often called the processor. That's why we decided to use the common name "processor" although other names can also appear in the text. ===== Overall View on Computer Architecture: Processor, Memory, IO, Buses ===== The computers which we use every day are all designed based on the same general idea of cooperation of three base elements processor, memory and peripheral devices. Their names represent their functions in the system. The memory stores the data and program code, the processor manipulates the data by executing programs, and peripherals are used to keep the contact with the user, environment and other systems. To exchange the information these elements are connected together by interconnections named buses. The generic block schematic diagram of the exemplary computer is shown in Fig. {{ref>compblock}}.
{{ :en:multiasm:cs:block_schematic_computer.png?600 |Block schematic of computer}} Block schematic of computer
==== Processor ==== It is often called “the brain” of the computer. Although it doesn’t think, the processor is the element which controls all other units of the computer. The processor is the device that manages everything in the machine. Every hardware part of the computer is controlled more or less by the main processor. Even if the device has its own processor - for example keyboard - it works under the control of the main one. Processor handles events. We can say that synchronous events are those that which processor handles periodically. The processor can’t stop. Of course when it has the power. But even when you don’t see anything special happening on the screen. In PC computer in an operating system without a graphical user interface, for example, plain Linux, or a command box in Windows, if you see only „C:\>” the processor works. In this situation, it executes the main loop of the system. In such a loop, it’s waiting for the asynchronous events. Such an asynchronous event occurs when the user pushes the key or moves the mouse, when the sound card stops playing one sound, the hard disk ends up transmitting the data. For all of those actions the processor handles executing the programs - or if you prefer – procedures. Processor is characterised by its main parameters including the frequency of operation, and class. Other features we will explain further in this chapter. Frequency is very important information which tells the user how many instructions can be executed in the time unit. To have the detailed information of a real number of instructions per second it must be combined with the average number of clock pulses required to execute the instruction. Older processors needed a few or even a dozen pulses of clock for a single instruction. Modern machines, thanks to parallel execution can achieve impressive results of a few instructions per single cycle. The class of the processor tells what is the number of bits in the data word. It means what is the size of the arguments of the data which the processor can calculate with a single arithmetic, logic or other operation. The still popular 8-bit machines have a data length of 8 bits, while most sophisticated can use 32 or 64-bit arguments. The class of the processor determines the size of its internal data registers, the size of arguments of instructions and the number of lines of the data bus. Some modern 64-bit processors have additional registers much longer than the ones in the main set. For example, in x64 architecture, there are ZMM registers of 512 bits in length. ==== Memory ==== Memory is the element of the computer that stores data and programs. It is visible to the processor as the sequence of data words, where every word has its own address. Addressing allows the processor to access simple and complex variables and to read the instructions for execution. Although it is intuitive that the size of a memory word should correspond to the class of the processor it is not always true. For example in PC computers, independently of the processor class, the memory is always organised as the sequence of bytes and the size of it is provided in megabytes or gigabytes. The byte is historically assumed as 8 bits of information and used as the base unit to express the size of data in the world of computers. The size of the memory installed on the computer does not have to correspond to the size of the address space – the maximal size of the memory which is addressable by the processor. In modern machines, it would be impossible or hardly achievable, for example for x64 architecture the theoretical address space is 2^64 (16 exabytes). Even address space supported in hardware by processors is as big as 2^48 which equals 256 terabytes. On the opposite side in constrained devices, the size of the physical memory can be bigger than supported by the processor. To enable access to bigger memory than the addressing space of the processor or to support flexible placement of programs in big address space the paging mechanism is used. It is a hardware support unit for mapping the address used by the processor into the physical memory installed in the computer. To learn more about paging please refer to https://connormcgarr.github.io/paging/ ==== Peripherals ==== Called also input-output (I/O) devices. There is a variety of units belonging to this group. It includes timers, communication ports, general-purpose inputs and outputs, displays, network controllers, video and audio modules, data storage controllers and many others. The main function of peripheral modules is to exchange information between the computer and the user, collect information from the environment, send and receive data to and from elements of the computer not connected directly to the processor, and exchange data with other computers and systems. Some of them are used to connect the processor to the functional unit of the same computer. For example, the hard disk drive is not connected directly to the processor, it uses the specialized controller which is the peripheral module operating as an interface between the centre of the computer and the hard drive. Another example can be the display. It uses a graphic card which plays the role of peripheral interface between the computer and the monitor. ==== Buses ==== Processor, memory and peripherals exchange information using interconnections called buses. Although you can find in the literature and internet a variety of kinds of buses and their names, on the very low level there are three buses which connect the processor, memory and peripherals. **Address bus** delivers the address generated by the processor to memory or peripherals. This address specifies the one, and only one memory cell or peripheral register that the processor wants to access. The address bus is used not only to address the data which the processor wants to transmit to or from memory or peripheral. It also addresses the instruction which processor fetches and later executes. Instructions are also stored in the memory. The address bus is one-directional, the address is generated by the processor and delivered to other units. If there is the DMA controller in the computer, in some circumstances it also can generate an address instead of the processor. Refer to the chapter with the DMA description. The number of lines in the address bus is fixed for the processor and determines the size of the addressing space the processor can access. For example, if the address bus of some processor has 16 lines, it can generate up to 16^2 = 65536 different addresses. **Data bus** is used to exchange data between processor and memory or peripherals. The processor can read the data from memory or peripherals or write the data to these units previously sending their address with the address bus. As data can be read or written the data bus is bi-directional. In the systems with a DMA controller the data bus is utilised to exchange the data between memory and peripherals directly. Refer to the chapter with the DMA description. The number of bits of the data bus usually corresponds to the class of the processor. It means that an 8-bit class processor has 8 lines of the data bus. **Control bus** is formed by lines mainly used for synchronisation between the elements of the computer. In the minimal implementation, it includes the read and write lines. Read line (#RD) is the information to other elements that the processor wants to read the data from the unit. In such a situation, the element, e.g. memory puts the data from the addressed cell on the data bus. Active write signal (#WR) informs the element that the data which is present on the data bus should be stored at the specified address. The control bus can also include other signals specific to the system, e.g. interrupt signals, DMA control lines, clock pulses, signals distinguishing the memory and peripheral access, signals activating chosen modules and others. ===== Von Neumann vs Harvard Architectures, Mixed Architectures ===== The classical architecture of computers uses a single address, and single data bus to connect the processor, memory and peripherals. This architecture is called von Neumann or Princeton, and we showed it in Fig. {{ref>vonneumann}}. Additionally in this architecture, the memory contains the code of programs and data the programs use. This suffers the drawback of the impossibility of accessing the instruction of the program and data to be processed at the same time, called the von Neumann bottleneck. The answer for this issue is the Harvard architecture, where program memory is separated from the data memory and they are connected to the processor with two pairs of address and data buses. Of course, the processor must support such type of architecture. The Harvard architecture we show in Fig. {{ref>harvard}}.
{{ :en:multiasm:cs:block_schematic_vonneumann.png?600 |Block schematic of von Neumann architecture computer}} Block schematic of von Neumann architecture computer
{{ :en:multiasm:cs:block_schematic_harvard.png?600 |Block schematic of Harvard architecture computer}} Block schematic of Harvard architecture computer
Harvard architecture is very often used in one-chip computers. It does not suffer the von Neumann bottleneck and additionally makes it possible to implement different lengths of the data word and instruction word. For example, the AVR 8-bit class family of microcontrollers has 16 bits of the program word. PIC microcontrollers, also 8-bit class have 13 or 14-bit instruction word length. In modern microcontrollers, the program is usually stored in internal flash reprogrammable memory, and data in internal static RAM memory. All interconnections including address and data buses are implemented internally making the implementation of the Harvard architecture easier than in a computer based on the microprocessor. In several mature microcontrollers, the program and data memory are separated but connected to the processor unit with a single set of buses. It is named mixed architecture. This architecture benefits from an enlargement of the size of possible address space but still suffers from the von Neumann bottleneck. This approach can be found in the 8051 family of microcontrollers. The schematic diagram of mixed architecture we present in Fig. {{ref>mixed}}.
{{ :en:multiasm:cs:block_schematic_mixed.png?600 |Block schematic of mixed architecture computer}} Block schematic of mixed architecture computer
===== CISC, RISC ===== It is not only the whole computer that can have a different architecture. This also touches processors. There are two main internal architectures of processors: CISC and RISC. CISC means Complex Instruction Set Computer while RISC stands for Reduced Instruction Set Computer. The naming difference can be a little confusing because it considers the instruction set to be complex or reduced. We can find CISC and RISC processors with similar number of instructions implemented. The difference is rather in the complexity of instructions not only the instruction set. Complex instructions mean that the typical single instruction of a CISC processor makes more during its execution than typical RISC instruction. It also uses more sophisticated addressing. It means that if we want to make some algorithm we can use fewer CISC instructions in comparison to more RISC instructions. Of course, it comes with the price of a more sophisticated construction of CISC processor than RISC which influences the average execution time of a single instruction. As a result, the overall execution time can be similar, but in CISC more effort is done by the processor while in RISC more work is done by the compiler (or assembler programmer). Additionally, there are differences in the general-purpose registers. In CISC the number of registers is smaller than in RISC. Additionally, they are specialised. It means that not all operations can be done with the use of any register. For example in some CISC processors arithmetic calculations can be done only with the use of a special register called accumulator, while addressing of table element or using a pointer can be done with the use of a special index (or base) register. In RISC almost all registers can be used for any purpose like the mentioned calculations or addressing. In CISC processors instructions have usually two arguments having the form of operation arg1, arg2 ; Example: arg1 = arg1 + arg2 In such a situation //arg1// is one of a source and also a destination - place for the result. It destroys the original //arg1// value. In many RISC processors three argument instructions are present: operation arg1, arg2, arg3 ; Example: arg3 = arg1 + arg2 In such an approach two arguments are the source and the third one is the destination - original arguments are preserved and can be used for further calculations. The table {{ref>ciscrisc}} summarises the difference between CISC and RISC processors. ^ Feature ^ CISC ^ RISC ^ | Instructions | Complex | Simple | | Registers | Specialised | Universal | | Number of registers | Smaller | Larger | | Calculations | With accumulator | With any register | | Addressing modes | Complex | Simple | | Non destroying instructions | No | Yes | | Examples of processors | 8086, 8051 | AVR, ARM |
CISC and RISC features
=====Components of Processor: Registers, ALU, Bus Control, Instruction Decoder ===== From our perspective, the processor is the electronic integrated circuit that controls other elements of the computer. Its main ability is to execute instructions. While we will go into details of the instruction set you will see that some of the instructions perform calculations or process data, but some of them do not. This suggests that the processor is composed of two main units. One of them is responsible for instruction execution while the second performs data processing. The first one is called the control unit or instruction processor, the second one is named the execution unit or data processor. We can see them in Fig {{ref>procblock}}.
{{ :en:multiasm:cs:block_schematic_processor.png?600 |Units of the processor}} Units of the processor
==== Control unit ==== The function of the control unit, known also as the instruction processor is to fetch, decode and execute instructions. It also generates signals to the execution unit if the instruction being executed requires so. It is a synchronous and sequential unit. Synchronous means that it changes state synchronously with the clock signal. Sequential means that the next state depends on the states at the inputs and the current internal state. As inputs, we can consider not only physical signals from other units of the computer but also the code of the instruction. To ensure that the computer behaves the same every time it is powered on, the execution unit is set to the known state at the beginning of operation by RESET signal. A typical control unit contains some essential elements: * Instruction register (IR). * Instruction decoder. * Program counter (PC)/instruction pointer (IP). * Stack pointer (SP). * Bus interface unit. * Interrupt controller. Elements of the control unit are shown in Fig {{ref>controlunit}}.
{{ :en:multiasm:cs:control_unit.png?600 |Elements of the control unit}} Elements of the control unit
The control unit executes instructions in a few steps: * Generates the address of the instruction. * Fetches instruction code from memory. * Decodes instructions. * Generates signals to the execution unit or executes instructions internally. In detail, the process looks as follows: - The control unit takes the address of the instruction to be executed from a special register known as the Instruction Pointer or Program Counter and sends it to the memory via the address. It also generates signals on the control bus to synchronise memory with the processor. - Memory takes the code of instruction from the provided address and sends it to the processor using a data bus. - The processor stores the instruction code in the instruction register and based on the bit pattern interprets what to do next. - If the instruction requires the execution unit operation, the control unit generates signals to control it. In cooperation with the execution unit, it can also read data from or write data to the memory. The control unit works according to the clock signal generator cycles known as main clock cycles. With every clock cycle, some internal operations are performed. One such operation is reading or writing the memory which sometimes requires more than a single clock cycle. Single memory access is known as a machine cycle. As instruction execution sometimes requires more than one memory access and other actions the execution of the whole instruction is named instruction cycle. Summarising one instruction execution requires one instruction cycle, and several machine cycles, each composed of a few main clock cycles. Modern advanced processors are designed in such a way that they are able to execute a single instruction (sometimes even more than one) every single clock cycle. This requires a more complex design of a control unit, many execution units and other advanced techniques which makes it possible to process more than one instruction at a time. The control unit also accepts input signals from peripherals enabling the interrupts and direct memory access mechanisms. For proper return from the interrupt subroutine, the control unit uses a special register called stack pointer. Interrupts and direct memory access mechanisms will be explained in detail in further chapters. Although the stack pointer is not implemented in all modern processors, every processor has some mechanism for storing the returning address. ==== Execution unit ==== An execution unit, known also as the data processor executes instructions. Typically, it is composed of a few essential elements: * Arithmetic logic unit (ALU). * Accumulator and set of registers, * Flags register, * Temporal register. The arithmetic logic unit (ALU) is the element that performs logical and arithmetical calculations. It uses data coming from registers, the accumulator or from memory. Data coming from memory data for arithmetic and logic instructions are stored in the temporal register. The result of calculations is stored back in the accumulator, other register or memory. In some legacy CISC processors, the only possible place for storing the result is the accumulator. Besides the result, ALU also returns some additional information about the calculations. It modifies the bits in the flag register which comprises flags that indicate the results from arithmetic and logical operations. For example, if the result of the addition operation is too large to be stored in the resulting argument, the carry flag is set to inform about such a situation. Typically flags register include: * Carry flag, set in case a carry or borrow occurs. * Sign flag, indicating whether the result is negative. * Zero flag, set in case the result is zero. * Auxiliary Carry flag used in BCD operations. * Parity flag, which indicates whether the result has an even number of ones. The flags are used as conditions for decision-making instructions (like //if// statements in some high-level languages). Flags register can also implement some control flags to enable/disable processor functionalities. An example of such a flag can be the Interrupt Enable flag from the 8086 microprocessor. As we mentioned in the chapter about CISC and RISC processors in the first ones there are specialised registers including the accumulator. A typical CISC execution unit is shown in Fig {{ref>CISCexeunit}}.
{{ :en:multiasm:cs:CISC_execution_unit.png?600 |Elements of the CISC execution unit}} Elements of the CISC execution unit
A typical RISC execution unit does not have a specialised accumulator register. It implements the set of registers as shown in Fig {{ref>RISCexeunit}}.
{{ :en:multiasm:cs:RISC_execution_unit.png?600 |Elements of the RISC execution unit}} Elements of the RISC execution unit
===== Processor Taxonomies, SISD, SIMD, MIMD ===== As we already know the processor executes instructions which can process the data. We can consider two streams flowing through the processor. Stream of instructions which passes through the control unit, and stream of data processed by the execution unit. In 1966 Michael Flynn proposed the taxonomies to define different processors' architectures. Flynn classification is based on the number of concurrent instruction (or control) streams and data streams available in the architecture. Taxonomies as proposed by Flynn are presented in Table{{ref>taxonomies}} | Basic taxonomies |^ Data streams || | ::: | ::: | **Single** | **Multiple** | ^ Instruction streams | **Single** | SISD | SIMD | | ::: | **Multiple** | MISD | MIMD |
Processor taxonomies by Flynn
==== SISD ==== Single Instruction Single Data processor is a classical processor with a single control unit and a single execution unit. It can fetch a single instruction at one cycle and perform single calculation. Mature PC computers based on 8086, 80286 or 80386 processors or some modern small-scale microcontrollers like AVR are examples of such an architecture. ==== SIMD ==== Single Instruction Multiple Data is an architecture in which one instruction stream can perform calculations on multiple data streams. Good examples of implementation of such architecture are all vector instructions (called also SIMD instructions) like MMX, SSE, AVX, and 3D-Now in x64 Intel and AMD processors. Modern ARM processors also implement SIMD instructions which perform vectorised operations. ==== MIMD ==== Multiple Instruction Multiple Data is an architecture in which many control units operate on multiple streams of data. These are all architectures with more than one processor (multi-processor architectures). MIMD architectures include multi-core processors like modern x64, ARM and even ESP32 SoCs. This also includes supercomputers and distributed systems, using common shared memory or even a distributed but shared memory. ==== MISD ==== Multiple Instruction Single Data. At first glance, they seem illogical, but these are machines where the certainty of correct calculations is crucial and required for the security of the system operation. Such an approach can be found in applications like space shuttle computers. ==== SIMT ==== Single Instruction Multiple Threads. Originally defined as the subset of SIMD. In modern construction, Nvidia uses this execution model in their G80 architecture ((https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf)) where multiple independent threads execute concurrently using a single instruction. ===== Memory, Types and Their Functions===== The computer cannot work without the memory. The processor fetches instructions from the memory, data is also stored in the memory. In this chapter, we will discuss memory types, technologies and their properties. The overall view of the computer memory can be represented as the memory hierarchy triangle shown in Fig. {{ref>memtriangle}} where the available size reduces while going up but the access time significantly shortens. It is visible that the fastest memory in the computer is the set of internal registers, next is the cache memory, operational memory (RAM), disk drive, and the slowest, but the largest is the memory available via the computer network - usually the Internet.
{{ :en:multiasm:cs:memory_triangle.png?400 |Memory hierarchy triangle}} Memory hierarchy triangle
In the table {{ref>memhierarchy}} you can find the comparison of different memory types with their estimated size and access time. ^ Memory type ^ Average size ^ Access time ^ | Registers | few kilobytes | <1 ns | | Cache | few megabytes | <10 ns | | RAM | few gigabytes | <100 ns | | Disk | few terabytes | <100 us | | Network | unlimited | few seconds, but can be unpredictable |
Parameters of memory in the hierarchy
==== Address space ==== Before we start talking about details of memory we will mention the term //address space//. The processor can access instructions of the program and data to process. The question is how many instructions and data elements can it reach? The possible number of elements (usually bytes) represents the address space available for the program and for data. We know from the chapter about von Neumann and Harvard architectures that these spaces can overlap or be separate. The size of the program address space is determined by the number of bits of the Instruction Pointer register. In many 8-bit microprocessors, the size of IP is 16-bits. It means that the size of address space, expressed in the number of different addresses the processor can reach, is equal to 2^16 = 65536 (64k). It is too small for bigger machines like PC computers so here the number of bits of IP is 64 (although the real number of bits used is limited to 48). The size of the possible data address space is determined by addressing modes and the size of index registers used for indirect addressing. In 8-bit microprocessors sometimes it is possible to join two 8-bit registers to achieve address space of the same 64k size as for the program. In PC computers program and data address spaces overlap, so the address space for data is the same as for programs. ==== Program memory ==== In this kind of memory, the instructions of the programs are stored. In many computers (like PCs), the same physical memory is used to store the program and the data. This is not the case in most microcontrollers where the program is stored in non-volatile memory, usually Flash EEPROM. Even if the memory for programs and data is common, the code area is often protected from writing to prevent malware and other viruses from interfering with the original, safe software. ==== Data memory ==== In data memory, all kinds of data can be stored. Here we can find numbers, tables, pictures, audio samples and everything that is processed by the software. Usually, the data memory is implemented as RAM, with the possibility of reading and writing. SRAM is a volatile memory so its content is lost if the computer is powered off. ==== Memory technologies ==== There are different technologies for creating memory elements. The technology determines if memory is volatile or non-volatile, the physical size (area) of a single bit, the power consumption of the memory array, and how fast we can access the data. According to the data retention, we can divide memories into two groups: * non-volatile, * volatile. Non-volatile memories are implemented as: * ROM * PROM * EPROM * EEPROM * Flash EEPROM * FRAM Volatile memories have two main technologies: * Static RAM * Dynamic RAM ==== Non-volatile memories ==== Non-volatile memories are used to store the data which should be preserved during the power is off. Their main application is to store the code of the program, constant data tables, and parameters of the computer (e.g. network credentials). The hard disk drive (or SSD) is also the non-volatile memory, although not directly accessible by the processor via address and data buses. ==== ROM ==== ROM - Read Only Memory is the kind of memory that is programmed by the chip manufacturer during the production of the chip. This memory has fixed content which can't be changed in any way. It was popular as a program memory in microcontrollers a few years ago because of the low price of a single chip, but due to a need for replacement of the whole chip in case of an error in the software and lack of possibility of a software update, it was replaced with technologies which allow reprogramming the memory content. Currently, ROM is sometimes used to store the serial number of a chip. ==== PROM ==== PROM - Programmable Read Only Memory is the kind of memory that can be programmed by the user but can't be reprogrammed further. It is sometimes called OTP - One Time Programming. Some microcontrollers implement such a memory but due to the inability of software updates, it is a choice for simple devices with a short lifetime of a product. OTP technology is used in some protection mechanisms for preserving from unauthorised software change or reconfiguration of device. ==== EPROM ==== EPROM - Erasable Programmable Read Only Memory is the memory which can be erased and programmed many times. The data can be erased by shining an intense ultraviolet light through a special glass window into the memory chip for a few minutes. After erasing the chip can be programmed again. This kind of memory was very popular a few years ago but due to a complicated erase procedure, requiring a UV light source, was replaced with EEPROM. ==== EEPROM ==== EEPROM - Electrically Erasable Programmable Read Only Memory is a type of non-volatile memory that can be erased and reprogrammed many times with electric signals only. They replaced EPROM memories due to ease of use and the possibility of reprogramming without removing the chip from the device. It is designed as an array of MOSFET transistors with, so-called, floating gates which can be charged or discharged and keep the stable state for a long time (at least 10 years). In EEPROM memory every byte can be individually reprogrammed. ==== Flash EEPROM ==== Flash EEPROM is the version of EEPROM memory that is similar to EEPROM. Whilst in the EEPROM a single byte can be erased and reprogrammed, in the Flash version, erasing is possible in larger blocks. This reduces the area occupied by the memory, making it possible to create denser memories with larger capacity. Flash memories can be realised as part of a larger chip, making it possible to design microcontrollers with built-in reprogrammable memory for the firmware. This enables firmware updates without the need for any additional tools or removing the processor from the operating device. It enables also remote updates of the firmware. ==== FRAM ==== FRAM - Ferroelectric Random Access Memory is the type of memory where the information is stored with the effect of change of polarity of ferroelectric material. The main advantage is that the power efficiency, access time, and density are comparable to DRAM, but without the need for refreshing and with data retention while the power is off. The main drawback is the price which limits the popularity of FRAM applications. Currently, due to their high reliability, FRAMs are used mainly in specialised applications like data loggers, medical equipment, and automotive "black boxes". ==== Volatile memories ==== Volatile memories are used for temporal data storage while the computer system is running. Their content is lost after the power is off. Their main application is to store the data while processed by the program. Their main applications are operational memory (known as RAM), cache memory, and data buffers. ==== SRAM ==== SRAM - Static Random Access Memory is the memory used as operational memory in smaller computer systems and as the cache memory in bigger machines. Its main benefit is very high operational speed. The access time can be as short as a few nanoseconds making it the first choice when speed is required: in cache memory or as processor's registers. The drawbacks are the area and power consumption. Every bit of static memory is implemented as a flip-flop and composed of six transistors. Due to its construction, the flip-flop always consumes some power. That is the reason why the capacity of SRAM memory is limited usually to a few hundred kilobytes, up to a few megabytes. ==== DRAM ==== DRAM - Dynamic Random Access Memory is the memory used as operational memory in bigger computer systems. Its main benefit is very high density. The information is stored as a charge in a capacitance created under the MOSFET transistor. Because every bit is implemented as a single transistor it is possible to implement a few gigabytes in a small area. Another benefit is the reduced power required to store the data in comparison to static memory. But there are also drawbacks. DRAM memory is slower than SRAM. The addressing method and the complex internal construction prolong the access time. Additionally, because the capacitors which store the data discharge in a short time DRAM memory requires refreshing, typically 16 times per second, it is done automatically by the memory chip, but additionally slows the access time. ===== Peripherals ===== Peripheral or peripheral devices, also known as Input-Output devices, enable the computer to remain in contact with the external environment or expand the computer's functionality. Peripheral devices enhance the computer's capability by making it possible to enter information into a computer for storage or processing and to deliver the processed data to a user, another computer, or a device controlled by the computer. Internal peripherals are connected directly to the address, data, and control buses of the computer. External peripherals can be connected to the computer via USB or a similar connection. The USB controller is also the peripheral device, so every external peripheral (e.g. mouse) is connected to the processor via an internal peripheral. In this book, we rather consider internal peripherals directly connected to the address, data, and control buses. ==== Types of peripherals ==== There is a variety of peripherals which can be connected to the computer. The most commonly used are: * parallel input/output ports * serial communication ports * timers/counters * analogue to digital converters * digital to analogue converters * interrupt controllers * DMA controllers * displays * keyboards * sensors * actuators ==== Addressing of I/O devices ==== From the assembler programmer's perspective, the peripheral device is represented as a set of registers available in I/O address space. Registers of peripherals are used to control their behaviour including mode of operation, parameters, configuration, speed of transmission etc. Registers are also data exchange points where the processor can store data to be transmitted to the user or external computer, or read the data coming from the user or another system. The size of the I/O address space is usually smaller than the size of the program or data address space. The method of accessing peripherals depends on the design of the processor. We can find two methods of I/O addressing implementation: separate or memory-mapped I/O. ==== Separate I/O address space ==== Separate I/O address space is accessed independently of the program or data memory. In the processor, it is implemented with the use of separate control bus lines to read or write I/O devices. Separate control lines usually mean that the processor implements also different instructions to access memory and I/O devices. It also means that the chosen peripheral and byte in the memory can have the same address and only the type of instruction used distinguishes the final destination of the address. Separate I/O address space is shown schematically in Fig {{ref>separateio}}. Reading the data from the memory is activated with the #MEMRD signal, writing with the #MEMWR signal. If the processor needs to read or write the register of the peripheral device it uses #IORD or #IOWR lines respectively. The "#" symbol before a signal name means that the active signal on the line is a LOW state while idle is a HIGH state.
{{ :en:multiasm:cs:separate_io.png?600 |Separate I/O address space}} Separate I/O address space
==== Memory-mapped I/O address space ==== In this approach, the processor doesn't implement separate control signals to access the peripherals. It uses a #RD signal to read the data from and a #WR signal to write the data to a common address space. It also doesn't have different instructions to access registers in the I/O address space. Every such register is visible like any other memory location. It means that this is only the responsibility of the software to distinguish the I/O registers and data in the memory. Memory-mapped I/O address space is shown schematically in Fig {{ref>memorymappedio}}.
{{ :en:multiasm:cs:memory_mapped_io.png?600 |Memory-mapped I/O address space}} Memory-mapped I/O address space
==== Instruction Execution Process ===== As we already mentioned, instructions are executed by the processor in a few steps. You can find in the literature descriptions that there are three, four, or five stages of instruction execution. Everything depends on the level of detail one considers. The three stages description says that there are fetch, decode and execute steps. The four-stage model says that there are fetch, decode, data read and execute steps. The five stages version adds another final step for writing the result back and sometimes reverses the steps of data read and execution. It is worth remembering that even a simple fetch step can be divided into a set of smaller actions which must be performed by the processor. The real execution of instructions depends on the processor's architecture, implementation and complexity. Considering the five-stage model we can describe the general model of instruction execution: - Fetching the instruction: * The processor addresses the instruction by sending the content of the IP register by the address bus. * The processor reads the code of the instruction from program memory by data bus. * The processor stores the code of the instruction in the instruction register. * The processor prepares the IP (possibly increments) to point to the next instruction in a stream. - Instruction decoding: * A simple instruction can be interpreted directly by instruction decoder logic. * More complex instructions are processed in some steps by microcodes. * Results are provided to the execution unit. - Data reading (if the instruction requires reading the data): * The processor calculates the address of the data in the data memory. * The processor sends this address by address bus to the memory. * The processor reads the data from memory and stores it in the accumulator or general-purpose register. - Instruction execution: * The execution unit performs actions as encoded in the instruction (simple or with microcode). * In modern processors there are more execution units, they can execute different instructions simultaneously. - Writing back the result: * The processor writes the result of calculations into the register or memory. ==== Instruction encoding ==== From the perspective of the processor, instructions are binary codes, unambiguously determining the activities that the processor is to perform. Instructions can be encoded using a fixed or variable number of bits. A fixed number of bits makes the construction of the instruction decoder simpler because the choice of some specific behaviour or function of the execution unit is encoded with the bits which are always at the same position in the instruction. On the opposite side, if the designer plans to expand the instruction set with new instructions in the future, there must be some spare bits in the instruction word reserved for future use. It makes the code of the program larger than required. Fixed lengths of instructions are often implemented in RISC machines. For example in ARM architecture instructions have 32 bits. In AVR instructions are encoded using 16 bits. A variable number of bits makes the instruction decoder more complex. Based on the content of the first part of the instruction (usually byte) it must be able to decide what is the length of the whole instruction. In such an approach, instructions can be as short as one byte, or much longer. An example of a processor with variable instruction length is 8086 and all further processors from the x86 and x64 families. Here the instructions, including all possible constant arguments, can have even 15 bytes. Although in the computer world information is very often encoded in bytes or multiples of bytes, there are processors with instructions encoded in other numbers of bits. Examples include PIC microcontrollers with an instruction length of 13 or 14 bits. ===== Modern Processors: Pipeline, Superscalar, Branch Prediction, Hyperthreading ===== Modern processors have a very complex design and include many units responsible mainly for shortening the execution time of the software. ==== Pipeline ==== As was described in the previous chapter, executing a single instruction requires many actions which must be performed by the processor. We could see that each step, or even substep, can be performed by a separate logical unit. This feature has been used by designers of modern processors to create a processor in which instructions are executed in a pipeline. A pipeline is a collection of logical units that execute many instructions at the same time - each of them at a different stage of execution. If the instructions arrive in a continuous stream, the pipeline allows the program to execute faster than a processor that does not support the pipeline. Note that the pipeline does not reduce the time of execution of a single instruction, it increases the throughput of the instruction stream. A simple pipeline is implemented in AVR microcontrollers. It has two stages, which means that while one instruction is executed another one is fetched as shown in Fig {{ref>pipelineavr}}.
{{ :en:multiasm:cs:pipeline_AVR.png?600 |Simple 2-stage pipeline in AVR microcontroller}} Simple 2-stage pipeline in AVR microcontroller
A mature 8086 processor executed the instruction in four steps. This allowed for the implementation of the 4-stage pipeline as shown in Fig. {{ref>pipeline8086}}
{{ :en:multiasm:cs:pipeline_8086.png?600 |2-stage pipeline in 8086 microprocessor}} 4-stage pipeline in 8086 microprocessor
Modern processors implement longer pipelines. For example, Pentium III used the 10-stage pipeline, Pentium 4 20-stage, and Pentium 4 Prescott even a 31-stage pipeline. Does the longer pipeline mean faster program execution? Everything has benefits and drawbacks. The undoubted benefit of a longer pipeline is more instructions executed at the same time which gives the higher instruction throughput. But the problem appears when branch instructions come. While in the instruction stream a conditional jump appears the processor must choose what way the instruction stream should go. Should the jump be taken or not? The answer usually is based on the result of the preceding instruction and is known when the branch instruction is close to the end of the pipeline. In such a situation in modern processors, the branch prediction unit guesses what to do with the branch. If it misses, the pipeline content is invalidated and the pipeline starts operation from the beginning. This causes stalls in the program execution. If the pipeline is longer - the number of instructions to invalidate is bigger. That's why Intel decided to return to shorter pipelines. In modern microarchitectures, the length of the pipeline varies between 12 and 20. ==== Superscalar ==== The superscalar processor increases the speed of program execution because it can execute more than one instruction during a clock cycle. It is realised by simultaneously dispatching instructions to different execution units on the processor. The superscalar processor doesn't implement two or more independent pipelines, rather decoded instructions are sent for further processing to the chosen execution unit as shown in Fig. {{ref>superscalar}}.
{{ :en:multiasm:cs:superscalar.png?600 |Superscalar architecture of pipelined processor}} Superscalar architecture of pipelined processor
In the x86 family first processor with two paths of execution was Pentium with U and V pipelines. Modern x64 processors like i7 implement six execution units. Not all execution units have the same functionality, for example, In the i7 processor, every execution unit has different possibilities, as presented in table {{ref>executionunits}}. ^ Execution unit ^ Functionality ^ | 0 | Integer calculations, Floating point multiplication, SSE multiplication, divide | | 1 | Integer calculations, Floating point addition, SSE addition | | 2 | Address generation, load | | 3 | Address generation, store | | 4 | Data store | | 5 | Integer calculations, Branch, SSE addition |
Execution units of i7 processor
The real path of instruction processing is much more complex. Additional techniques are implemented to achieve better performance e.g. out-of-order execution, and register renaming. They are performed automatically by the processor and the assembler programmer does not influence their behaviour. ==== Branch prediction ==== As it was mentioned, the pipeline can suffer invalidation if the conditional branch is not properly predicted. The branch prediction unit is used to guess the outcome of conditional branch instructions. It helps to reduce delays in program execution by predicting the path the program will take. Prediction is based on historical data and program execution patterns. There are many methods of predicting the branches. In general, the processor implements the buffer with the addresses of the last few branch instructions with a history register for every branch. Based on history, the branch prediction unit can guess if the branch should be taken. ==== Hyperthreading ==== Hyper-Threading Technology is an Intel approach to simultaneous multithreading technology which allows the operating system to execute more than one thread on a single physical core. For each physical core, the operating system defines two logical processor cores and shares the load between them when possible. The hyperthreading technology uses a superscalar architecture to increase the number of instructions that operate in parallel in the pipeline on separate data. With Hyper-Threading, one physical core appears to the operating system as two separate processors. The logical processors share the execution resources including the execution engine, caches, and system bus interface. Only the elements that store the architectural state of the processor are duplicated including essential registers for the code execution. ===== Fundamentals of Addressing Modes ===== Addressing Modes is the way in which the argument of an instruction is specified. The addressing mode defines a rule for interpreting the address field of the instruction before the operand is reached. Addressing mode is used in instructions which operate on the data or in instructions which change the program flow. ==== Data addressing ==== Instructions which reach the data have the possibility of specifying the data placement. The data is an argument of the instruction, sometimes called an operand. Operands can be of one of the following: register, immediate, direct memory, and indirect memory. As in this part of the book the reader doesn't know any assembler instructions we will use the hypothetic instruction //copy// that copies the data from the source operand to the destination operand. The order of the operands will be similar to high-level languages where the left operand is the destination and the right operand is the source. Copying data from //a// to //b// will be done with an instruction as in the following example: copy b, a **Register operand** is used where the data which the processor wants to reach is stored or is intended to be stored in the register. If we assume that //a// and //b// are both registers named //R0// and //R1// the instruction for copying data from //R0// to //R1// will look as in the following example and as shown in the Fig.{{ref>addrregister}}. copy R1, R0
{{ :en:multiasm:cs:addressing_register.png?600 |Illustration of addressing registers}} Illustration of addressing registers
**An immediate operand** is a constant or the result of a constant expression. The assembler encodes immediate values into the instruction at assembly time. The operand of this type can be only one in the instruction and is always at the source place in the operands list. Immediate operands are used to initialise the register or variable, as numbers for comparison. An immediate operand as it's encoded in the instruction, is placed in code memory, not in data memory and can't be modified during software execution. Instruction which initialises register //R1// with the constant (immediate) value of //5// looks like this: copy R1, 5
{{ :en:multiasm:cs:addressing_immediate.png?600 |Illustration of immediate addressing mode}} Illustration of immediate addressing mode
**A direct memory operand** specifies the data at a given address. An address can be given in numerical form or as the name of the previously defined variable. It is equivalent to static variable definition in high-level languages. If we assume that the //var// represents the address of the variable the instruction which copies data from the variable to //R1// can look like this: copy R1, var
{{ :en:multiasm:cs:addressing_direct.png?600 |Illustration of direct addressing mode}} Illustration of direct addressing mode
**Indirect memory operand** is accessed by specifying the name of the register which value represents the address of the memory location to reach. We can compare the indirect addressing to the pointer in high-level languages where the variable does not store the value but points to the memory location where the value is stored. Indirect addressing can also be used to access elements of the table in a loop, where we use the index value which changes every loop iteration rather than a single address. Different assemblers have different notations of indirect addressing, some use brackets, some square brackets, and others //@// symbol. Even different assembler programs for the same processor can differ. In the following example, we assume the use of square brackets. The instruction which copies the data from the memory location addressed by the content of the //R0// register to //R1// register would look like this: copy R1, [R0]
{{ :en:multiasm:cs:addressing_indirect.png?600 |Illustration of indirect addressing mode}} Illustration of indirect addressing mode
**Variations of indirect addressing**. The indirect addressing mode can have many variations where the final address doesn't have to be the content of a single register but rather the sum of a constant value with one or more registers. Some variants implement automatic incrementation (similar to the "++" operator) or decrementation ("--") of the index register before or after instruction execution to make processing the tables faster. For example, accessing elements of the table where the base address of the table is named //data_table// and the register //R0// holds the index of the byte which we want to copy from a table to //R1// could look like this: copy R1, table[R0]
{{ :en:multiasm:cs:addressing_index.png?600 |Illustration of indirect index addressing mode}} Illustration of indirect index addressing mode
Addressing mode with pre-decrementation (decrementing before instruction execution) could look like this: copy R1, table[--R0] Addressing mode with post-incrementation (incrementing after instruction execution) could look like this: copy R1, table[R0++] ==== Program control flow destination addressing ==== The operand of jump, branch, or function call instructions addresses the destination of the program flow control. The result of these instructions is the change of the Instruction Pointer content. Jump instructions should be avoided in structural or object-oriented high-level languages, but they are rather common in assembler programming. Our examples will use the hypothetic //jump// instruction with a single operand—the destination address. **Direct addressing** of the destination is similar to direct data addressing. It specifies the destination address as the constant value, usually represented by the name. In assembler, we define the names of the addresses in code as //labels//. In the following example, the code will jump to the label named //destin//: jump destin
{{ :en:multiasm:cs:jump_direct.png?600 |Illustration of addressing in direct jump}} Illustration of addressing in direct jump
**Indirect addressing** of the destination uses the content of the register as the address where the program will jump. In the following example, the processor will jump to the destination address which is stored in //R0//: jump [R0]
{{ :en:multiasm:cs:jump_indirect.png?600 |Illustration of addressing in indirect jump}} Illustration of addressing in indirect jump
==== Absolute and Relative addressing ==== In all previous examples, the addresses were specified as the values which represent the **absolute** memory location. The resulting address (even calculated as the sum of some values) was the memory location counted from the beginning of the memory - address "0". It is presented in Fig{{ref>addrabsolute}}.
{{ :en:multiasm:cs:addressing_absolute.png?600 |Illustration of absolute addressing of the variable}} Illustration of absolute addressing of the variable
Absolute addressing is simple and doesn't require any additional calculations by the processor. It is often used in embedded systems, where the software is installed and configured by the designer and the location of programs does not change. Absolute addressing is very hard to use in general-purpose operating systems like Linux or Windows where the user can start a variety of different programs, and their placement in the memory differs every time they're loaded and executed. Much more useful is the **relative addressing** where operands are specified as differences from memory location and some known value which can be easily modified and accessed. Often the operands are provided relative to the Instruction Pointer which allows the program to be loaded at any address in the address space, but the distance between the currently executed instruction and the location of the data it wants to reach is always the same. This is the default addressing mode in the Windows operating system working on x64 machines. It is illustrated in Fig{{ref>addrrelative}}.
{{ :en:multiasm:cs:addressing_relative.png?600 |Illustration of IP relative addressing of the variable}} Illustration of IP relative addressing of the variable
Relative addressing is also implemented in many jump, branch or loop instructions. ===== Fundamentals of Data Encoding, Big Endian, Little Endian ===== The processor can work with different types of data. These include integers of different sizes, floating point numbers, texts, structures and even single bits. All these data are stored in the memory as a single byte or multiple bytes. ==== Integers ==== Integer data types can be 8, 16, 32 or 64 bits long. If the encoded number is unsigned, it is stored in binary representation, while if the value is signed, the representation is two's complement. A natural binary number starts with aero if it contains all bits equal to zero. While it contains all bits equal to one, the value can be calculated with the expression {{ :en:multiasm:cs:equation_binary.png?200 |}}, where n is the number of bits in a number. In two's complement representation, the most significant bit (MSB) represents the sign of the number. Zero means a non-negative number, one represents a negative value. The table {{ref>binarynumbers}} shows the integer data types with their ranges. ^ Number of bits ^ Minimum value (hexadecimal) ^ Maximum value (hexadecimal) ^ Minimum value (decimal) ^ Maximum value (decimal) ^ | 8 | 0x00 | 0xFF | 0 | 255 | | 8 signed | 0x80 | 0x7F | -128 | 127 | | 16 | 0x0000 | 0xFFFF | 0 | 65 535 | | 16 signed | 0x8000 | 0x7FFF | -32 768 | 32 767 | | 32 | 0x0000 0000 | 0xFFFF FFFF | 0 | 4 294 967 295 | | 32 signed | 0x8000 0000 | 0x7FFF FFFF | -2 147 483 648 | 2 147 483 647 | | 64 | 0x0000 0000 0000 0000 | 0xFFFF FFFF FFFF FFFF | 0 | 18 446 744 073 709 551 615 | | 64 signed | 0x8000 0000 0000 0000 | 0x7FFF FFFF FFFF FFFF | -9 223 372 036 854 775 808 | 9 223 372 036 854 775 807 |
Integer binary numbers
==== Floating point ==== Integer calculations do not always cover all mathematical requirements of the algorithm. To represent real numbers the floating point encoding is used. A floating point is the representation of the value //A// which is composed of three fields: * Sign bit * Exponent (E) * Mantissa (M) fulfilling the equation {{ :en:multiasm:cs:equation_floating.png?200 |}} There are two main types of real numbers, called floating point values. Single precision is the number which is encoded in 32 bits. Double precision floating point number is encoded with 64 bits. They are presented in Fig{{ref>realtypes}}.
{{ :en:multiasm:cs:floating_numbers.png?600 |Illustration of a single and double precision real numnbers}} Illustration of a single and double precision real numnbers
The Table{{ref>realnumbers}} shows a number of bits for exponent and mantissa for single and double precision floating point numbers. It also presents the minimal and maximal values which can be stored using these formats (they are absolute values, and can be positive or negative depending on the sign bit). ^ Precision ^ Exponent ^ Mantissa ^ The smallest ^ The largest ^ | Single (32 bit) | 8 bits | 23 bits | {{ :en:multiasm:cs:min_fp_32.png?105 }} | {{ :en:multiasm:cs:max_fp_32.png?100 }} | | Double (64 bit) | 11 bits | 52 bits | {{ :en:multiasm:cs:min_fp_64.png?120 }} | {{ :en:multiasm:cs:max_fp_64.png?100 }} |
Floating point numbers
The most common representation for real numbers on computers is standardised in the document IEEE Standard 754. There are two modifications implemented which make the calculations easier for computers. * The Biased exponent * The Normalised Mantissa Biased exponent means that the bias value is added to the real exponent value. This results with all positive exponents which makes it easier to compare numbers. The normalised mantissa is adjusted to have only one bit of the value "1" to the left of the decimal. It requires an appropriate exponent adjustment. ==== Texts ==== Texts are represented as a series of characters. In modern operating systems, texts are encoded using two-byte Unicode which is capable of encoding not only 26 basic letters but also language-specific characters of many different languages. In simpler computers like in embedded systems, 8-bit ASCII codes are often used. Every byte of the text representation in the memory contains a single ASCII code of the character. It is quite common in assembler programs to use the zero value (NULL) as the end character of the string, similar to the C/C++ null-terminated string convention. ==== Endianness ==== Data encoded in memory must be compatible with the processor. Memory chips are usually organised as a sequence of bytes, which means that every byte can be individually addressed. For processors of the class higher than 8-bit, there appears the issue of the byte order in bigger data types. There are two possibilities: - Little Endian - low-order byte is stored at a lower address in the memory. - Big Endian - high-order byte is stored at a lower address in the memory. These two methods for a 32-bit class processor are shown in Fig{{ref>littlebigendian}}
{{ :en:multiasm:cs:little_big_endian.png?600 |Illustration of Little and Big Endian data placement in the memory}} Illustration of Little and Big Endian data placement in the memory
===== Interrupt Controller, Interrupts ===== An interrupt is a request to the processor to temporarily suspend the currently executing code in order to handle the event that caused the interrupt. If the request is accepted by the processor, it saves its state and performs a function named an interrupt handler or interrupt service routine (ISR). Interrupts are usually signalled by peripheral devices in a situation while they have some data to process. Often, peripheral devices do not send an interrupt signal directly to the processor, but there is an interrupt controller in the system that collects requests from various peripheral devices. The interrupt controller prioritizes the peripherals to ensure that the more important requests are handled first. From a hardware perspective, an interrupt can be signalled with the signal state or change. * Level triggered - stable low or high level signals the interrupt. While the interrupt handler is finished and the interrupt signal is still active the interrupt is signalled again. * Edge triggered - interrupt is signalled only while there is a change on interrupt input. The falling or rising edge of the interrupt signal. The interrupt signal comes asynchronously which means that it can come during execution of the instruction. Usually, the processor finishes this instruction and then calls the interrupt handler. To be able to handle interrupts the processor must implement the mechanism of storing the address of the next instruction to be executed in the interrupted code. Some implementations use the stack while some use a special register to store the returning address. The latter approach requires software support if interrupts can be nested (if the interrupt can be accepted while already in another ISR). After finishing the interrupt subroutine processor uses the returning address to return back program control to the interrupted code. The Fig. {{ref>interrupt}} shows how interrupt works with stack use. The processor executes the program. When an interrupt comes it saves the return address on the stack. Next jumps to the interrupt handler. With //return// instruction processor returns to the program taking the address of an instruction to execute from the stack.
{{ :en:multiasm:cs:interrupt_handling.png?600 |Illustration of interrupt handling with stack use}} Illustration of interrupt handling with stack use
==== Recognising interrupt source ==== To properly handle the interrupts the processor must recognise the source of the interrupt. Different code should be executed when the interrupt is signalled by a network controller, different if the source of the interrupt is a timer. The information on the interrupt source is provided to the processor by the interrupt controller or directly by the peripheral. We can distinguish three main methods of calling proper ISR for incoming interrupts. * Fixed – microcontrollers. Every interrupt handler has its own fixed starting address. * Vectored – microprocessors. There is one interrupt pin, the peripheral sends the address of the handler through the data bus. * Indexed – microprocessors. The peripheral sends an interrupt number. It is used as the index in the table of handlers' addresses. ==== Maskable and non-maskable interrupts ==== Interrupts can be enabled or disabled. Disabling interrupts is often used for time-critical code to ensure the shortest possible execution time. Interrupts which can be disabled are named maskable interrupts. They can be disabled with the corresponding flag in the control register. In microcontrollers, there are separate bits for different interrupts. If an interrupt can not be disabled is named non-maskable interrupt. Such interrupts are implemented for critical situations: * memory failure * power down * critical hardware errors In microprocessors, there exists separate non-maskable interrupt input – NMI. ==== Software and internal interrupts ==== In some processors, it is possible to signal the interrupt by executing special instructions. They are named software interrupts and can be used to test interrupt handlers. In some operating systems (DOS, Linux) software interrupts are used to implement the mechanism of calling system functions. Another group of interrupts signalled by the processor itself are internal interrupts. They aren't signalled with special instruction but rather in some specific situations during normal program execution. * Exceptions – generated by abnormal behaviour in some instructions (e.g. dividing by 0). * Trap – intentionally initiated by programmer transfer control to handler routine (e.g. for debugging). * Fault – generated while detecting some errors in the program (memory protection, invalid code of operation). ===== DMA ===== Direct memory access (DMA) is the mechanism for fast data transfer between peripherals and memory. In some implementations, it is also possible to transfer data between two peripherals or from memory to memory. DMA operates without the activity of the processor, no software is executed during the DMA transfer. It must be supported by a processor and peripheral hardware, and there must be a DMA controller present in the system. The controller plays the main role in transferring the data. DMA controller is a specialised unit which can control the data transfer process. It implements several channels each containing the address register which is used to address the memory location and counter to specify how many cycles should be performed. Address register and counter must be programmed by the processor, it is usually done in the system startup procedure. The system with an inactive DMA controller is presented in Fig.{{ref>DMAinactive}}.
{{ :en:multiasm:cs:DMA_inactive.png?600 |System with inactive DMA controller}} System with inactive DMA controller
The process of data transfer is done in some steps, Let us consider the situation when a peripheral has some data to be transferred. * peripheral signals the request to transfer data (DREQ). * DMA controller forwards the request to the processor (HOLD). * The processor accepts the DMA cycle (HLDA) and switches off from the busses. * DMA controller generates the address on the address bus and sends the acknowledge signal to the peripheral (DACK). * Peripheral sends the data by the data bus. * DMA generates a write signal to store data in the memory. * DMA controller updates address register and the counter. * If the counter reaches zero data transfer stops. Everything is done without any action of the processor, no program is fetched and executed. Because everything is done by hardware the transfer can be done in one memory access cycle so much faster than by the processor. Data transfer by processor is significantly slower because requires at least four instructions of program execution and two data transfers: one from the peripheral and another to the memory for one cycle. The system with an active DMA controller is presented in Fig.{{ref>DMAactive}}.
{{ :en:multiasm:cs:DMA_active.png?600 |System performing DMA transfer}} System performing DMA transfer
DMA transfer can be done in some modes: * Single - one transfer at a time * Block (burst) - block of data at a time * On-demand - as long as the I/O device accepts transfer * Cycle stealing - one cycle DMA, one CPU * Transparent - DMA works when the CPU is executing instructions ====== Programming in Assembler for IoT and Embedded Systems ====== Assembler is a low-level language that directly reflects processor instructions. It is used for programming microcontrollers and embedded devices, allowing for maximum control over the hardware of IoT systems. Programming in an assembler requires knowledge of processor architecture, which is more complex than high-level languages but offers greater efficiency. It is often used in critical applications where speed and efficiency matter. It also enables code optimization for energy consumption. In IoT and embedded systems, an assembler enables efficient use of hardware resources, offering direct access to hardware. It is possible to use assembler code in environments like Arduino or Raspberry Pi. Assembler programs are fast and efficient, crucial in systems with limited resources. Direct control of processor registers and hardware components allows for the creation of advanced applications. Although many libraries use assembler functions, it is often necessary to write code independently. Programming in assembler requires a deep understanding of processor architecture, which improves system design and optimization. ===== Evolution of the Hardware ===== The development of computer hardware is crucial for technology, particularly in the context of the Internet of Things (IoT) and embedded systems. The beginnings date back to the 1940s when the first electronic computers, such as ENIAC, were created. In the 1970s, the first processors, such as the Intel 4004, appeared, revolutionizing the computer industry. Processors are advanced computing units in computers, smartphones, and other devices. Their main tasks are performing computational operations, managing memory, and controlling system components. Microprocessors are a subset of processors designed to work in more specialized devices, such as televisions or industrial equipment. They are small, efficient, and energy-saving, making them ideal for Internet of Things (IoT) applications. Microprocessors became the foundation for the development of microcontrollers, which are the heart of embedded systems. Microcontrollers differ from processors and microprocessors in applications and functionality—small, standalone units with built-in memory, clocks, and interfaces. Microcontrollers are complex integrated circuits designed to act as the brain of electronic devices. The central processing unit (CPU) plays a key role in their operation, controlling the microcontroller's functions and performing arithmetic and logical operations. Inside the microcontroller are registers, communication interface controllers, timers, and memory, all of which are connected by a common bus. Modern microcontrollers have built-in functions such as analog-to-digital converters (ADC) and communication interfaces (UART, SPI, I2C). There are several types of microcontrollers, including PIC (Microchip Technology), AVR (Atmel), ESP8266/ESP32 (Espressif), and TI MSP430 (Texas Instruments). ===== Specific Elements of AVR Architecture ===== AVR is an extension of the idea presented in Vegard Wollan and Alf-Egil Bogen's thesis. Together with Gaute Myklebust, they patented the architecture, and in 1996, Atmel Norway was established as the AVR microcontroller design center. In 2016, Microchip acquired Atmel. AVR architecture is a popular choice for microcontrollers due to its efficiency and versatility. Here are some specific elements that define the AVR architecture: * RISC Architecture: AVR microcontrollers are based on the RISC (Reduced Instruction Set Computing) architecture, which allows them to execute instructions in a single clock cycle. This results in high performance and low power consumption. * Harvard Architecture: AVR utilizes the Harvard architecture, which features separate memory spaces for program code and data. This separation allows simultaneous access to both memories, increasing processing speed. * On-chip Memory: AVR microcontrollers feature built-in flash memory for storing program code, as well as SRAM and EEPROM for data storage. This integration simplifies design and improves reliability. * General-purpose Registers: They contain 32 x 8-bit general-purpose registers, facilitating efficient data manipulation and quick access during operations. * Interrupts: AVR microcontrollers support internal and external interrupt sources. * Real Time Counter (RTC): Timer/Counter Type 2 (TC2) general-purpose, dual-channel, 8-bit timer/counter module. This timer/counter allows clocking from an external 32 kHz watch crystal * I/O Capabilities: AVR microcontrollers offer various programmable I/O lines, allowing flexible interfacing with external devices and peripherals. * Watchdog Timer: A programmable watchdog timer enhances system reliability by resetting the microcontroller in the event of software failure. * Clock Oscillator: The built-in RC oscillator provides the necessary clock signals for the microcontroller's operation. {{:en:multiasm:piot:architecture.gif?400|}} Other Architectures: PIC * Manufacturer: Microchip Technology * Architecture: RISC * PIC16: 8-bit microcontrollers, various versions with different peripheral sets, low power consumption * PIC18: 8-bit microcontrollers with advanced features such as CAN FD * PIC32: 32-bit microcontrollers, high performance, wide range of applications ESP8266/ESP32 * Manufacturer: Espressif Systems * Architecture: Tensilica Xtensa (extended version based on RISC) * ESP8266: Single-core processor, 80 MHz (overclockable to 160 MHz), 32 KB RAM, 80 KB user RAM, Wi-Fi * ESP32: Dual-core processor, 240 MHz, 520 KB SRAM, Wi-Fi, Bluetooth, more GPIO TI MSP430 * Manufacturer: Texas Instruments * Architecture: RISC * MSP430: 16-bit microcontrollers, ultra-low power consumption, advanced analog peripherals, fast switching between modes ===== Registers ===== Registers are a key element of AVR microcontrollers. There are various types of registers, including general-purpose, special-purpose, and status registers. General-purpose registers are used to store temporary data. Special registers control microcontroller functions, such as timers or ADC. Status registers store information about the processor state, such as carry or zero flags. Each register has a specific function and is accessible through particular assembler instructions. Registers allow quick access to data and control over the processor. AVR CPU General Purpose Working Registers (R0-R31) **R0-R15:** Basic general-purpose registers used for various arithmetic and logical operations. **R16-R31:** General-purpose registers that can be used with immediate {{:en:multiasm:piot:avr_general_purpose_register.png?400|}} **The X-register, Y-register, and Z-register:** address pointers for indirect addressing R26-R31: These registers have additional functions as 16-bit address pointers for indirectly addressing the data space. They are defined as X, Y, and Z registers. **Other registers:** RAMPX, RAMPY, RAMPZ: Registers concatenated with the X-, Y-, and Z-registers, enabling indirect addressing of the entire data space on MCUs with more than 64 KB of data space, and constant data fetch on MCUs with more than 64 KB of program space. RAMPD: Register concatenated with the Z-register, enabling direct addressing of the whole data space on MCUs with more than 64 KB data space. EIND: Register concatenated with the Z-register, enabling indirect jump and call to the entire program space on MCUs with more than 64K words (128 KB) of program space. ===== Data Types and Encoding ===== In assembler, various data types, such as bytes, words, and double words, are used. Bytes are the smallest data unit and have 8 bits. Words have 16 bits, and double words have 32 bits. Data types are used to store integers, floating-point numbers, and characters. Data encoding involves representing information in binary form. Data can be encoded in binary, decimal, or hexadecimal format. Character encoding is done using standards such as ASCII. Understanding data types and encoding is crucial for effective programming in assembler. This course will discuss various data types and encoding techniques. Practical examples will demonstrate how to encode and decode data in assembly language. AVR microcontrollers are classified as 8-bit devices, but the instruction word is 16-bit. **Data Types in AVR Assembler** * Single bits: Can be used for logical operations and control, e.g., setting or clearing bits in registers. * Bytes, 8-bit values: The basic data type in AVR, used to store small integers, ASCII characters, etc. * Words, 16-bit values: Consist of two bytes, used to store larger integers or memory addresses. * Double Words, 32-bit values: Consist of four bytes, used to store even larger integers or addresses. For additional libraries, such as those used in the C language, 64-bit data can be utilized. These data require extra handling. **Encoding** * Binary Encoding: Data is stored in binary format, the most fundamental form of data representation in microcontrollers. * ASCII Encoding: Characters are often stored using ASCII encoding, where a unique 7-bit or 8-bit code represents each character. * BCD (Binary-Coded Decimal): Sometimes used for applications requiring precise decimal representation, such as digital clocks or calculators. Each decimal digit is represented by its binary equivalent. * Endianness: AVR microcontrollers typically use little-endian format, where the least significant byte is stored at the lowest memory address. ===== Addressing Modes ===== Addressing modes define how the processor accesses data. There are 15 different addressing modes, such as: Direct Addressing, Indirect Addressing, Indirect with Displacement, Immediate Addressing, Register Addressing, Relative Addressing, Indirect I/O Addressing, and Stack Addressing. Details: **1. Direct Single Register Addressing** {{:en:multiasm:piot:adr01.png?400|}} The operand is contained in the destination register (Rd). **2. Direct Register Addressing, Two Registers** {{:en:multiasm:piot:adr02.png?400|}} Operands are contained in the source register (Rr) and destination register (Rd). The result is stored in the destination register (Rd). **3. I/O Direct Addressing** {{:en:multiasm:piot:adr03.png?400|}} Operand address A is contained in the instruction word. Rr/Rd specifies the destination or source register. **4. Data Direct Addressing** {{:en:multiasm:piot:adr04.png?400|}} A 16-bit Data Address is contained in the 16 LSBs of a two-word instruction. Rd/Rr specify the destination or source register. The LDS instruction uses the RAMPD register to access memory above 64 KB. **5. Data Indirect Addressing** {{:en:multiasm:piot:adr05.png?400|}} The operand address is the contents of the X-, Y-, or Z-pointer. Data Indirect Addressing is called Register Indirect Addressing in AVR devices without SRAM. **6. Data Indirect Addressing with Pre-decrement** {{:en:multiasm:piot:adr06.png?400|}} The X,- Y-, or the Z-pointer is decremented before the operation. The operand address is the decremented contents of the X-, Y-, or the Z-pointer. **7. Data Indirect Addressing with Post-increment** {{:en:multiasm:piot:adr07.png?400|}} The X-, Y-, or the Z-pointer is incremented after the operation. The operand address is the content of the X-, Y-, or the Z-pointer before incrementing. **8. Data Indirect with Displacement** {{:en:multiasm:piot:adr08.png?400|}} The operand address is the result of the q displacement contained in the instruction word added to the Y- or Z-pointer. Rd/Rr specify the destination or source register. **9. Program Memory Constant Addressing** {{:en:multiasm:piot:adr09.png?400|}} Constant byte address is specified by the Z-pointer contents. The 15 MSbs select the word address. For LPM, the LSb selects the low byte if cleared (LSb == 0) or high byte if set (LSb == 1). For SPM, the LSb should be cleared. If ELPM is used, the RAMPZ Register is used to extend the Z-register. **10. Program Memory Addressing with Post-increment** Constant byte address is specified by the Z-pointer contents. The 15 MSbs select word address. The LSb selects low byte if cleared (LSb == 0) or high byte if set (LSb == 1). If ELPM Z+ is used, the RAMPZ Register is used to extend the Z-register. **11. Store Program Memory** {{:en:multiasm:piot:adr11.png?400|}} The Z-pointer is incremented by 2 after the operation. Constant byte address is specified by the Z-pointer contents before incrementing. The 15 MSbs select word address and the LSb should be left cleared. **12. Direct Program Memory Addressing** {{:en:multiasm:piot:adr12.png?400|}} Program execution continues at the address immediate in the instruction word. **13. Indirect Program Memory Addressing** {{:en:multiasm:piot:adr13.png?400|}} Program execution continues at the address contained by the Z-register (i.e., the PC is loaded with the contents of the Z-register). **14. Extended Indirect Program Memory Addressing** {{:en:multiasm:piot:adr14.png?400|}} Program execution continues at the address contained by the Z-register and the EIND-register (i.e., the PC is loaded with the contents of the EIND and Z-register). **15. Relative Program Memory Addressing** {{:en:multiasm:piot:adr15.png?400|}} Program execution continues at the address PC + k + 1. The relative address k is from -2048 to 2047. ===== Instruction Encoding ===== Assembler instructions have a specific format that defines the operation and its arguments. The instruction format can vary depending on the processor architecture. Instruction encoding involves representing operations in binary form. Instructions can vary in length, from one to several bytes. Encoding instructions is crucial for the processor's efficient execution of a program. Instructions encoding for AVR microcontrollers involve converting assembler instructions into the appropriate machine code that the processor can directly execute. Each assembler instruction is represented by a specific set of bits that define the operation and its operands. **Instruction Format:** * AVR instructions are encoded in 16-bit or 32-bit machine words. * 16-bit instructions are most commonly used, but some more complex instructions may require 32-bit encoding. **Bit Allocation:** * Opcode: Specifies the type of operation (e.g., addition, subtraction, shift). * Registers: Specification of source and destination registers. * Constants: Immediate values used in instructions. * Addresses: Memory addresses used in load and store instructions. ===== Instruction Set ===== The assembler instruction set includes arithmetic, logical, control, and input/output operations. Arithmetic instructions include addition, subtraction, multiplication, and division. Logical instructions include operations such as AND, OR, XOR, and NOT. Control instructions include jumps, loops, and conditions. Input/output instructions include operations on ports and registers. Understanding the instruction set is crucial for effective programming in assembler. This course will discuss the most commonly used instructions and their application in assembler code. Practical examples will show how to use instructions in AVR programming. The instruction set allows full control over the processor and its functions. **Basic commands** ^ Data Transfer ^^ |ldi | load immediate| |mov | copy register| |movw | copy register pair| ^ Logical Instructions ^^ |and | logical AND| |andi | logical AND with immediate| |or | logical OR| |ori | logical OR with immediate| |eor | exclusive OR| |com | one's complement| |neg | two's complement| ^ Arithmetic Instructions ^^ |add | add without carry| |adc | add with carry| |adiw | add immediate to word| |sub | subtract without carry| |subi | subtract immediate| |sbc | subtract with carry| |sbci | subtract immediate with carry| |sbiw | subtract immediate from word| |inc | increment| |dec | decrement| |mul | multiply unsigned//(1)//| |muls | multiply signed//(1)//| |mulsu | multiply signed with unsigned//(1)//| |fmul | fractional multiply unsigned//(1)//| |fmuls | fractional multiply signed//(1)//| |fmulsu | fractional multiply signed with unsigned//(1)//| //(1) Not all processors support commands// ^ Bit Shifts ^^ |lsl | logical shift left| |lsr | logical shift Right| |rol | rotate left through carry| |ror | rotate right through carry| |asr | arithmetic shift right| ^ Bit Manipulation ^^ |sbr | set bit(s) in register| |cbr | clear bit(s) in register| |ser | set register| |clr | clear register| |swap | swap nibbles| ===== Best Practices on Structural Programming ===== Modular Programming: Modular programming involves dividing code into smaller, independent modules. Modules can be reused multiple times and are easy to test. Modular programming increases the readability and understandability of the code. Each module should have a clearly defined function and interface. Modules should be independent of each other, which facilitates their modification and development. ===== 3 Levels of Programming: C++, Libraries, Assembler ===== Programming AVR microcontrollers can be divided into three levels: C++, libraries, and assembler. Each of these levels offers distinct benefits and is utilized according to the project's specific requirements. * **C++** is a high-level language that allows programmers to write code in a more abstract and understandable way. Using C++ for AVR enables the use of advanced features such as object-oriented programming, inheritance, and polymorphism. This makes the code more modular and easier to maintain. Compilers like AVR-GCC convert C++ code into machine code that can be executed by the microcontroller. * **Libraries** are sets of predefined functions and procedures that facilitate programming AVR microcontrollers. An example is the AVR Libc library, which offers functions for I/O handling, memory management, and mathematical operations. Using libraries allows for rapid application development without the need to write code from scratch. Libraries are particularly useful in projects that require frequent use of standard functions. * **Assembler** is a low-level language that allows direct programming of the AVR microcontroller. Writing code in assembler gives full control over the hardware and allows for performance optimization. However, programming in assembler requires a deep understanding of the microcontroller's architecture and is more complex than programming in C++. Assembler is often used in critical applications where every clock cycle counts. The choice of programming level depends on the project's specifics. C++ is ideal for creating complex applications, libraries facilitate rapid prototyping, and assembler provides maximum control and performance. Each of these levels has its place in AVR microcontroller programming. ===== Software Tools for AVR ===== Programming in assembler for AVR microcontrollers requires appropriate tools that facilitate code creation, debugging, and optimization. Here are some popular software tools for AVR: * Microchip Studio (formerly Atmel Studio) is an integrated development environment (IDE) for AVR and SAM microcontrollers. It allows writing, compiling, and debugging code in assembler and C/C++. Microchip Studio offers support for a wide range of AVR devices and integration with programming and debugging tools. * AVR Toolchain is a set of tools that includes a compiler, assembler, linker, and standard and mathematical libraries. These tools are based on GNU projects and are available for Windows, Linux, and macOS. AVR Toolchain is ideal for developers who prefer working with command-line tools. * AVRDUDE (AVR Downloader/Uploader) is a tool for programming AVR microcontrollers. It allows uploading code to the microcontroller using various programmers. AVRDUDE is often used in conjunction with other tools, such as Microchip Studio, to provide a complete programming cycle. ===== Hardware Debugging ===== Hardware debugging of AVR microcontrollers is a crucial element of the programming process, enabling precise testing and diagnosis of code issues. Here are some popular methods and tools for hardware debugging for AVR: * debugWIRE is a debugging interface developed by Atmel (now Microchip Technology) for AVR microcontrollers. It enables debugging using a single pin (RESET) and is particularly useful in small microcontrollers that lack many pins. debugWIRE allows setting breakpoints, tracking code execution, and monitoring registers. * JTAG (Joint Test Action Group) is a standard debugging interface that enables full debugging of AVR microcontrollers. JTAG offers advanced features, including code execution tracking, setting breakpoints, monitoring registers, and memory. It is a more advanced tool than debugWIRE but requires more pins. * PDI (Program and Debug Interface) PDI is a debugging interface used in some AVR microcontrollers, such as AVR XMEGA. PDI enables programming and debugging of the microcontroller using two pins. It is a compromise between the simplicity of debugWIRE and the advanced capabilities of JTAG. Debugging Tools * Microchip Studio: An integrated development environment (IDE) for AVR microcontrollers that supports debugging using debugWIRE, JTAG, and PDI. * AVR-GDB: A version of the classic GDB debugger adapted for AVR microcontrollers. It allows debugging code at the assembler and C/C++ level. * AVRDUDE: A tool for programming AVR microcontrollers that can be used in conjunction with hardware debuggers. Simulator * Simulators, such as SimulAVR, allow testing and debugging code without the need for physical hardware. Simulators are useful for verifying code functionality and identifying errors at an early stage. Practical Tips * Setting fuses: Before starting debugging, the microcontroller fuses should be properly set to enable the use of debugWIRE or JTAG. * Disconnecting external reset sources: In the case of debugWIRE, all external reset sources should be disconnected to avoid interference. ===== Energy Efficiency ===== There are five sleep modes to select from: * Idle Mode * Power Down * Power Save * Standby * Extended Standby To enter any of the sleep modes, the Sleep Enable bit in the Sleep Mode Control Register (SMCR.SE) must be written to '1' and a SLEEP instruction must be executed. Sleep Mode Select bits (SMCR.SM[2:0]) select which sleep mode (Idle, Power-down, Power-save, Standby, or Extended Standby) will be activated by the SLEEP instruction. Sleep options shown are from ATmega328PB. {{:en:multiasm:piot:sleep_modes.png?400|}} **Idle Mode** When the SM[2:0] bits are set to '000', the SLEEP instruction puts the MCU into Idle mode, stopping the CPU but allowing peripherals like SPI, USART, Analog Comparator, 2-wire Serial Interface, Timer/Counters, Watchdog, and interrupts to continue operating. This mode halts the CPU and Flash clocks but keeps other clocks running. The MCU can wake up from both external and internal interrupts. **Power-Down Mode** When the SM[2:0] bits are set to '010', the SLEEP instruction puts the MCU into Power-Down mode, stopping the external oscillator. Only external interrupts, 2-wire Serial Interface address watch, and Watchdog (if enabled) can wake the MCU. This mode halts all generated clocks, allowing only asynchronous modules to operate. **Power-Save Mode** When the SM[2:0] bits are set to '011', the SLEEP instruction puts the MCU into Power-Save mode, similar to Power-Down but with Timer/Counter2 running if enabled. The device can wake up from Timer Overflow or Output Compare events from Timer/Counter2. **Standby Mode** When the SM[2:0] bits are set to '110' and an external clock option is selected, the SLEEP instruction puts the MCU into Standby mode, similar to Power-Down but with the oscillator running. The device wakes up in six clock cycles. **Extended Standby Mode** When the SM[2:0] bits are set to '111' and an external clock option is selected, the SLEEP instruction puts the MCU into Extended Standby mode, similar to Power-Save but with the oscillator running. The device wakes up in six clock cycles. AVR® 8-bit microcontrollers include several sleep modes to save power. The AVR device can also lower power consumption by shutting down the clock for select peripherals via a register setting. That register is called the Power Reduction Register (PRR). {{:en:multiasm:piot:sleep_modes_details.png?400|}} The PRR provides a runtime method to stop the clock to select individual peripherals. The current state of the peripheral is frozen, and the I/O registers cannot be read or written. Resources used by the peripheral when stopping the clock will remain committed. Hence, the peripheral should, in most cases, be disabled before stopping the clock. Waking up a module, which is done by clearing the bit in PRR, puts the module into the same state as before shutdown. PRR clock shutdown can be used in Idle mode and Active mode to significantly reduce the overall power consumption. In all other sleep modes, the clock is already stopped. Tips to Minimize Power Consumption * Analog to Digital Converter (ADC): Disable the ADC before entering sleep modes to save power. * Analog Comparator: Disable the Analog Comparator in Idle and ADC Noise Reduction modes if not used. * Brown-Out Detector (BOD): Turn off the BOD if not needed, as it consumes power in all sleep modes. * Internal Voltage Reference: Disable it if not needed by the ADC, Analog Comparator, or BOD. * Watchdog Timer: Turn off the Watchdog Timer if not needed, as it consumes power in all sleep modes. * Port Pins: Configure port pins to minimize power usage, ensuring no pins drive resistive loads. * On-chip Debug System: Disable the On-chip debug system if not needed, as it consumes power in sleep modes. ====== Programming in Assembler for Mobiles and ARM ====== Let's take a look at the mobile devices market. Mobile phones, tablets, and other devices are built on ARM processor architecture. For example, take Snapdragon SoC (System-on-Chip) designed by Qualcomm – this chip integrates a CPU based on the ARM architecture. Of course, that chip may have an additional graphics processing unit(GPU) and even a digital signal processor(DSP) for faster signal processing. Similarly, Apple A18 processors are based on ARM architecture. The only difference between all these mobile devices is the ARM version on which the processor is designed. ARM stands for Advanced RISC Machine. Today, ARM has multiple processor architecture series, including Cortex-M, Cortex-R, Cortex-A, and Cortex-X, as well as other series. Cortex-X series processors are made for performance to be used in Smartphones and Laptops. Now it supports up to 14 cores, but this value changes over time. Similarly, the Cortex-A series is made for devices designed to execute complex computation tasks. These processors are made to provide power-efficient workloads for increasing battery life. Similarly to the Cortex-X series, these processors also may have up to 14 cores. The cortex-R series is made for real-time operations, where reaction to events is crucial. These are made specifically to be used in time-sensitive and safety-critical environments. The architecture itself is very similar to the Cortex-A processors with some exceptions that will not be discussed here. The Cortex-M series is for microcontrollers where low power consumption and computational power are required. Many wearable devices, IoT, and embedded devices contain ARM microcontrollers. Mobile devices tend to use Cortex-A series processors. To learn its architecture, we will use the Raspberry PI. For ARM architecture, we will use the Cortex-M series microcontroller. We chose an ST-manufactured microcontroller, the STM32H743VIT6 microcontroller, to work with the assembler programming language. The microcontroller itself is being chosen for a reason – the processor core is based on ARMv7 Cortex-M7 which is one of the most powerful processor architectures among microcontrollers. ===== Overview of ARM and Mobile Device Architecture ===== Now you may realize that there is more than one assembly language type. The most widely used assembly language types are ARM, MIPS, and x86. New architectures will come, and they will replace the old ones, just because they may have reduced power consumption, or silicon die size. Another push for new CPU architecture is malware that uses the architecture features against users, to steal their data. Such malware, like “Spectre” and “Meltdown”, perform attacks based on vulnerabilities in modern CPU designs. Used vulnerabilities were designed as a feature for the CPU – speculative execution and branch prediction. So – this means that in the future, there will be new architectures or their newer versions and alongside them, new malwares. ===== ARM Assembly Language Specifics ===== CODE FORMAT and IMAGES MISSING Before starting programming, it's good to know how to add comments to our newly created code and the syntax specifics. This mainly depends on the selected compiler. All code examples provided will be created for the GNU ARM ASM. { https://sourceware.org/binutils/docs-2.26/as/} The symbol ‘@’ is used to create a comment in the code till the end of the exact line. Some other assembly languages use ‘;’ to comment, but for ARM assembly language, the ‘;’ character indicates a new line to separate statements. The suggestion is to avoid the use of this ‘;’ character; there will be no such statements that would be divided into separate code lines. Multiline comments are created in the same way as in C/C++ programming languages by use of ‘/*’ and ‘*/’ character combinations. Now, let's compare the Cortex-M and Cortex-A series. The ARMv8-A processor series supports three instruction sets: A32, T32, and A64. A32 and T32 are the Arm and Thumb instruction sets, respectively. These instruction sets are used to execute instructions in the AArch32 Execution state. The A64 instruction set is used when executing in the AArch64 Execution state, and there are no 64-bit wide instructions; it does not refer to the size of the instructions in the memory. ARM Cortex-M processors based on ARMv7 can operate with two instruction types, Thumb and ARM instruction types. The Thumb instructions are 16-bit wide and have some restrictions, like access to the registers. Many microcontrollers are currently based on the ARMv7 processor, but new ones based on ARMv8 and ARMv8.1 are already available. The ARMv8-M series processors work only with Thumb instructions, and there is no ARM state to execute ARM32 instructions, so these processors do not support the A32 instruction set. The feature is removed, maybe because there was no need to use the ARM state – a lot of compilers use the Thumb instruction set. Looking at the ARMv8-A processors, the instruction sets make some difference. AArch32 execution state can execute programs designed for ARMv7 processors. This means many A32 or T32 instructions on ARMv8 are identical to the ARMv7 ARM or THUMB instructions, respectively. On the ARMv7 processors, the THUMB instruction set has some restrictions, like reduced access to general-purpose registers, whereas the ARM instruction set has all the registers accessible. Similar access restrictions are on the ARMv8 processors. Let's look at the machine code produced by the assembler instructions to determine why the exact instruction set has some restrictions on the ARMv8 processor. We will take just one instruction, one of the most common, the MOV instruction. More than one machine code for the MOV instruction is produced because of the different addressing modes. We will take just one instruction, which copies a value stored in one register to another. **A64**: The machine code for the MOV instruction is given in the figure above. The bit values presented are fixed for proper operation identification. The ‘sf’ bit identifies data encoding variant 32-bit (sf=0) or 64-bit(sf=1). [ISA_A64 p.646] The ‘Rm’ and the ‘Rd’ bit fields identify the exact register number from which the data will be copied to the destination register. Each is 5 bits wide, so all 32 CPU registers can be addressed. The ‘sf’ bit only identifies the number of bits to be copied between registers: 32 or 64 bits of data. The ‘opc’ bitfield identifies the operation variant (addressing mode for this instruction), and the ‘N’ bit, mainly used for bitwise shift operations; for others, like this one, this bit has no meaning. There are instructions where the ‘N’ bit is used to identify some instruction options, but this bit is used with the ‘imm6’ bits together. **A32**: The figure above shows the same instruction, but the register address is smaller than for the instruction in A64. The bitfields ‘Rd’ and ‘Rm’ are 4-bit wide, so only 16 CPU registers can be addressed using this instruction in A32. Other bitfields, like ‘cond’, are also used for conditional instruction execution. The ‘S’ bit identifies whether the status register must be updated. The ’stype’ bitfield is used for the type of shift to be applied to the second source register, and finally, the ‘imm5’ bitfield is used to identify the shift amount 0 to 31. **T32**: The Thumb instructions have multiple machine codes for this one operation T1 THUMB instruction D bit and Rd fields together identify the destination register. The source and destination registers can now be addressed through only four bits: only 16 general-purpose registers are accessible. A smaller number of registers can be accessed in the following machine code. The OP bitfield identifies the shift type, and imm5 identifies its amount. The result Rd will be shifted by imm5 bits of the Rm register value. Notice that only three bits are used to address the general-purpose registers – only eight registers are accessible. Finally, the last machine code for this instruction is 16 16-bit wide instruction, but still, the Rd and the Rm fields are four bits wide. This instruction has more shift operations available than the previous one, but instead of one imm5 field, which identifies the shift amount, it is divided into two parts: imm3 and imm2. Both parts are combined for the same purpose: to identify the shift amount. Different machine codes for the T32 instructions give the ability to choose the most suitable one, but the code must be consistent with one machine code type. Switching between machine code types in the processor is still possible, but compiling such code with multiple machine codes will be even more complicated than learning assembler. Summarising these instruction sets, the A32 was best for ARMv7 processors, with only 16 general-purpose registers available. For the ARMv8, there are 31 registers to be addressed, forcing ARM to introduce us to the A64 instruction set, where 32 registers can be addressed. This is why use of the A64 instruction set in the following sections. Instruction options All assembly language types use similar mnemonics for arithmetic operations (some may require additional suffixes to identify some options for the instruction). A32 assembly instructions have specific suffixes to make commands executed conditionally, and those four most significant bits for many instructions give this ability. Unfortunately, there is no such option for A64, but there are special conditional instructions that we are going to describe in the following subsection. We looked at a straightforward instruction and its exact machine code in the previous section. Examining machine codes for each instruction is a perfect way to learn all the available options and all the restrictions. To help understand and read the instruction set documentation, there will be another example for the ADD instruction in the A64 instruction set. The ADD instruction: let's first look at the assembler instruction, which adds two registers together and stores the result in a third register. ADD X0, X1, X2 @X0 = X1 + X2 We need to look at the instruction set documentation to determine the possible options for this instruction. In the documentation, we can find three main differences between the ADD instructions. Despite that, for the data manipulation instruction, the ‘S’ suffix can be added to update the status flags in the processor Status Register. 1. The ADD and ADDS instructions with extended registers: ADD X3, X4, W5, UXTW @X3 = X4 + W5 ADDS X3, X4, W5, UXTW @X3 = X4 + W5 The machine code representation to the assembler instruction would be like: ADDS X3, X4, W5, UXTW Rd = X3 @ pointer to the register where the result will be stored Rn = X4 @ pointer to the First operand of the provided operands Rm = W5 @ pointer to the Second operand of the provided operands, which will be extended to 64 bits We already know that the ‘sf’ bit identifies the length of the data (32 or 64 bits). The main difference between these two instructions is in the ‘S’ bit. The same is in the name of the instruction. The ‘S’ bit is meant to identify for the processor to update the status bits after instruction execution. These status bits are crucial for conditions. The 30th ‘op’ bit and ‘opt’ bits are fixed and not used for this instruction. The three option bits (13th to 15th) extend the operation. These bits are used to extend the second source (Rm) operand. This is handy when the source operands differ in length, like the first operand is 16-bit wide and the second is 8-bit wide. The second register must be extended to maintain the data alignment. Overall, there are three bits: 8 different options to extend the second source operand. The table below explains all these options. Let's look at those options only, the bit values are irrelevant for learning the assembler. UXTB/SXTB Unsigned/Signed byte (8-bit) is extended to word (32-bit) UXTH/SXTH Unsigned/Signed halfword (16-bit) is extended to word (32-bit) UXTW/SXTW Unsigned/Signed word (32-bit) is extended to double word (64-bit) UXTX/SXTX or LSL Unsigned/Signed double word (64-bit) is extended to double word (64-bit), and there is no use for such extension with the unsigned data type. For the UXTX, the LSL shift is preferred if the ‘imm3’ bits are set from 0 to 4. Other ranges are reserved and unavailable because the result can be unpredictable. Moreover, this shift is only available if the ‘Rd’ or the ‘Rn’ operands are equal to ‘11111’, which is the stack pointer (SP). In all other cases, the UXTX extension will be used. In the conclusion for this instruction type, it is handy when the operands are of different lengths, but that’s not all. The shift provided to the second operand allows us to multiply it by 2, 4, 8 or 16, but it works only if the destination register is 64 bits wide (the Xn registers are used). The shift amount is restricted to 4 bits only, even when the ‘imm3’ can identify the larger values. Also, the SXTB/H/W/X are used when the second operand can store negative integers. ADDS X3, X4, W5, SXTX #2 /*extend the W5 register to 64 bits and then shift it by 2 (LSL), that makes a multiplication by 4 (W5=W5*4) add the multiplied value to the X4+(W5*4) store the result in the X3 register X3 = X4 + (W5*4) */ ADD X3, X4, W5, UXTB #1 /*Take the lowest byte from W5 (W5[7:0]) Zero-extend it to 64-bit Shifts left by 1 (multiply by 2) Add to X4 and store in X3*/ ADD X7, X8, W9, SXTH /* Take W9[15:0], sign-extend to 64 bits without shifting */ /* Add to X8 and store in X7 */ /* X7 = X8 + W9[15:0] */ 2. The ADDS (ADD) instructions with immediate value: In the machine code, it is possible to identify the maximum value that can be added to the register. The ‘imm12’ bits restrict the value to 0 to 4095. Besides that, the ‘sh’ bit allows to shift left (LSL) the immediate value by 12 bits. Examples: ADD W0, W1, #100 @Basic 32-bit ADD @Adds 100 to W1, stores in W0 and no shift is performed ADD X0, X1, #4095 @ Basic 64-bit ADD @Adds 4095 to X1, stores in X0 ADD X2, X3, #1, LSL #12 @ 64-bit ADD with shifted immediate (LSL #12) @Add 4096 to X3 (1 << 12 = 4096) @Store the result in X2 ADD X4, SP, #256 @ Using SP as base register @ Add 256 to SP. Useful for frame setup or stack management ADD SP, SP, #64 @ Writing result to SP (Stack pointer arithmetic) ;Add 64 to SP and writes back to SP ADD W5, W6, #2, LSL #12 @ 32-bit ADD with shifted immediate @Add 8192 to W6 and store the result in W5 (2 << 12 = 8192) ADDS X7, X8, #42 @ ADDS (immediate) – flag-setting @Add 42 to X8, store the result in X7 and finally update condition flags (NZCV) ADDS X9, X10, #3, LSL #12 @ ADDS with shifted immediate @Add 12288 to X10, store the result in X9 (3 << 12 = 12288) @Update condition flags ADDS X11, SP, #512 @ ADDS with SP base @Add 512 to SP, result in X11 and update condition flags 3. The ADDS (ADD) instruction with a shifted register The final add instruction type is to add two registers together, where one of them can be shifted, and the shift also can be chosen between LSL (Logical Shift Left), LSR (Logical Shift Right) and ASR (Arithmetic Shift Right). The fourth shift option is not available. The number of bits in the ‘imm6’ field identifies the number of bits to be shifted for the ‘Rm’ register before it is added to the ‘Rn’ register. Similar options are available for other arithmetical instructions like SUB, as well as other instructions like LDR. The instruction set documentation may give the necessary information to determine the possibilities of instructions and restrictions on their usage. ===== Registers ===== The CPU registers are the closest place to the processor to store the data. However, not all registers are meant to store data. There might be more specialised registers to configure the processor or additional modules for the processor. Microcontrollers have integrated special peripherals to perform communication tasks, convert analogue voltage signals into digital values, and vice versa, and many other peripherals. Each of them has its dedicated registers to store configuration and data. These registers are not located near the CPU but somewhere in the memory, primarily in the specified address region of the RAM. The information about these peripheral module registers can be found in the reference manuals and/or programmers' manuals. Some microcontrollers store all information in the datasheets. This section will focus on the CPU and the closest registers to the ARMv8 CPU. ==== CPU registers==== AArch64 provides 31 general-purpose registers (R0..R30) with 64 bits. A 64-bit general-purpose register is named X0 to X30, and a 32-bit register is named W0 to W30, like in the picture above. The name ‘Rn’ is used to describe architectural registers. IMAGE Note that by accessing the W0 or W1 register, we cannot access the rest of the most significant 32 bits. Also, when the W register is written in a 32-bit register, the top 32 bits (most significant bits of the 64-bit register) are zeroed. And there are no registers named R0 or R1, so if we need to access the 64-bit register result, we need to address it with X0 or X1 (or others), and similarly with 32-bit registers – W0, W1 and so on are used to address general-purpose registers. These examples perform single 32-bit arithmetic operations: IMAGE These examples perform single 64-bit arithmetic operations: IMAGE Special registers like the Stack Pointer (SP), Link Register (LR), and Program Counter (PC) are available. The Link Register (LR) is stored in the X30 register. The Stack Pointer SP and the Program Counter PC registers are no longer available as regular general-purpose registers. Using the SP register with a limited set of data processing instructions through the WSP register name is still possible. Unlike ARMv7, the PC register is no longer accessible through data processing instructions. The PC register can be read by ‘ADR’ instruction, i.e. ‘ ADR, X17, . ’ – the dot ‘.’ means “here” and register X17 will be written with the exact PC register value. Some branch instructions and some load/store operations implicitly use the value of the PC register. Note that the PC and SP are general-purpose registers in the A32 and T32 instruction sets, but this is not the case in the A64 instruction set. Someone may have noticed that for A64 instructions, 32 registers in total can be addressed, but there are only 31 general-purpose registers. By accessing the value of the X31 register, the result will be either the current stack pointer or the zero register, depending on the instruction. Writing to R31 will not have any effect. The register X31 is called the Zero Register. The Zero Register, ZXR and WZR (64 or 32 bit wide), are always read as zero and ignore writes. TABLE{ General purpose registers Dedicated registers notes Architectural name R0, R1, R2, R3, .., R29, R30 SP, ZR These names are mainly used in documentation about architecture, instruction sets. 64-bit X0, X1, X2, X3, .., X29, X30 SP, XZR The ‘x’ stands for extended word All 64 bits are used 32-bit W0, W1, W2, W3, .., W29, W30 WSP, WZR The ‘w’ stands for word Only the bottom (least significant) 32 bits are used } IMAGE ARMv8 has an additional 32 register set for floating point and vector operations like general-purpose registers. These registers are 128 bits wide, and like general-purpose registers, they can be accessed in several ways. The letters for these registers identify byte (Bx), half-word (Hx), single-word (Sx), double-word (Dx) and quad-word (Qx) access. IMAGE Similar instructions are performed with these examples, which perform single 32-bit floating-point arithmetic operations: IMAGE (code example) These examples perform single 64-bit floating-point arithmetic operations, and the last instruction is meant for 128-bit floating-point multiplication: IMAGE (code example) These registers operate like general-purpose registers; writing one byte to one of those registers will zero out the rest of the most significant bits. To save all the data in these registers, a different access method must be used: these registers can be treated as vectors using the Vx register. This means it is treated as though it contains multiple independent values, instead of a single value. Operations with vector registers are a bit complex; to make it easier to understand, let's divide those instructions that operate with vector registers into several groups. One group is meant to load and store the values of vector registers, and another group is for performing arithmetical operations on vector registers. LD1 LD1R LDAP1 LDAPUR The first group of instructions to work with vectors: the load instructions load the data into the vector register by identifying the result register for the instruction as a vector register. There are several load instructions to perform this operation --Subsection NOT FINISHED -- ==== CPU Configuration==== For Raspberry PI 5, the ARM Cortex-A76 processor has four CPU cores. Each core has its own stack pointers, status registers and other registers. Before looking at CPU registers, some specifics must be explained. The core has several execution levels named EL0, EL1, EL2 and EL3. These execution levels in datasheets are called Exception Levels – the level at which the processor resources are managed. EL0 is the lowest level; all user applications are executed at this level. EL1 level is meant for operating systems, EL3 is intended for a Hypervisor application to control the resources for OS and the lower levels of exception layers. The CPU's general-purpose registers are independent of Exception levels, but it is essential to understand which Exception Level executes the code. IMAGE IMAGE We will look only at AArch64 registers to narrow the number of registers. There are a lot of registers dedicated to the CPU. Specialised registers will be left aside again to narrow the amount of information, and only those registers meant for program execution will be reviewed. As there is more than one core, each core must have a dedicated status register. All registers that store some status on AArch64 CPU cores are collected in the table below. Don’t get confused by many of these listed status registers. The registers with a green background are relevant to the programming, because only those registers store the actual instruction execution status. TABLE ===== Data Types and Encoding ===== Many instructions have additional suffixes to identify the length of data used. As one of the examples for multiple data lengths, the load/store instructions will be used in this section TABLE Table ==== Big/little-endian ==== Overall, by the ARMv8.0, a new feature to the processor is added. In this section, the main focus is on endianness, and the feature named “FEAT_MixedEnd” from ARMv8 processors allows programmers to control the endianness of the memory. This means that the ARMv8 have implemented both little-endian and big-endian. For the Raspberry PI v5 running in AArch64, the higher exception layers control the endianness for the lower exception layer. For example, the code running in the Exception Level EL1 layer can control the endianness of the EL0 Exception Level. Note that if the Linux OS is already running on the Raspberry PI, then the kernel EL1 endianness should not be changed, because the OS is running on the EL1 layer, and now no OS can switch endianness at runtime. As both modes are supported in the ARMv8, there might be a need to figure out what settings are set for EL0 concerning endianness. The ID_AA64MMFR0_EL1 is an AArch64 Memory Model Feature register that holds information about endianness. For the program code in the register ID_AA64MMFR0_EL1 bits [11:8] indicate the endianness features for the whole CPU. For example, in the register ID_AA64MMFR0_EL1, the BigEndEL0 bits indicate the support for mixed-endian at Exception Level EL0. If the value is nonzero, then The SCTLR_EL1.E0E register bit field is controlled through higher Exception Levels. Assuming that our designed code will be executed in Exception Level EL0, then ... <-- {not finished} ===== Addressing Modes ===== The ARMv8 has simple addressing modes based on load and store principles. ARM cores perform Arithmetic Logic Unit (ALU) operations only on registers. The only supported memory operations are the load (which reads data from memory into registers) or store (which writes data from registers to memory). By use of LDR/STR instructions and their available options, all the data must be loaded in the general-purpose registers before the operations on them can be performed. Multiple addressing modes can be used for load and store instructions: IMAGE ===Register addressing=== The register is used to store the address of the data in the memory. Or the data will be stored in the memory at the address given in the register. LDR R0, [R1] @ fill the register R0 with the data located at address stored in R1 register STR R1, [R2] @ store the content for register R1 into the memory at location given in the R2 register ===Pre-indexed addressing=== An offset to the base register is added before the memory access. The address is calculated by adding two registers together: [R1, R2] will result in R1+R2. The register R1 stands for the base register, and R2 stands for the offset. The offset value can also be negative, but the final, calculated address must be a positive value. LDR R0, [R1, R2] @ address pointed to by R1 + R2 STR R2, [R4, R3] @ address pointed to by R4 + R3 The LDR instruction machine code also allows a shift operation to the offset register while maintaining the same addressing mode. LDR R0, [R1, R2, LSL #2] @ address is R1 + (R2*4) ===Pre-indexed addressing with write back=== The ARM separates this addressing mode because the register that points to the address in memory can now be modified. The value will be added to the register before the access is performed. LDR R0, [R1, #32]! @ read R0 from address pointed to by R1+32, then R1=R1 + 32 The exclamation mark at the end of the closing bracket indicates that the value of the register R1 must be increased by 32. After that, the memory access can be performed. ===Post-index with write back=== Like with the previous addressing mode, this one is also separated. Like the addressing mode says, ‘post’ means that the value of the register will be increased after performing the memory access. LDR R0, [R1], #32 @ read R0 from address pointed to by R1, then R1=R1 + 32 ===Other addressing modes=== There are some other addressing modes available. Some addressing modes are used just for function or procedure calls, others may combine previously described addressing modes and so on. This subsection will introduce some of the other available addressing modes. • Register to register, also called register direct addressing mode. It is used to copy the data from one register to another register. No memory access is performed with such operations. • Literal addressing, alternatively, the immediate addressing mode is used to identify the data directly in the instruction. • PC-relative addressing ===Complex addressing modes === This is not a real addressing mode, but some instructions give possibility to make the addressing a bit complex. ===== Basic Instructions and Operations ===== ===== Procedures and Functions Call Standards ===== ===== Compatibility for Compilers and OSes ===== ===== Peripheral Management in RPi ===== ===== Advanced Assembly Programming ===== ===== Vectorisation ===== ===== Assembly Code Optimisation ===== ===== Energy Efficient Coding: Sleep Modes, Hibernate ===== ====== Programming in Assembler for PC ====== ===== Evolution of x86 Processors ===== The evolution of x86 processors began with the Intel 8086 microprocessor introduced in 1978. It is worth noticing that the first IBM machines, the IBM PC and IBM XT, used the cheaper version 8088 with an 8-bit data bus. The 8088 was software compatible with 8086, but because of the use of an 8-bit external data bus, it allowed for a reduction in the cost of the whole computer. IBM later used the 8086 processors in personal computers named PS/2. Its successor, 80286, was used in the IBM AT personal computer, the extended version of the XT, and in the IBM PS/2 286 machine, which became the standard architecture for a variety of personal computers designed by many vendors. This started the history of personal computers based on the x86 family of processors. In this section, we will briefly describe the most important features and differences between models of x86 processors. ==== 8086 ==== The 8086 is a 16-bit processor, which means it uses 16-bit registers. With the use of the segmentation operating in so-called real addressing mode and a 20-bit address bus, it can access up to 1MB of memory. The base clocking frequency of this model is 5 - 10 MHz. It implements a three-stage instruction pipeline (loosely pipelined) which allows the execution of up to three instructions at the same time. In parallel to the processor, Intel designed a whole set of supporting integrated circuits, which made it possible to build the computer. One of these chips is 8087 - a match coprocessor known now as the Floating Point Unit. ==== 80186 ==== This model includes additional hardware units and is designed to reduce the number of integrated circuits required to build the computer. Its clock generator operates at 6 - 20 MHz. 80186 implements a few additional instructions. It's considered the faster version of 8086. ==== 80286 ==== The 80286 has an extended 24-bit address bus and theoretically can use 16MB of memory. In this model, Intel tried to introduce protected memory management with a theoretical address space of 1 GB. Due to problems with compatibility and efficiency of execution of the software written for 8086, extended features were used by niche operating systems only, with IBM OS/2 as the most recognised. ==== 80386 ==== It is the first 32-bit processor designed by Intel for personal computers. It implements protected and virtual memory management that was successfully used in operating systems, including Microsoft Windows. Architectural elements of the processor (registers, buses) are extended to 32 bits, which allows addressing up to 4GB of memory. 80386 extends the instruction pipeline with three stages, allowing for the execution of up to 6 instructions simultaneously. Although no internal cache was implemented, this chip enables connecting to external cache memory. Intel uses the IA-32 name for this architecture and instruction set. The original name 80386 was later replaced with i386. ==== i486 ==== It is an improved version of the i386 processor. Intel combined in one chip the main CPU and FPU (except i486SX version) and memory controller, including 8 or 16 kB of cache memory. The cache is 4-way associative, common for instructions and data. It is a tightly pipelined processor which implements a five-stage instruction pipeline where every stage operates in one clock cycle. The clock frequency is between 16 - 100 MHz. Tightly pipelined means that all stages of the pipeline perform their duties within the same time period. Loosely pipelined implies that some kind of buffer is used between pipeline stages to decouple the units and allow them to work more independently. Set-associative is the method of placing memory regions in the cache. It represents a compromise between the complexity of the design of the cache controller and flexibility. In general, more "ways" (e.g. 4-way over 2-way) means more flexibility at the cost of complexity. ==== Pentium ==== Pentium is a successor of the i486 processor. It is still a 32-bit processor but implements a dual integer pipeline, which makes it the first superscalar processor in the x86 family. I can operate with the clock frequency ranging from 60 to 300 MHz. An improved microarchitecture also includes a separate cache for instructions and data, and a branch prediction unit, which helps to reduce the influence of pipeline invalidation for conditional jump instructions. The cache memory is 8kB for instructions and 8kB for data, both 2-way associative. ==== Pentium MMX ==== Pentium MMX (MultiMedia Extension) is the first processor which implements the SIMD instructions. It uses FPU physical registers for 64-bit MMX vector operations. The clock speed is 120 - 233 MHz. Intel also decided to improve the cache memory as compared to the Pentium. Both instruction and data cache are twice as big (16kB), and they are 4-way associative. SIMD stands for Single Instruction Multiple Data. Such instructions allow for performing the same operation on more than one data in a single instruction. Intel introduced such instructions in Pentium MMX and continued in further processors as SSE and AVX instructions, naming them vector instructions. ==== Pentium Pro ==== Pentium Pro implements a new architecture (P6) with many innovative units, organised in a 14-stage pipeline, which enhances the overall performance. The advanced instruction decoder generates micro-operations, RISC-like translations of the x86 instructions. It can produce up to two micro-operations representing simple x86 instructions and up to six micro-operations from the microcode sequencer, which stores microcodes for complex x86 instructions. Micro-operations are stored in the buffer, called the reorder buffer or instruction pool, and by the reservation station are assigned to a chosen execution unit. In Pentium Pro, there are six execution units. An important feature of P6 architecture is that instructions can be executed in an out-of-order manner with flexibility of the physical register use known as register renaming. All these techniques allow for executing more than one instruction per clock cycle. Additionally, Pentium Pro implements a new, extended paging unit, which allows addressing 64GB of memory. The instruction and data L1 cache have 8kB in size each. The physical chip also includes the L2 cache assembled as a separate silicon die. This made the processor too expensive for the consumer market, so it was mainly implemented in servers and supercomputers. ==== Pentium II ==== Pentium II is the processor based on experience gathered by Intel in the development of the previous Pentium Pro processor and MMX extension to Pentium. Pentium II combines P6 architecture with SIMD instructions operating at a maximum of 450 MHz. The L1 cache size is increased to 32 KB (16 KB data + 16 KB instructions). Intel decided to exclude the L2 cache from the processor's enclosure and assemble it as a separate chip on a single PCB board. As a result, Pentium II has a form of PCB module, not an integrated circuit as previous models. Although offering slightly worse performance, this approach made it much cheaper than Pentium Pro, and the implementation of multimedia instructions made it more attractive for the consumer computer market. ==== Pentium III ==== Pentium III is very similar to Pentium II. The main enhancement is the addition of the Streaming SIMD Extensions (SSE) instruction set to accelerate SIMD floating point calculations. Due to the enhancement of the production process, it was also possible to increase the clocking frequency to the range of 400 MHz to 1.4 GHz. ==== Pentium 4 ==== Pentium 4 is the last 32-bit processor developed by Intel. Some late models also implement 64-bit enhancement. It is based on NetBurst architecture, which was developed as an improvement to P6 architecture. The important modification is a movement of the instruction cache from the input to the output of the instruction decoder. As a result, the cache, named trace cache, stores micro-operations instead of instructions. To increase the market impact, Intel decided to enlarge the number of pipeline stages, using the term "hyperpipelining" to describe the strategy of creating a very deep pipeline. A deep pipeline could lead to higher clock speeds, and Intel used it to build the marketing strategy. The Pentium 4's pipeline in the initial model is significantly deeper than that of its predecessors, having 20 stages. The Pentium 4 Prescott processor even has a pipeline of 31 stages. Operating frequency ranges from 1.3 GHz to 3.8 GHz. Intel also implemented the Hyper Threading technology in the Pentium 4 HT version to enable two virtual (logical) cores in one physical processor, which share the workload between them when possible. With Pentium 4, Intel returned to the single chip package for both the processor core and L2 cache. Pentium 4 extends the instruction set with SSE2 instructions, and Pentium 4 Prescott with SSE3. NetBurst architecture suffered from high heat emission, causing problems in heat dissipation and cooling. ==== AMD Opteron ==== Opteron is the first processor which supported the 64-bit instruction set architecture, known at the beginning as AMD64 and in general as x86-64. Its versions include the initial one-core and later multi-core processors. AMD Opteron processor implements x86, MMX, SSE and SSE2 instructions known from Intel processors, and also AMD multimedia extension known as 3DNow!. ==== Pentium D ==== Pentium D is a multicore 64-bit processor based on the NetBurst architecture known from Pentium 4. Each unit implements two processor cores. ==== Core Processors ==== * Pentium Dual Core. After facing problems with heat dissipation in processors based on the NetBurst microarchitecture, Intel designed the Core microarchitecture, derived from P6. One of the first implementations is Pentium Dual-Core. After some time, Intel changed the name of this processor line back to Pentium to avoid confusion with Core and Core 2 processors. There is a vast range of Core processors models with different sizes of cache, numbers of cores, offering lower or higher performance. From the perspective of this book, we can think of them as modern, advanced and efficient 64-bit processors, implementing all instructions which we consider. There are many internet sources where additional information can be found. One of them is the Intel website((https://www.intel.com/content/www/us/en/products/details/processors.html)), and another commonly used is Wikipedia((https://en.wikipedia.org/wiki/X86)). * Core All Intel Core processors are based on the Core microarchitecture. Intel uses different naming schemas for these processors. Initially, the names represented the number of physical processor cores in one chip; Core Duo has two physical processors, while Core Quad has four. * Core 2 Improved version of the Core microarchitecture. The naming schema is similar to the Core processors. * Core i3, i5, i7 i9 Intel changed the naming. Since then, there has been no strict information about the number of cores inside the chip. Name rather represents the overall processor's performance. * Core X and Core Ultra X The newest (at the time of writing this book, introduced in 2023) naming drops the "i" letter and introduces the "Ultra" versions of high-performance processors. There exist Core 3 - Core 9 processors as well as Core Ultra 3 - Core Ultra 9. ==== The most important processors ==== ^ Model ^ Year ^ Class ^ Address bus ^ Max memory ^ Clock freq. ^ Architecture ^ | 8086 | 1978 | 16-bit | 20 bits | 1 MB | 5 - 10 MHz | x86-16 | | 80186 | 1982 | 16-bit | 20 bits | 1 MB | 6 - 20 MHz | x86-16 | | 80286 | 1982 | 16-bit | 24 bits | 16 MB | 4 - 25 MHz | x86-16 | | 80386 | 1985 | 32-bit | 32 bits | 4 GB | 12.5 - 40 MHz | IA-32 | | i486 | 1989 | 32-bit | 32 bits | 4 GB | 16 - 100 MHz | IA-32 | | Pentium | 1993 | 32-bit | 32 bits | 4 GB | 60 - 200 MHz | IA-32 | | Pentium MMX | 1996 | 32-bit | 32 bits | 4 GB | 120 - 233 MHz | IA-32 | | Pentium Pro | 1995 | 32-bit | 36 bits | 64 GB | 150 - 200 MHz | IA-32 | | Pentium II | 1997 | 32-bit | 36 bits | 64 GB | 233 - 450 MHz | IA-32 | | Pentium III | 1999 | 32-bit | 36 bits | 64 GB | 400 MHz - 1.4 GHz | IA-32 | | Pentium 4 | 2000 | 32-bit | 36 bits | 64 GB | 1.3 GHz - 3.8 GHz | IA-32 | | AMD Opteron | 2003 | 64-bit | 40 bits | 1 TB | 1.4 GHz - 3.5 GHz | x86-64 | | Pentium 4 Prescott | 2004 | 64-bit* | 36 bits | 64 GB | 1.3 GHz - 3.8 GHz | IA-32, Intel64* | | Pentium D | 2005 | 64-bit | | | 2.66 GHz - 3.73 GHz | Intel64 | | Pentium Dual-Core | 2007 | 64-bit | | | 1.3 GHz - 3.4 GHz | Intel64 | | Core 2 | 2006 | 64-bit | 36 bits | 64 GB | 1.06 GHz - 3.5 GHz | Intel64 | * in some models ===== Specific Elements of the x86 and x64 Architectures ===== The x86 architecture was created in the late 70th years of XX century. The technology available at that time didn't allow for implementing advanced integrated circuits containing millions of transistors in one silicon die. The 8086, the first processor in the whole family of x86, required additional supporting elements to operate. This led to the necessity of making some decisions about the compromise between efficiency, computational possibilities and the size of the silicon and cost. This is the reason why Intel invented some specific elements of the architecture. One of them is the segmentation mechanism to extend the addressing space from 64 kB typically available to 16-bit processors to 1 MB. We present in this chapter some specific features of the x86 and x64 architectures. ==== Segmented addressing in real mode ==== The 8086 can address the memory in so-called real mode only. In this mode, the address is calculated with two 16-bit elements: segment and offset. The 8086 implements four special registers to store the segment part of the address: CS, DS, ES, and SS. During program execution, all addresses are calculated relative to one of these registers. The program is divided into three segments containing the main elements. The code segment contains processor instructions and their immediate operands. The instructions address are related to the CS register. The data segment is related to the DS register. It contains data allocated by the program. The stack segment contains the program stack and is related to the SS register. If needed, it is possible to use an extra segment related to the ES register. It is by default used by string instructions.
{{ :en:multiasm:cs:Real_segments.png?600 |Illustration of assignment of segments and segment registers in real mode}} Segments and segment registers in real mode
Although the 8086 processor has only four segment registers, there can be many segments defined in the program. The limitation is that the processor can access only four of them at the same time, as presented in Fig {{ref>realsegments}}. To access other segments, it must change the content of the segment register. The address, which consists of two elements, the segment and the offset, is named a logical address. Both numbers which form a logical address are 16-bit numbers. So, how to calculate a 20-bit address with two 16-bit values? It is done in the following way. The segment part, taken always from the chosen segment register, is shifted four bit positions left. Four bits at the right side are filled with zeros, forming a 20-bit value. The offset value is added to the result of the shift. The result of the calculations is named the linear address. It is presented the Fig {{ref>realcalc}}. In the 8086 processor, the linear address equals the physical address, which is provided via the address bus to the memory of the computer.
{{ :en:multiasm:cs:real_calculation.png?600 |Illustration of address calculation in real mode}} Address calculation in real mode
The segment in the memory is called a physical segment. The maximum size of a single physical segment can't exceed 64 kB (65536 B), and it can start at an address evenly divisible by 16 only. In the program, we define logical segments which can be smaller than 64 kB, can overlap, or even start at the same address. ==== Segmented addressing in protected mode ==== With the introduction of processors capable of addressing larger memory spaces, the real addressing mode was replaced with addressing with the use of descriptors. We will briefly describe this mechanism, taking 80386 as an example. The 80386 processor is a 32-bit machine built according to IA-32 architecture. Using 32 bits, it is possible to address 4 GB of linear address space. Segmentation in 32-bit processors can be used to implement the protection mechanisms. They prevent access to segments which are created by other processes to ensure that processes can operate simultaneously without interfering. Every segment is described by its descriptor. The descriptor holds important information about the segment, including the starting address, limit (size of the segment), and attributes. As it is possible to define many segments at the same time, descriptors are stored in the memory in descriptor tables. IA-32 processors contain two additional segment registers, FS and GS, which can be used to access two extra data segments (Fig {{ref>ia32segments}}), but all segment registers are still 16-bit in size.
{{ :en:multiasm:cs:IA32_segments.png?600 |Illustration of assignment of segments and segment registers in protected mode}} Segments and segment registers in protected mode
The segment register in protected mode holds the segment selector, which is an index in the table of descriptors. The table is stored in memory, created and managed by the operating system. Each segment register has an additional part that is hidden from direct access. To speed up the operation, the descriptor is downloaded from the table into the hidden part automatically, each time the segment register is modified. It is schematically presented in Fig {{ref>ia32addr}}.
{{ :en:multiasm:cs:IA32_addressing.png?600 |Illustration of addressing with descriptors in protected mode}} Address calculation in protected mode
Although segmentation allows for advanced memory management and the implementation of memory protection, none of the popular operating systems, including Windows, Linux or MacOS, ever used it. In Windows, all segment registers, via descriptors, point to the zero base address and the maximal limit, resulting in the flat memory model. In this approach, the segmentation de facto does not function. Memory protection is implemented at the paging level. The flat memory model is shown in the Fig {{ref>ia32flat}}.
{{ :en:multiasm:cs:IA32_flat.png?600 |Illustration of flat memory addressing mode}} Flat memory addressing
==== Paging ==== Paging is a mechanism for translating linear addresses to physical addresses. Operating systems extensively use it to implement memory protection and virtual memory mechanisms. The virtual memory mechanism allows the operating system to use the entire address space with less physical memory installed in the computer. The paging is supported by the memory management unit using a structure of tables, stored in memory. In 32-bit mode, the tables are organised in a two-level structure, with a single page directory table and a set of page tables. The pages can be 4kB or 4MB in size. The 1-bit information about the page size (named PS) is stored in the page directory entry. * 4kB pages Each entry in the page directory holds a pointer to the page table, and every page table stores the pointers to the pages. The linear address is divided into three parts. The first (highest) part is a 10-bit index of the entry in the page directory table, the middle 10 bits form an index of the entry in the page table, and finally, the last (least significant) 12 bits are the offset within the table in memory. This mechanism is shown in Fig {{ref>paging324kB}}.
{{ :en:multiasm:cs:paging_32_4kB.png?600 |Illustration of paging in 32-bit mode with 4kB pages}} Paging in 32-bit mode with 4kB pages
* 4MB pages For pages of 4MB in size, the page table level is omitted. The page directory holds the direct pointer to the table in memory. In this situation, the entry in the page directory is indexed with 10 bits, and the offset within the page is 22 bits long. It is shown in the Fig {{ref>paging324MB}}.
{{ :en:multiasm:cs:paging_32_4MB.png?600 |Illustration of paging in 32-bit mode with 4MB pages}} Paging in 32-bit mode with 4MB pages
In 64-bit mode, the tables are organised in a four or five-level structure. In a 4-level structure, the highest level is a single page map level 4 table, next, there are page directory pointer tables, page directory tables, and page tables. The linear address is divided into six parts. Each table is indexed with 9 bits and can store up to 512 entries. It is shown in the Fig {{ref>paging644kB}}. The pages can be 4kB, 2MB, or 1GB in size. For 2MB pages, the page tables level is omitted, while for 1GB pages, the page directory level is additionally omitted.
{{ :en:multiasm:cs:paging_64_4kB.png?600 |Illustration of paging in 64-bit mode with 4kB pages}} Paging in 64-bit mode with 4kB pages
As currently built computers have significantly less physical memory than can be theoretically addressed with a 64-bit linear address, producers decided to limit the usable address space. That's why the 4-level paging mechanism recognises only 48 bits, leaving the upper bits unused. In 5-level paging, 57 bits are recognised. The most significant part of the address should have the value of the highest recognisable bit from the address. As the most significant bit of the number represents the sign, duplicating this bit is known as sign extension. Sign-extended addresses having 48 or 57 recognised bits are known as canonical addresses, while others are non-canonical. It is presented in Fig {{ref>canonical}}. In current machines, only canonical addresses are valid.
{{ :en:multiasm:cs:canonical.png?600 |Illustration of canonical addresses in 64-bit mode}} Canonical addresses in 64-bit mode
==== Addressing in x64 processors ==== Because the segmentation wasn't used by operating systems software producers, AMD and Intel decided to abandon segmentation in 64-bit processors. For backwards compatibility, modern processors can run 32-bit mode with segmentation, but the newest versions of operating systems use 64-bit addressing, named long mode and referred to as x64. The only possible addressing is a flat memory model with segment registers and descriptors set to zero address as the beginning of the linear memory space. In x64 bit mode, some instructions can use segment registers FS and GS as base registers. Operating systems' kernels also use them. The 64-bit mode theoretically allows the processor to address a vast memory of 16 exabytes in size. It is expected that such a big memory will not be installed in currently built computers, so the processors limit the available address space, not only at the paging level but also physically, having a limited number of address bus lines. ===== Register Set ===== ==== x86 registers ==== The 8086 processor has seven general-purpose registers for data storage, data manipulation and indirect addressing. Four registers: AX, BX, CX and DX can be accessed in two 8-bit halves. The lower half has the "L" letter, and the upper half has the "H" letter in the register name instead of "X". General-purpose registers are presented in Fig {{ref>gpregistresx86}}.
{{ :en:multiasm:cs:GP_registers_x86.png?600 |Illustration of general purpose registers in x86 16-bit processor}} General purpose registers in 16-bit 8086 processor
They have some special functions, as listed below. * AX - Accumulator, used for general data manipulation; some instructions work only with the accumulator. * BX - Base register, used for indirect addressing in base addressing mode. * CX - Counter register, used for iteration counting for loops and repeated instructions. * DX - Data register, used as an extension of the accumulator, and for I/O address space addressing. * SI - Source index, used for indirect index addressing of source data tables or strings. * DI - Destination index, used for indirect index addressing of destination data tables or strings. * BP - Base pointer, often used for accessing data on the stack without the need for the stack pointer modification. Some registers have special purposes. * SP - Stack pointer, used for stack manipulation, automatically handles storing and retrieving the return address of procedures while calling and returning. * IP - Instruction pointer, points to the next instruction to be executed; sometimes called program counter. These registers are presented in Fig {{ref>spregistresx86}}.
{{ :en:multiasm:cs:special_registers_x86.png?600 |Illustration of stack pointer and instruction pointer registers in x86 16-bit processor}} Stack pointer and instruction pointer in 16-bit 8086 processor
Every processor has a register containing bits used to inform the software about the state of the ALU, about the result of the last arithmetic or logic operation, and used to control the behaviour of the CPU. In x86, it is called the Flags register, and is shown in fig {{ref>flagsx86}}.
{{ :en:multiasm:cs:flags_register_x86.png?500 |Illustration of the flags register in x86 16-bit processor}} Flags register in 16-bit 8086 processor
The meaning of informational flags is as follows: * CF - Carry flag, set if an operation generates a carry or borrow; for example, if the result of an addition is larger than 16 bits. Used for unsigned arguments. * PF - Parity flag, set if the result of an operation contains an even number of bits equal to "1". * AF - Auxiliary Carry flag, similar to CF but informs about the carry or borrow from 3th to the 4th bit. It is used for BCD calculations. * ZF - Zero flag, set if the result of an operation is zero. * SF - Sign flag, it is a copy of the most significant bit of the result * OF - Overflow flag, set if the result of an operation is too large or small to fit in the destination operand. Used for signed arguments. Control flags allow modification of the processor's behaviour: * TF - Trap flag, used for debugging to execute a program one instruction at a time. * IF - Interrupt flag, if set, interrupts are enabled. * DF - Direction flag, determines the direction of string operations. If cleared, string operations work from lower to higher addresses; if set, they operate in the opposite direction. The 8086, being a 16-bit processor, can address memory in real mode only (please refer to section "Segmented addressing in real mode"). It has four segment registers: * CS - Code segment register, used together with IP to access the instruction in the code segment. * DS - Data segment register, used to access the default data segment. * ES - Extra segment register, used to access an additional data segment. * SS - Stack segment register, used together with SP or BP to access elements on the stack. Segment registers are shown in Fig {{ref>segmentregsx86}}.
{{ :en:multiasm:cs:segment_registers_x86.png?200 |Illustration of the segment registers in x86 16-bit processor}} Segment registers in 16-bit 8086 processor
==== IA-32 registers ==== Starting from the 80386 processor general-purpose registers, stack pointer, instruction pointer, and flags register were extended to 32 bits. Inter added an extra 16 bits to each of them, leaving the possibility to access the lower half as in previous models. For example, an accumulator can be accessed as 32-bit EAX, 16-bit AX, and 8-bit AH and AL. The resulting register set is shown in Fig {{ref>GPregsIA32}}.
{{ :en:multiasm:cs:GP_registers_IA32.png?600 |Illustration of the register set in IA-32 32-bit processor}} Register set in 32-bit IA-32 processors
Similarly, the IP and flags registers were extended to 32-bit size (fig.{{ref>eflagsIA32}}). Additional bits in the Eflags register are mainly used for handling the virtual memory mechanism and memory protection. They are used mainly by the operating system software, so we will not focus on details here.
{{ :en:multiasm:cs:eflags_register_IA32.png?600 |Illustration of the Eflags register in IA-32 32-bit processor}} Eflags register in 32-bit IA-32 processors
In 32-bit machines, two additional segment registers were added: FS and GS (fig.{{ref>segregsIA32}}). It is worth noticing that all segment registers are still 16 bits long, but there are hidden parts, not available directly to the programmer, used for storage of descriptors (refer to the section "Segmented addressing in protected mode").
{{ :en:multiasm:cs:segment_registers_IA32.png?200 |Illustration of segment registers in IA-32 32-bit processor}} Segment registers in 32-bit IA-32 processors
==== x64 registers ==== The 64-bit extension to the IA-32 architecture was proposed by AMD. Registers were extended to 64 bits, and additionally, eight new general-purpose registers were added. They obtained the names R8 - R15. All registers (including already known registers) can be accessed as 64-bit, 32-bit lower-order half, 16-bit lower-order quarter (word) and 8-bit lower-order byte. Registers are presented in fig.{{ref>GPregsx64}} and in fig.{{ref>newregsx64}}. The Eflags register is extended to 64 bits and named Rflags. The upper 32 bits of the Rflags are reserved.
{{ :en:multiasm:cs:GP_registers_x64.png?600 |Illustration of the register set in x64 64-bit processor}} Register set in 64-bit x64 processors
{{ :en:multiasm:cs:new_registers_x64.png?600 |Illustration of new registers in x64 64-bit processor}} New registers in 64-bit x64 processors
==== FPU registers ==== The floating point unit, sometimes referred to as x87, works using eight 80-bit data registers, as presented in figure {{ref>fpudataregs}}. They are organised as a stack, so data items are pushed into the top register. Calculations are done using the top of the stack and another register, usually the next one. Registers are named ST(0) to ST(7), where ST(0) is the top of the stack; register ST(7) is the bottom. The ST name is another way to refer to ST(0). The registers contain the sign bit, exponent part and significand part, encoded according to the IEEE754 standard. Please refer to the section about data encoding and to the section about the FPU description for details of floating-point instructions.
{{ :en:multiasm:cs:fpu_data_registers.png?600 |Illustration of the register set in FPU coprocessor}} Register set in x87 Floating Point Unit coprocessor
There are also control and status registers in the Floating Point Unit as presented in figure {{ref>fpucontrolregs}}.
{{ :en:multiasm:cs:fpu_control_registers.png?600 |Illustration of the control registers FPU coprocessor}} Control registers in x87 Floating Point Unit coprocessor
FPU instruction pointer, data pointer and last opcode registers are used for cooperation with the main processor. The Tag register (figure {{ref>fputagreg}}) holds the 2-bit information of the states of each data register. It helps to manage the stack organisation of the data registers.
{{ :en:multiasm:cs:fpu_tag_register.png?400 |Illustration of the TAG register in FPU coprocessor}} TAG register in x87 Floating Point Unit coprocessor
The Status Word register contains information about the FPU's current state, including the top of the stack, exception flags, and condition codes (figure {{ref>fpustatereg}}). The latter ones inform about the result of the last operation, similarly to the flags in the RFlag register.
{{ :en:multiasm:cs:fpu_sw_register.png?500 |Illustration of the Status Word register in FPU coprocessor}} Status Word register in x87 Floating Point Unit coprocessor
The Control Word register (figure {{ref>fpucontrolreg}} makes it possible to enable or disable exceptions, control the precision of calculations and rounding of their results. The infinity bit is not meaningful in new processors.
{{ :en:multiasm:cs:fpu_cw_register.png?500 |Illustration of the Control Word register in FPU coprocessor}} Control Word register in x87 Floating Point Unit coprocessor
==== MMX registers ==== MMX instructions are executed with the use of 64-bit packed data. Intel decided to map MMX registers onto the existing FPU registers. The MMX unit has 8 registers named MM0 - MM7, as presented in figure {{ref>mmxregs}}. Because they are physically the same as FPU registers, it is not possible to mix freely MMX and FPU instructions.
{{ :en:multiasm:cs:mmx_registers.png?600 |Illustration of the MMX registers}} MMX registers
==== XMM registers ==== SSE instructions are executed with the use of 128-bit packed data. To perform calculations, new 128-bit registers were introduced. The SSE unit has 8 registers named XMM0 - XMM7, as presented in figure {{ref>xmmregs}}. Because these registers are separate from all previous ones, there are no conflicts between SSE and other instructions.
{{ :en:multiasm:cs:xmm_registers.png?500 |Illustration of the 128-bit XMM registers}} 128-bit XMM registers
To control the behaviour and to provide information about the status of the SSE unit, the MXCSR register is implemented (figure {{ref>mxcsrreg}}). It is a 32-bit register with bits with their functions similar to the FPU unit's Control Word and Status Word registers concatenated together.
{{ :en:multiasm:cs:mxcsr_register.png?500 |Illustration of the MXCSR SSE control register}} MXCSR control register for SSE unit
==== YMM registers ==== The YMM registers are the 256-bit extension to the XMM registers. They are introduced to implement the AVX set of instructions. Additionally, in 64-bit processors, the number of available YMM registers was increased to 16 (figure {{ref>ymmregs}}).
{{ :en:multiasm:cs:ymm_registers.png?600 |Illustration of the 256-bit YMM registers}} 256-bit YMM registers
==== ZMM registers ==== ZMM registers are the further extension of XMM registers to 512 bits. In 64-bit machines, 32 of such registers are available (figure {{ref>zmmregs}}).
{{ :en:multiasm:cs:zmm_registers.png?800 |Illustration of the 512-bit ZMM registers}} 512-bit ZMM registers
XMM are the physical lower halves of YMM, which are the lower halves of ZMM. Anything written to the XMM register appears in part of the YMM and ZMM registers as presented in figure {{ref>xyzmmregs}}
{{ :en:multiasm:cs:xyzmm_register.png?800 |Illustration of the relation between XMM, YMM and ZMM registers}} The relation between XMM, YMM and ZMM registers
Together with AVX extension and ZMM registers, eight 16-bit opmask registers were introduced to x64 processors (figure {{ref>opmaskregs}}). They can be used, e.g. to provide conditional execution or mask several elements of vectors in AVX instructions.
{{ :en:multiasm:cs:opmask_registers.png?400 |Illustration of the opmask registers}} Opmask registers
==== Additional registers ==== In the x64 architecture, there are additional registers used for control purposes. The CR0, CR2, CR3, CR4, and CR8 registers are used for controlling the operating mode of the processor, virtual and protected memory mechanisms and paging. The debug registers, named DR0 through DR7, control the debugging of the processor’s operations and software. Memory type range registers MTRR can be used to map the address space used for the memory-mapped I/O as non-cacheable. Different processors have different sets of model-specific registers, MSR. There are also machine check registers MCR. They are used to control and report on processor performance, detect and report hardware errors. Mostly, the described registers are not accessible to an application program and are controlled by the operating system, so they will not be described in this book. For more details, please refer to Intel documentation ((https://www.intel.com/content/www/us/en/content-details/851056/intel-64-and-ia-32-architectures-software-developer-s-manual-volume-1-basic-architecture.html?wapkw=253665)). ===== Data Types and Encoding ===== ==== Fundamental data types ==== Fundamental data types cover the types of data elements that are stored in memory as 8-bit (Byte), 16-bit (Word), 32-bit (Doubleword), 64-bit (Quadword) or 128-bit (Double quadword), as shown in figure {{ref>fundamentaldata}}. Many instructions allow for processing data of these types without special interpretation. It depends on the programming engineer how to interpret the inputs and results of instruction execution. The order of bytes in the data types containing more than a single byte is little-endian. The lower (least significant) byte is stored at the lower address of the data. This address represents the address of the data as a whole.
{{ :en:multiasm:cs:fundamental_data_types.png?700 |Illustration of fundamental data types}} Fundamental data types
The data types which can be used in the x64 architecture processors can be integers or floating point numbers. Integers are processed by the main CPU as single values (scalars). They can also be packed into vectors and processed with the specific SIMD instructions, including MMX and partially by SSE and AVX instructions. Integers can be interpreted as unsigned or signed. They can also represent the pointer to some variable or address within the code. Scalar real numbers are processed by the FPU. SSE and AVX instructions support calculations with scalars or vectors composed of real values. All possible variants of data types are stored within the fundamental data types. Further in this chapter, we describe all possible singular or packed integers and floating-point values. ==== Integers ==== Integers are the numbers without the fractional part. In x64 architecture, it is possible to define a variety of data of different sizes, all of them based on bytes. A single byte forms the smallest possible information item stored in memory. Even if not all bits are effectively used, the smallest element which can be stored is a byte. Two bytes form the word. It means that in x64 architecture, Word data type means 16 bits. Two words form the Double Word data type (32 bits), and four words form the Quad Word data type (64 bits). With the use of large registers in modern processors, it is possible to use in a few instructions the Double Quad Word data type, containing 128 bits (sometimes called Octal Word). ==== Integer scalar data types ==== Integer data types can be one, two, four or eight bytes in length. Unsigned integers are binary encoded in natural binary code. It means that the starting value is 0 (zero), and the maximum value is formed with bits "1" at all available positions. The x64 architecture supports the unsigned integer data types shown in table {{ref>uinttypes}}. ^ Name ^ Number of bits ^ Minimum value ^ Maximum value ^ Minimum value (hex) ^ Maximum value (hex) ^ | Byte | 8 | 0 | 255 | 0x00 | 0xFF | | Word | 16 | 0 | 65535 | 0x0000 | 0xFFFF | | Doubleword | 32 | 0 | 4294967295 | 0x00000000 | 0xFFFFFFFF | | Quadword | 64 | 0 | 18446744073709551615 | 0x0000000000000000 | 0xFFFFFFFFFFFFFFFF |
Unsigned integer data types
Signed integers are binary encoded in 2's complement binary code. The highest bit of the value is a sign bit. If it is zero, the number is non-negative; if it is one, the value is negative. It means that the starting value is encoded as the highest bit equal to 1 and all other bits equal to zero. The maximum value is formed with a "0" bit at the highest position and bits "1" at all other positions. The x64 architecture supports the signed integer data types shown in table {{ref>sinttypes}}. ^ Name ^ Number of bits ^ Minimum value ^ Maximum value ^ Minimum value (hex) ^ Maximum value (hex) ^ | Signed Byte | 8 | -128 | 127 | 0x80 | 0x7F | | Signed Word | 16 | -32768 | 32767 | 0x8000 | 0x7FFF | | Signed Doubleword | 32 | -2147483648 | 2147483647 | 0x80000000 | 0x7FFFFFFF | | Signed Quadword | 64 | -9223372036854775808 | 9223372036854775807 | 0x8000000000000000 | 0x7FFFFFFFFFFFFFFF |
Signed integer data types
==== Integer vector data types ==== Vector data types were introduced with SIMD instructions starting with the MMX extension, and followed in the SSE and AVX extensions. They form the packed data types containing multiple elements of the same size. The elements can be considered as signed or unsigned depending on the algorithm and instructions used. The 64-bit packed integer data type contains eight Bytes, four Words or two Doublewords as shown in figure {{ref>packedint64}}.
{{ :en:multiasm:cs:packed_integers_64.png?600 |Illustration of 64-bit packed integer data types}} 64-bit packed integer data types
The 128-bit packed integer data type contains sixteen Bytes, eight Words, four Doublewords or two Quadwords as shown in figure {{ref>packedint128}}.
{{ :en:multiasm:cs:packed_integers_128.png?750 |Illustration of 128-bit packed integer data types}} 128-bit packed integer data types
The 256-bit packed integer data type contains thirty-two Bytes, sixteen Words, eight Doublewords, four Quadwords or two Double Quadwords as shown in figure {{ref>packedint256}}.
{{ :en:multiasm:cs:packed_integers_256.png?800 |Illustration of 256-bit packed integer data types}} 256-bit packed integer data types
The 512-bit packed integer data type contains sixty-four Bytes, thirty-two Words, sixteen Doublewords, eight Quadwords or four Double Quadwords as shown in figure {{ref>packedint512}}. Double Quadwords are not used as operands, they are the results of some operations only.
{{ :en:multiasm:cs:packed_integers_512.png?800 |Illustration of 512-bit packed integer data types}} 512-bit packed integer data types
==== Floating point values ==== Floating point values store the data encoded for calculation on real numbers. Depending on the precision required for the algorithm, we can use different data sizes. Scalar data types are supported by the FPU (Floating Point Unit), offering single precision, double precision or double extended precision real numbers. In C/C++ compilers, they are referred to as float, double and long double data types, respectively. Vector (packed) floating-point data types can be processed by many SSE and AVX instructions, offering fast vector, matrix or artificial intelligence calculations. Vector units can process half precision, single precision and double precision formats. The 16-bit Brain Float format was introduced to calculate the dot scalar product to improve the efficiency of AI training and inference algorithms. Floating point data types are shown in figure {{ref>floattypes}} and described in table {{ref>tablefloattypes}}. The table shows the number of bits used. In reality, the number of mantissa bits is assumed to be one bit longer, because the highest bit representing the integer part is always "1", so there is no need to store it (except for Double extended data format, where the integer bit is present).
{{ :en:multiasm:cs:floating_types.png?700 |Illustration of floating point data types}} Floating point data types in x64 architecture
^ Name ^ Bits ^ Mantissa bits ^ Exponent bits ^ Min value ^ Max value ^ | Double extended | 80 | 64 | 15 | {{:en:multiasm:cs:float_ep_min.png?125}} | {{:en:multiasm:cs:float_ep_max.png?125}} | | Double precision | 64 | 52 | 11 | {{:en:multiasm:cs:float_dp_min.png?120}} | {{:en:multiasm:cs:float_dp_max.png?115}} | | Single precision | 32 | 23 | 8 | {{:en:multiasm:cs:float_sp_min.png?120}} | {{:en:multiasm:cs:float_sp_max.png?110}} | | Half precision | 16 | 10 | 5 | {{:en:multiasm:cs:float_hp_min.png?110}} | {{:en:multiasm:cs:float_hp_max.png?100}} | | Brain Float | 16 | 7 | 8 | {{:en:multiasm:cs:float_bf_min.png?120}} | {{:en:multiasm:cs:float_bf_max.png?110}} |
Floating point data types
==== Floating point vector data types ==== Floating point vectors are formed with single or double precision packed data formats. They are processed by SSE or AVX instructions in a SIMD approach of processing. A 128-bit packed data format can store four single-precision data elements or two double-precision data elements. A 256-bit packed data format can store eight single-precision values or four double-precision values. A 512-bit packed data format can store sixteen single-precision values or eight double-precision values. These packed data types are shown in figure {{ref>packedfloattypes}}. Instructions operating on 16-bit half-precision values or Brain Floats can use twice as many such elements simultaneously in comparison to single-precision data. It is worth mentioning that some instructions operate on a single floating-point value, using only the lowest elements of the operands.
{{ :en:multiasm:cs:packed_floats.png?800 |Illustration of packed floating point data types}} Pcked floating point data types in x64 architecture
==== Bit field data type ==== A bit field is a data type whose size is counted by the number of bits it occupies. The bit field can start at any bit position in the fundamental data type and can be up to 32 bits long. MASM supports it with the RECORD data type. The bit field type is shown in figure {{ref>bitfieldtype}}.
{{ :en:multiasm:cs:bit_field_type.png?300 |Illustration of bit field data type}} The bit field data type
==== Pointers ==== Pointers store the address of the memory which contains interesting information. They can point to the data or the instruction. If the segmentation is enabled, pointers can be near or far. The far pointer contains the logical address (formed with the segment and offset parts). The near pointer contains the offset only. The offset can be 16, 32 or 64 bits long. The segment selector is always stored as a 16-bit number. Illustration of possible pointer types is shown in figure {{ref>pointertypes}}.
{{ :en:multiasm:cs:pointers_types.png?600 |Illustration of near and far pointers types}} The near and far pointers types
The offset is often the result of complex addressing mode calculations and is called an effective address. ===== Addressing Modes in Instructions ===== Addressing mode specifies how the processor reaches the data in the memory. x86 architecture implements immediate, direct and indirect memory addressing. Indirect addressing can use a single or two registers and a constant to calculate the final address. The addressing mode takes the name of the registers used. The 32-bit mode makes the choice of the register for addressing more flexible and enhances addressing with the possibility of scaling: multiplying one register by a small constant. In 64-bit mode, addressing relative to the instruction pointer was added for easy relocation of programs in memory. In this chapter, we will focus on the details of all addressing modes in 16, 32 and 64-bit processors. In each addressing mode, we are using the simple examples with the mov instruction. The move instruction copies data from the source operand to the destination operand. The order of the operands in instructions is similar to that of high-level languages. The left operand is the destination, the right operand is the source, as in the following example: mov destination, source ==== Immediate addressing ==== The immediate argument is a constant encoded as part of the instruction. This means that this value is encoded in a code section of the program and can't be modified during program execution. In 16-bit mode, the size of the constant can be 8 or 16 bits, and in 32- and 64-bit mode, it can be up to 32 bits. The use of the immediate operand depends on the instruction. It can be, for example, a numerical constant, an offset of the address or additional displacement used for efficient address calculation, a value which specifies the number of iterations in shift instructions, a mask for choosing the elements of the vector to operate with or a modifier which influences the instruction behaviour. Please refer to the specific instruction documentation for the detailed description. The examples of instructions using immediate addressing mov ax, 10 ; move the constant value of 10 to the accumulator AX mov ebx, 35 ; move the constant 35 to the EBX register mov rcx, 5 ; move the constant 5 to the RCX register ==== Direct addressing ==== In direct addressing mode, the data is reached by specifying the target offset (displacement) as a constant. The processor uses this offset together with the appropriate segment register to access the byte in memory. The displacement can be specified in the instruction as a number, a previously defined constant or a constant expression. With segmentation enabled, it is possible to use the segment prefix to select the segment register we want to use. The example instructions which use numbers or a variable name as the displacement are shown in the following code and presented in figure {{ref>directx86}}. ; copy one byte from the BL register to memory address 0800h in the data segment mov ds:[0800h], bl ; copy two bytes from memory at address 0600h to the AX accumulator mov ax, ds:[0600h] ; copy eight bytes from the variable in memory (previously defined) to RAX mov rax, variable
{{ :en:multiasm:cs:addressing_direct_x86.png?600 |Illustration of direct addressing mode in x86 processor}} Direct addressing mode in x86 architecture
==== x64 RIP-relative direct addressing ==== 64-bit processors have some specific addressing modes. The default mode for direct addressing mode in the x64 architecture is addressing relative to the RIP register. If there is no base or index register in the instruction, the address is encoded as a 32-bit signed value. This value represents the distance between the data byte in memory and the address of the next instruction (current value of RIP). In figure {{ref>directRIPrelativex64}}, the RIP relative addressing where the variable is moved to AL is shown.
{{ :en:multiasm:cs:direct_RIP_relative_x64.png?600 |Illustration of direct RIP-relative addressing mode in x64 processor}} Direct RIP-relative addressing mode in x64 architecture
==== x64 32-bit direct addressing mode ==== In this mode, the instruction holds the 32-bit signed value, which is sign-extended to 64 bits by the processor. This limits the addressing space to two areas. The first region starts from the address 0 and can reach 2GB of memory. The second is 2GB at the very end of the entire address space. As a 2GB memory size is insufficient for modern operating systems, this addressing mode is not supported by Windows compilers. MASM can use this mode in conjunction with a base or index register. ==== x64 64-bit direct addressing ==== In this mode, the address is a 64-bit unsigned value. As in general, the arguments of the instruction are 32-bit in length, this addressing mode can be used only with a specific version of the MOV instruction and only with the accumulator (AL, AX, EAX, RAX). MASM assembler does not use this mode. ==== Indirect addressing ==== In the x86 architecture, there is a possibility to use one or two registers in one instruction to calculate the effective address, which is the final offset within the current segment (or section). In case of the use of two registers, one of them is called the base register, the second one is called the index register. In 16-bit processors, the base registers can be BX and BP only, while the index registers can be SI or DI. If BP is used, the processor automatically chooses the stack segment by default. For BX used as the base register, or for instructions with an index register only, the processor accesses the data segment by default. The 32-bit architecture makes the choice of registers much more flexible, and any of the eight registers (including the stack pointer) can be used as the base register. Here, the stack segment is chosen if the base register is EBP or ESP. The index register can be any of the general-purpose registers, excluding the stack pointer. Additionally, the index register can be scaled by a factor of 1, 2, 4 or 8. In the 64-bit architecture, which introduces eight additional registers, any of the sixteen general-purpose registers can be the base register or index register (excluding the stack pointer, which can be base only). In the following sections, we will show examples of all possible addressing modes. ==== Base addressing ==== Base addressing mode uses the base register only. The base register is not scaled, no index register is used, and no additional displacement is added. In figure {{ref>basex86}}, the use of BX as the base register is shown to transfer data from memory to AL.
{{ :en:multiasm:cs:addressing_base_x86.png?600 |Illustration of indirect base addressing mode in x86 processor}} Indirect base addressing mode in x86 architecture
The code shows other examples of base addressing. ; copy one byte from the data segment in memory at the address from the BX register to AL mov al, [bx] ; copy two bytes from the data segment in memory at the address from the EBX register to AX mov ax, [ebx] ; copy eight bytes from memory at address from RBX to RAX (no segmentation in x64) mov rax, [rbx] ==== Base addressing with displacement ==== Base addressing mode with displacement uses the base register with an additional constant added. So the final effective address is a sum of the content of the base register and the constant. It can be interpreted as the base register holds the address of the data table, and the constant is an offset of the byte in the table. In figure {{ref>basedispx86}}, the use of BX as the base register with additional displacement is shown to transfer data from memory to AL.
{{ :en:multiasm:cs:addressing_base_disp_x86.png?600 |Illustration of indirect base addressing mode with displacement in x86 processor}} Indirect base addressing mode with displacement in x86 architecture
Some code examples are shown below. The number represents the previously defined constant. MASM accepts various syntaxes of base plus displacement address notation. ; copy one byte from the data segment in memory at the address from the BX register ; increased by "number" to AL mov al, [bx] + number mov al, [bx + number] mov al, [bx] number mov al, number [bx] mov al, number + [bx] mov al, [number + bx] ==== Index addressing with displacement ==== Index addressing mode with displacement uses the index register with an additional constant added. So the final effective address is a sum of the content of the index register and the constant. It can be interpreted as the address of the data table is a constant, and the index register is an offset of the byte in a table. In figure {{ref>indexdispx86}}, the use of DI as the index register with additional displacement is shown to transfer data from memory to AL.
{{ :en:multiasm:cs:addressing_index_disp_x86.png?600 |Illustration of indirect index addressing mode with displacement in x86 processor}} Indirect index addressing mode with displacement in x86 architecture
Some code examples are shown below. The table represents the previously defined table in memory. MASM accepts various syntaxes of index plus displacement address notation. ; copy one byte from the data segment in memory at the address calculated ; as the sum of the table increased by index from DI to AL mov al, [di] + table mov al, [di + table] mov al, [di] table mov al, table [di] mov al, table + [di] mov al, [table + di] ==== Base Indexed addressing ==== In this addressing mode, the combination of two registers is used. In a 16-bit processor, the base register is BX or BP, and the index register is SI or DI. In a 32- or 64-bit processor, any register can be the base, and all except the stack pointer can be the index. The final effective address is the sum of the contents of two registers. In figure {{ref>baseindexx86}}, the use of BP as the base and DI as the index register to transfer data from memory to AL is shown.
{{ :en:multiasm:cs:addressing_base_index_x86.png?600 |Illustration of indirect base plus index addressing mode in x86 processor}} Indirect base plus index addressing mode in x86 architecture
MASM assembler accepts different notations of the base + index registers combination, as shown in the code. In the x86, the order of registers written in the instruction is irrelevant. ; copy one byte from the data segment in the memory at the address calculated ; as the sum of the base (BX) register and the index (SI) register to AL mov al, [bx] + [si] mov al, [bx + si] mov al, [bx] [si] mov al, [si] [bx] In x86, there are only four possible combinations of base and index registers. They are depicted in the code. ; copy one byte from the data or stack segment in memory at the address calculated ; as the sum of the base register and index register to AL mov al, [bx] + [si] ; data segment mov al, [bx] + [di] ; data segment mov al, [bp] + [si] ; stack segment mov al, [bp] + [di] ; stack segment In 32- or 64-bit processors, the first register used in the instruction is the base register, and the second is the index register. While segmentation is enabled use of EBP or ESP as a base register determines the segment register choice. Notice that it is possible to use the same register as base and index in one instruction. ; copy one byte from the data or stack segment in memory at the address calculated ; as the sum of the base register and index register to AL mov al, [eax] + [esi] ; data segment mov al, [ebx] + [edi] ; data segment mov al, [ecx] + [esi] ; data segment mov al, [edx] + [edi] ; data segment mov al, [esi] + [esi] ; data segment mov al, [edi] + [edi] ; data segment mov al, [ebp] + [esi] ; stack segment mov al, [esp] + [edi] ; stack segment ==== Base Indexed addressing with displacement ==== In this addressing mode, the combination of two registers with an additional constant is used. In a 16-bit processor, the base register BX or BP, and the index register SI or DI. The constant can be encoded as an 8- or 16-bit value. In such a processor, this is the most complex mode available. In a 32- or 64-bit processor, any register can be the base, and all except the stack pointer can be the index. The constant is up to a 32-bit signed value. The final effective address is the sum of the contents of two registers and the displacement. In figure {{ref>baseindexdispx86}}, the use of BX as the base and SI as the index register with displacement to transfer data from memory to AL is shown. It can be interpreted as the constant is the address of the table of structures, the base register holds the offset of the structure in a table, and the index register keeps the offset of an element within the structure.
{{ :en:multiasm:cs:addressing_base_index_disp_x86.png?600 |Illustration of indirect base plus index plus displacement addressing mode in x86 processor}} Indirect base plus index plus displacement addressing mode in x86 architecture
MASM assembler accepts different notations of the base + index + displacement, as shown in the code. In the x86, the order of registers written in the instruction is irrelevant. ; copy one byte from the data segment in the memory at the address calculated ; as the sum of the base (BX) register, the index (SI) register and a displacement to AL mov al, [bx] + [si] + table mov al, [bx + si] + table mov al, [bx] [si] + table mov al, table [si] [bx] mov al, table [si] + [bx] mov al, table [si + bx] In 32- or 64-bit processors, the first register used in the instruction is the base register, and the second is the index register. While segmentation is enabled, the use of EBP or ESP as a base register determines the segment register choice. The displacement can be placed at any position in the address argument expression. Some examples are shown below. ; copy one byte from the data or stack segment in memory at the address calculated ; as the sum of the base, index and displacement (table) to AL mov al, [eax] + [esi] + table ; data segment mov al, table + [ebx] + [edi] ; data segment mov al, table [ecx] [esi] ; data segment mov al, [edx] [edi] + table ; data segment mov al, table + [ebp] + [esi] ; stack segment mov al, [esp] + [edi] + table ; stack segment ==== Index addressing with scaling ==== Index addressing mode with scaling uses the index register multiplied by a simple constant of 1, 2, 4 or 8. This addressing mode is available for 32- or 64-bit processors and can use any general-purpose register except of stack pointer. In figure {{ref>indexscaleIA32}}, the use of EBX as the index register with a scaling factor of 2 is shown to transfer data from memory to AL.
{{ :en:multiasm:cs:addressing_index_scale_IA32.png?600 |Illustration of indirect index addressing mode with scaling in IA32 processor}} Indirect index addressing mode with scaling in IA32 architecture
Because in these instructions, there is no base register used, if there is segmentation enabled, the data segment is always chosen. ; copy one byte from the data or stack segment in memory at the address calculated ; as the multiplication of the index register by a constant to AL mov al, [eax * 2] ; data segment mov al, [ebx * 4] ; data segment mov al, [ecx * 8] ; data segment mov al, [edx * 2] ; data segment mov al, [esi * 4] ; data segment mov al, [edi * 8] ; data segment mov al, [ebp * 2] ; data segment ==== Base Indexed addressing with scaling ==== Base indexed addressing mode with scaling uses the sum of the base register with the content of the index register multiplied by a simple constant of 1, 2, 4 or 8. This addressing mode is available for 32- or 64-bit processors and can use any general-purpose register as base and almost any general-purpose register as index, except of stack pointer. The figure {{ref>baseindexscaleIA32}} presents the use of EDI register as base, EAX as the index register with a scaling factor of 4 to transfer data from memory to AL.
{{ :en:multiasm:cs:addressing_base_index_scale_IA32.png?600 |Illustration of indirect base index addressing mode with scaling in IA32 processor}} Indirect base index addressing mode with scaling in IA32 architecture
The scaled register is assumed as the index, the other one is the base (even if it is not used first in the instruction). While segmentation is enabled, the use of EBP or ESP as a base register determines the segment register choice. ; copy one byte from the data or stack segment in memory at the address calculated ; as the sum of the base register and index register to AL mov al, [eax] + [esi * 2] ; data segment mov al, [ebx] + [edi * 4] ; data segment mov al, [ecx] + [eax * 8] ; data segment mov al, [edx] + [ecx * 2] ; data segment mov al, [esi] + [edx * 4] ; data segment mov al, [edi] + [edi * 8] ; data segment mov al, [ebp] + [esi * 2] ; stack segment mov al, [esp] + [edi * 4] ; stack segment ==== Base Indexed addressing with displacement and scaling ==== Base indexed addressing mode with displacement and scaling uses the sum of the base register, the content of the index register multiplied by a simple constant of 1, 2, 4 or 8, and an additional constant. This addressing mode is available for 32- or 64-bit processors and can use any general-purpose register as base and almost any general-purpose register as index, except of stack pointer. The displacement can be up to a 32-bit signed value, even in a 64-bit processor. The figure {{ref>baseindexdispscaleIA32}} presents the use of EDI register as base, EAX as the index register with a scaling factor of 4 to transfer data from memory to AL. The interpretation is similar to the base-indexed addressing mode with displacement for the x86 16-bit machine. The constant is the address of the beginning of the table of structures, the base register contains the offset of the structure, index register holds the scaled number of the element in a structure pointing to the chosen byte.
{{ :en:multiasm:cs:addressing_base_index_disp_scale_IA32.png?600 |Illustration of indirect base index addressing mode with displacement and scaling in IA32 processor}} Indirect base index addressing mode with displacement and scaling in IA32 architecture
As in the base indexed mode with scaling without displacement, the scaled register is assumed as the index, and the other one is the base (even if it is not used first in the instruction). While segmentation is enabled, the use of EBP or ESP as a base register determines the segment register choice. The displacement can be placed at any position in the instruction. ; copy one byte from the data or stack segment in memory at the address calculated ; as the sum of the base, scaled index and displacement (table) to AL mov al, [eax] + [esi * 2] + table ; data segment mov al, table + [ebx] + [edi * 4] ; data segment mov al, table + [ebp] + [esi * 2] ; stack segment mov al, [esp] + [edi * 4] + table ; stack segment ==== Summary for indirect addressing ==== In 16-bit processors, the base registers can be BX and BP only. The first one is used to access the data segment, the second one automatically chooses the stack segment by default. The additional offset can be unused or can be encoded as an 8 or 16-bit signed value. The schematic of the x86 effective address calculation for indirect address generation is shown in figure {{ref>effectivex86}}
{{ :en:multiasm:cs:effective_x86.png?600 |Illustration of possible combination of base and index registers in effective address calculation for indirect addressing mode in x86 processor}} Possible combination of registers in effective address calculation for indirect addressing mode in x86 architecture
The 32-bit architecture makes the choice of registers much more flexible, and any of the eight registers (including the stack pointer) can be used as the base register. Here, the stack segment is chosen if the base register is EBP or ESP. Index register can be any of the general-purpose registers except of stack pointer. The index register can be scaled by a factor of 1, 2, 4 or 8. Additional displacement can be unused or can be encoded as an 8, 16 or 32-bit signed value. In the 64-bit architecture, any of the sixteen general-purpose registers can be the base register. As in 32-bit processors index register can not be the stack pointer and can be scaled by 1, 2, 4 or 8. Additional displacement can be unused or encoded as an 8, 16 or 32-bit signed number. In figure {{ref>effectiveIA32}}, the schematic of the effective address calculation and use of registers for indirect addressing in the IA32 architecture is shown.
{{ :en:multiasm:cs:effective_IA32.png?600 |Illustration of possible combination of base and index registers in indirect addressing mode in IA32 processor}} Possible combination of registers in indirect addressing mode in IA32 architecture
===== Principles of Instructions Encoding ===== ===== Instruction Set of x86 - Essentials ===== ===== Procedures, Functions and Calls in Windows and Linux ===== ===== Compatibility with HLL Compilers (C++, C#) and Operating Systems ===== ===== FPU ===== ===== MMX ===== ===== Macros ===== ===== Code Examples ===== ===== Energy Efficient Programming ===== ===== Interrupts ===== ===== Optimisation =====