So, as the Raspberry Pi 5 uses a 64-bit ARMv8.2A CPU (Cortex-A76), it supports the AArch64 instruction set. That means that all assembler compilers can be used that target the AArch64 instruction set. The Raspberry Pi 5 OS configuration allows the Operating System to start in 32-bit mode. To avoid unnecessary problems, the OS must run in 64-bit mode. This can be verified by checking the /boot/firmware/config.txt file. In the file, the parameter “arm_64bit” must be set to one (to 64-bit mode). If this parameter is missing, then it can be added at the end of the file. There are options with compiler selection: one option is GNU Assembler (GAS) via GCC; another is the LLVM assembler in CLANG. CLANG was originally not meant to compile assembler code, but LLVM is. LLVM includes an integrated assembler that supports most LLVM targets, including ARM and ARM64. Both can be installed on the Raspberry Pi operating system:
sudo apt install build-essentialsudo apt install clangsudo apt install llvmEither way, the assembly code must be saved in a file with the extension “S” or “.s” before it can be assembled into an executable binary file. To make it easier to navigate between all created programs, it is recommended to create a new folder for each new code project. In such a case, the folder will contain the assembler code file, the compiled object file, and the executable binary file.
All examples will be compiled with GCC assembler or GAS (GNU Assembler). Assuming that the assembler code already exists in a program.S file, it can be compiled with “as -o program.o program.s” and then using the linker command “ ld -o program program.o”. The first step creates a program.o file, or object file that contains all objects, the machine code, text messages, constants, etc., into one single file. Later, the object file becomes an executable binary once all objects in the file have been linked.
Sometimes compiling the code on another PC may be advantageous. Larger codes can be compiled faster. But for small code projects, it may be easier to compile them directly on Raspberry Pi OS. This removes necessary connections between the PC and the Raspberry Pi.
Here goes the basic assembly code to print out a text message in a hello.s:
.global _start
_start:
MOV X0, #1 @ stdout
LDR X1, =msg @ address of string
MOV X2, #13 @ string length
MOV X8, #64 @ write syscall
SVC #0
MOV X8, #93 @ exit syscall
MOV X0, #0
SVC #0
msg: .ascii "Text Message!\n"
Build it by calling “as -o hello.o hello.s” and then “ld -o hello hello.o” commands in the Linux OS command line. After that it can be executed by the “./hello” command. During program execution time, the “Text Message!” should be shown in the terminal. This example uses the Linux OS system calls.
The way Raspberry Pi OS calls created code depends on how it is built and how it runs, because this can be done at several different levels. The code can be written and executed as a user-space program on some OS, such as Linux. It is the most common and at the same time the easiest way to create and execute assembly code. Raspberry Pi Os will build and create assembly code into an ELF executable.
Commands “as -o program.o program.s” and “ld -o program program.o” create a Linux executable file. This file follows the ELF format (Executable and Linkable Format). After that, the developed program can be executed as an ordinary Linux program (i.e. ‘./program’). At this moment, the shell (bash) calls the Linux kernel using the execve() system call. The Linux kernel reads the ELF header of the newly created program and loads the code and data into memory. The kernel also arranges the stack and, if arguments are passed to the program, prepares the registers and their initial values. And the last thing for the kernel is to set the program counter register to the _start symbol in the program, the memory address where the Linux kernel copied the program code it created. At this point, the created program code begins executing - the designed program is being called like an ordinary function - we can pass the arguments and receive the data.
In a pure assembly program, the label “_start” must be present. This creates an entry point into the program where Linux can jump in. Let's take the same example from the previously developed code and break it.
.global _start @ define _start label as global in the whole program _start: @ the point where created program begins MOV X0, #1 @ Set stdout LDR X1, =msg @ pass an argument - address of string MOV X2, #13 @ pass an argument - string length MOV X8, #64 @ write syscall SVC #0 @ call syscall MOV X8, #93 @ exit syscall MOV X0, #0 @ pass an argument 0 (success) SVC #0 @ Finish this program execution and exit msg: .ascii “Text Message!\n”
Another rule is proper exit from the program. In Linux, the code cannot use the RET instruction to return to the label ‘_start’ where the program was started. Because there is no return address, the created and program code is the first thing executed after loading into memory, and it must end by telling the kernel to end this program’s execution. This is also done through system calls:
MOV X8, #93 @ syscall number for exit
MOV X0, #0 @ return code
SVC #0 @ make the syscall
After these instructions are executed, the kernel will stop the process and free up the stack memory. Processor control is returned to the OS shell.
As for the second level, the code can be made bare-metal and does not depend on an OS. This type of code is much more complex because it heavily relies on knowledge of hardware, its address space, and related details. On the other hand, this type of code in hardware will execute much faster than the same code in the OS. That’s because the OS schedules multiple tasks, and the program can be halted for some time while other programs run. These types of codes are present on all devices with processors, whether or not the device uses an OS. The bootloader is a bare-metal program that prepares the hardware for the OS. If there is no OS, the program is designed to perform a specific task on the hardware.
The special program runs immediately after a system reset. Taking an example of the bootloader in the Raspberry Pi. Before the bootloader loads into the memory, the very special startup binary code sets the CPU state, initialises memory and sets SP to a high address. Caches and Memory management units (MMUs) are turned off by default to save energy in case something fails, but the startup script can be edited to enable them. At the end of the startup code, the Program Counter register is set to a specified address where the main program starts (i.e., the bootloader). The bootloader checks the hardware, initialises and prepares it to work with the operating system. In this case, the bootloader and the operating system can be replaced with a bare-metal program. Program code runs directly on the processor; there is no system-call layer or stack setup unless the program code provides it.
Example code to blink the LED connected to Raspberry Pi GPIO17 (pin 11 on the physical board). Theoretically, the example code should work on older Raspberry Pi versions, but since the board now contains a dedicated chip, it becomes much harder because the PCIe must be initialised, the address space must be known, and much more. Unfortunately, there is no comprehensive documentation for Raspberry Pi 5, including its address space and other details.
.equ PERIPH_BASE, 0x40000000 @ RP1 peripheral base
.equ GPIO_BASE, PERIPH_BASE + 0x0D0000 @ GPIO controller base
.equ GPFSEL0, 0x00 @ Function select register 0
.equ GPFSEL1, 0x04 @ Function select register 1
.equ GPSET0, 0x1C @ Pin output set register
.equ GPCLR0, 0x28 @ Pin output clear register
@ the next section defines that ongoing code is program code
.section .text
.global _start
_start:
@ Initialise stack pointer
LDR X0, =stack_top
MOV SP, X0
@ Configure GPIO17 as output
LDR X1, =GPIO_BASE
LDR W2, [X1, #GPFSEL1] @ Each pin takes 3 bits
BIC W2, W2, #(0x7 << 21) @ Clear bits for GPIO17
ORR W2, W2, #(0x1 << 21) @ Set bits to 001 (GPIO17 is output)
STR W2, [X1, #GPFSEL1]
blink_loop:
@ Set GPIO17 high
MOV W3, #(1 << 17)
STR W3, [X1, #GPSET0]
@ Simple delay loop
MOV X4, #0x200000 @ set value
delay1:
SUBS X4, X4, #1 @ decrease register value
B.NE delay1 @ if not equal to zero, repeat
@ Set GPIO17 low
STR W3, [X1, #GPCLR0]
@ Delay again
MOV X4, #0x200000
delay2:
SUBS X4, X4, #1
B.NE delay2
B blink_loop @ repeat
@Reserve space for stack
.align 16
stack_top:
.space 4096
This code must be loaded with the bootloader as “kernel8.img” at address 0x80000. These commands will compile and prepare the program to replace the OS:
as blink.s -o blink.old blink.o -Ttext=0x80000 -o blink.elfobjcopy -O binary blink.elf kernel8.imgAfter that, the code will continue to execute until the power is switched off, the processor is RESET, or an unexpected exception occurs. Here, the code is responsible for everything, including setting up the stack pointer, enabling caches, handling interrupts, and working with devices directly at hardware addresses. This requires reviewing all hardware-related documentation. But these programs tend to be much faster and more robust than OS related programs. This is a typical way to design programs such as device firmware, RTOS, or bootloader.
The same code can be adjusted to work as a kernel module or a driver. In such a case, the code will require much more editing and investigation into OS-related documentation. The kernel must know how often this program should execute, and it may also need to work with mutexes. This is better implemented in C, as it involves multiple runtime libraries from the Linux OS.