I want to start this blog with something that's easier to digest and maybe relevant with the theme of this blog: Assembly. It's 2022 and we're thinking: can I program in assembly nowadays? Well it turns out you can. Assembly is pretty much alive not dead. People telling you otherwise don't know what they're talking about. Look at this TIOBE index of programming language.
It's 8th in the index. So much achievement who many consider as dead language 🤷. Not even PHP can beat its popularity.
Assembly is not really a programming language like C or Go. In short, it's like 1-1 mapping between machine language and its textual format. By that definition, it's the lowest level of language and it's inherently non-portable across machines.
Writing assembly on modern system is possible and fairly easy if you're a developer. You only need, at least, two things. An assembler and a linker. Assembler is a program which will assemble the textual representation to machine code, also known as object code. A linker will link the object code with platform libraries to form a full program.
Let's start with the easiest platform to work and explain with: Linux. Fire up your computer, virtual machine, or docker container to start this journey. I assume as you're going here, you don't have problem installing
gcc to your machine. But anyway, I'll still give you reference for Ubuntu/Debian and Fedora/Redhat.
# Red Hat Based Distro (including Fedora) sudo dnf groupinstall gcc # Debian/Ubuntu sudo apt install gcc
I didn't put how to install in distro like Arch, because if you're using Arch Linux, then you basically know what you're doing.
You can check by invoking the
$ as --version GNU assembler (GNU Binutils for Ubuntu) 2.38 Copyright (C) 2022 Free Software Foundation, Inc. This program is free software; you may redistribute it under the terms of the GNU General Public License version 3 or later. This program has absolutely no warranty. This assembler was configured for a target of `x86_64-linux-gnu'. $ ld --version GNU ld (GNU Binutils for Ubuntu) 2.38 Copyright (C) 2022 Free Software Foundation, Inc. This program is free software; you may redistribute it under the terms of the GNU General Public License version 3 or (at your option) a later version. This program has absolutely no warranty.
You're practically set. Now how to start writing Hello World in assembly? Well let's break down what constitutes a hello world program.
- You'd need somewhere to save information about the string
- You'd need a place and way to print that string to screen.
- And last but not least, you'd need a way to exit the program.
Number (3) is particularly important because as high level language programmer you may never thinks about it. A program should quit, du-uh... But if we're dealing with machine here, it needs to be told to quit and return to operating system. Let's break it down one by one.
Prerequisite and Entry Point.
So let's fire up your favourite text editor and write these line of codes to a file, e.g.
.code64 .section .rodata .section .text .global _start _start:
I'll explain as we go.
_start is what we called the entry point. The code which an operating system will execute first after loading it to memory.
.code64 is a directive. It's like keyword and it means that we want to target an x86_64 architecture.
A place to save "Hello World" information
Linux uses ELF as its executable format.
Just kidding, not that ELF but this ELF: Executable and Linkable Format.
In short, ELF file consists of several sections. Those sections is defined within the assembly file. The place to place static information, like
"Hello, World1" is a section called
.rodata or Read Only Data. Pretty self explanatory. That's the
.rodata on our previous code means. So let's type in our string below the
section .rodata. And because we're dealing with ASCII characters we use
.section .rodata msg: .ascii "Hello, World!\n" .set msglen, (. - msg)
Another thing that we'll need is the length of the string. To do that, we can hard-code the length, or use directive.
set directive will assign a symbol to a value. This symbol is available at compile time only. The last line means that we set msglen equals substraction of current position
. to the position of
msg which will result to the length of the
msg. Pretty neat. We're done defining place for "Hello, World!"
A place and a way to write the string.
In Linux, the output terminal is represented by
stdout or standard output with file number of
1. You can look it up by using command
man stdout or from this page on 5th paragraph.
On program startup, the integer file descriptors associated with the streams stdin, stdout, and stderr are 0, 1, and 2, respectively.
What we should do is just write to that "file". To do that, we can use
write system call. But how we invoke write? Every system call has a number, and it turns out
1 as the system call number. You can see the on Filippo's system call table. The C signature for this system call is as follows:
ssize_t write(int fd, const void *buf, size_t count);
To invoke the system call we're using
SYSCALL instruction. But before that, we'd need to put the system call number in
rax register. In AT&T syntax that's used by GNU Assembler it's like this:
mov $1, %rax
A constant in AT&T syntax is prefixed by
$ and a register name is prefixed by
%. So it means that move a constant
1 to register
rax. That's our first instruction and put below
Alright, and lastly, we'd need to know how to invoke the system calls? Well that's what Application Binary Interfaces designed to do. It specifies a way to call function. Wikipedia says that 1st to 6th parameters are passed to
R9 respectively. So we passed them like this:
fdis the file descriptor we pass stdout which is
1. That goes to
bufis the address, we pass
msg. That goes to
rsi. This uses special instruction
leawhich allows us to load an address to a register.
- And lastly, we pass the length
rdxis it's the register holding the 3rd parameter. It's a constant so we prefix that with dollar sign.
and then we call
mov $1, %rdi lea msg, %rsi mov $msglen, %rdx syscall
That'll write the message to the standard out. And with the same analogy we call
exit with syscall number 60 with an exit code of
0 (success) to terminate the program.
mov $60, %rax mov $0, %rdi syscall
So, there, we finished source code. Let's name it
.code64 .section .rodata msg: .ascii "Hello, World!\n" .set msglen, (. - msg) .section .text .global _start _start: mov $1, %rax mov $1, %rdi lea msg, %rsi mov $msglen, %rdx syscall mov $60, %rax mov $0, %rdi syscall
And then we assemble and link using the tools we have installed.
as -o hello.o -s hello.s ld -o hello hello.o
You can execute the binary:
./hello Hello, World!
On macOS, all of those tools are included within Xcode and Command Line Tools for Xcode. It'll install LLVM based tooling. Open terminal and test if you have it installed.
Notes: This is only tested on macOS with Intel chip. I didn't test this on Apple Silicon
$ as --version Apple clang version 13.1.6 (clang-1322.214.171.124.3) Target: x86_64-apple-darwin21.4.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin $ ld --help ld64: For information on command line options please use 'man ld'.
If you see those messages, it means that you're ready to write code. Open up your favourite text editor and type these commands.
The code and differences with Linux
The code is similar to linux in a way you call syscall and suff with some changes in syntax and syscall number. Let's see:
.code64 .global _main .static_data msg: .ascii "Hello, World!\n" .set msglen, (. - msg) .text _main: mov $0x2000004, %rax mov $1, %rdi lea msg(%rip), %rsi mov $msglen, %rdx syscall mov $0x2000001, %rax xor %rdi, %rdi syscall
- The first one we notice is the entry point. Instead of
_start, we have
- Read only data section is marked with
.static_data. It's difference between GNU assembler and Apple's LLVM assembler. Another difference is text section is simply marked as
msgshould be loaded with relative location to
RIP(instruction pointer register).
writesystem call is numbered
1. The real system call is
4as shown here github.com/opensource-apple/xnu/blob/master.. but you'd need to add
0x2000000to signify it's BSD system call instead of Mach.
exitsystem call is numbered
1and prefixed with
Assembly and Executing
Assembling and executing is similar to Linux with some additional parameters:
as -arch x86_64 hello.s -o hello.o ld -arch x86_64 \ -L /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib \ -lSystem -o hello hello.o ./hello
The first command is similar we just add
-arch x86_64 to signify our target architecture. This is because LLVM is a cross compiler by default and can target multiple architecture.
The last one other than architecture we add directory and the library we want to link with:
-L /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib -lSystem means that we want to link to
libSystem which is mandatory in macOS.
That's how you do assembly in 2022 in full-blown 64-bit modern operating systems.I purposedly left Windows out because Windows is little unique in this regard. Stay tuned and I'll write about writing a Hello World in Windows with Assembly.