Hello World in x86_64 Assembly on Linux, and macOS.
Because even in 2022, assembly programming is still relevant
I want to start this blog with something that's easier to digest and maybe relevant with the theme of this blog: Assembly. It's 2022 and we're thinking: can I program in assembly nowadays? Well it turns out you can. Assembly is pretty much alive not dead. People telling you otherwise don't know what they're talking about. Look at this TIOBE index of programming language.
It's 8th in the index. So much achievement who many consider as dead language 🤷. Not even PHP can beat its popularity.
Assembly is not really a programming language like C or Go. In short, it's like 1-1 mapping between machine language and its textual format. By that definition, it's the lowest level of language and it's inherently non-portable across machines.
Writing assembly on modern system is possible and fairly easy if you're a developer. You only need, at least, two things. An assembler and a linker. Assembler is a program which will assemble the textual representation to machine code, also known as object code. A linker will link the object code with platform libraries to form a full program.
Linux
Let's start with the easiest platform to work and explain with: Linux. Fire up your computer, virtual machine, or docker container to start this journey. I assume as you're going here, you don't have problem installing gcc
to your machine. But anyway, I'll still give you reference for Ubuntu/Debian and Fedora/Redhat.
# Red Hat Based Distro (including Fedora)
sudo dnf groupinstall gcc
# Debian/Ubuntu
sudo apt install gcc
I didn't put how to install in distro like Arch, because if you're using Arch Linux, then you basically know what you're doing.
You can check by invoking the as
and ld
command.
$ as --version
GNU assembler (GNU Binutils for Ubuntu) 2.38
Copyright (C) 2022 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `x86_64-linux-gnu'.
$ ld --version
GNU ld (GNU Binutils for Ubuntu) 2.38
Copyright (C) 2022 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) a later version.
This program has absolutely no warranty.
You're practically set. Now how to start writing Hello World in assembly? Well let's break down what constitutes a hello world program.
- You'd need somewhere to save information about the string
"Hello, World!"
. - You'd need a place and way to print that string to screen.
- And last but not least, you'd need a way to exit the program.
Number (3) is particularly important because as high level language programmer you may never thinks about it. A program should quit, du-uh... But if we're dealing with machine here, it needs to be told to quit and return to operating system. Let's break it down one by one.
Prerequisite and Entry Point.
So let's fire up your favourite text editor and write these line of codes to a file, e.g. hello.s
.code64
.section .rodata
.section .text
.global _start
_start:
I'll explain as we go. _start
is what we called the entry point. The code which an operating system will execute first after loading it to memory. .code64
is a directive. It's like keyword and it means that we want to target an x86_64 architecture.
A place to save "Hello World" information
Linux uses ELF as its executable format.
Just kidding, not that ELF but this ELF: Executable and Linkable Format.
In short, ELF file consists of several sections. Those sections is defined within the assembly file. The place to place static information, like "Hello, World1"
is a section called .rodata
or Read Only Data. Pretty self explanatory. That's the .rodata
on our previous code means. So let's type in our string below the section .rodata
. And because we're dealing with ASCII characters we use .ascii
directive.
.section .rodata
msg: .ascii "Hello, World!\n"
.set msglen, (. - msg)
Another thing that we'll need is the length of the string. To do that, we can hard-code the length, or use directive. set
directive will assign a symbol to a value. This symbol is available at compile time only. The last line means that we set msglen equals substraction of current position .
to the position of msg
which will result to the length of the msg
. Pretty neat. We're done defining place for "Hello, World!"
A place and a way to write the string.
In Linux, the output terminal is represented by stdout
or standard output with file number of 1
. You can look it up by using command man stdout
or from this page on 5th paragraph.
On program startup, the integer file descriptors associated with the streams stdin, stdout, and stderr are 0, 1, and 2, respectively.
What we should do is just write to that "file". To do that, we can use write
system call. But how we invoke write? Every system call has a number, and it turns out write
has 1
as the system call number. You can see the on Filippo's system call table. The C signature for this system call is as follows:
ssize_t write(int fd, const void *buf, size_t count);
To invoke the system call we're using SYSCALL
instruction. But before that, we'd need to put the system call number in rax
register. In AT&T syntax that's used by GNU Assembler it's like this:
mov $1, %rax
A constant in AT&T syntax is prefixed by $
and a register name is prefixed by %
. So it means that move a constant 1
to register rax
. That's our first instruction and put below .section .text
Alright, and lastly, we'd need to know how to invoke the system calls? Well that's what Application Binary Interfaces designed to do. It specifies a way to call function. Wikipedia says that 1st to 6th parameters are passed to RDI
, RSI
, RDX
, RCX
, R8
, R9
respectively. So we passed them like this:
fd
is the file descriptor we pass stdout which is1
. That goes tordi
.buf
is the address, we passmsg
. That goes torsi
. This uses special instructionlea
which allows us to load an address to a register.- And lastly, we pass the length
msglen
tordx
is it's the register holding the 3rd parameter. It's a constant so we prefix that with dollar sign.
and then we call SYSCALL
instruction.
mov $1, %rdi
lea msg, %rsi
mov $msglen, %rdx
syscall
That'll write the message to the standard out. And with the same analogy we call exit
with syscall number 60 with an exit code of 0
(success) to terminate the program.
mov $60, %rax
mov $0, %rdi
syscall
So, there, we finished source code. Let's name it hello.s
.code64
.section .rodata
msg: .ascii "Hello, World!\n"
.set msglen, (. - msg)
.section .text
.global _start
_start:
mov $1, %rax
mov $1, %rdi
lea msg, %rsi
mov $msglen, %rdx
syscall
mov $60, %rax
mov $0, %rdi
syscall
And then we assemble and link using the tools we have installed.
as -o hello.o -s hello.s
ld -o hello hello.o
You can execute the binary:
./hello
Hello, World!
macOS
On macOS, all of those tools are included within Xcode and Command Line Tools for Xcode. It'll install LLVM based tooling. Open terminal and test if you have it installed.
Notes: This is only tested on macOS with Intel chip. I didn't test this on Apple Silicon
$ as --version
Apple clang version 13.1.6 (clang-1316.0.21.2.3)
Target: x86_64-apple-darwin21.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
$ ld --help
ld64: For information on command line options please use 'man ld'.
If you see those messages, it means that you're ready to write code. Open up your favourite text editor and type these commands.
The code and differences with Linux
The code is similar to linux in a way you call syscall and suff with some changes in syntax and syscall number. Let's see:
.code64
.global _main
.static_data
msg: .ascii "Hello, World!\n"
.set msglen, (. - msg)
.text
_main:
mov $0x2000004, %rax
mov $1, %rdi
lea msg(%rip), %rsi
mov $msglen, %rdx
syscall
mov $0x2000001, %rax
xor %rdi, %rdi
syscall
- The first one we notice is the entry point. Instead of
_start
, we have_main
. - Read only data section is marked with
.static_data
. It's difference between GNU assembler and Apple's LLVM assembler. Another difference is text section is simply marked as.text
instead of.section .text
. - The
msg
should be loaded with relative location toRIP
(instruction pointer register). - The
write
system call is numbered0x2000004
instead of1
. The real system call is4
as shown here github.com/opensource-apple/xnu/blob/master.. but you'd need to add0x2000000
to signify it's BSD system call instead of Mach. - The
exit
system call is numbered1
and prefixed with0x20000000
Assembly and Executing
Assembling and executing is similar to Linux with some additional parameters:
as -arch x86_64 hello.s -o hello.o
ld -arch x86_64 \
-L /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib \
-lSystem -o hello hello.o
./hello
The first command is similar we just add -arch x86_64
to signify our target architecture. This is because LLVM is a cross compiler by default and can target multiple architecture.
The last one other than architecture we add directory and the library we want to link with: -L /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib -lSystem
means that we want to link to libSystem
which is mandatory in macOS.
Wrap Up
That's how you do assembly in 2022 in full-blown 64-bit modern operating systems.I purposedly left Windows out because Windows is little unique in this regard. Stay tuned and I'll write about writing a Hello World in Windows with Assembly.