Hello World in x86_64 Assembly on Linux, and macOS.

Hello World in x86_64 Assembly on Linux, and macOS.

Because even in 2022, assembly programming is still relevant

I want to start this blog with something that's easier to digest and maybe relevant with the theme of this blog: Assembly. It's 2022 and we're thinking: can I program in assembly nowadays? Well it turns out you can. Assembly is pretty much alive not dead. People telling you otherwise don't know what they're talking about. Look at this TIOBE index of programming language.

Jepretan Layar 2022-05-02 pukul 10.26.51.png It's 8th in the index. So much achievement who many consider as dead language 🤷. Not even PHP can beat its popularity.

Assembly is not really a programming language like C or Go. In short, it's like 1-1 mapping between machine language and its textual format. By that definition, it's the lowest level of language and it's inherently non-portable across machines.

Writing assembly on modern system is possible and fairly easy if you're a developer. You only need, at least, two things. An assembler and a linker. Assembler is a program which will assemble the textual representation to machine code, also known as object code. A linker will link the object code with platform libraries to form a full program.

Linux

Let's start with the easiest platform to work and explain with: Linux. Fire up your computer, virtual machine, or docker container to start this journey. I assume as you're going here, you don't have problem installing gcc to your machine. But anyway, I'll still give you reference for Ubuntu/Debian and Fedora/Redhat.

# Red Hat Based Distro (including Fedora)
sudo dnf groupinstall gcc

# Debian/Ubuntu
sudo apt install gcc

I didn't put how to install in distro like Arch, because if you're using Arch Linux, then you basically know what you're doing.

You can check by invoking the as and ld command.

$ as --version

GNU assembler (GNU Binutils for Ubuntu) 2.38
Copyright (C) 2022 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `x86_64-linux-gnu'.

$ ld --version

GNU ld (GNU Binutils for Ubuntu) 2.38
Copyright (C) 2022 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) a later version.
This program has absolutely no warranty.

You're practically set. Now how to start writing Hello World in assembly? Well let's break down what constitutes a hello world program.

  1. You'd need somewhere to save information about the string "Hello, World!".
  2. You'd need a place and way to print that string to screen.
  3. And last but not least, you'd need a way to exit the program.

Number (3) is particularly important because as high level language programmer you may never thinks about it. A program should quit, du-uh... But if we're dealing with machine here, it needs to be told to quit and return to operating system. Let's break it down one by one.

Prerequisite and Entry Point.

So let's fire up your favourite text editor and write these line of codes to a file, e.g. hello.s

.code64
.section .rodata

.section .text
.global _start 
_start:

I'll explain as we go. _start is what we called the entry point. The code which an operating system will execute first after loading it to memory. .code64 is a directive. It's like keyword and it means that we want to target an x86_64 architecture.

A place to save "Hello World" information

Linux uses ELF as its executable format.

Lord-of-the-Rings-Galadriel-Legolas-Elrond.webp

Just kidding, not that ELF but this ELF: Executable and Linkable Format.

elf-view.png

In short, ELF file consists of several sections. Those sections is defined within the assembly file. The place to place static information, like "Hello, World1" is a section called .rodata or Read Only Data. Pretty self explanatory. That's the .rodata on our previous code means. So let's type in our string below the section .rodata. And because we're dealing with ASCII characters we use .ascii directive.

.section .rodata
msg: .ascii "Hello, World!\n"
.set msglen, (. - msg)

Another thing that we'll need is the length of the string. To do that, we can hard-code the length, or use directive. set directive will assign a symbol to a value. This symbol is available at compile time only. The last line means that we set msglen equals substraction of current position . to the position of msg which will result to the length of the msg. Pretty neat. We're done defining place for "Hello, World!"

A place and a way to write the string.

In Linux, the output terminal is represented by stdout or standard output with file number of 1. You can look it up by using command man stdout or from this page on 5th paragraph.

On program startup, the integer file descriptors associated with the streams stdin, stdout, and stderr are 0, 1, and 2, respectively.

What we should do is just write to that "file". To do that, we can use write system call. But how we invoke write? Every system call has a number, and it turns out write has 1 as the system call number. You can see the on Filippo's system call table. The C signature for this system call is as follows:

ssize_t write(int fd, const void *buf, size_t count);

To invoke the system call we're using SYSCALL instruction. But before that, we'd need to put the system call number in rax register. In AT&T syntax that's used by GNU Assembler it's like this:

mov $1, %rax

A constant in AT&T syntax is prefixed by $ and a register name is prefixed by %. So it means that move a constant 1 to register rax. That's our first instruction and put below .section .text

Alright, and lastly, we'd need to know how to invoke the system calls? Well that's what Application Binary Interfaces designed to do. It specifies a way to call function. Wikipedia says that 1st to 6th parameters are passed to RDI, RSI, RDX, RCX, R8, R9 respectively. So we passed them like this:

  1. fd is the file descriptor we pass stdout which is 1. That goes to rdi.
  2. buf is the address, we pass msg. That goes to rsi. This uses special instruction lea which allows us to load an address to a register.
  3. And lastly, we pass the length msglen to rdx is it's the register holding the 3rd parameter. It's a constant so we prefix that with dollar sign.

and then we call SYSCALL instruction.

mov $1, %rdi 
lea msg, %rsi 
mov $msglen, %rdx
syscall

That'll write the message to the standard out. And with the same analogy we call exit with syscall number 60 with an exit code of 0 (success) to terminate the program.

mov $60, %rax
mov $0, %rdi 
syscall

So, there, we finished source code. Let's name it hello.s

.code64
.section .rodata
msg: .ascii "Hello, World!\n"
.set msglen, (. - msg)

.section .text 
.global _start
_start:
  mov $1, %rax
  mov $1, %rdi 
  lea msg, %rsi 
  mov $msglen, %rdx
  syscall

  mov $60, %rax
  mov $0, %rdi 
  syscall

And then we assemble and link using the tools we have installed.

as -o hello.o -s hello.s

ld -o hello hello.o

You can execute the binary:

./hello

Hello, World!

macOS

On macOS, all of those tools are included within Xcode and Command Line Tools for Xcode. It'll install LLVM based tooling. Open terminal and test if you have it installed.

Notes: This is only tested on macOS with Intel chip. I didn't test this on Apple Silicon

$ as --version
Apple clang version 13.1.6 (clang-1316.0.21.2.3)
Target: x86_64-apple-darwin21.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

$ ld --help
ld64: For information on command line options please use 'man ld'.

If you see those messages, it means that you're ready to write code. Open up your favourite text editor and type these commands.

The code and differences with Linux

The code is similar to linux in a way you call syscall and suff with some changes in syntax and syscall number. Let's see:

.code64
.global _main
.static_data
msg: .ascii "Hello, World!\n"
.set msglen, (. - msg)
.text
_main:
    mov $0x2000004, %rax 
    mov $1, %rdi 
    lea msg(%rip), %rsi 
    mov $msglen, %rdx
    syscall 
    mov $0x2000001, %rax 
    xor %rdi, %rdi 
    syscall
  1. The first one we notice is the entry point. Instead of _start, we have _main.
  2. Read only data section is marked with .static_data. It's difference between GNU assembler and Apple's LLVM assembler. Another difference is text section is simply marked as .text instead of .section .text.
  3. The msg should be loaded with relative location to RIP (instruction pointer register).
  4. The write system call is numbered 0x2000004 instead of 1. The real system call is 4 as shown here github.com/opensource-apple/xnu/blob/master.. but you'd need to add 0x2000000 to signify it's BSD system call instead of Mach.
  5. The exit system call is numbered 1 and prefixed with 0x20000000

Assembly and Executing

Assembling and executing is similar to Linux with some additional parameters:

as -arch x86_64 hello.s -o hello.o
ld -arch x86_64  \ 
-L /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib  \
-lSystem -o hello hello.o

./hello

The first command is similar we just add -arch x86_64 to signify our target architecture. This is because LLVM is a cross compiler by default and can target multiple architecture.

The last one other than architecture we add directory and the library we want to link with: -L /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib -lSystem means that we want to link to libSystem which is mandatory in macOS.

Wrap Up

That's how you do assembly in 2022 in full-blown 64-bit modern operating systems.I purposedly left Windows out because Windows is little unique in this regard. Stay tuned and I'll write about writing a Hello World in Windows with Assembly.