I've been working on a kernel+OS for a while as a hobby project, and I feel like a big hurdle that most people encounter with kernel development is how you get from writing code to running your binary. I kind of want to walk through how I got my kernel booting here since I think it's pretty interesting to work at such a low-level in assembly.

Overview

When a computer first boots up, it launches whats called the 'BIOS' which controls things like detecting and setting up RAM, detecting some disks, and providing an API for reading from them easily. Once all that setup has happened, it looks at the boot disk and loads the first sector (512b) off it into memory at a fixed location in memory. This sector is called the "boot sector" and it's main job is to load your kernel into memory. Luckily, it's a pretty common thing to load a kernel so I use an existing bootloader, multiboot. Multiboot does all the hard work of loading your kernel's ELF binary into memory and jumping to it. In my kernel, the file boot.asm is the starting point. Multiboot requires a table to be defined in your binary so you can tell multiboot exactly how to load your kernel, mine looks like this (NASM assembly):

; Declare constants for the multiboot header.
MBALIGN  equ  1 << 0            ; align loaded modules on page boundaries
MEMINFO  equ  1 << 1            ; provide memory map
FLAGS    equ  MBALIGN | MEMINFO ; this is the Multiboot 'flag' field
MAGIC    equ  0x1BADB002        ; 'magic number' lets bootloader find the header
CHECKSUM equ -(MAGIC + FLAGS)   ; checksum of above, to prove we are multiboot


section .multiboot
align 8
	dd MAGIC
	dd FLAGS
	dd CHECKSUM

This is a very simple, barebones multiboot header that works just enough. Once multiboot loads the binary into memory, it will jump to the entry symbol defined in your binary in 32bit protected mode and load some special values into the registers. In multiboot 1 EAX=0x2BADB002 and EBX is a 32bit pointer to the multiboot table. The multiboot table has information like memory maps, platform specific tables like SMP or ACPI and all that. The first code my kernel runs is to setup the stack and store the values somewhere else (since eax and ebx are most commonly used, I dont want to lose them)

;; statically allocate 256 bytes of "boot stack"
BOOT_STACK_SIZE equ 256
boot_stack:
	times BOOT_STACK_SIZE db 0


;; 32 bit code, in the section .init
[bits 32] ALIGN 8
section .init

global _start ;; entrypoint
_start:
	;; load the boot stack
	mov esp, boot_stack + BOOT_STACK_SIZE
	mov ebp, esp

	;; move the info that grub passes into the kenrel into
	;; arguments that we will use when calling kmain later
	mov edi, ebx
	mov esi, eax

Then, I enable physical address and page size extensions. These let me use more than 4GB of memory and large pages like 2mb or 1gb if the hardware supports it.

	; enable PAE and PSE
	mov eax, cr4
	or eax, (CR4_PAE + CR4_PSE)
	mov cr4, eax

Then, since my kernel is 64bit, I want to enable 64bit mode asap. Unfortunately, this is a little more complicated than I'd like. Let's look at the code and I'll explain it later

	; enable long mode and the NX bit
	mov ecx, MSR_EFER
	rdmsr
	or eax, (EFER_LM | EFER_NX)
	wrmsr

	; set cr3 to a pointer to pml4
	mov eax, boot_p4
	mov cr3, eax

	; enable paging
	mov eax, cr0
	or eax, CR0_PAGING
	mov cr0, eax

First, I enable long mode by setting bits in the EFER model specific register on the CPU. The bits I set are long mode (64 bit features) and the NX bit, which allows the kernel to disable execution on pages later on in user processes. I then load cr3, the page table register with a statically allocated page table. NASM allows really cool "meta-programming" stuff and I just define the complicated page table like this:

; paging structures
align PAGE_SIZE
[global boot_p4]
boot_p4:
	dq (boot_p3 + PG_PRESENT + PG_WRITABLE)
	times 271 dq 0
	;; include the high mapping for p3 (mapped with large pages)
	dq (high_p3 + PG_PRESENT + PG_WRITABLE)
	times 239 dq 0

boot_p3:
	dq (boot_p2 + PG_PRESENT + PG_WRITABLE)
	times 511 dq 0

boot_p2:
	dq (boot_p1 + PG_PRESENT + PG_WRITABLE)
	times 511 dq 0


;; ID map the first bit 512 pages of memory
boot_p1:
	;; pg starts at zero
	%assign pg 0
	;; repeat 512 times
	%rep 512
		;; store the mapping to the page
		dq (pg + PG_PRESENT + PG_WRITABLE)
		;; pg += 4096 (small page size)
		%assign pg pg+PAGE_SIZE
	%endrep


high_p3:
	dq (high_p2 + PG_PRESENT + PG_WRITABLE)
	times 511 dq 0

high_p2:
	;; pg starts at zero, like above. We fill in the entries statically
	%assign pg 0
	%rep 512
		dq (pg + PG_PRESENT + PG_BIG + PG_WRITABLE)
		;; pg += 4096 (large page size, which most systems support)
		%assign pg pg+PAGE_SIZE*512
	%endrep

Now, that looks pretty complex, but really all it does is fill out data with increasing numbers. Let's look at where boot_p1 is defined. First I define a variable pg wich defaults to zero. I then loop 512 times (since there are 512 entries in a page directory) and identity map the first 512 pages. I identity map these pages because once we switch paging on, all memory requests will go through the virtual address space and existing pointers should just transfer over. It can be done another way but this is honestly the easiest way. Everything else in that code is setting up pointers where they need to be with various bits set (the present bit, mostly) so everything works out. Because this is compiled statically, I can just use boot_p4 as the page table.

But what's the high_p3 symbol? This is a concept I stole from linux and many other existing kernels. This simply maps the same memory (0-2MB) but instead of identity mapping, it maps it to the virtual address 0xffff880000000000. It just so works out that that address is the halfway point of the virtual address space, and it lets the kernel always have all of physical memory mapped at the same offset. All user memory is below this, and all kernel memory is above it. (Interestingly, doing this leaves my kernel susceptible to meltdown/spectre...).

So, we've got the page tables mapped statically and loaded, what's next? Well we want to jump to our main kernel function. All the kernel code outside of this init code lives above the high half so we need a way to jump to it:

  ; leave protected mode and enter long mode
  lgdt [gdtr]
  mov ax, 0x10
  mov ss, ax
  mov ax, 0x0
  mov ds, ax
  mov es, ax
  mov fs, ax
  mov gs, ax

  jmp 0x08:.trampoline

So here, I load a GDT (which I'm not going to go into, since it's a bit of a backwards-compatibility mess) which will work in 64bit mode, zero out segment registers and jump to some symbol called .trampoline using NASM's long jump instruction. the 0x08 means to load a certain code segment descriptor from the GDT (which specifies that we run in kernel mode). Then, in the same segment linked to low memory, trampoline is defined as 64 bit code that just jumps to a 64bit address:

; some 64-bit code in the lower half used to jump to the higher half
[bits 64]
.trampoline:
  ; enter the higher half now that we loaded up that half (somewhat)
  mov rax, qword .next
  jmp rax

We need this trampoline function because you cannot jump directly to code located above 0xffff880000000000 in 32bit mode (that address is more than 32 bits :D). That .next symbol gets us to our kmain which is defined somewhere in C/C++:

; the higher-half code
[bits 64]
[section .init_high]
.next:
  ; re-load the GDTR with a virtual base address
  mov rax, [gdtr + 2]
  mov rbx, KERNEL_VMA
  add rax, rbx
  mov [gdtr + 2], rax
  mov rax, gdtr + KERNEL_VMA
  lgdt [rax]

  ;; setup a 64bit stack (allocated statically somewhere)
  mov rbp, 0
  mov rsp, qword stack + STACK_SIZE

  ; clear the RFLAGS register
  push 0x0
  popf

  ; call the kernel!
  call kmain

So that's most of the assembly in the kernel, almost everything else is written in C or C++ with some inline assembly. The only other thing I struggled with was the linker script. Linker scripts let you define where symbols should be loaded in memory and are important when you want to do this complicated high-half/low-half symbol addresses. Here's what the important parts of the kernel.ld file look like:

OUTPUT_FORMAT(elf64-x86-64)
ENTRY(_start) /* define the entry symbol */

PAGE_SIZE  = 0x1000;
KERNEL_VMA = 0xffff880000000000;

SECTIONS
{
	. = 1M;
	_virt_start = . + KERNEL_VMA;

	.init : {
		*(.multiboot)
		*(.initl)
		/* ... */
	}

	. += KERNEL_VMA;

	high_kern_start = .;
	.text ALIGN(PAGE_SIZE) : AT(ADDR(.text) - KERNEL_VMA) {
		*(.init_high)
		*(.text*)
		/* ... */
	}
    /* ... more sections ... */
}

The above code looks kind of complicated, but it really isnt. When linking object files together, the linker picks addresses for various sections using this script. Multiboot wants the kernel to be loaded at the 1MB mark and so I setup the .init section to be loaded there, with the .multiboot section first (again, this is part of the spec). Then I add KERNEL_VMA=0xffff880000000000 to the virtual address and define .text. The AT(...) directive says that the section should be loaded in low memory, but the symbols should have high addresses. So some symbol foo in high memory at address 0xffff8800005120200 would be loaded into physical memory at address 0x00000000005120200 but everytime you reference it you would get the high-half address.

Then everything could be built using only a couple of commands

$ nasm -f elf64 boot.asm -o boot.o
$ ld -m elf_x86_64 boot.o -T kernel.ld -o kernel.elf

Then you load it up in a disk image with grub and it should boot!