Last time, we talked about page tables, which help translate virtual memory address to physical addresses. We also looked at why most architectures structure them as sparse, tree-like data structures: other representations would use too much space to store the translation table. In the x86-64 architecture, the page tables are split into four levels.
It's important to note that every process has its own page table. In WeensyOS, the kernel code contains a global array ptable whose slots contain the addresses of the top-level (L4) page table for each process. This way, the kernel can locate and modify the page tables for every process.
Each L4 page table contains 512 8-byte entries that can either either be empty or contain the address of an L3 page table. The page has 512 entries because dividing the page size of 4096 bytes by 8 yields 512.
The picture below shows an example of how the page tables for Alice's and Eve's process might be structured. The L4, L3, and L2 page tables each contain entries that are either empty or contain the address of the page containing a lower-level page table.
Why does the kernel have its own page table?Modern computer architectures, including the x84-64 architecture, accelerate virtual address translation via page tables in hardware. As a consequence, it's actually impossible to turn virtual memory off and work directly with physical addresses. Consequently, even the kernel needs to use page tables to translate virtual addresses! It may use an identity mapping to make virtual addresses of key resources (such as the console, or special I/O memory) identical to their physical addresses, but it does need to go through the translation.
OS kernels, like Linux's and the WeensyOS kernel, can actually operate with different active page tables. For example, when a process makes a system call, the kernel executes in process context, meaning that the active page tables are those of the calling user-space process. But in other contexts, such as when handling an interrupt or at bootup, the kernel is not running on behalf of a userspace process and uses the kernel (process 0) page tables.
This structure means that there is 1 L4 page table, up to 512 L3 page tables (each of 512 L4 PT entries can point to a single L3 PT page), up to 512 2 L2 page tables, and up to 512 3 L1 page tables. In practice, there are far fewer, as the picture shows. Using all 512 3 L1 page tables would allow addressing 512 3 × 512 = 512 4 ≈ 68 billion pages. 68 billion pages of 4096 bytes each would cover 256 TB of addressable memory; the page tables themselves would be 512 GB in size. Most programs only need a fraction of this space, and by leaving page table entries empty at high levels (e.g., L3 or L4), entire subtrees of the page table structure don't need to exist.
The L1 page table entries are special, as they supply the actual information to translate a physical address into a virtual one. Instead of storing another page table's address, the L1 page table entries (PTEs) contain both part of the physical address (PA) that a given virtual address (VA) translates into, and also the permission bits that control the access rights on the page (PTE_P, PTE_W, and PTE_U; as well as other bits we don't cover in detail in this course).
The access permission bits are stored in the lowest 12 bits of the PTE, since those bits aren't needed as part of the physical address. Recall that the lowest 12 bits address a byte within the page, and that we use the offset (lowest 12 bits) from the virtual address directly; therefore, the lowest 12 bits of the page's physical address are always zero, making them available for metadata storage. (The top bit, i.e., bit 63, is also used for metadata: it represents the "execute disable" or NX/XD bit, which marks data pages as non-executable.)
The picture below zooms in on the L1 page table used in translating VA 0x10'0001 to PA 0x8001. Note that the indexes into the L4, L3, and L2 page tables are all zero, since the upper bits of the VA are all zero. (The full 48-bit VA is 0x0000'0010'0001.) The offset bits (lowest 12 bits) correspond to hexadecimal value 0x001, and they get copied straight into the VA. The next nine bits (bit 12 to 21) are, in binary, 0b1'0000'0000 (hex: 0x100, decimal 256). They serve as the index into the L1 page table, where the 256 th entry contains the value 0x8 (0x0'0000'0008 as full 36-bit value) in bits 12 to 47. This value gets copied into bits 12 to 47 of the PA, and combined with the offset of 0x001 results in the full PA of 0x0000'0000'8001.
Page tables are the fundamental abstraction for virtual memory on modern computers. While you don't need to remember the exact details of the x86-64 page table structure, you should understand why the structure is designed this way, and how it works – for example, you might get asked to design a page table structure for another architecture in the quiz!
Finally, one important detail of virtual address translation is that user-space processes don't need to switch into the kernel to translate a VA to a PA. If every memory access from user-space required a protected control transfer into the kernel, it would be horrendously slow! Instead, even though the process page tables are not writable or accessible from userspace, the CPU itself can access them when operating in user-space. This may sound strange, but it works because the CPU stores the physical address of the L4 page table in a special register, %cr3 on x86-64. This register is privileged and user-space code cannot change its value (trying to do so traps into the kernel). When dereferencing a virtual address, the CPU looks at the L4 page table at the address stored in %cr3 and then follows the translation chain directly (i.e., using logic built into the chip, rather than assembly instructions). This makes address translation much faster – however, it turns out that even this isn't fast enough, and the CPU has a special cache for address translations. This is called the Translation Lookaside Buffer (TLB), and it stores the physical addresses for the most recently translated virtual addresses.
Processes are how we run programs on our computers, and our computers often use several processes to get things done. For example, a simple terminal command such as ls or grep each run a new process that produces some output and then exits.
In WeensyOS, the kernel starts four processes at startup, but (at least until step 5 of Project 3), there is no way for a user-space process to start another user-space process. A realistic operating system clearly needs to be able to do so.
Many Unix-based operating systems – which include Linux, the BSD line of operating systems, and Mac OS X – use a system call named fork() for process creation. fork elicits controversy even after nearly 50 years of use, and it's not the only way to create processes (Windows, for example, has a different approach). But it is how millions of computers and devices do it!
fork() has the effect of cloning a user-space process. For example, this program ( fork1.cc ) calls fork() ("forks"), prints a message, and exits:
#include "helpers.hh" int main() < pid_t p1 = fork(); assert(p1 >= 0); printf("Hello from pid %d\n", getpid()); >
How many times will the message be printed when we run it? It is printed twice:
$ ./fork1 Hello from pid 19244 Hello from pid 19245
This happens because the call to fork() enters the kernel, which clones the process, and then continues user-space execution in both clones. Both processes execute the rest of the program, and thus both execute the printf function call. Note that the processes have different process IDs, as evidenced by the fact that the getpid() system call returns different values.
Remember that execution continues in the same program for both the parent and child process (although their execution can diverge). If the child forks again, it can create further processes (see fork2.cc , which ends up with a total of four processes). But with fork() alone, an OS could only ever run processes with the exact same code! That's not the case in reality, of course – next time, we'll learn about how to execute a different program.
Today, we reviewed page tables and why their structure makes sense as a general, flexible, and performance abstraction for virtual memory. In Project 3, you are responsible for setting up the virtual memory mappings stored in the page tables for user-space processes, though you won't need to understand the specific details of the x86-64 page table structure.
We also talked about how the fork() system call allows a user-space process to start another process by essentially cloning itself. The two processes, called "parent" and "child" continue executing from the same place in the code, and they start with identical memory mappings (though these mappings are backed by different physical memory pages, for the most part). But the processes can evolve independently after the fork() system call returns. In Project 3, you will implement handling of the fork() system call in the WeensyOS kernel!