SF Senior Mod
Joined: 21 Sep 2003
|Posted: Fri Feb 27, 2004 8:59 pm Post subject: Memory management explained
Memory management explained
Copyright 2004, 2006 Israel G. Lugo
Ok, what you're witnessing here is Virtual Memory Management at its best. You see, in virtual memory capable OSes (like Win9x/NT and *NIX), each process's address space is completely separate from the others. Well, not exactly completely, there are some exceptions. I'll try to explain without dwelving too deep in implementation details (I'll try to be as clear as I can but I haven't slept in 24 hours so I appologize if this turns out confusing):
When the OS launches a given process that runs with a flat memory model (that means most 32 bit programs), it creates a linear address space for it (also known as virtual address space). This will be the linear memory range that the new process can access. Now, what does this linear (or virtual) mean?
It means that the addresses you access within your program (e.g. 0x00401234 or whatever) are not real physical addresses. That is, when you do something like
|*((int *)0x12345678) = 9;
you're not really writing to the physical address 0x12345678. You're writing to your virtual address 0x12345678, which is mapped into some physical address by the OS.
Ok, so you may be wondering, what's the point of all this? The point is, amongst other things, that each process has its own independent address space - and as such, it can use whatever addresses for its internal variables without having to worry if they are already used by some other process. Can you imagine what it would be like creating a program to run on a multitasking system without any sort of abstraction layer over the physical RAM? You'd have to know each and every address used by each and every process the system is ever going to run, to make sure that you don't use any of them or risk having two completely independent processes garble each other because their programmers happened to use the same addresses for variables. Kind of hard on the programmers...
Now, how does it all really work? Ok, let's get down and dirty with the memory managers First of all, you need to understand a simple but important concept, which is the concept of paged memory. A page is nothing more than a small consecutive portion of memory. In Intel x86 systems each page is 4KBytes (= 4096 bytes). Paged memory is a way of managing memory (another way is segmented memory, which is somewhat harder on the programmer, although simpler to implement at the OS level). Nowadays the most used by far is paged memory, as it's far more practical for the programmer and much more flexible. Windows supports the two models, as the underlying 32 bit CPUs support them as well (they had to be backwards compatible with DOS, which only used segmented memory management). Basically what Windows does so it doesn't have to bother with the different segments is set the segment registers to include the whole memory range, so they all overlap and mean the same thing.
Ok, enough History. With paged memory, the OS divides your physical memory in pages, just as it divides your physical hard disk in clusters. Why? Well, it's just not practical dealing with individual bytes when you have, say, 256 millions of them. Also it would take up more space to store each byte's individual properties than the bytes we're supposed to be managing (the structures needed to manage 256 million bytes would take up more than 256 million bytes). So, we have the physical memory divided into 4Kb pages, and that's what we manage. Now, we need a way to access individual bytes within a page. We do that with the displacement (or offset) field. Fancy names aside, all of this means: when you access physical address 0x12345678, you're actually accessing byte 0x678 within page 0x12345. Get it? This is obviously for a 32-bit system with 4Kb pages. How did I know which portion of the address was the page and which portion was the displacement? Easy. We're talking about a 4Kb page system, so if each page is 4096 bytes long, how many bits do we need to index it? 12 (2^12 = 4096). So, the 12 rightmost bits in an address represent the displacement inside the page. The rest represent the page itself.
So, now we have a way of dividing addresses memory into pages and back. We're still talking about physical pages, though. Actual RAM addresses. Time to implement the abstraction layer. For that we'll need a little something called a page table. This is just a table of page table entries (redundancy rules ), but I'll go into more detail about that later. What we want to achieve is have each process with its own page table, with the table holding the mapping between the process' virtual pages and the actual physical pages where they are located. So, exactly what is a page table entry (or PTE)? Basically, it's a struct with several fields. Its index in the page table (PT) indicates the virtual page it represents (so PT would give us the PTE for virtual page 3, just as PT[0x00400] would give us the PTE for virtual page 0x00400). Actually it's a bit more complicated than that, since we can have up to 0xFFFFF pages and we don't want to be indexing a table with 1 million entries, what Intel CPUs (and consequently, Windows) do is have a table of tables, in a way similar to how ext2 inodes handle large files with many blocks. But moving on, we have a PTE for each virtual page that the process can access. Within that PTE are stored several fields, including the actual physical page that the virtual page maps to, its permissions bits (readable,writeable,accessible from ring 3,etc), the present bit (which tells the OS wether the page is actually present in physical memory or it's been swapped out to disk), the dirty bit (which tells the OS if the page's contents have changed since the last time it was written to disk), and the disk cluster in which its contents were stored (in case it's been swapped out).
Basically speaking, as it is the PT is checked for each and every memory read/write every process does. This is done by the CPU itself (assuming it supports paged memory management, otherwise forget about all this, but x86 supports it since the 386). Naturally this isn't ideal since it's very slow, we'll be adding 1 or 2 memory reads for each memory read you do, which means we're effectively multiplying the number of memory reads your process does by at least 2. So a solution came, and it was in the form of a on-die cache, better known as the Translation Lookaside Buffer or TLB. This is a special cache within the CPU that stores the last PTEs it's encountered. Its size varies on each CPU, IIRC the Intel Pentium 4 has 2 TLB's (one for instruction addresses and one for data addresses), with about 64 entries each (may be 128, forget right now ). With this we drastically reduce the need for logistical memory reads, since by the principle of reference locality this should cover a good proportion of the cases.
Ok so this is what happens when a process tries to read/write from a given virtual address (the only thing they know about are virtual addresses), let's consider the following example:
|*((int *)0x00401234) = 5;
When the CPU encounters this instruction, it will have to find out the actual physical address where it has to write. To do that, it will take the virtual address 0x00401234 and find out which virtual page it belongs to. Assuming this is a 4Kb page machine (like the x86), the virtual page will be 0x00401 (just strip away the least significant 12 bits, remember). The displacement withing the page is 0x234. Ok, now to find out where virtual page 0x00401 is stored. For that, the CPU will check its TLB and see if it finds it there (let's assume it doesn't, to make things interesting). Since it's not there, it will have to go check the page table for the current process. It gets the base address of the page directory of the current process from one of its registers (CR3) and goes there to check in which page table this particular page belongs (it's the whole table of tables concept, as I explained above it's not practical having to index a table with 1 million entries). The details of how it reaches the PTE aren't important. The point is it will go search the page table for an entry for virtual page 0x00401 (I use the word "search" in a broad sense, it won't actually do a linear search, it's a matter of calculating the index of the PTE for this virtual page, if it exists). Let's assume it finds it - if it doesn't, it will throw an unresolvable page fault as the process is trying to access an address that doesn't exist in its context, and Windows will give you a nice "This program has commited an access violation" message. Ok, so the CPU just found the PTE for page 0x00401. First it will check its permissions bits, to see if we have write access. Let's assume we do. Next, it will check the Present bit, to see if the page is currently loaded in physical memory or if we have to load it from disk. If the Present bit is 0 (meaning the page isn't in physical memory), the CPU will generate a page fault, bringing into action the OS' memory manager, which will read the PTE, find out in which disk block the page is stored, read it from disk, put it somewhere in physical memory and update the PTE to contain the new physical page where it's stored, as well as putting the present bit to 1. Ok, so now it's present in memory. Next, the CPU will read the physical page base address from the PTE. Let's assume this virtual page was stored in physical page 0x01F20. Now the CPU has everything it needs: a physical base address, and a displacement. All it has to do is join the two and it will have the physical address where it has to write: 0x01F20234. From then on, it just stores the 5 we originally wanted to store in there, and that's it.
The fact that you're superuser or not has nothing to do with it, this is just memory management and the way it works. You *could* tap into other processes' address space from ring 0 (after all, the memory manager itself has to do it), say like from a device driver - either you code it by hand (read CR3, do the math to find the PTE, read from the PTE, etc) or use one of the ring 0 memory manager functions that will allow you to do such things. In either case you shouldn't need to go so deep for something as simple as IPC (inter process communication), just use shared memory - check out the MS SDK reference for the CreateFileMapping and MapViewOfFile APIs. Or, if you want to peek into the memory of a process that you didn't code yourself (and as such doesn't open your file mapped object), check out the ReadProcessMemory and WriteProcessMemory APIs.
Regarding shared memory, as you probably guessed by now, all it entails is simply having virtual addresses on different processes pointing to the same physical pages. The virtual addresses for each process don't even have to be the same (as you can see in the MapViewOfFile documentation).
There are more interesting things to learn about such as the copy on write bit (not implemented in Win9x) and the much in vogue now execute permission - had st00pid Intel thought of actually implementing a verification of the execute permission bit, buffer overflows would simply not be an issue... The stack has no business being executable in the first place. A buffer overflow attempt would just cause an access violation when the CPU tried to execute code from a non-executable page (such as the stack or data). Of course permissions can be changed, for example by using VirtualProtect or VirtualProtectEx, but to do that the attacker would have to be able to execute code in the first place. Oh well, perhaps some day, at least AMD was smart enough to include it in their new CPU.
Hmm this turned out rather big, I hope I managed to keep it clear enough. Any questions, fire away
Keywords: memory management, Israel G. Lugo, virtual memory, linear address, physical address, TLB, PTE, PT, pagefile, swapfile, page fault, access violation, process address space, CPU, operating system
Last edited by capi on Wed Jul 19, 2017 2:47 am; edited 9 times in total