NX flag in CPUs: a very old "new idea"

Networking/Security Forums -> Programming and More

Author: capiLocation: Portugal PostPosted: Thu May 12, 2005 2:02 pm    Post subject: NX flag in CPUs: a very old "new idea"
I'm posting here what started out as a bit of a rant in one of the hidden sections of the forum, but eventually evolved into something that might actually have some interest : )

The whole concept of having an execute flag for memory pages has been in existence for ages. It is not a new, original, or revolutionary idea, not especially difficult to implement, and would have required minimal to no changes in actual CPU architecture for any CPU calling itself capable of paged memory management (read 386 for instance). The only novelty here is certain lame manufacturers of very widely used CPUs actually bothering to provide hardware support for it, after decades of intentional neglect ("we can't be bothered", as opposed to "we didn't know").

Memory page permissions are as old as paged memory management, it is not a new concept in OS design by any stretch of the imagination. OSes (even Winblows and obviously *nix) have explicitly offered support for the read, write and execute permissions for decades, alongside all the other property bits paged memory uses. The problem is the x86 folks only implemented hardware support for read and write permissions. Execute permissions were left out, in a misguided effort to spare some transistors.

For example, as you all know, a userspace program can't write to kernel memory; or to a constant segment; or to memory otherwise set to read-only. That's why you get a sigsegv when you try to do something like char *s = "Blah"; s[0] = 'b'; (unless your compiler/linker doesn't place strings in a constant segment). This is due to specific CPU hardware support that checks page permissions as specified by the OS, before accessing any given memory address. However, thanks to said CPU support being incomplete in the most widespread architectures, and only checking read/write permissions, you can execute any code from anywhere without the OS being able to stop you, or even be aware of it. You can VirtualProtect(..., PAGE_READWRITE, ...) all you want, that you will be doing the exact same thing as VirtualProtect(..., PAGE_EXECUTE_READWRITE, ...). The result? Well, we all know that, now don't we?

Regarding whether this would only affect the stack or not, there is no technical reason for the effect to be limited to the stack. Page permissions are applicable to all memory pages, whether they be part of a process' stack segment, data segment, code segment or whatever. It just depends on the OS being sensible enough. The loader should not give execute permission to any segment other than the code segment. And it should not give write access to the code segment (most OS' loaders already do at least that part). Data and stack segments have no business being executable, at least not by default. Those two right there would already take practically all the fun from heap and stack based overflows; you can't execute from anywhere outside the code segment (as physically assured by CPU support), and you can't overflow into the code segment as it is not writeable (again as physically assured by CPU support).

Sure, you can still DoS a poorly constructed app; just because you can't execute from the stack, doesn't mean you can't blow the return address to hell and send the process on a one way ticket to coredump-land when the function returns. Or make it return into an already existing code address (original code, can't just inject some downloaded code and expect it to run), so you can still have some flow control. As an aside, even though you're limited to existing code, this is _much_ more powerful than some may think, and where I'd put my money on for future exploit development on NX systems. Also, you can still control data in a poorly coded app (NX would not take away the possibility to overflow, it would just take away the possibility to execute from overflow). Ok, so it would not wipe exploit coders from existance. Still, clearly it would be undeniably more secure than what it is now, all of which due to incomplete hardware support ever since the early 386 days.

To put this in terms that would be more familiar to a sysadmin, I'll use an analogy: think of your computer's memory as if it were a box's filesystem. Now, picture yourself trying to secure the box when the only permissions you can set or clear are read and write. Execute permission is not implemented (that is, it is there but not verified), everything is executable. And that means everything; every file, every folder, for anyone (as long as they can read it). Also, assume that certain folders and almost all data files have to be world writeable for the box to function properly (run normal apps without some _major_ hacking). You can see where this is going, right?

Now, on top of that, picture that the filesystem actually fully supported +x and -x. Ever since they invented the damn thing. Chmod did everything right, the permission bits were there, only problem was that the kernel developers thought "oh we can spare a few lines of code if we don't verify execution permissions on files before we run them.".

Now, picture that all of a sudden, after a couple of decades of this, kernel developpers suddenly came out and said "Hey everybody, we've got an incredible new technology! It's a great idea! We're actually going to verify execution permissions for files before running them!" Then picture that for some reason, people actually applauded them for this, instead of laughing in their face and damning them to some really hot place to be tortured by little red people carrying pitchforks. Picture all that, and you'll basically be where we are now with the memory/execute/no-execute thing. Just replace "filesystem" with "memory", "chmod" with "your OS handling memory" and "kernel developpers" with "(some, especially x86) CPU manufacturers".

Anyway, I'll stop ranting now, hope this was entertaining at least...

Author: StormhawkLocation: Warwickshire, England, UK PostPosted: Fri May 13, 2005 12:41 pm    Post subject:
An enlightening article, capi. A most interesting read!

Author: capiLocation: Portugal PostPosted: Fri May 13, 2005 1:06 pm    Post subject:
Thank you, glad it was of interest to someone Very Happy

Author: StormhawkLocation: Warwickshire, England, UK PostPosted: Sat May 14, 2005 2:09 am    Post subject:
So this is now being marketed as something new?

Author: capiLocation: Portugal PostPosted: Sat May 14, 2005 3:01 am    Post subject:
Oh yes, it's the greatest novelty! Brand new feature in the AMD 64 chips, an incredible breakthrough that Intel would soon add to their own 64-bit chips, following on the footsteps of the initial innovators. And I hear after that, they're going to try putting round wheels on cars, too! Seems some of their brilliant minds came up with this other innovative concept, they're planning to start phasing out the square-wheeled cars anytime in the near future.

Searching a bit came up with the following example: http://hardware.earthweb.com/chips/article.php/3358421

Funny how they present basic concepts inherent to paged memory management such as page faults, in an attempt to associate them with this "groundbreaking discovery". "All new Coke bottle! Now it comes with an opening! And it uses the laws of gravity, formulated by Isaac Newton, to let the liquid out!!"

Incidentally, in case you're wondering about the need to update your OS for it to work with those particular architectures' "new" feature, it is obviously because the binaries for those architectures never actually had the code for it since the CPUs did not support it. I assure you it is most certainly not a new concept by any stretch of the imagination. Heck, just get your hands on the Win95 SDK and go look up the VirtualProtect reference, see the possible values for the protection mask. Hell, it goes all the way back to paged memory management - that's 386 in Intel speak. And it even before that, just think *nix on non-x86 platforms...

Author: mmelton PostPosted: Fri May 20, 2005 2:15 am    Post subject:
(forgive me, I went linux specific here, as its very hard to explain with closed OSes in mind... ring0 is a mystery!)

As a microprocessors student, I felt I had to reply to this.

It is not a new, original, or revolutionary idea, not especially difficult to implement, and would have required minimal to no changes in actual CPU architecture

It's not a new idea, you're right. But it was a good idea, and today, remains a nice idea.

You claim tho, that it would have required minimal changes to actual CPU architectures. That, however, I must disagree with. Lets take the x86 and the Alpha, the two main players in memory execution. Why on earth would you want to execute code in memory?

Because you have little memory? Yes!

The stack, if we take the ebp as the floor, grows down.


 mov $0x5, %eax # 5 bytes
mov %eax, %ebx # "brk" 5 bytes more
mov $0x2d, %eax # 45: sys_brk
int $0x80

Now our ebp is ebp+5, and we have 5 more byte to use. Yay! 5 bytes isn't a lot, but when we start to think about the limitations of the stack, we slowly realize, its a lot of memory!

The stack on an x86 linux machine, is 8kbytes (2 x 4096) per process. ie: u8 supervisor_stack[0] in the linux task_struct structure. You cant swap out the stack that i know of. On x64, its twice that, 16k. (as of 2.6 there's a newish abstract layer focusing at 64-bit processors, so without using threadinfo, you're not entirely aware of the extra space)

The stack isnt paged, as it would be impossible to virtualize the VM's stack. (lol)

If we didn't have the stack, naturally, processes would use very costly calls in order to fetch memory stored all around external memory, every time there was a task switch. Completely missing the point of the L1 and L2 caches.

So. We need the stack. But what does that have to do about executable memory pages? Its simple. What we cant fit into the limited 8/16k unpaged stack space, we fit into higher memory with malloc.

malloc memory can be protected with mprotect. Virtual Memory Areas (process context), can also be created with specific protections with mmap. Later altered with remap.

Most modern OSes implement (read: clone) the x86 memory execute ability so that they too can reap the benefits I'll discuss later. This means all the ports comply to all the MAP_ and PROT_ attributes thrown at them. That means they support PROT_EXECUTE and thus VM_EXEC.

So, most computer operating systems will execute memory locations. But why? and why did it take MS so long to 'invent' SP2's software feature?

Simply put: it's easier. It's easier assigning the stack pointer to some address that the GMT (global memory table) knows about, and the processor can run off and start executing code at that address. We want to run code at an address anywhere! Running code anywhere means we can evade copying data from a remote non-execution location to the stack, launch, and copy again. It takes the number of costly hits to memory down substantially.

On the ARM7 (a RISC processor), it takes something like 20 cycles to read from external memory - you've got to set registers, and call stuff, then reset r15... argh! Its a nightmare. On arm and Intel computers and other embedded systems where you've got fast access fixed memory you can just LOAD r15, #DEADBEEF. Where the external memory which is hardware mapped to the address of DEADBEEF is executed. The Gameboy Advanced does this! (weee ARM) So you don't have to do a bunch of memory jostling operations until its somewhere where you can execute.

So, where stack is already a premium and far too small to hold say a kernel image, we have to use memory execution.

The whole 'execute where i say', has recently been in the linux kernel news. Execute In Place (http://thread.gmane.org/gmane.linux.kernel/302002) attempts to provide the kernel with the ability of executing mapped memory from block devices eventually leading the filesystems.

In order to implement the NX flag, what kind of changes do you have to do? In software, you have to rewrite all your malloc functions to turn it on. Similar to MS now zeroing all all szPassword strings before freeing them, you have to rewrite so much code to make use of the NX flag. Microsoft took so long with SP2 because of this.

In hardware, you have to make sure there are server thousand logic tests within the CPU microcode. You have to tag every memory table entry, and you have to track it.

What you're asking here is *HARD* to do by any means.

Hth Smile


Author: capiLocation: Portugal PostPosted: Fri May 20, 2005 4:36 am    Post subject:
Hello mmelton, welcome to SFDC! Glad to have you with us, hope you'll stick around Very Happy

First, thanks for commenting on the post. Always nice to share some thoughts.

Ok, now, we need to clarify a few misunderstandings here: I never said it wasn't a good idea; I said it wasn't a new idea. I'm all in favor of it being implemented, my point is it should have been implemented some 20 years ago, along with the whole paged memory concept and the read and write permissions which were implemented. Execution permission, as you agreed, is old, and it having been left out since has facilitated most of what we see today exploit-wise.

Also, I never said "ban execute from memory"; that would be a mistake and, as you pointed out, lead to a wasteful excessive copying of code from here to there everytime we wanted to execute. No, what I said was "implement per-page execute permissions, just like they implemented read/write permissions, ring3 access permission, dirty bit, etc etc".

I'm talking PTE's (Page Table Entries), TLB (Translation Lookaside Buffer), etc. Paged-memory capable CPUs such as the 386 and onward already have the layout to deal with per-page memory permissions such as read and write access. Reading from a range of memory? Sure, check the TLB, raise a page fault if it's not there, etc until you get a permission mask, check permissions, go ahead and read if it's ok, throw up and die if it's not. Same thing could be extended for execution permission, just use the same mechanisms as for data fetch when prefetching code.

Let's take from the Windows PE loader as an example (ironically, I am far more familiar with Winblows' internals, including ring-0, than with Linux; something which I will be fixing as soon as I get some time lol). You open a binary, you've got a few sections. Code section, data section, const section, etc. What do you do? You create a process object. Allocate yourself some pages, copy the code, the data, the const, etc over there, set permissions as approppriate (i.e. code - r/x, data - r/w, const - r, etc). Allocate a couple more pages, stick a guard page on the bottom (or the top, depending on how you look at it), give it r/w permission. Set the hardware context, create internal structures (add thread to internal thread list, set up security context, etc), and finally you bring the thing to life by placing it on the active list for the scheduler to eventually schedule it in (pass the redundancy). For Linux, pretty much the same thing.

As noted above, I can't speak with much authority on the Linux kernel implementation, but I can tell you that on Windoze a process' stack is pageable like anything else. All ring-3 components of a process are pageable, including thread's stacks. You stick a guard page on the end, and you've got yourself a growing stack. As for task switching, that's obviously done within kernel threads at ring-0, with the dispatch component itself being mostly non-pageable (as is the actual memory manager, for obvious reasons).

The point here isn't having one small area of non-paged memory where you can execute and everything else non-exec, forcing you into expensive code copies everytime you tasked switched, etc. The point is using what is already there, the same structure and layout that is already in use, but add execution permissions so that only the code segment in a process is executable. And, of course, make it so the user can change permissions by the normal syscalls if he really wants to exec from the heap; again, that mechanism is already in place for the read and write permissions. Just need to set another bit in the PTE for whatever page the user referenced in the syscall. In fact, the Win32 API has had the execute permission bit for a good while (as far back as Win95 and probably further back, but can't verify right with the Win3.1 SDK now); VirtualProtect(), the ring-3 API for changing the permissions of a range of pages, has had specific flags with and without execute (which were obviously doing the same as there was no hardware support).

The changes from a ring-3 perspective would have been null. Developper doesn't have to think about setting read and write permissions when he malloc's. The kernel will do it by default when malloc eventually calls whatever API to grab a page the OS may have (e.g. LocalAlloc/GlobalAlloc for Win32, mmap and friends for *nix, etc). malloc is C library, there is no need to change it for a change in internal OS memory manager implementation. The existance of the eXecute permission bit requires no change at C library level. It might have required some rewrite of the API itself to deal with the no-longer-no-op case of setting or clearing the eXecute bit, but nothing particularly different from what was already being done with the read and write bits. At the kernel level, it would require similar modifications to the syscalls that got called from the API (i.e. actually set or clear the eXecute bit in the PTE), also to the page fault manager, to deal with the new cases of executable and non-executable memory, what to do when a process tries to execute and it can't (bugcheck, panic, kill the process - most logical option - etc). Again, nothing different than what is already done when a read or write permission is violated by a process.

Hardware-wise, you would need to check one more bit in the PTE's permission field, and you would need to throw a page fault if you tried to fetch code from a non-eXecute page. Again, nothing entirely different from what's already being done, as you are already checking the TLB, PTE etc when you prefetch code in the first place: to check for read permission. As you are surely aware, just randomly setting EIP to DEADBEEF will usually end up in an ugly page fault followed by a core dump as the linear address does not map to anywhere (unless by chance that particular address does map to something readable, but you get the point).

All I was asking in the previous post was that the CPU checked one more bit in the PTE when it did code prefetching: the eXecute bit. If it's there, fetch the code and run it, if it's not, tell the OS about it (as is already done when fetching code, for the read permission).

As for the OS, all it has to do is turn on the X bit for the whole code segment when creating a process, and turn it off everywhere else. It already has to set permissions to R/W value for every page it allocates, it just needs to set it to R/X for those pages that will contain the process' code. A difference of a bit in a bitmask. That, and have a way of changing the bit on demand for a given page, through the normal APIs.

Forget about the stack and whatnot, you don't execute from the stack, never did. Nor do you execute from user-malloc'ed memory. If the user really wants to execute from malloc'ed memory, he just has to call the API to set the X bit on his allocated memory, and everything will work just the same.

As I said, nothing difficult, nothing innovative (all this is as old as paged memory management), but certainly _very_ important and would have saved us many headaches if it were done properly the first time around, instead of being ignored.

Bleh, I hope I didn't repeat myself too much there, it's almost 4 in the morning and my mind isn't exactly at 100% Wink

Edit: spelling and grammar

Last edited by capi on Fri May 20, 2005 9:00 pm; edited 2 times in total

Author: mmelton PostPosted: Fri May 20, 2005 2:45 pm    Post subject:
Hey thanks for the welcome!

I guess I did think you were somewhat against executable segments. My bad!

I still think it's fairly hard however. The i386 is a 36-bit processor. 4 bits of which are used for instructions by the decoder unit, allowing full 32-bit memory addressing.

If you extended the PGD - page global directory - or per process PTE, you add to the number of cycles the processor has to loop in order to check an NX bitstate. This is because everytime it sends an address for data on the FSB, it would have to send a full 32bit address (in order to prevent cross page execution from different processes). This would then be followed by a control bus check, from the processors MMU, where the data would stall in the mean time. This check would be another clock cycle.

The processor seg/pagefaults when accessing another processes' memory space is a soft fault - the processor only maintains a list of used pages as supplied by software.

So most of the overhead for tracking would be software based, unless you extended the address bus to 33 bits, and the processor vector tables to 33bits also.

The solution to this is introducing a completely new address/control bus checker. And I feel thats a lot of silicon.

I agree with you that its a great idea, but I cant see from the hardware side how it could have been implemented a few years ago.

I like this kind of dabate. Are all these forums like this?


Author: capiLocation: Portugal PostPosted: Fri May 20, 2005 7:34 pm    Post subject:
Ok, clearly, when talking about the present state of CPUs, the differences may be more noticeable as we are talking about changing something that has evolved from the beginning without implementing the eXecute bit. Naturally the differences would probably be greater if we set out to change current CPUs.

What I'm saying is this should have been done since the beginning, since paged memory was implemented at the hardware level; the execute permission should have been included into the design then, alongside read and write, ring-3 access, etc. It would not have been a big difference then, you already had to change the CPU to implement everything in the first place (the PTE, the TLB, etc), instead of skimping on the silicon they should have done the thing right.

Sure, it might have cost one more clock cycle or two; so what? It is not an acceptable compromise to throw away an important security measure in the name of a couple of clock cycles, which would not have made much difference anyway when compared to the whole process of checking all the other bits. This would be analogue to the kernel developpers back in the 1970's thinking "hey why should we bother checking the execution bit in inodes? We can run binaries faster if we don't implement it". Simply not acceptable.

CPUs are vastly underused, as you no doubt are aware. People will call a Pentium 2 500MHz "ancient", and say "I _only_ have 256MB of RAM". The truth is most people basically use their fancy 3GHz CPUs as little more than a glorified heating element. It will sit there toasting electrons away for 99,9999999999% of its useful life. Software developers (especially certain closed-source companies) are spoiled for resources, it's easier to say "works slow? buy a faster computer" than it is to actually develop quality software - starting by not using those damn high-level languages such as VB, or (yuk) Java (spit spit).

But anyway, rant aside, I don't agree that it would have been that hard to implement the execution permission back in the 386 days. They already had to change a lot from previous models to cater for paged memory management; the added change of including another bit would have been negligible. The technology was already there, code fetching already had to map linear to physical addresses (and check read permissions in the process), adding another bit would have meant little to no change in what they already had to do to get it all working.

The 386 was a conceptually simple model. No pipelining, no added complications like today. Of course, if you think today's processors, pipelined to the extreme, etc, you will find more differences. Naturally, if you take 20 years or so of evolution from a certain model, you may end up with something considerably different than you would if you had made a minor change back then (the whole butterfly effect thing, etc). But back then, when they actually implemented all of this, it would have been a negligible difference. In fact, even looking at today's CPUs I strongly doubt it would imply such a big difference. Certainly not a big change in architecture. The execution unit would remain untouched; you'd still have OOE (out of order execution), etc, you would still have instruction cache, same micro-ops, same ALU overall. Of course you would have to make some changes to the code fetch unit, possibly affecting some parts of the bus as well, perhaps the PTE might have to be larger than 32 bits. But again, that's thinking of things as they are now, those 32 bits have been taken up by other features as they were not being used from the start.

For example, up to Pentium 2, bit 7 in a PTE (page size, 4K vs 4M) was reserved, not in use. That right there was an available bit. Of course evolution came and, since the bit was unused, someone thought of using it to create the 4M pages. If you look even further back, you would have had even more available bits. Back then at the time of the 386, when implementing the whole mechanism of paged memory, there was absolutely no reason that you could not have used one of the bits in the 32 that a PTE had available to signify the execution permission. As we've both agreed, the idea was not new, it was not a matter of "oh no one thought of that". The concept of execute permission was as inherent to page permissions as read/write permissions. It was simply not implemented, a design choice. A very poor one at that, since it allowed all the sorts of exploits which we see today.

Had the hardware support been properly implemented back then, we would have had a lot more protection against a threat that we've needlessly faced all this time.

Of course, I'm not saying that having an execute bit would have been the be all end all of security. Obviously not. There are ways around it, you can still exploit a poorly coded app even in an environment where you can only execute from the code segment (and even if you can't write to the code segment). As long as you can overrun into the return address, you can tell the CPU where to execute, and that's still a very powerful thing (I'd not like to go into too much detail for obvious reasons, but just think overrunning the stack with your code, setting up a fake stack frame, setting the return address to mprotect with certain parameters, ... Wink).

I like this kind of dabate. Are all these forums like this?

Indeed, it's been a pleasurable discussion. Always nice to have informed debates over such interesting issues. You will find that all of our forum is a great place both to learn and to share one's knowledge in the most varied fields, that is the very purpose of this community.

Networking/Security Forums -> Programming and More

output generated using printer-friendly topic mod, All times are GMT + 2 Hours

Page 1 of 1

Powered by phpBB 2.0.x © 2001 phpBB Group