Thursday, September 28, 2006

 

Steve Yegge stirring things up again

I just have to say a few words about Steve Yegge and his latest blogging.

I first met Steve Yegge when he still worked at Amazon. He was a big inspiration to me -- he blogged inside Amazon before he was blogging on the Internet, and his blog entries were always exciting to me, even when I didn't agree with them. He was a driving force behind the "Amazon Developer's Journal," which had a programming challenge in it each month that I won second prize in the one time I entered! He breathed life into the debate about programmer productivity, programming languages and tools, all that interesting stuff.

I actually rubbed him the wrong way the first time I commented on his blog, to the point that he wrote and entire blog entry chastising my comment as FUD. Of course I didn't mean it that way (we go into this in the comments), but I think I just struck a nerve.

So Steve is blogging on the big wide Internet now, and naturally is just as engaging as he was at Amazon. Now he's at Google, and making Google sound like a pretty awesome place. But he's also got a lot to say about languages and programmer productivity, just like he always did.

It's really tempting to get sucked into this and other Internet-wide discussions/debates. There are lots of smart people on the Internet, and lots of interesting things to talk about. Yesterday there was an announcement from Intel that they're expecting to have 80-core chips within five years. What does this mean for the future of Operating Systems? Naturally I have a pretty strong opinion about that.

However, I have always been a person that would rather act than talk. When Steve was blogging about LISP and Ruby and such inside Amazon, I wanted to say "If you believe in LISP, do something cool with it to demonstrate why it's compelling." I did just that with Ruby inside Amazon. I am much more satisfied letting my Ruby-related work speak for itself rather than writing blog posts arguing about Ruby's potential.

That's how I continue to feel about arguing the merits of methodologies, operating systems, or languages. I'm not a very convincing arguer, but I think I'm a pretty good software developer. If my opinions are founded, my good software can speak for itself; if I'm full of shit, my bad software will make that clear to me and everyone else.

Don't get me wrong -- I think Steve Yegge is a doer too. He's so hyperactive that he can be a doer and a talker at the same time. But I'm going to concentrate on convincing people through compelling software rather than compelling words.

(I know I haven't posted any updates here in a while, but I have been working on several iterations of the memory manager. I'll post more about it when I have made more tangible progress).

Sunday, September 10, 2006

 

There's more to life than malloc() and free()

When it comes to memory allocation, we programmers don't have a lot of imagination. All we've got is malloc() and free(). Or new/delete. Or new/wait-for-a-collector-to-stop-the-world. It's all the same.

For example, what is the classic answer to the problem of "how do you implement a resizable array?" Well, you malloc() an array of some size, and when you exceed that size you malloc() a bigger array and copy everything over. Because malloc() and free() are all you've got, right?

Wrong! All modern operating systems use MMUs to map physical memory into a virtual memory space. So for your resizable array, you could ask the OS to reserve quite a large area of your memory space, but only allocate pages for a small amount of it. Then when you want more, ask the OS to allocate more of the large memory space you reserved. Especially as we move to 64-bit OS's, where most applications will use a tiny fragment of that space. Giving applications more control over their memory map could open the door for some serious optimizations.

To be fair, I think that people do this sort of thing by mmaping /dev/zero, or tricks like that.

This will be a consistent theme. I believe that interfaces to systems like the memory manager should be richer, and expose more of the capabilities that are available. malloc()/free() is a simple interface that works well, but the potential exists for other interfaces that work much better for certain use cases.

 

Entry #1: Information Overload

I should have started blogging before now, so I would have good incremental notes about my thought process and results. Unfortunately, this is my first entry, and therefore it will be quite long and detailed.

First I'm going to talk a little about my approach from a high level, then I'll talk about my progress to date. This entry does not talk about why I am doing this, or my vision for where I want it to go. For that, you'll have to wait for future entries.

Starting from a microkernel

Yes, building an OS is extremely difficult. Luckily I have chosen to build on an existing microkernel, L4. Actually, L4 is an entire microkernel family, for which there are several implementations that provide roughly the same API. Building on a microkernel buys me threads, address spaces, memory protection, and saves me from the nitty-gritty details of how to boot on a particular hardware architecture.

Lucky for me, L4 does exactly the set of things I would prefer not to do myself. Even more importantly, L4 leaves to me all the things I need control over. For example: yes, L4 has threads and therefore a thread scheduler. However, it has a mechanism that lets you take over scheduling if you really want. L4's thread scheduler is simplistic -- the highest-priority runnable thread gets scheduled, but there's a system call that lets you donate the rest of your timeslice to another process. So: make your scheduler the highest priority thread in the system, L4's thread scheduler will always pick it, and then you can let your scheduler donate the rest of its timeslice to whatever thread you think should run. There are similarly smart ways for giving you all the control you need over paging.

Let me start with a bit of backstory about how L4 came to be. I wasn't around for the days of Mach, but apparently Mach sucked. I don't really know the details, but the general impression I have is that it is big and slow (at least for a microkernel), and failed to deliver on a lot of the promise of microkernels. Microkernel optimism (of the kind Tanenbaum displayed in the oh-so-famous flamewar with Linus) was sort of crushed by Mach's lackluster showing.

(there is a bit of irony here, since I am typing on a Mac running OS X, which runs this weird frankenstein of BSD glued on top of Mach. but let's leave that for now. :)

Then Jochen Liedtke came along and wiped the slate clean. He designed a microkernel from the ground up, using these principles:To demonstrate that microkernels could perform well, he wrote his first version in assembly language. I think I even read somewhere that it was so small that the entire kernel could fit in the 4k CPU cache! His work spurred a new school of fast, lean microkernels now known as the L4 microkernel family.

The current L4 scene

Like I mentioned, there are multiple implementations of the L4 ABI/API. The two most prominent ones I have come across are Fiasco and L4Ka::Pistachio. The first comes out of TU Dresden in Germany, and is a small part of a much larger system called DROPS ("The Dresden Real-Time Operating System Project"). At first glance, I was inclined to go with this implementation because of their focus on real-time (I'll explain later why this is important to me), but I ultimately decided against it for three reasons. One is that the code-base is rather large and hard to understand. A second is that it's not well-separated from DROPS, and I am only interested in the microkernel part. Finally, it's GPL, and I want to be free from GPL if possible.

Instead I decided to use L4Ka::Pistachio. It is a joint collaboration between the University of Karlsruhe in Germany and the University of New South Wales, Australia (which I think is totally fantastic -- those guys could code around the clock from opposite corners of the world!) I like that its code is small and clean, and (relatively) easy to understand. I like that it's BSD. It also has decent documentation, though somewhat out of date. So far I've been happy with it. Its last release was in 2004, but there is a reasonable amount of CVS activity, so I'm using the CVS version.

It doesn't advertise itself as being "real-time" like Fiasco does. To be honest, I'm not sure what this means in practice. One thing it could mean is that it has a real-time scheduler. But L4 contains mechanisms that I believe will let you effectively implement scheduling in user space. It could mean that interrupts can always preempt the kernel. I am not sure exactly what kind of worst-case guarantees Pistachio offers about latency for dispatching interrupts to user-level processes.

IPC Model

IPC is of course the burning question. L4's IPC is apparently pretty fast (ten times as fast as Mach for a single tiny message, according to Wikipedia). What I find even more interesting than raw speed, though, is the execution model they've used for IPC. It's all synchronous: data is always passed from sender directly to receiver. The kernel never queues it. IPC can only complete when a sender is sending and a receiver is receiving, both at the same time. So if a thread decided to do a receive with no timeout, it is effectively put to sleep until someone sends it a message.

I think this model is going to be key to making the OS perform with low latency (which is my ultimate goal). It helps low-latency because it gives you more control over the flow of execution between processes.

I think of this as a baton of execution. Say you're a keyboard driver. The keyboard interrupt comes in, and since you've already registered on that irq line, the interrupt hands the baton to you. You process the input, then send an IPC to whoever your client is (say it's an application), thereby passing the baton to him. He might look at the key event and decide he wants to draw something to the screen, so he sends the video driver an IPC, passing the baton to her (yes, this is a female video driver). The video driver can write to video memory appropriately. She then sends a completion IPC back to the application, who then gets a chance to run again.

(I'm not 100% sure that L4's IPC works this way all the time. The crucial point is that if I perform an IPC that sends a message to another process and also waits indefinitely for someone to send a message to me (this is an atomic operation, and only one syscall), I want to be sure that the thread I just sent the message to gets to run automatically. I do not want the scheduler to run and pick whatever thread it considers most ready.)

In this way, the processer's time follows the interesting task around, from thread to thread, and address space to address space. We never once sent a message, but waited for the destination thread to get scheduled in -- every thread in the chain of (keyboard driver -> app -> vga driver -> app) got to run immediately, because there was interesting work for it to do. This all might have even happened during a single timeslice.

The "baton" idea reflects the reality of running multiple programs on a processer -- only one guy gets to move at a time. What's nice about synchronous IPC is that it gives you a little more control over how that baton gets passed around, instead of trying to make a scheduler smart enough to assess the state of the world and decide what is most important.

First steps

In the last two weeks, I've written enough code to get familiar with L4 programming. First I took the sample program and modified it to be able to write arbitrary characters to the screen. Obviously, the first message I had it write was "Hello, L4 World!" :). This involved reading various documentation about text-mode VGA programming. Then I extended it a bit to be able to move the cursor also.

After that was working, I wrote some keyboard code that can read keyboard scancodes as they come in. This involved learning how to register for an interrupt using L4. I certainly don't want to waste CPU by spinning in a tight loop, polling for keyboard events. Instead, I register for IRQ1, go to sleep, and when I wake up I look for the keyboard event I know is there. There's a lot of code you have to write to map keyboard scancodes to something that has meaning (using a table like this), and even more code to keep track of whether modifiers like shift are down and mapping "SHIFT+7" to ampersand.

With that, I had enough hardware support to implemement a basic console. But before I did so, I wanted to be able to run the vga and keyboard as their own servers, in their own address space. This turned out to be a lot of work. Luckily, I can make L4 do the work of loading more than one server into memory. It has a loader called KickStart that together with GRUB can load an arbitrary number of ELF binaries into memory. After that, I have to use L4 system calls to create a new thread, create and configure an address space for it, and map in some memory. Then I send it a message that contains its initial IP (instruction pointer -- entry point of the binary) and SP (stack pointer -- since it's written in C and needs a stack), and it's on its way!

But now that I have multiple processes working, it's time to start writing some memory management. I have this crazy idea that the memory manager itself can run in its own address space, thereby managing its memory along with the memory of all other processes. So the page table would describe even the memory used by the page table itself! I am not at all wedded to this idea, and I will abandon it if is impractical or more trouble than it's worth. I just figure: if it can be done, why not do it?

This page is powered by Blogger. Isn't yours?