The 8-fold path to reliable operating systems

Speaker: Herbert Bos, professor at the Vrije Universiteit, colleague of Tanenbaum

1. The ever-increasing bloat of OS

Herbert started his talk by pointing at all the bloat in our current OSes. Windows went from 6 MLOC in NT3.5 (that was in 1993) to 50 MLOC in XP (released in 2001). That means every year, the codebase grows by about 30%. Free OSes also saw dramatic growth: the Linux kernel was 10 kLOC in 1991, fattened to 13.5 MLOC in 2010; that's an average growth of +60% a year. Similar stats were shown for FreeBSD.1

In addition, all software contains bugs. I seem to remember Herbert quoted a single-digit number of bugs per 10 kLOC as a baseline. In this respect, free systems, the BSDs in particular, did better than average (cheers in the audience). FreeBSD did particularly well (louder cheers from part of the audience).

Therefore, current OSes are too complex to be understood. A single kernel hacker can't pretend he understands all the aspects of his kernel. There are bugs lurking around, and the most we can do is limit their impact when we hit them.

2. Reliability

How important is stability, anyway? After all, people have been using desktop OSes that blue-screen regularly for 20 years. However, in industrial systems, the reliability of your controlling processes is very important. You don't want the nearby chemical plant to have an incident where the computer that controls the valve in mixings locks up…

But after all, some computing systems are designed with reliability in mind. TCP is able to perform with segment loss (and even embraces loss as a flow control mechanism); RAID provides resiliency in the face of disk failure; the Tandem computers could survive the magic smoke escaping the CPU. Software, on the other hand, is not quite reliable. The Minix project tried to correct that, using some of the techniques of the systems cited.

Minix explicitly decided to forego performance in this quest for reliability, on the grounds that it is less important, and that hardware will improve anyway.

3. Empty your mind

The first step2 is to reduce the bloat of the system, at least its core. The Minix3 micro-kernel contains 10 kLOC. This makes it a very small system, understandable by a single hacker. The codebase is made very modular. Out of these 10 kLOC, most of the code runs in user mode.

4. Let go your privileges

Only code that has privilege to take down the whole system is actually able to take down the whole system. It is therefore necessary not to give more privileges to code that it actually needs; and code need to be able to let go its own privileges.

5. Balance the ten deadly powers

However, at least some code need to access, say, external devices. Privilege granting is very fine-grained. Access to DMA ranges is granted and enforced on a per-device and per-process basis.

IPC, which is fundamental in the Minix architecture, must be safe (more on this later).

Extensions to the micro-kernel can't access the full CPU, neither in terms of time nor in terms of instructions accessible. Access to external resources for extensions is done explicitly.

6. Insulate

Drivers each have a separate address space, so a faulty driver that would poke at random pages cannot take the system down. Of course, a driver can only access its own address space.

I/O to devices is done through kernel services, so the kernel can check what a driver does. Interrupts from the hardware are handled by the kernel. The information is then pushed to the driver through the regular IPC mechanism.

7. IPC

There are per-component ACLs, enforced by the kernel, that determine which process can communicate with which other process or which part of the kernel.

Of course, we don't want a faulty process to be able to hang another process. When a process doesn't trust another, in particular doesn't trust it to handle messages in a timely manner, message sending is done asynchronously. A typical chain of trust, where drivers are not trusted, is shown below.

 --------    <―――――――     --------    <·······     --------
( Server )               ( Driver )               ( Kernel )
 --------     ·······>    --------     ―――――――>    --------


  ······>  Untrusted, asynchronous communication
  ――――――>  Trusted, synchronous communication

8. The Minix separation of components

   ------                                           ^          ^
  (  PM  )              +-------------------------+ |          |
   ------   ---------   |         ----            | |          |
 ----      ( Syscall )  |        ( pf )           | |          |
( RS )      ---------   |  -----  ----            | |          |
 ----                   | ( TCP )                 | | Servers  |
       ----             |  -----   Network,       | |          |  User-
      ( MM )            |          single process | |          |    space
       ----             +-------------------------+ |          |  processes
                                                    |          |      
................................................    v          |
                                                    ^          |
    ------          ------------                    |          |
   ( SATA )        ( Net driver )                   | Drivers  |
    ------          ------------                    |          |
................................................    v          v
       _  __                    _                              ^
      | |/ /___ _ __ _ __   ___| |                             |  
      | ' // _ \ '__| '_ \ / _ \ |                             |  Kernel- 
      | . \  __/ |  | | | |  __/ |                             |    space
      |_|\_\___|_|  |_| |_|\___|_|   (10 kLOC)                 | 
                                                               v 

High-level view of the Minix architecture with a handful of components.

Note one insulation violation, where high-level network services (pf and TCP/UDP) are put into the same process. This is justified by the fact that when pf dies, we don't want TCP anymore; and when TCP dies, we want to restart pf anyway as all state is lost.

9. Overcome death

The reincarnation server shown in the figure above is the parent of all processes. The RS regularly pings other processes; when one dies or is caught in an endless loop, the RS can restart it, or maybe take some other action.

For a NIC device driver, it is sufficient to just restart it. For a printer driver, you can just restart the job; at most the user will see a document printed twice. However, when something like TCP dies, things get more hairy. All the internal state is lost, so connections are lost.

The network has two implementations: the initial, single-server one and a newer, decomposed one. Their respective sizes are:

Table 1: Recoverable code in the network stack
  Recoverable after Unrecoverable
  server crash  
Single server 1 kLOC 33 kLOC
Decomposed 35 kLOC 7 kLOC

Note that recoverability did introduce complexity, as IPC between processes is not free in terms of supporting code. However, the amount of code that is able to crash the network is much reduced.

Embracing the restarting of processes also enables live updates of the system components. Then again, this is difficult to do when services or device drivers need to keep a lot of internal state.

10. Additional design considerations

The kernel will not accept to run any external code. There is no support for run-time loading of external modules.

All data structures are static, allocated at boot time. There are no malloc() calls in the kernel. Similarly, IPC is done over fixed-length messages. This allows to avoid a whole class of buffer overruns bugs (and attack vectors).

The rendez-vous system for IPC is kept simple. This is essential in a design where IPC is paramount to the working of the system.

Keeping drivers as regular user processes has the additional benefit of making their development easier. They are also easier to debug with the usage tools.

11. Reliability experiments

Herbert showed the results of two experiments to show the benefits of the Minix design. In a first experiment, the Ethernet device driver is killed every second, while transmitting data over the network. Throughput went from 11 MB/s (saturation of the Ethernet link) to 8 MB/s. The network transfer itself was not affected.

A second experiment tested fault injection: in a loop, the condition is randomly modified. 100 faults are injected; if after one second the driver is still running, 100 new faults are injected until the driver crashes.

This resulted in 18 000 driver crashes. None of those crashes took the machine down, the kernel kept running.

More experiments are presented in Fault Isolation for Device Drivers (and the experimental protocol is better described than in a generic talk about the design of Minix).

12. Performance considerations

Lots of IPC between a lot of processes incurs significant overhead. It is hard to scale. In order to keep it reasonably fast, Minix avoids going through the kernel for IPC. This avoids polluting caches. Instead, the kernel will only set up the communication channels; afterwards, processes will rather share buffers (which of course are marked read-only for the process that doesn't own them).

Signalling in process communication is kept to a minimum. As much as possible, communication is made asynchronous, especially in the hot path. Drivers should adopt a event-driven design around message passing.

Tasks that require several actions can be done in parallel. For instance, fork() requires setting up a new address space and duplicating file descriptors. This is the responsibility of different devices, so this can be done in parallel.

Footnotes:

1

Of course, these rates of growth are not entirely comparable. Windows is much more than a kernel, so the growth of the core of the system, which is the object of the article, is somewhat amortised. Linux tends to carry much more drivers in its own tree, and the hardware landscape that runs Linux is much more diversified today than it was 20 years ago. Herbert's point about the increasing bloat and complexity of OSes still stands.

2

I am afraid I haven't noted the explicit eight steps. This is unfortunate, as the reader will lose the Zen spirit emanating from the step names.

Auteur : Frédéric Perrin

Date : samedi 29 octobre 2011

Sauf mention contraire, les textes de ce site sont sous licence Creative Common BY-SA.

Ce site est produit avec des logiciels libres 100% élevés au grain et en plein air.