OpenBSD/sun4v: porting OpenBSD to UltraSPARC T1 and T2
Speaker: Mark Kettenis
1. A short history of the SPARC architecture
The SPARC architecture is a design promoted by Sun through an organisation that licences the design to other manufacturers. This is similar to the way the ARM architecture is managed. The SPARC architecture started at version 7 in 1986, in 32-bit mode. Version 8 was specified in 1990. Lots of implementations, from several vendors (Sun as well as others) were available on the market.
In 1993, version 9 was released. It introduced in particular 64-bit mode. The main vendors are Fujitsu with sparc64, and Sun itself with its UltraSPARC sun4u.
Unfortunately, the architecture is not completely specified, in particular the privileged mode (especially the MMU). They are large differences between the Fujitsu and the Sun chips, and both companies attempted to make them compatible: Fujitsu wanted Solaris to run on its chips. The result was the sun4v chip.
The K computer in Japan, the fastest in terms of LINPACK benchmark, runs on Fujitsu's SPARC64.
2. Features
This is a RISC design, as opposed to the x86 architecture everybody and his dog runs.
The CPU has register windows: on function entry you can set aside the whole set of 32 generic registers and get a fresh new set, then restore the previous window on function return. This allows for fast function calls. Of course only so many windows are available, and deep function call graphs will still require to save registers one the stack beyond a certain depth.
The CPU features fast trap handling. In addition, interrupts can be sorted by priority.
Lastly, its MMU relies on software handling of cache misses.1
3. Chip multithreading
The architecture was designed from the groundup to offer massive parallelism. A single processor includes 4 to 8 cores, and each core can execute 4 or 8 threads. The core will preempt a thread when it sees a load from memory, and switch to the least recently ran, runnable thread while waiting for the memory fetch.
This allows for up to 64 active threads per chip. Recent designs allow up to 4 chips per machine.
The raw speed of individual cores is rather low, typically 1.6GHz. However, those tradeoffs were made in order to reduce power consumption, heat dissipation, and still achieve high efficienty. The K computer is 6th on the June 2011 Green500 list when measured in FLOPS per Watt.
4. Getting hacking
Mark started from the sun4u OpenBSD port, with the goal of having a single kernel for the two architectures, sun4u and sun4v. It should be noted that Solaris uses two different kernels.
For bare metal booting, no change was necessary for the first and
second stage of boot(8)
. However, the kernel itself needed a lot of
changes. The approach Mark used was to patch the code at run-time to
handle the differences.
5. Domains
The UltraSPARC CPU allows the creation of domains, that allow the administrator to split the machine into slices. The hypervisor is in the firmware (and as I understood it, it doesn't require a core for itself). Machine resources can be attributed to domains for their exclusive use: vCPUs (hardware threads), memory, I/O devices, cryptographic resources (there is hardware support for MD5, SHA, RSA and other cryptographic services, but fewer crypto units than cores). There is no overcommit of memory or CPU.
One domain is privilege: the control domain has access to the hypervisor, and can be used to configure the other domains, in particular resource partition.
Service domains present virtual devices to the other domains, arbitrating between guest domains. The control domain is typically also a service domain.
I/O domain have direct access to the hardware devices. Support for running in a I/O domain, talking directly to the NIC or the disk was added rather quickly to OpenBSD; but to make good use of the hardware, it is preferable to run as many services in a guest domain as possible.2
Guest domains only use virtual devices exposed by a service domain. Sun's recommendation, of course, is to run as few applications in service and I/O domains as possible.
6. Bootstraping guest domain support
The first step in porting OpenBSD to sun4v was to have it run as a guest domain.
An I/O domain was created from Solaris as a control domain. In this
I/O domain, OpenBSD was booted diskless (root over NFS), from a real
dedicated NIC. Once this was done, hacking on vnet(4)
, the device
driver for the virtual NIC provided by a service domain, started. Once
vnet(4)
was available, it was possible to start OpenBSD diskless,
this time using the virtual NIC instead of a dedicated one.
At this point, OpenBSD can run in a guest domain, from an NFS root
accessed through vnet(4)
. Hacking on vdsk(4)
to access a virtual
disk provided by a service domain could now start. After some hacking,
OpenBSD could access (and boot from) a vdsk(4)
device.
7. Communication between domains
A communication system between domains exists. It consists of 64-byte messages. Sun makes no promises on the reliability of theses messages. However, Mark never noticed any message loss. He conjectured that Sun did not guarantee the reliability of the messages in order to be able to implement message passing over the network between domains running on separate hosts, maybe even to support hot migration of guest domains across physical hosts.
Be it as it may, Sun has defined reliable data streams over those 64-byte messages (called logical domain channels). The virtual NIC and disk devices implemented in OpenBSD follow the Sun specified protocol.
7.1. Virtual NICs
The basic idea is to put the data to transmit in some shared memory page, and then send through a LDC the address of this page. Of course, the other domains are not trusted, so the kernel doesn't expose directly its mbufs; instead, the buffer is copied and only that copy is made available to the service domain.
Similarly, when receiving data, the packet is first copied to a proper mbuf and then the packet is handled.
For Ethernet switching (when OpenBSD is a service domain), a plain
bridge(4)
is used, with virtual NICs or physical PCIe interfaces as
ports.
7.2. Virtual disks
OpenBSD implements both the client side (in vdsk(4)
) for guest
operation, and the server side (in vds(4)
) for running as a service
domain. The vdsk(4)
driver presents to the rest of the kernel a SCSI
device.
Between the service and the guest domain, data is exchanged using buffers exported by the guest. When OpenBSD is a guest, we already depend so much on what the service domain decides to return on reads that we have little choice but to trust it blindly. When OpenBSD is a service domain, it "just" needs to copy data to (and from) the buffers provided by the guest domain.
All in all, this means less copying around of data in the virtual disk case than in the virtual NIC case.
8. Status
All features of the control domain are not available to OpenBSD for the moment. In particular, configuring domains does not work. Initial suport for starting and stopping other domains exists, but isn't ready yet. It is a good idea to keep a Solaris disk around for the moment.
OpenBSD is able to run as a service domain, as an I/O domain, and as a guest domain, servicing a Solaris guest domain or using virtual devices exported by a Solaris service domain. The specific setup Mark had in mind was using OpenBSD as a service domain with pf running on the real hardware, protecting several (Solaris or OpenBSD) guest domains.
Footnotes:
OK, I'll admit I'm reaching the limits of my knowledge of CPU design, and I'm only listing these points because Mark insisted on them. I'll invite the interested reader to peruse the All About OpenSPARC slides for a much better written overview.
I am not positive I understood completely the distinction between service and I/O domains. Surely service domains need to access the real hardware to present virtual devices? A service domain that present a virtual network without having access to the real NIC would enable other guests to network amongst themselves, but not access the Real World. While this might useful is some scenarios, in general this seems quite limiting.