Dell C6100 XS23-SB server

Last week’s laptop review reminds me that I should also write about a new server purchase. (I know, everyone’s moving to cloud computing, and here I am buying a rackmount server to colocate..)

Kelly Sommers has one of the best dev blogs out there, and she recently wrote about a new server she’s installing at home. It turns out that Dell made some dramatically useful servers around four years ago — the server is a slim rackmount size (2U) yet contains four independent nodes, each of which can carry dual Xeon processors, eight RAM banks, and three 3.5″ disks. Dell didn’t sell these via standard markets: they went to large enterprises and governments, and are now off-lease and cheaply available on eBay. They’re called “Dell C6100″ servers, and there are two models that are easy to find: XS23-SB, which uses older (LGA771) CPUs and DDR2 RAM; and XS23-TY3, which uses newer LGA1366 CPUs and DDR3. Here’s a Serve the Home article about the two models. (There are also new C6100 models available from Dell directly, but they’re different.)

I got one of these — each of the four nodes has two quad-core Xeon L5420s @ 2.5GHz and 24GB RAM, for a total of 8 CPUs and 96GB RAM for $750 USD. I’ve moved the RAM around a bit to end up with:

CPU RAM Disk
2 * L5420 32GB 128GB SSD (btrfs), 1TB (btrfs)
2 * L5420 24GB 3 * 1TB (raid5, ext4)
2 * L5420 24GB 3 * 750GB (raid5, ext4)
2 * L5420 16GB 2 * 1TB (raid1, ext4)

While I think this is a great deal, there are some downsides. These machines were created outside of the standard Dell procedures, and there aren’t any BIOS updates or support documentation available (perhaps Coreboot could help with that?). This is mainly annoying because the BIOS on my XS23-SB (version 1.0.9) is extremely minimal, and there are compatibility issues with some of the disks I’ve tried. A Samsung 840 EVO 128GB SSD is working fine, but my older OCZ Vertex 2 does not, throwing “ata1: lost interrupt” to every command. The 1TB disks I’ve tried (WD Blue, Seagate Barracuda) all work, but the 3TB disk I tried (WD Green) wouldn’t transfer at more than 2MB/sec, even though the same disk does 100MB/sec transfers over USB3, so I have to suspect the SATA controller — it also detected the disk as having 512-byte logical sectors instead of 4k sectors. Kelly says that 2TB disks work for her; perhaps we’re limited to 2TB per drive bay by this problem.

So what am I going to use the machine for? I’ve been running a server (void.printf.net) for ten years now, hosting a few services (like tinderbox.x.org, openetherpad.org and a Tor exit node) for myself and friends. But it’s a Xen VM on an old machine with a small disk (100GB), so the first thing I’ll do is give that machine an upgrade.

While I’m upgrading the hardware, what about the software? Some new technologies have come about since I gave out accounts to friends by just running “adduser”, and I’m going to try using some of them: for starters, LXC and Btrfs.

LXC allows you to “containerize” a process, isolating it from its host environment. When that process is /sbin/init, you’ve just containerized a entire distribution. Not having to provide an entirely separate disk image or RAM reservation for each “virtual host” saves on resources and overhead compared with full virtualization from KVM, VirtualBox or Xen. And Btrfs allows for copy-on-write snapshots, which avoid duplicating data shared between multiple snapshots. So here’s what I did:

$ sudo lxc-create -B btrfs -n ubuntu-base -t ubuntu

The “-B btrfs” has to be specified for initial creation.

$ sudo lxc-clone -s -o ubuntu-base -n guest1

The documentation suggested to me that the -s is unneeded on btrfs, but it’s required — otherwise you get a subvol but not a snapshot.

root@octavius:/home/cjb# btrfs subvol list /
ID 256 gen 144 top level 5 path @
ID 257 gen 144 top level 5 path @home
ID 266 gen 143 top level 256 path var/lib/lxc/ubuntu-base/rootfs
ID 272 gen 3172 top level 256 path var/lib/lxc/guest1/rootfs

We can see that the new guest1 subvol is a Btrfs snapshot:

root@octavius:/home/cjb# btrfs subvol list -s /
ID 272 gen 3172 cgen 3171 top level 256 otime 2014-02-07 21:14:37 path var/lib/lxc/guest1/rootfs

The snapshot appears to take up no disk space at all (as you’d expect for a copy-on-write snapshot) — at least not as seen by df or btrfs filesystem df /. So we’re presumably bounded by RAM, not disk. How many of these base system snapshots could we start at once?

Comparing free before and after starting one of the snapshots with lxc-start shows only a 40MB difference. It’s true that this is a small base system running not much more than an sshd, but still — that suggests we could run upwards of 700 containers on the 32GB machine. Try doing that with VirtualBox!

So, what’s next? You might by now be wondering why I’m not using Docker, which is the hot new thing for Linux containers; especially since Docker 0.8 was just released with experimental Btrfs support. It turns out that Docker’s better at isolating a single process, like a database server (or even an sshd). Containerizing /sbin/init, which they call “machine mode”, is somewhat in conflict with Docker’s strategy and not fully supported yet. I’m still planning to try it out. I need to understand how secure LXC isolation is, too.

I’m also interested in Serf, which combines well with containers — e.g. automatically finding the container that runs a database, or (thanks to Serf’s powerful event hook system) handling horizontal scaling for web servers by simply noticing when new ones exist and adding them to a rotation.

But the first step is to work on a system to provision a new container for a new user — install their SSH key to a user account, regenerate machine host keys, and so on — so that’s what I’ll be doing next.

Git patches in Gnus

I took over maintaining the Linux kernel’s MMC/SD/SDIO subsystem recently, and quickly found that I was spending too much time saving, applying and compile-testing submitted patches (and trying to remember which of these I’d done for a given patch). The following Emacs/Gnus function helps with that — with a single keypress when looking at a mail that contains a patch, it applies the patch to my git tree, runs the kernel’s “checkpatch” tool to check for common errors, and kicks off a compile test in the background. I’m not much of an elisp coder, so feel free to critique it if you can.

(defun apply-mmc-patch ()
    "Take a gnus patch: apply; compile-test; checkpatch."
    (interactive)
    (setq default-directory "/home/cjb/git/mmc/")
    (setq compilation-directory "/home/cjb/git/mmc/")
    ; First, apply the patch.
    (dvc-gnus-article-apply-patch 2)
    ; Run 'git format-patch', and save the filename.
    (let ((patchfile (dvc-run-dvc-sync
        'xgit (delq nil (list "format-patch" "-k" "-1"))
        :finished 'dvc-output-buffer-handler)))
      ; Compile the result.
      (compile "make modules")
      ; Now run checkpatch.
      (let ((exit-code (call-process "perl" nil nil nil
                     "scripts/checkpatch.pl"
                     patchfile)))
    (if (eq exit-code 0)
        (message "Checkpatch: OK")
      (message "Checkpatch: Failed")))))

(define-key gnus-summary-mode-map "A" 'apply-mmc-patch)

KDB+KMS for nouveau/radeon

First, some background: KDB (a kernel debugger shell) and KMS (kernel mode-setting) combine to let you drop into a graphical shell when something debugger-worthy happens on your Linux machine. That thing might be a panic, or a breakpoint, or a hardware trap, or a manual entry into the kdb shell. Inside the shell you can, for example: get a backtrace, inspect dmesg or ps, look at memory contents, and kill tasks.

This is a big improvement over the previous model of “something bad happens to your laptop while it’s in X, and the keyboard LEDs start blinking, and you hard-reboot and wonder what happened and wish your laptop had a serial port”.

Here’s a video of KDB+KMS in action — it’s from Jason Wessel at Wind River, who deserves massive kudos for having enough patience to get all of this debugging code merged into mainline Linux to everyone’s satisfaction:

Jesse recently wrote about how to give KDB+KMS a spin on Intel graphics chipsets, and now I’ve written patches that allow radeon and nouveau users to join in too. The method for testing them is similar to Jesse’s:

If you test with radeon or nouveau, please let me know what hardware you tested on, and whether everything worked. Thanks!

Btrfs snapshots proposal

I’ve written up a feature proposal on how we can use Btrfs snapshots to enable system rollbacks in Fedora 13, by gluing together the existing kernel code to do Btrfs snapshots, a UI for performing rollbacks, and a yum plugin to make snapshots automatically before each yum transaction. Lots of good comments so far, and LWN has written an article about it.

(Updated: The LWN link is no longer subscriber-only.)

Fun with graphics drivers

It all started, as most things in the universe have, with a slight popping noise and a bad smell.

One of my coworkers sent me a message last weekend, while I was at home, pointing out that my desktop machine at work smelled like it was burning. After discussing the merits of turning off computers emitting burning smells before chatting with their owners about it, we had a look inside and found this:

The leftmost set of capacitors is fine, the set to the left of the fan is not. It’s actually a very well-controlled demolition; an electrolytic capacitor has boiled and blown out the top of its cap, which is perforated to enable exactly this failure method, and then once one cap is gone the rest are obliged to follow suit ‘cause the current on them increases in response. So, anyway, this left me in the market for a new graphics card.

The card in the picture is an nVidia 7300, which I bought two years ago because it was the only cheap card that could handle the dual-link DVI required for my 30” display with free drivers. Two months ago, though, Dave Airlie committed support for dual-link TMDS on ATI Avivo cards. These cards had historically been the worst of the worst for drivers—the free drivers were left nowhere without any specifications, and even the ATI binary driver for Linux took a long time to add support. Thanks largely to AMD’s latest NDA-less specification drops, though, we’re well on the way to a free accelerated driver for these cards, so I bought a X1600+ from buy.com.

When I booted X on it with the radeon driver, though, it looked craptastic. Since I like hacking on drivers, and since I didn’t have much of a choice, I thought I’d try to figure out what the problem was. The rest of this post is about what I tried during the subsequent bonding exercise between me and the registers on my new graphics card.

First, I asked around on IRC, and learned that it isn’t a known problem and that obtaining a register dump might be a good idea. You can get the “avivotool” that does this:

git clone git://people.freedesktop.org/~airlied/radeontool
git checkout -b avivo origin/avivo

I made the register dumps, but nothing was standing out. Then I lucked out: I tried the fglrx (proprietary) driver to see whether the problem happened there—it didn’t, and furthermore it didn’t happen on the radeon driver after fglrx had run. This suggests that fglrx is setting useful registers that radeon doesn’t even know enough about to reset when it starts.

The next step is to diff register dumps with broken-radeon and post-fglrx working-radeon. This got >100 differently-set registers. Dave Airlie gave me a basic explanation of the register layout: under 0×1000 and over 0×6000 are setup, and the rest are mostly the 2D acceleration space.

Manually setting the <0×1000 space to the working values didn’t come up with anything, though, and trying every register by hand was getting tedious. I wrote a perl script that takes two register dumps, assumes one is “good” and one “bad”, and offers to set each differently-set register to the “good” state, pausing for a second for you to inspect the screen output for a fix inbetween each.

And, eventually, it got it right—0×6590 (which radeon_reg.h knows to be AVIVO_D1SCL_SCALER_ENABLE ) and 0×6594 are both set when the card boots, but unset by the fglrx driver, and unsetting them while running radeon results in a fixed screen image. We can do this simply in the driver:

OUTREG(0x6590, 0);
OUTREG(0x6594, 0);

At this point, Alex Deucher (who works at AMD/ATI on open-source driver development—awesome!) stepped in and explained that the scaler code doesn’t work yet, and the driver should really turn it off when it isn’t being asked for. So now we do.

Here’s a photo of the working result. Planet Emacsers will be pleased to see a ridiculous amount of emacs in it.

Tracing internal function calls in a binary

Dear everyone who likes Unix,

I have a binary (which uses glib and was compiled from C) and I’d like to get output with the function name each time any function in that binary is called. So, I’d like the output of ltrace(1), but for function calls rather than dynamic library calls. I am bored of adding g_debug("%s", G_STRFUNC); to the top of all my functions.

You’d think this would be easy, given that incredibly similar tools have existed for twenty years, but so far the shortest answer I’ve heard starts “well, you could write a gcc profile function stub that..”. It would be nice not have to recompile, since gdb certainly doesn’t have to, but I’d welcome the way to achieve this with a recompile as well.

Any ideas? Thanks!

Update: jmbr wins, with the only solution that doesn’t require anything more than gdb, and no recompile. Here’s his script: http://superadditive.com/software/callgraph. I’d like to work on it to add support for modules loaded with dlopen().

Systemtap for fun and profit

Kjartan Maraas pointed me at this Fedora bug yesterday — it points out that /proc/$pid/maps has been broken in Rawhide for a month. The patch that made linux/fs/proc/task_mmu.c (which is where map requests are handled) diverge from mainline is this one. I can’t read more about its motivation since the Bugzilla ID is security blocked.

So, where to start? mm_for_maps() is a new function that does a bunch of checks on the relationship between task and current before deciding whether to allow the request; I threw some printk()s in to find out which were failing, and found that we take the !__ptrace_may_attach(task)) path to the out label in the code below:

if (task->mm != mm)
    goto out;
if (task->mm != current->mm && !__ptrace_may_attach(task))
    goto out;

From there, it got hazy. __ptrace_may_attach() returns whatever security_ptrace() does. This takes us into pluggable LSMs land — any LSM that gave a struct security_operations with a pointer to a ptrace function will have a shot at returning an error that would be sent back to security_ptrace() to stop our request from completing.

But how do I tell which LSM is complaining, or even which LSMs are loaded? After all, they’re registered at runtime. Enter systemtap, as wisely suggested to me by Bill Nottingham (whom I now surely owe beer to). Systemtap is similar to the Solaris dtrace; it’ll let you instrument and track kernel functions and system calls for a running kernel. It was installed on my Rawhide machine by default, which is always a nice touch.

So, how to see which ptrace functions were registered? Enter my first systemtap probe:

unity:cjb~ % cat list-ptrace.stp
probe kernel.function("*ptrace*") {
    printf("%sn", probefunc())
}
unity:cjb~ % sudo stap list-ptrace.stp

At this point, our stp script is converted into C code and compiled into a kernel module, before being loaded into the running kernel. After running a cat /proc/<pid>/maps in another terminal, we see:

__ptrace_may_attach
cap_ptrace

.. which suggests that cap_ptrace was called by __ptrace_may_attach, and that’s where our __ptrace_may_attach might be being turned down. To be sure that we got to cap_ptrace via __ptrace_may_attach, we can ask for a backtrace:

unity:cjb~ % cat cap-backtrace.stp
probe kernel.function("cap_ptrace") {
    printf("%s -> %sn", probefunc(), print_backtrace())
}
unity:cjb~ % sudo stap cap-backtrace.stp
cap_ptrace ->
trace for 6345 (cat)
 0xc04c2286 : cap_ptrace+0x7/0x49 []
 0xc042a600 : __ptrace_may_attach+0xac/0xae []
 0xc049350c : mm_for_maps+0x83/0xd8 []
 0xc0492892 : m_start+0x28/0x11d []
 0xc04800d9 : seq_read+0xdb/0x268 []
 0xc0446288 : audit_syscall_entry+0x104/0x12b []
 0xc047fffe : seq_read+0x0/0x268 []
 0xc04648e2 : vfs_read+0x9f/0x13e []
 0xc0464d2e : sys_read+0x3c/0x63 []
 0xc0403d07 : syscall_call+0x7/0xb []

We’re pretty sure that cap_ptrace was responsible. Hunting through its source, we see that it has a path to return -EPERM, which would do it. So, we recompile the kernel in order to have cap_ptrace tell us what return value it’s going to use, right? Well, no. Straight back to systemtap:

unity:cjb~ % cat return-codes.stp
probe kernel.function("*ptrace*").return {
    printf("%s -> ", probefunc())
    log(returnstr(1));
}

The .return after the function pattern tells systemtap to trigger when the function is returning, and returnstr(1) asks for the return value as a decimal. There’s also print_regs(), if you prefer to see what’s in EAX directly. Over to the other terminal to cat a maps file again, and:

unity:cjb~ % sudo stap return-codes.stp
cap_ptrace -> 0
__ptrace_may_attach -> 0

That’s odd. cap_ptrace is returning 0, which we can see in its code is meant to mean success, and __ptrace_may_attach is receiving it back unharmed. Cue an “ah-hah!” moment as we realise that the conditional:

if (task->mm != current->mm && !__ptrace_may_attach(task))
    goto out;

.. has the wrong polarity; each of the functions that __ptrace_may_attach backs onto return zero for “success” (permission to attach), but the logic above is “if we’re not trying to get the map of the current process, and __ptrace_may_attach isn’t non-zero, we should fail”. The exclamation mark needs to disappear.

And so we’re done. My uses of systemtap weren’t nearly as complex as those in the tutorial, but I’m happy that I saved myself the kernel compiles. I’d somehow managed to miss any hype around systemtap; if you’re another systemtap user, please consider blogging your code!

All about the bling.

I noticed that the AIGLX movies on the Fedora wiki showed a neat wobbly minimize animation, and decided to find out where the code was. It was disabled in metacity CVS — here’s a patch
to enable this animation as the default, with the explosion animation still available if you start metacity with USE_EXPLOSION=1. Enjoy!

No more oops.

GIT is immensely impressive. Sadly, Dominik Brodowski is even more impressive, and has a fix for the bug I was having fun bisecting sitting in his PCMCIA tree; note to self to next time mail the relevant maintainer and ask if they know anything about the bug you’re going to try and fix.

Here’s the git bisect visualisation letting me know which merge was responsible, which is where I decided to check out brodo’s tree. (Of course, I could also have continued the bisection down to the individual patch.)

Rawhide bug update.

Busy day. I’m pleased and impressed that 2/3 of the bugs I mentioned yesterday are fixed:

  • S3 sleep works again after applying this patch from Hugh Dickins. I don’t get video or ethernet when I come back, but that’s taken care of by unloading my ethernet driver beforehand, and by killing/restarting X on resume. I should see if using vbetool differently (restorestate, perhaps) helps with that.

  • A CVS commit to lvm claims to fix the problem I had, and the new package is in the buildqueue.

  • Nothing new on my pccardctl eject oops. I’m going to try and git bisect this over the weekend — it’s a nice candidate for bisection, since it was working as recently as 2.6.16 and it’s not clear whether this is a locking problem (the PCMCIA code moved from semaphores to mutexes, but the patch looks sensible) or a netdev problem.