Systemtap for fun and profit

Kjartan Maraas pointed me at this Fedora bug yesterday — it points out that /proc/$pid/maps has been broken in Rawhide for a month. The patch that made linux/fs/proc/task_mmu.c (which is where map requests are handled) diverge from mainline is this one. I can’t read more about its motivation since the Bugzilla ID is security blocked.

So, where to start? mm_for_maps() is a new function that does a bunch of checks on the relationship between task and current before deciding whether to allow the request; I threw some printk()s in to find out which were failing, and found that we take the !__ptrace_may_attach(task)) path to the out label in the code below:

if (task->mm != mm)
    goto out;
if (task->mm != current->mm && !__ptrace_may_attach(task))
    goto out;

From there, it got hazy. __ptrace_may_attach() returns whatever security_ptrace() does. This takes us into pluggable LSMs land — any LSM that gave a struct security_operations with a pointer to a ptrace function will have a shot at returning an error that would be sent back to security_ptrace() to stop our request from completing.

But how do I tell which LSM is complaining, or even which LSMs are loaded? After all, they’re registered at runtime. Enter systemtap, as wisely suggested to me by Bill Nottingham (whom I now surely owe beer to). Systemtap is similar to the Solaris dtrace; it’ll let you instrument and track kernel functions and system calls for a running kernel. It was installed on my Rawhide machine by default, which is always a nice touch.

So, how to see which ptrace functions were registered? Enter my first systemtap probe:

unity:cjb~ % cat list-ptrace.stp
probe kernel.function("*ptrace*") {
    printf("%sn", probefunc())
}
unity:cjb~ % sudo stap list-ptrace.stp

At this point, our stp script is converted into C code and compiled into a kernel module, before being loaded into the running kernel. After running a cat /proc/<pid>/maps in another terminal, we see:

__ptrace_may_attach
cap_ptrace

.. which suggests that cap_ptrace was called by __ptrace_may_attach, and that’s where our __ptrace_may_attach might be being turned down. To be sure that we got to cap_ptrace via __ptrace_may_attach, we can ask for a backtrace:

unity:cjb~ % cat cap-backtrace.stp
probe kernel.function("cap_ptrace") {
    printf("%s -> %sn", probefunc(), print_backtrace())
}
unity:cjb~ % sudo stap cap-backtrace.stp
cap_ptrace ->
trace for 6345 (cat)
 0xc04c2286 : cap_ptrace+0x7/0x49 []
 0xc042a600 : __ptrace_may_attach+0xac/0xae []
 0xc049350c : mm_for_maps+0x83/0xd8 []
 0xc0492892 : m_start+0x28/0x11d []
 0xc04800d9 : seq_read+0xdb/0x268 []
 0xc0446288 : audit_syscall_entry+0x104/0x12b []
 0xc047fffe : seq_read+0x0/0x268 []
 0xc04648e2 : vfs_read+0x9f/0x13e []
 0xc0464d2e : sys_read+0x3c/0x63 []
 0xc0403d07 : syscall_call+0x7/0xb []

We’re pretty sure that cap_ptrace was responsible. Hunting through its source, we see that it has a path to return -EPERM, which would do it. So, we recompile the kernel in order to have cap_ptrace tell us what return value it’s going to use, right? Well, no. Straight back to systemtap:

unity:cjb~ % cat return-codes.stp
probe kernel.function("*ptrace*").return {
    printf("%s -> ", probefunc())
    log(returnstr(1));
}

The .return after the function pattern tells systemtap to trigger when the function is returning, and returnstr(1) asks for the return value as a decimal. There’s also print_regs(), if you prefer to see what’s in EAX directly. Over to the other terminal to cat a maps file again, and:

unity:cjb~ % sudo stap return-codes.stp
cap_ptrace -> 0
__ptrace_may_attach -> 0

That’s odd. cap_ptrace is returning 0, which we can see in its code is meant to mean success, and __ptrace_may_attach is receiving it back unharmed. Cue an “ah-hah!” moment as we realise that the conditional:

if (task->mm != current->mm && !__ptrace_may_attach(task))
    goto out;

.. has the wrong polarity; each of the functions that __ptrace_may_attach backs onto return zero for “success” (permission to attach), but the logic above is “if we’re not trying to get the map of the current process, and __ptrace_may_attach isn’t non-zero, we should fail”. The exclamation mark needs to disappear.

And so we’re done. My uses of systemtap weren’t nearly as complex as those in the tutorial, but I’m happy that I saved myself the kernel compiles. I’d somehow managed to miss any hype around systemtap; if you’re another systemtap user, please consider blogging your code!

All about the bling.

I noticed that the AIGLX movies on the Fedora wiki showed a neat wobbly minimize animation, and decided to find out where the code was. It was disabled in metacity CVS — here’s a patch
to enable this animation as the default, with the explosion animation still available if you start metacity with USE_EXPLOSION=1. Enjoy!

Rawhide bug update.

Busy day. I’m pleased and impressed that 2/3 of the bugs I mentioned yesterday are fixed:

  • S3 sleep works again after applying this patch from Hugh Dickins. I don’t get video or ethernet when I come back, but that’s taken care of by unloading my ethernet driver beforehand, and by killing/restarting X on resume. I should see if using vbetool differently (restorestate, perhaps) helps with that.

  • A CVS commit to lvm claims to fix the problem I had, and the new package is in the buildqueue.

  • Nothing new on my pccardctl eject oops. I’m going to try and git bisect this over the weekend — it’s a nice candidate for bisection, since it was working as recently as 2.6.16 and it’s not clear whether this is a locking problem (the PCMCIA code moved from semaphores to mutexes, but the patch looks sensible) or a netdev problem.

Rawhide!

Keeping track of details for a few bugs I’m seeing in current Rawhide on the new laptop:

  • S3 sleep is broken, possibly libata-wide since Jeremy is hitting the same symptoms as me (no disk access on resume) and he’s on AHCI while I’m on sd_mod and ata_piix. The symptoms are syslogd complaining that it can’t write out the journal, and lots of repeated:
sd 0:0:0:0: SCSI error: return code = 0x40000
end_request: I/O error, dev sda, sector 18800717
  • pccardctl eject is giving me an oops, which is new to Rawhide; looks like this is upstream, since it’s the same oops as in this lkml post.
  • The lvm(8) in my kernel-2141 initrd fails to boot, giving:
Volume group for uuid not found: (uuid)
0 logical volume(s) in volume group "group0" now active
mount: could not find filesystem /dev/root

I reverted bin/lvm in the initrd to the version found in 2136, which has the same version number but a different md5sum — that booted, so this looks like a new bug in lvm.