Posted by Chris Ball
Thu, 11 May 2006 00:40:00 GMT
Kjartan Maraas pointed me at this Fedora bug yesterday — it points out that /proc/$pid/maps has been broken in Rawhide for a month. The patch that made linux/fs/proc/task_mmu.c (which is where map requests are handled) diverge from mainline is this one. I can't read more about its motivation since the Bugzilla ID is security blocked.
So, where to start? mm_for_maps() is a new function that does a bunch of checks on the relationship between task and current before deciding whether to allow the request; I threw some printk()s in to find out which were failing, and found that we take the !__ptrace_may_attach(task)) path to the out label in the code below:
if (task->mm != mm)
goto out;
if (task->mm != current->mm && !__ptrace_may_attach(task))
goto out;
From there, it got hazy. __ptrace_may_attach() returns whatever security_ptrace() does. This takes us into pluggable LSMs land — any LSM that gave a struct security_operations with a pointer to a ptrace function will have a shot at returning an error that would be sent back to security_ptrace() to stop our request from completing.
But how do I tell which LSM is complaining, or even which LSMs are loaded? After all, they're registered at runtime. Enter systemtap, as wisely suggested to me by Bill Nottingham (whom I now surely owe beer to). Systemtap is similar to the Solaris dtrace; it'll let you instrument and track kernel functions and system calls for a running kernel. It was installed on my Rawhide machine by default, which is always a nice touch.
So, how to see which ptrace functions were registered? Enter my first systemtap probe:
unity:cjb~ % cat list-ptrace.stp
probe kernel.function("*ptrace*") {
printf("%s\n", probefunc())
}
unity:cjb~ % sudo stap list-ptrace.stp
At this point, our stp script is converted into C code and compiled into a kernel module, before being loaded into the running kernel. After running a cat /proc/<pid>/maps in another terminal, we see:
__ptrace_may_attach
cap_ptrace
.. which suggests that cap_ptrace was called by __ptrace_may_attach, and that's where our __ptrace_may_attach might be being turned down. To be sure that we got to cap_ptrace via __ptrace_may_attach, we can ask for a backtrace:
unity:cjb~ % cat cap-backtrace.stp
probe kernel.function("cap_ptrace") {
printf("%s -> %s\n", probefunc(), print_backtrace())
}
unity:cjb~ % sudo stap cap-backtrace.stp
cap_ptrace ->
trace for 6345 (cat)
0xc04c2286 : cap_ptrace+0x7/0x49 []
0xc042a600 : __ptrace_may_attach+0xac/0xae []
0xc049350c : mm_for_maps+0x83/0xd8 []
0xc0492892 : m_start+0x28/0x11d []
0xc04800d9 : seq_read+0xdb/0x268 []
0xc0446288 : audit_syscall_entry+0x104/0x12b []
0xc047fffe : seq_read+0x0/0x268 []
0xc04648e2 : vfs_read+0x9f/0x13e []
0xc0464d2e : sys_read+0x3c/0x63 []
0xc0403d07 : syscall_call+0x7/0xb []
We're pretty sure that cap_ptrace was responsible. Hunting through its source, we see that it has a path to return -EPERM, which would do it. So, we recompile the kernel in order to have cap_ptrace tell us what return value it's going to use, right? Well, no. Straight back to systemtap:
unity:cjb~ % cat return-codes.stp
probe kernel.function("*ptrace*").return {
printf("%s -> ", probefunc())
log(returnstr(1));
}
The .return after the function pattern tells systemtap to trigger when the function is returning, and returnstr(1) asks for the return value as a decimal. There's also print_regs(), if you prefer to see what's in EAX directly. Over to the other terminal to cat a maps file again, and:
unity:cjb~ % sudo stap return-codes.stp
cap_ptrace -> 0
__ptrace_may_attach -> 0
That's odd. cap_ptrace is returning 0, which we can see in its code is meant to mean success, and __ptrace_may_attach is receiving it back unharmed. Cue an "ah-hah!" moment as we realise that the conditional:
if (task->mm != current->mm && !__ptrace_may_attach(task))
goto out;
.. has the wrong polarity; each of the functions that __ptrace_may_attach backs onto return zero for "success" (permission to attach), but the logic above is "if we're not trying to get the map of the current process, and __ptrace_may_attach isn't non-zero, we should fail". The exclamation mark needs to disappear.
And so we're done. My uses of systemtap weren't nearly as complex as those in the tutorial, but I'm happy that I saved myself the kernel compiles. I'd somehow managed to miss any hype around systemtap; if you're another systemtap user, please consider blogging your code!
Tags fedora, kernel, linux | 8 comments | 7722 trackbacks
Posted by Chris Ball
Mon, 24 Apr 2006 01:19:00 GMT
GIT is immensely impressive. Sadly, Dominik Brodowski is even more impressive, and has a fix for the bug I was having fun bisecting sitting in his PCMCIA tree; note to self to next time mail the relevant maintainer and ask if they know anything about the bug you're going to try and fix.
Here's the git bisect visualisation letting me know which merge was responsible, which is where I decided to check out brodo's tree. (Of course, I could also have continued the bisection down to the individual patch.)
Tags kernel, linux | 11 comments | 6 trackbacks
Posted by Chris Ball
Sat, 22 Apr 2006 03:23:00 GMT
Busy day. I'm pleased and impressed that 2/3 of the bugs I mentioned yesterday are fixed:
S3 sleep works again after applying this patch from Hugh Dickins. I don't get video or ethernet when I come back, but that's taken care of by unloading my ethernet driver beforehand, and by killing/restarting X on resume. I should see if using vbetool differently (restorestate, perhaps) helps with that.
A CVS commit to lvm claims to fix the problem I had, and the new package is in the buildqueue.
Nothing new on my pccardctl eject oops. I'm going to try and git bisect this over the weekend — it's a nice candidate for bisection, since it was working as recently as 2.6.16 and it's not clear whether this is a locking problem (the PCMCIA code moved from semaphores to mutexes, but the patch looks sensible) or a netdev problem.
Tags fedora, kernel, linux | 1 comment | 9 trackbacks
Posted by Chris Ball
Fri, 31 Mar 2006 00:56:00 GMT
After writing my previous post about notifications (which is required reading for the rest of this one, I'm afraid), I was able to talk to Robert Love about the Yi Yang patch, and why he thinks it didn't get a good response. He gave a few reasons:
- It's lossy; you lose events from boot time, from before the userspace daemon starts, and if the daemon crashes.
- Requires root to listen on netlink, as opposed to the
inotify_add_watch(2) syscall interface of inotify.
- For this purpose
poll(2) is good, netlink is bad.
All of which are reasonable complaints. If that's the wrong solution, though, what would the right one look like? Thankfully, Robert has ideas on that too (and I hope I explain them correctly):
To avoid the causes of lossiness above, the log should be an on-disk log maintained by the filesystem. Having it done by each filesystem separately isn't necessarily awful for maintainability; the ext3 journalling layer is supposedly generic. There should be sequence points, so the (single) userspace daemon reading the log knows that it's up to date as of sequence n, and can tell the kernel to clean the parts of the log up to n.
The on-disk log would be fixed in size and circular — so still lossy so far — but Robert has an idea (which Tridge and Rusty Russell are apparently also partly responsible for) to make sure the lossiness doesn't hurt so bad. Here it is.
You do event "compression"; the log is stored in a tree of path names and events. If you have change events for a couple of hundred files in /home/foo/{bar,baz,etc}, you mark all of /home/foo as dirty and throw away the events inside it. Userspace has to go off and stat(2)-dance inside /home/foo to find which files have changed, but at least you've traded precision for accuracy and come out with a log that enables every change to be noticed. You'd keep reparenting as you run out of room in the log, so if you exhausted log space recording changes in /home/foo and /home/bar, all of /home gets marked dirty.
This is still root-only so far, but you can build security on it.
So! This is a lot more work than Yang's elegant netlink patch, but I've decided that ridding the world of updatedb is a worthy goal, and so I'll be starting work on Robert's design next week. I've booked days off work to go to LinuxWorld Boston, and an interested friend is visiting from the UK as well. A further idea is to get the userspace daemon to export inotify-compatible events, so that programs like Beagle can use this new mechanism without requiring a rewrite. I'm assured that apps like F-spot and Leaftag are waiting for this kind of event notification too.
(Note: Firefox hung as I was writing this post, taking the unfinished blog entry with it, with strace hanging on a futex. I blame the flash on the LinuxWorld site. But! Getting a core file with gdb's gcore and running strings on it gave me a perfectly-formatted blog post back. Yay!)
Tags kernel, linux | 16 comments | 5 trackbacks
Posted by Chris Ball
Mon, 27 Mar 2006 17:55:00 GMT
It took far too long to find this half-remembered linux-kernel thread on Google amidst all the results (mostly from distro installer guides) claiming that swap partitions are preferable to swap files. Here's hoping the link below will help future searchers.
Swap files and swap partitions have the same performance:
In 2.6 [swap files and swap partitions] have the same reliability and they will have the same performance unless the swapfile is badly fragmented.
-- Andrew Morton, on linux-kernel.
(There is a performance difference under Linux 2.4, though, as explained at the link above.)
Update: Tim comments that swsusp (the kernel software suspend-to-disk support) only works with swap partitions, so there is still one good reason to use a swap partition. Suspend2 doesn't have this limitation.
Tags kernel, linux | 11 comments | 5 trackbacks
Posted by Chris Ball
Sat, 25 Mar 2006 05:20:00 GMT
sweet rattle of disk:
a new locatedb comes;
I should go to sleep.
— me.
I've been thinking, over the last week or two, about the right way to handle an incremental updatedb — and in turn, the right way to handle generic filesystem notifications from the kernel. Fortunately, someone else has been thinking about it too and has actually been writing code, but we'll get to that in a moment.
These thoughts started off as a linux-kernel thread, with Jon Masters wondering if inotify can be used for this. Alas no, for what appear to be many reasons:
- inotify has no support for recursive watches; you'd have to put a watch on every directory on the system.
- Even this wouldn't work, because there's a hard limit on the number of watches on a system (8192 per "device", but inotify doesn't use a device interface anymore, it uses syscalls now) and this limit is an order of magnitude smaller than the number of directories on my
/.
- There's a race condition which would kill performance, meaning that you have to do a
stat(2) dance over each directory to make sure you see modifications — while you can guarantee that the kernel will deliver you each event for a directory you're interested in, you can't guarantee being able to register interest in a newly-created directory before something happens to files inside it, leaving you needing to scan inside the directory after registering the watch on it.
So that isn't going to work. As Jon points out, there are more uses for this than stopping your Linux box acting as a bedtime alarm clock every day; anti-virus people want it (and already use LSMs for the purpose), and smart indexing/backup tools could use it. OS X and Vista both have this kind of indexing service.
What's really needed is a lightweight layer that sends notifications of filesystem events to userspace via netlink, such that userspace can do what it wants with them. Luckily for us, that's exactly the patch that appeared on linux-kernel yesterday, courtesy of Yi Yang. This is a small and non-invasive patch (I think the relevant code-review phrase here is: "This is elegant, but correct.") that does just what we need and no more. It hasn't had a great reception so far, but I'd love to see it in mainline.
And now, I really should go to sleep. G'night!
Tags kernel, linux | 20 comments | 5 trackbacks