Filesystem notifications revisited

After writing my previous post about notifications (which is required reading for the rest of this one, I’m afraid), I was able to talk to Robert Love about the Yi Yang patch, and why he thinks it didn’t get a good response. He gave a few reasons:

  • It’s lossy; you lose events from boot time, from before the userspace daemon starts, and if the daemon crashes.
  • Requires root to listen on netlink, as opposed to the inotify_add_watch(2) syscall interface of inotify.
  • For this purpose poll(2) is good, netlink is bad.

All of which are reasonable complaints. If that’s the wrong solution, though, what would the right one look like? Thankfully, Robert has ideas on that too (and I hope I explain them correctly):

To avoid the causes of lossiness above, the log should be an on-disk log maintained by the filesystem. Having it done by each filesystem separately isn’t necessarily awful for maintainability; the ext3 journalling layer is supposedly generic. There should be sequence points, so the (single) userspace daemon reading the log knows that it’s up to date as of sequence n, and can tell the kernel to clean the parts of the log up to n.
The on-disk log would be fixed in size and circular — so still lossy so far — but Robert has an idea (which Tridge and Rusty Russell are apparently also partly responsible for) to make sure the lossiness doesn’t hurt so bad. Here it is.

You do event “compression”; the log is stored in a tree of path names and events. If you have change events for a couple of hundred files in /home/foo/{bar,baz,etc}, you mark all of /home/foo as dirty and throw away the events inside it. Userspace has to go off and stat(2)-dance inside /home/foo to find which files have changed, but at least you’ve traded precision for accuracy and come out with a log that enables every change to be noticed. You’d keep reparenting as you run out of room in the log, so if you exhausted log space recording changes in /home/foo and /home/bar, all of /home gets marked dirty.
This is still root-only so far, but you can build security on it.

So! This is a lot more work than Yang’s elegant netlink patch, but I’ve decided that ridding the world of updatedb is a worthy goal, and so I’ll be starting work on Robert’s design next week. I’ve booked days off work to go to LinuxWorld Boston, and an interested friend is visiting from the UK as well. A further idea is to get the userspace daemon to export inotify-compatible events, so that programs like Beagle can use this new mechanism without requiring a rewrite. I’m assured that apps like F-spot and Leaftag are waiting for this kind of event notification too.

(Note: Firefox hung as I was writing this post, taking the unfinished blog entry with it, with strace hanging on a futex. I blame the flash on the LinuxWorld site. But! Getting a core file with gdb’s gcore and running strings on it gave me a perfectly-formatted blog post back. Yay!)


  1. We ‘gnome-vfs & nautilus’ people would be even more interested. I wonder why f-spot and even leaftag get mention but not nautilus 😉

  2. If the log is fixed in size, why can’t it just be kept in memory? It seems like storing it on disk would cause a lot of thrashing, that could dramatically slow down large disk updates (perhaps by a factor of 2 or even an order of magnitude for some situations I can envision).

  3. Thanks for the Bugzilla link! That looks about right. I was running esd, although I didn’t have any esd functions in my backtrace.

    I think the reason rlove wants to use the disk is that the sort of log size we’re considering is ~20M, and we’re unlikely to get this enabled by default by adding an extra 20M of RAM usage to the kernel.

    > dramatically slow down large disk updates

    Well, ext3 already gives you the choice of journal/ordered/writeback modes — perhaps we want to make the update mode of this log depend on the chosen journal update mode. I think for now it’ll be best to get it working and see what comes up.


Leave a Reply

Your email address will not be published. Required fields are marked *