Fwd: VM bug

From: Andy Valencia <vandys_at_nospam.org>
Date: Mon Jun 07 1999 - 10:50:45 PDT

Real good debug work here. Note that the VSTa list doesn't permit binary
attachments in messages, so I'm forwarding Eric's message, along with the
test program in source form.

I need to scratch my head on the COW aspect a bit before I send an actual
answer.

Andy Valencia

>From: Eric_Jacobs@fc.mcps.k12.md.us (Eric Jacobs)
>To: vsta@zendo.com
>Date: Sun, 6 Jun 1999 19:47:59 -0400
>Subject: Threads race for VM
>Message-ID: <msg1997270.thr-2665246.10cf5d@fc.mcps.k12.md.us>
>Organization: Montgomery County Public Schools
>MIME-Version: 1.0
>
>I just finished unconvering a nasty race for virtual memory that
>sometimes occurs in multi-threaded processes. The most likely to
>be affected are processes that call static-linked functions for the
>first time in two or more threads at the same time.
>
>Race.c, when linked using the static libraries (-lc_s), will cause
>a kernel panic after the thread fork (kern/atl.c: add_atl: already
>there). The sequence of events I suspect goes something like this:
>One thread goes first, and calls syslog. That page isn't loaded
>into memory yet, so it locks the page slot and asks the server for
>that page, with fod_fillslot. Before that read is completed, the
>other thread also tries to call syslog. The page isn't available
>yet, so it also faults. When this thread goes to lock the page
>slot, the slot is busy and so it blocks. The first thread completes
>the read and attaches the page, sets the hardware translation,
>and releases the lock. However, the second thread has already
>faulted! So the second thread wakes up, references the page that was
>just loaded, and tries to attach it, but this would mean that a
>page is being attached twice for the same pview, which is a no-no
>and the kernel panics.
>
>I added a simple loop to vm_fault.c after the lock_slot() that scans
>the attach list to see if the page is already there before it goes to
>fill it, and if it is, it just skips it and returns without an error.
>The logic is that if that page is already attached, there shouldn't
>have been a fault, and we just return. This simple solution works for
>all situations except for copy-on-write; a COW page would already
>have an attachment for that page anyway. More thought is required for
>this one; COW pages may still be able to race (although I haven't
>encountered a situation where this happens.)
>
>The real issue here, I suppose, is that page slot "locks" aren't
>really like the normal p_lock/v_lock kind of lock, but rather more
>like semaphores that set up a critical section that encompasses the
>perpage, attach list and hat_*trans information. When another thread
>tries to access any address in that page while we have the lock, it's
>violating our critical section. Of course, the i386 doesn't know that;
>it just does what the HAT tells it to do. So when a thread gets a
>fault for a page and the page slot is busy, that means that when the
>i386 generated the fault, it used the HAT information when it was not
>algorithmically correct to do so.
>
>What this means is that when lock_slot finds the slot is busy and
>has to wait, we need to recheck all of the conditions that could
>have caused a fault, because the processor wasn't using "thread-safe"
>information. Fortunately, this condition seems to be rather rare.
>Perhaps a more ideal solution would be to have lock_slot() return
>a flag which indicates whether it had to wait for the slot to be
>free. That way we wouldn't have to scan the pp_atl every time.
>Or maybe we could just have vm_fault return in such a case, to try
>to regenerate the fault now that the HAT is up-to-date?
>
>Using shared libraries with -lc, race.c won't panic the kernel
>because the libc shared libraries are likely already loaded by the
>time the program runs. If you put a syslog() before the thread fork,
>it will also avoid the panic, because syslog gets paged in from the
>binary before the race happens. This can make debugging this kind of
>thing a real pain; when I put syslog's in to see where the panic was
>happening, the problem went away!

Race.c:

#include <syslog.h>
#include <sys/syscall.h>

void
thread(void) {
        syslog(LOG_INFO, "thread info");
        mutex_thread(0);
}

int
main(void) {
        int t;

        t = tfork(thread, 0);
        
        syslog(LOG_INFO, "main info");
        mutex_thread(0);
        
        return(0);
}
Received on Mon Jun 7 09:44:58 1999

This archive was generated by hypermail 2.1.8 : Thu Sep 22 2005 - 15:12:56 PDT