Re: Performance

From: Dave Hudson <dave_at_nospam.org>
Date: Thu Jan 19 1995 - 09:31:27 PST

Andrew Valencia wrote:
>
> Bizarre. I've had ~4 unsubscribes since you made your performance posting.
> There's lies, damn lies, and performance numbers, but I guess some people
> still base decisions on'em. :-)

Well I hope people aren't basing too much on my figures as they're based on
3 kernel calls per 2 switches and involve messaging activity. I noticed
from a note to me yesterday that the 5us switch on QNX uses sched_yield() -
I think maybe I need to write a new benchmark (just think of all those
missing message copies and mallocs :-)). FWIW on my DX/2-66 I show just
under 100us now, but I should be able to halve that (easily) if I
reconstruct my test.

> >>[inlining]
> >> I agree, it's a powerful improvement to the language. Presumably we'll see
> >> an iteration of ANSI C which adds it.
> >I hope so, as yesterday I found it could save me another 5% or so, but
> >macroising the code in question would be really nasty as some of it's quite
> >long (but only used once or twice).
>
> Let's agree kernel source can require gcc (but not 2.X-only features, for
> now) and get the performance.

OK, everything so far seems OK except for one inline assembly construct
which I believe also fails in gcc 2.6.2 (sounds like I need to work this one
out and mail the gcc buglist).

> If you *really* want to give Linux a bad day, try measuring dispatch latency
> on a busy system. Which reminds me, we need to enable kernel preemption.
> :-) Linux is enjoying the benefits of a non-reentrant monolithic kernel,
> but it crashes on the rocks badly if it hits memory leaks or stale memory
> references (fun in a shared kernel address space), needs to dispatch
> something while a kernel thread is active (fork() is usually one of the most
> painful), or (heh!) wants to run on multiple processors.
 
Funny, I was only thinking of moving that check_preempt() line today :-)

> For some of this, we're just paying the price for now (as we compare
> performance with systems using simplifying assumptions) but the enabling
> technologies underneath should pay off over time.

Like atomic locking (which is where I seem to have my gcc buglet). This and
the mutex code could be really tweaked for single processor use!

> >On my quest for yet more performance I've found a real gotcha that to a
> >lesser extent affects server operations, but particularly affects kernel
> >code - the hash table manipulations. Looking at the assembly output we
> >generate a long divide operand. This costs 40 clocks on either a 386, 486
> >or a Pentium. From what I understand, a lot of the RISC world needs to do
> >this in software, so it's not a particularly desireable feature.
>
> Ok, I'll take a look at this. I think your idea of moving it up to a power
> of two makes a lot of sense. We could just store a hash mask. The only
> hesitation I feel is some dusty corner of my mind remembers being told that
> sometimes a prime is the best modulo for hash functions. But I certainly
> haven't used that to date.
>
> >I got some time last night to generate assembly output for all of the
> >kernel. I grepped it all for div instructions as they're really nasty (2.5
> >us on a 386sx-16), and found that apart from in the kernel debugger (not
> >really a problem) we have one each in use in the malloc() and free() code,
> >one in perm_ctl(), and one in pick_run().
>
> In free()? Why?
>
> The one in malloc() should probably be handled by using a big gnarly ?:
> construct to divine the right bucket index at compile time. The sizes
> passed are usually constants (which flattens the ?: down to a constant) and
> the binary search otherwise is probably better than the current code anyway.
> See how BSD's kernel malloc (malloc.h, in particular) does it.
>

Fixed these already, but I'm not sure whether the hash changes have helped
that much. I think two of the divs were caused by the use of % operators -
anyway, one disappears by using a shift instead, and I tweaked the malloc
code and got rid of the other. I'll take a look at the malloc stuff you
mentioned - it looks a good idea.

> Can you generate ASCII files from your analyzer trace? I'd be interested in
> traces of a zero-length message exchange for:

I'll see what I can do - I ought to at least be able to get at the
dissassembled output and write a script to insert values.

> I've Cc'ed the list as there appears to be some interest on performance
> tuning.

                                Regards,
                                Dave
Received on Thu Jan 19 09:20:01 1995

This archive was generated by hypermail 2.1.8 : Thu Sep 22 2005 - 15:12:17 PDT