Performance

From: Andrew Valencia <vandys_at_nospam.org>
Date: Thu Jan 19 1995 - 09:03:57 PST

Bizarre. I've had ~4 unsubscribes since you made your performance posting.
There's lies, damn lies, and performance numbers, but I guess some people
still base decisions on'em. :-)

>> I'm going to be on vacation for 1-2 weeks. I'll have my laptop along, and
>> have been fixing the swapping portion of VSTa. I've found a couple bugs so
>> far, but it's still not quite right.
>That will be good to see. Have you implemented a reserved pool for the
>memory allocations yet? This seemed a favourite solution at one point, but
>I wondered if you'd thought of another way of doing things.

The page stealing code is *really* broken. I haven't found it yet, but it
looks like it might involve the references made by the pseudo-view during
message passing. I still think pure page stealing will cover most of the
needs which otherwise would require a special pool. We'll see.

>>[inlining]
>> I agree, it's a powerful improvement to the language. Presumably we'll see
>> an iteration of ANSI C which adds it.
>I hope so, as yesterday I found it could save me another 5% or so, but
>macroising the code in question would be really nasty as some of it's quite
>long (but only used once or twice).

Let's agree kernel source can require gcc (but not 2.X-only features, for
now) and get the performance.

>On a quick test I ran yesterday, gcc running with no inlining showed itself
>to produce code about 25-30% slower than with inlining. I assume lcc would
>give about the same performance (based on what everyone tells me). I'm
>begining to wonder if (for now at least) lcc is such a good idea for the
>kernel? (can't you tell that I'm starting to see my performance figures
>catching up my Linux system :-))

lcc is pretty awful--no register allocation. Hopefully this will be fixed,
it looked like it was only the implicit register use for string functions
which messed up the code generator.

>Actually on a serious note, I've been pondering performance of message
>passing generally. Assuming that there was hardware support I think it
>would soon dominate calls within monolithic kernels, but I've been looking
>at code complexity. It appears that a lot of monolithic kernel drivers have
>to jump through hoops to ensure that preemption doesn't kill them, whereas
>our drivers don't seem to need this, just letting the kernel handle this
>instead. I'm beginning to sympathise with the QNX claims that we can visit
>the kernel more often and still end up with as a good a level of
>performance. I suspect this is largely because optimising monolithic kernel
>code would be far more difficult (our 5 minute modify, rebuild, reboot and
>test cycle is pretty impressive really) - I'd never really thought about it
>before.

If you *really* want to give Linux a bad day, try measuring dispatch latency
on a busy system. Which reminds me, we need to enable kernel preemption.
:-) Linux is enjoying the benefits of a non-reentrant monolithic kernel,
but it crashes on the rocks badly if it hits memory leaks or stale memory
references (fun in a shared kernel address space), needs to dispatch
something while a kernel thread is active (fork() is usually one of the most
painful), or (heh!) wants to run on multiple processors.

For some of this, we're just paying the price for now (as we compare
performance with systems using simplifying assumptions) but the enabling
technologies underneath should pay off over time.

>> [omit frame pointer]
>> This makes stacks almost unreadable in a crash. OK if you have no kernel
>> debugger & no DEBUG.
>I suspected as much, but it might be worth a thought in a "production"
>appliction.

Agreed.

>On my quest for yet more performance I've found a real gotcha that to a
>lesser extent affects server operations, but particularly affects kernel
>code - the hash table manipulations. Looking at the assembly output we
>generate a long divide operand. This costs 40 clocks on either a 386, 486
>or a Pentium. From what I understand, a lot of the RISC world needs to do
>this in software, so it's not a particularly desireable feature.

Ok, I'll take a look at this. I think your idea of moving it up to a power
of two makes a lot of sense. We could just store a hash mask. The only
hesitation I feel is some dusty corner of my mind remembers being told that
sometimes a prime is the best modulo for hash functions. But I certainly
haven't used that to date.

>I got some time last night to generate assembly output for all of the
>kernel. I grepped it all for div instructions as they're really nasty (2.5
>us on a 386sx-16), and found that apart from in the kernel debugger (not
>really a problem) we have one each in use in the malloc() and free() code,
>one in perm_ctl(), and one in pick_run().

In free()? Why?

The one in malloc() should probably be handled by using a big gnarly ?:
construct to divine the right bucket index at compile time. The sizes
passed are usually constants (which flattens the ?: down to a constant) and
the binary search otherwise is probably better than the current code anyway.
See how BSD's kernel malloc (malloc.h, in particular) does it.

>I must admit, I'm very surprised that I haven't really found any major
>improvements that I can make (apart from the inlining of the p_lock() and
>v_lock() functions) that will individually give me more than a couple of
>percent at best. That said though, I now have a kernel that even with
>debugging turned on is nearly twice as fast at a lot of things than when I
>started, and I haven't yet recompiled the servers with the improved library
>code.

This just means I was (usually) not asleep at the keyboard while writing
VSTa. I suspect that *big* wins to be had after your excellent work (which
is necessary, and thanks much for your efforts!) are in some design
modifications. Kernel continuations (where a thread switches directly over
to the context of the "other" side of a client/server interaction) showed a
big win in Mach 3.0. Your tests are using RT processes? Otherwise the
hierarchical scheduler will slow you down (although the "cheated" queue
sidesteps much of this for non-CPU-bound scenarios). A special path for
messaging with small/no buffers is the only other change I've come up with.

Can you generate ASCII files from your analyzer trace? I'd be interested in
traces of a zero-length message exchange for:

        msg_send()..swtch()
        swtch()..msg_send() (return)
        msg_receive()..swtch()
        swtch()..msg_receive() (return)
        msg_reply()..swtch()
        swtch()..msg_reply() (return)

And again for a one-byte message. Also, a:

        msg_receive() (return)..page_fault (return)

can give me an idea of what it takes to bring in a page on demand.

If you can dump symbols to your disassembler on the analyzer, cool,
otherwise nm(1) it and I'll hack up something to interpose the symbolic
values here.

I've Cc'ed the list as there appears to be some interest on performance
tuning.

                                                Regards,
                                                Andy
Received on Thu Jan 19 08:30:23 1995

This archive was generated by hypermail 2.1.8 : Thu Sep 22 2005 - 15:12:17 PDT