Re: A different performance measure

From: Dave Hudson <dave_at_nospam.org>
Date: Sat Jan 21 1995 - 07:55:53 PST

Hi,

David Koogler wrote:
>
> The new numbers look a lot more reasonable! I just could not see how a context
> switch could take 100uS, except maybe on a 6-Mhz 80286.

As I just mentioned in my reply to Mike there's toom for improvement yet,
but I think we have to keep a sense of proportion. When Mike tested a
kernel for me last week on a 16MHz 386sx he saw a 1.4ms switch time drop to
about 750us, but only improved throughput by 5-10% on file I/O.

> On a related point, I saw this note on comp.linux.development.system and
> thought I should pass it along:
>
> > From: pjensen@csi.compuserve.com (Phil Jensen)
> > Subject: code alignment - mod 16 vs mod 4
> > Date: 19 Jan 1995 11:43:56 -0500
> >
> > As this may affect kernel efficiency, I thought I'd ask: in my (Slackware)
> > 1.0.9 kernel, the compiler obviously used an alignment of 16, but the loader
> > used 4. So as you read through zSystem.map, for a while all the entry points
> > end in 4, then c, then 0... If 16-byte alignment is worth doing, it's worth
> > doing right - this way just wastes space. So I ask - is this still
> happening?

A lot of this is down to compile time switch settings. Linux is normally
built with -m486 which 16 byte aligns functions and does some other 4 byte
alignment. It also changes some of the frame handling opcodes since some
code sequences are faster than others. This is basically as described in
Appendix G of the 486 Programmer's Reference Manual. In reality this is not
always a good thing, but on the whole leads to faster code (one of my
earlier "optimised" kernels was actually 10% *slower* with -m486, but now
it's about 4% faster). There's been quite a good discussion on some of
these sort of points on comp.arch since I subscribed to it a week or so ago
- basically padding functions makes them very easy to prefetch (a very big
win on clock multiplied CPUs), but reduced the amount of real code in the L1
cache (a big loss).

> > Another compilation thing - has anyone ever actually benchmarked kernel speed
> > with and without -fomit-frame-pointer? I haven't, but I've looked at SIZE,
> > which is about 3% LARGER with -fomit... Why? Every frame reference is one
> > byte longer (the "s-i-b" byte - necessary to get references off of %esp).
>
> I have seen a lot of applications compiled for Linux where -fomit-frame-pointer
> is a common optimization. Using s-i-b references are not particularly fast,
> especial as Intel applied various RISC optimizations which speed up the simple
> instructions (register-to-register, mem-displacement-to-register, and the
> like).
> Any of the complex indexing modes are penalized and tend to run slower. The
> joys of overlapped instruction execution and its associated lockout problems.

-fomit-frame-pointer is a huge win for small functions, but these are much
better off being inlined anyway :-) It's also a win if we can do most
things with register variables, but deciding which should and shouldn't be
compiled this way is very difficult. As for speed, I rebuilt a kernel a
couple of days ago with all files -fomit-frame-pointer and it was 5% slower
than with the frame pointer there.

Not that inlining is the answer to everything - on a 486 the worsened cache
performance can actually lead to some horrendous CPU stalls while new code
is prefetched. I've now got a couple of really good examples of where on a
486 it's appalling.

As another general comment I think it is well worth considering the
difference in initial design aims for say QNX or Linux against VSTa. VSTa
set out to be ported to other architectures and with an initial design
requirement to support SMP. I could easily create a very much faster VSTa
kernel simply by limiting my ultimate goal to being a fast x86 single CPU
O/S - immediately the mutexing code could be made much faster, and there'd
be no messing around with per-cpu structures.

In the long run, I think VSTa has a more appropriate design to cope with the
systems that I expect we'll be seeing in the next few years. Just going off
subject for a while, the laws of physics are rapidly getting to the stage
where single CPUs will run out of steam, and even now I invariably have to
consider the rather finite speed of electrons (approx 3-4 inches per ns in a
PCB seems a good rule of thumb) when working on fast motherboards :-)
Looking at things like the PCI bus spec, reflections are now assumed to be
the norm rather than the result of poor design, and going to much wider
buses gives little except increased silicon package costs and more complex
(read more expensive) motherboards.

                                Regards,
                                Dave

PS. If anyone wishes to comment on that last paragraph I think it would be
better to take it off the mailing list (for now at least).
Received on Sat Jan 21 07:29:31 1995

This archive was generated by hypermail 2.1.8 : Thu Sep 22 2005 - 15:12:17 PDT