Re: Project registry

From: Andrew Valencia <vandys_at_nospam.org>
Date: Mon Sep 27 1993 - 19:23:05 PDT

[nick@nsis.cl.nec.co.jp (Gavin Thomas Nicol) writes:]

>Well, this list is quiet. I hope that doesn't mean that everyone has
>abandoned VSTa.

Well, I'll be the last one out the door! :-)

I have actually been making progress on my filesystem. Today, I got
large files working. A file is structured as:

| File header | Extents | Data...

So the stuff UNIX would put in inode is actually just the first N bytes
of a file. Each extent is merely <start,len> tuples. The sizes are
chosen carefully, so that a 100-byte file occupies a single 512-byte
sector, including both file data as well as file ownership, protection,
and allocation information. A directory is just a special file, so
smaller directories will also occupy a single sector.

As a file grows, its existing allocation is extended where possible. So
if your file is currently {<3,1>} (i.e., sector three, one sector long),
and you write enough data to push out beyond the initial sector, the
filesystem will see how many blocks can be had starting at 4. Up to 64K
is taken at once, in the hope that taking this much proactively will
spare you having to "piecemeal" your file together as you write.

Thus, your file could then be {<3,129>}. The file's length is marked to
be the entire length of the whole allocation (129*512 = 66048). If you
crash, the file thus correctly represents the storage it holds, without
fsck being required before filesystem operation is resumed.

As you write the file, the filesystem keeps track of the highest offset
you have written. On last close, blocks beyond this point are freed, and
the file is "shrunk" down to the actual size written. If you wrote
a K or so, then closed the file, the example file would end up as {<3,3>},
and 6..129 would return to the free list.

The free list uses a sorted linked list of free ranges, with coalescing.
Each range is held in a single sector; each sector has a next pointer
to another sector's worth of free list ranges. A sector can hold 0 or
more entries; if it ever overflows, a global rebalance is done, resulting
in each sector being exactly half full (except the last in the list which
might hold less). This allows contiguous ranges of free blocks to be easily
identified, and is very storage efficient, even with small allocation
units (such as sectors). It is also easy to prove that your free list
can always represent the total of free blocks, because a block on the
free list can be consumed to provide another sector's worth of storage
for the free list. Bitmaps are popular for free block management, but
their size grows as a function of increasing filesystem size and decreasing
allocation unit size. Most implementations either take a larger allocation
size--say, 8K--and waste disk space for smaller files, or add tremendous
complexity to special-case sub-allocation-size units (i.e., "frags" in
the BSD filesystem). VSTa is primarily an experimental platform, so I
decided to try something different.

The most complex part of the filesystem is the block cache. This is
a misnomer; each entry in the buffering system is variably-sized.
For an extent of up to 64K, a buffer of the exact size is kept. For
larger extents, the extent is buffered in 64K chunks.

Between the contiguous block allocation and the variable-size buffer,
I hope to provide very efficient filesystem services. Consider a
file with allocation {<10,8>} (i.e., a file starting at sector 10 for 8
sectors--4K). When this file is opened and read, a 4K buffer is allocated.
A single disk I/O fills the entire 4K. Thus, most small files will have
their entire contents buffered in a single disk I/O. Larger files will
have their contents buffered in chunks of 64K. Once the file contents
are in the filesystem's buffer, it may be parcelled out to clients without
further I/O activity.

This scheme should work well for text files, objects, and some executables.
It will not work well for, say, databases. They would do better to talk
directly to a disk partition, or through a similar, but non-buffered
version of this filesystem. My initial prototype will also not protect
itself well against multiple, competing allocation requests. I hope to
incrementally improve this as I learn more from actual use.

                                                Regards,
                                                Andy Valencia
Received on Mon Sep 27 19:21:55 1993

This archive was generated by hypermail 2.1.8 : Wed Sep 21 2005 - 19:37:12 PDT