CS222 Lecture: Memory Systems Last revised 4/23/99
Introduction
A. In CS221, we looked at the basis building blocks of memory systems: the
individual devices: chips, disks, tapes etc.
B. We now focus on complete memory systems. We will see that memory systems
are seldom composed of just one type of memory; instead, they are
HIERARCHICAL systems composed of a mixture of technologies aimed at
achieving a good tradeoff between speed, capacity, cost, and physical
size.
I. The memory hierarchy
A. Since every instruction executed by the CPU requires at least one
memory access (to fetch the instruction) and often more, the performance
of memory has a strong impact on system performance.
B. With present technologies, it turns out to be possible to build very fast
memories - that are quite expensive - or to built quite inexpensive
memories - that are relatively slow. The following table summarizes the
currently available technologies:
(FILL IN QUANTITY COLUMN LAST)
Technology typical quantity in access time transfer rate $/bit
a multi-user (bytes/sec)
system (bytes)
CPU registers 10^1..10^5 ~ 10 ns 10^9 -
and "on chip"
memory
Static 0..10^6 ~ 20 ns 10^8 10^-3..10^-4
RAM
Dynamic 10^6-10^8 ~ 60 ns 10^7 10^-5..10^-6
RAM
disk 10^8-10^11 ~ 10 ms 10^6 10^-7
Tape 10^8-10^10 / reel > 1 sec (due 10^6 10^-7..10^-9
(unlimited # reels) to mounting)
C. Thus, sophisticated systems will often have a HIERARCHY of memories,
using several different technologies, with a relatively small amount of
very fast memory, a much larger amount of fast memory (MOS), and a very
large amount of slow memory (disk and tape.)
FILL IN QUANTITY COLUMN - NOTE: Numbers represent a range of systems from
moderate PC's to multi-user systems
1. The hierarchy can be pictured this way:
_________________________________
Registers | CPU registers (part of CPU) |
----------------|-------------------------------|---------------
"Memory" | Very fast memory (cache) |
| (on CPU chip and/or separate) |
|-------------------------------|
| Fast memory (main) |
| (Dynamic RAM) |
| |
| |
| |
| |
| |
| |
|-------------------------------|
| Slow memory (virtual) |
| (Disk) |
| |
| |
| |
| |
| |
| |
: :
: :
----------------|-------------------------------|---------------
File system | Disk and tape |
: :
: :
2. Note that the portion described as "memory" corresponds to memory as
it is viewed by the assembly language program (locations that can be
specified by the operand address part of an instruction.) Within
this portion, only main memory is absolutely needed.
a. Cache memory serves to provide fast access to a subset of
memory that is needed often, and contains copies of frequently
accessed locations in main memory.
b. Virtual memory serves to provide additional capacity beyond what
is physically available in main memory.
c. Both look to the programmer like main memory, but are not
physically implemented as such.
3. Note, too, that disk plays two roles - as part of the memory system
and as part of the file system, and that tape is treated here as part
of the file system. This is because disk and tape are often accessed
from programs using IO statements, and are physically connected into
the system as part of the IO system. We consider only the virtual
memory role for disk in this lecture.
D. This hierarchy can provide very good performance by taking advantage of
the principle of LOCALITY OF REFERENCE: if one observes the memory
references generated by a program during a short window of time, it will
be the case that most of the references will pertain to a small subset
of the total address space of the program.
1. This happens because most the execution time of a program is spent
executing loops.
2. Within a loop, the same body of code is being executed over and over;
in addition, it is usually the case that there are some data items
being accessed over and over.
3. The basic idea, then, is to keep in the fastest memory in the
hierarchy those data items that are being used currently, with the
moderately fast memory used for items that will be needed soon and
slow memory used for items that will not be needed until the distant
future.
a. Assembly language programmers and optimizing compilers try to
make good use of the CPU registers toward that end. This will not
be something we will discuss here.
b. We will focus on the memory system proper.
E. We will proceed soon to discuss cache memory and virtual memory in
detail. First, though, we must consider a little more of the interface
between the CPU and the memory system as a whole.
1. As a user or system program is running, it generates a stream of memory
references for instruction fetch, operand fetch, and result stores.
These take the form of an address plus a read/write specifier - e.g.
CPU ---> memory
Read 1703
Write 3424
Read 1704
2. The majority of requests are reads
a. Instruction fetches: every instruction involves at least one read
b. In evaluating expressions, intermediate results are typically kept
in CPU registers with a final store at the end.
Ex: X := A + B * C - D
would need (on a 1-address plus general register machine):
5 instruction fetches
4 operand fetches
1 store
So 90% of the accesses are reads. (70% - 90% reads is typical)
c. Therefore, the primary design goal in the memory system is to
optimize reads from memory - without, of course, penalizing writes.
3. Another issue we need to discuss is memory management:
a. Many systems place some address translation hardware between the
CPU and the memory system. This is called a memory management
unit, and may be physically part of the CPU or may be a separate
device.
CPU ---- Memory management ---- Memory system
b. Each logical address generated by the CPU is translated into a
physical address by the MMU.
c. The original incentive for doing this was multiprogramming -
allowing several different programs (perhaps belonging to several
different users) to be resident in memory at the same time.
i. To prevent conflicts between the programs, it is necessary to
ensure that each references a distinct region of physical
memory.
ii. However, when a program is written it is generally not possible
to know what region of physical memory is will ultimately occupy
when it is running.
iii. Thus, programs might be written on the assumption that their
range of accessible memory addresses ranges from 0 .. some upper
limit. It is the task of the MMU to translate the addresses
generated by the program into real physical addresses.
d. Memory mangement also supports virtual memory - the MMU may
translate some addresses into a reference to data currently on
disk, rather than in main memory, and may cause the data to be
brought from disk into main memory.
e. We will say more about memory management later.
II. Cache Memory
A. One technique used to improve overall memory system performance is CACHE
MEMORY.
1. At one time, cache memory was a feature generally found only in
higher-end computer systems. However, as CPU speeds have continued to
increase while memory speeds have not, cache memory has become a
necessity on even desktop computer systems.
a. About 10 years ago, common CPU clock speeds were on the
order of 8-16 MHz, and DRAM cycle times on the order of 70-80 ns.
Under these circumstances, it would be possible to perform a
memory access every 1-2 clock cycles.
b. Today, CPU clocks have gone to 200-500 MHz, while DRAM access time
has improved only slightly, to about 60 ns. Thus, an access to
main memory takes on the order of 20-30 clocks.
c. Since a RISC is designed to execute one instruction per clock,
and since each instruction must be fetched from memory, it would
seem that RISC's couldn't function at the clock speeds they do
using memory of the sort available today.
d. Execution of one (or more) instructions per clock is critically
dependent on the use of cache memory.
2. Cache memory is a small, high-speed memory, logically located between
the CPU and the rest of the memory system.
a. At one point in time, cache memory was usually separate from the
CPU.
b. Today's high-speed RISCs depend on having cache memory on the CPU
chip that operates at the same speed as the CPU itself. This has
been made possible by improved chip manufacturing techniques that
allow more transistors per chip.
c. Many systems now use a two layer cache, with a small primary cache
on the CPU chip a larger separate secondary cache.
3. Cache memory works because of the phenomenon of locality of reference -
in any given interval of time, a program tends to do most of its memory
accesses in a limited region of memory. This comes about due to loops
in the code and frequently-used data structures in the data. The goal
is to keep as many as possible of the currently-being-used memory
locations in the cache.
a. Each memory read is first tried against the cache. If the data is
found there (a "hit"), the processor can proceed at maximum speed.
b. Otherwise, we have a cache miss and the processor must wait for a
slower access to secondary cache or (if there is a miss there
too) to main memory.
c. Well-designed caches can achieve hit rates of 95% or more much of
the time.
d. To see why this is important, suppose we have a 200 MHz CPU with
a memory system that requires 100 ns to do an access (including
bus overhead) and a single, on-chip cache.
i. Theoretically, the time to execute an instruction is 5 ns.
ii. Suppose, however, that the cache has a hit rate of 90%. Then
90% of instructions can be fetched in 5 ns, but 10% require
100 ns. Now the average time per instructions becomes
.9 x 5 ns + .1 x 100 ns = 14.5 ns - which is equivalent to
reducing the clock rate by a factor of three! (Plus any loss
of time due to data cache misses.)
iii. With a 95% hit rate primary cache, and a 20 ns secondary
cache that hits 90% of the primary cache misses, we get an
average instruction fetch time of
.95 x 5 ns + (.05)(.9) x 20 ns + (.05)(.1) x 100 ns = 5.7 ns
B. In principle, a cache is an associative memory containing pairs of the
form:
memory address value
(Note: On a byte-addressable machine, it is common to have each entry in
the cache contain several successive bytes - called a line or block.
For example, most VAXes use a cache based on entries holding 8
consecutive bytes of data. Thus, the lower three bits of an address
would be ignored when looking up an entry, but once the entry is found
they would be used to select the correct portion of it.)
1. However, a fully associative memory is impractical. For example, in
a cache of 1000 entries, each address emitted by the CPU would have to
be compared to all 1000 entries at the same time. Even if the required
number of comparators could be economically built, the incoming address
would have to drive 1000 logic loads. This would require several
layers of buffering (since a typical gate output can drive < 10
others), which would inject intolerable delays. So one of several
approximations to associative memory is used.
2. Direct mapping.
a. The cache is constructed as a set of pairs of the form tag, value.
The data portion of the pair is called a CACHE LINE, or CACHE
BLOCK. Typically, the size of the line or block corresponds to the
system data bus size - e.g. 4-16 bytes - for first-level cache,
but may be much bigger for a second-level cache (as big as, say,
128 bytes).
b. The number of entries is always a power of 2. Suppose it is 2^n.
c. Then the address from the CPU is broken down as follows, assuming
each line contains 8 bytes of data:
tag | entry number | position in entry
| n bits | 3 bits
i. Bits 3 .. n + 2 of the address select one of the entries in
the cache. If the tag of that entry matches the rest of the
address, we have a hit. (The tag comparison is needed because
many addresses map to the same cache entry.)
ii. Otherwise, the a complete line of data is obtained from memory.
It is stored (along with the tag portion of its address) in the
cache for future use, replacing the entry currently there.
iii. Note that this scheme implies that at any time at most one entry
with any given pattern in its n entry-number bits can be in the
cache. This is usually not a problem - but may be if a loop
accesses a data structure whose address is some multiple of
2^n away from one of the instructions in the loop.
3. Set Associative
a. This is an improvement on direct mapping. It addresses the problem
of a loop that happens to access a data item whose address is some
multiple of 2^n different from that of some instruction in the loop.
The low order bits of the address select not one entry in the
cache, but a set of entries. (A set size of two or four is common.)
The tags for each entry are compared in parallel with the incoming
address, and if one matches there is a hit.
b. When a reference is not found in the cache, one of the entries
in the set must be replaced. This is typically done either in
FIFO fashion, LRU fashion, or randomly.
Example: The cache memory on a VAX 11/780 was a two-way set associative
cache using 8 byte lines. There are a total of 1024 entries,
organized as 512 sets of two entries each - so the cache
capacity is 8K bytes. Random replacement is used.
An address generated by the CPU is broken down as follows:
29 12 11 3 2 0
| tag | index | offset |
(Note: a VAX physical address is 30 bits)
Each entry consists of 64 bits of data plus an 18 bit tag
Suppose that the processor generated the following physical
address (hex) 0001AC44, and wants to access 4 byte longword.
The cache interprets this as:
tag = 0001A index = 188 offset = 4
If set 188 (hex) contains an entry with tag 0001A, then bytes
4..7 of that entry are returned to the CPU. Otherwise, the
contents of memory locations 0001AC40 .. 0001AC47 are fetched.
One of the two entries in set 188 (randomly chosen) is
replaced with a tag of 0001A and a value equal to the 8 bytes
fetched. In addition, 4 of the 8 bytes fetched are returned
to the CPU. (Note: Because the memory bus on the VAX is 8
bytes wide, the entire line can be fetched in one main memory
cycle.)
C. Issues with regard to caches
1. Replacement policies
a. Because of its size, the cache can only store a small fraction of
all the addresses lying in the address space of the current program.
b. Thus, when a reference occurs to an item that is not in the cache,
some item currently in the cache must be removed to make room for
the item.
c. With a direct mapping cache, there is no choice involved, since
each address maps to one and only one cache location. As noted
above, with a set associative scheme LRU, FIFO, or random may be
used.
2. Write-through versus write-back
a. What happens in the cache when a memory access is a write rather
than a read? (and the location referenced is in the cache).
b. In a write-through cache, the data is written to Mp and to cache
at the same time. This slows the system down some (though the CPU
can go on to the next instruction while the memory write occurs),
but not drastically since writes are proportionally rare.
c. In a write-back cache, the data is written only to the cache.
A written in bit is associated with the item, so that when it is
selected for replacement by a new entry, it is then written to
Mp. This avoids waiting for multiple writes to a single location
in Mp; but there is a potential problem if a DMA IO device is to
access data that has not yet been written back. This can be
handled by forcing the cache to be flushed to main memory before
a DMA operation is initiated.
3. Validity of cache items.
a. When the system is first started up, or when there is a change of
user in a multiprogrammed environment, the cache will not contain
valid data until a sufficient number of reads have been done.
Therefore, each entry in the cache must contain a valid bit - to
be cleared initially, and to be set when an entry is copied from
Mp. (When the valid bit is clear, the cache misses on any
attempt that maps to that item. Of course, in a set associative
cache each member of a set must have its own bit.)
b. There is often a provision for the operating system to invalidate
cache entries when a context change to a new user is done.
(If the cache is write-back, this also results in changed entries
being written back to main memory.)
4. Placement of the cache
a. As we shall see, many systems include memory management hardware
between the CPU and Mp to perform translation of the addresses.
The cache can either go between the CPU and the translation
hardware or between the translation hardware and memory.
CPU -- M.cache -- K.mapping -- Mp
or CPU -- K.mapping -- M.cache -- Mp
b. Which is better?
i. In the first position, it affects only CPU accesses to memory; in
the second, both CPU and DMA IO. The former has the problem that
a DMA write access to memory could make data in the cache
incorrect. This can be handled in one of two ways:
- Software restrictions - no CPU access to a region of memory
while DMA data is being transferred to it.
- The cache can "listen in" on the memory bus (though this is
difficult with translation hardware in between.)
ii. However, the first position has advantages, too:
- The cache only serves the CPU and is not cluttered with
one-time DMA accesses.
- The cache can help reduce the number of address translations
that must be done - if a CPU address is found in the cache, it
need not be translated.
5. Set size (for set associative caches).
a. When designing a set associative cache, a set size must be chosen
(always a power of 2). (Note that using a larger set size means
there can be fewer sets for a given total cache size.)
b. A set size of two is commonly used, because it is simpler to build.
A set size of four has been found experimentally to give marginally
better hit/miss performance. Experimental evidence suggests set
sizes greater than 4 produce no significant gain.
III. Mapping Logical Memory Addresses to Physical Memory Addresses
A. We noted earlier that the CPU generates a stream of addresses
representing accesses to the memory.
B. We will speak of the stream of addresses generated by the CPU as
LOGICAL ADDRESSES. (Sometimes the term virtual address is used, even if
the system is not a virtual memory system - but we will reserve the latter
term for virtual memory systems per se.).
1. We call them LOGICAL because many systems apply some mapping function
to the address before actually presenting it to the physical memory.
2. Logical addresses are generated by the following mechanisms:
a. Contents of the PC: instruction fetches
b. The various operand addressing modes: direct, relative, indirect
etc.
C. Physically, MP is organized as an array of individual addressable units
(bytes or words). These units are numbered 0 .. total size-1. We call
this numbering a PHYSICAL ADDRESS.
D. We call the range of possible logical addresses the LOGICAL ADDRESS
SPACE, and the range of possible physical addresses the PHYSICAL ADDRESS
SPACE.
1. The logical address space is dictated by the CPU architecture. For
example, a CPU that has a 16 bit PC and 16 bit registers will
generally have a logical address space of 64K.
2. The physical address space is dictated by the number of memory chips
installed. This in turn may be limited by the number of bits
provided on the memory address bus. For example, a 24-bit memory
address bus would dictate a maximum physical space of 16 meg - though
a given system could have less if not all possible chips are
present.
E. There are a number of possibilities for the relationship between these
two spaces:
1. They might be equal.
a. This is, for example, true on small microcomputer systems,
particularly those using 8 or 16-bit CPU's. It was also the case
on early mainframes and minis.
b. Conceptually, this is the simplest possibility. The address
output of the CPU is simply connected directly to the address bus
of the memory system.
c. But it is inflexible. The system supports a single size of
memory, with no room for expansion.
d. In multiprogrammed systems, partitioning memory is a problem.
Each user's program must be compiled to run in a specific partition
of memory, unless relative addressing is used throughout.
2. Logical space might be smaller than physical space. This possibility
is more a matter of historical significance than something that is
true of modern systems.
a. Example: The PDP-11 had a 16-bit logical address (64K); but
PDP-11 systems used either an 18 or 22 bit physical address bus
(256K or up to 4Meg).
b. PC's based on Intel 8086/80186/80286 chips used a 20-bit logical
address (1 Meg with 640K available for RAM) but used various schemes
to allow access to a physical address space as big as 4 Meg.
c. To make use of all of physical memory, some sort of mapping scheme
becomes necessary.
i. Different tasks running on the system may have the same logical
address mapped to different physical addresses - so any one task
is still of limited size, but available memory is used to
support several tasks.
ii. Or, the mapping scheme may be changeable "on the fly" to allow
a program to access different regions of physical memory at
different times.
3. Logical space might be larger than physical space. Here, there are
two possibilities.
a. Some logical addresses might simply be illegal and unused.
Example: The Power PC chips used in MacIntoshes have a 32
bit address, which would allow 4 gigabytes of memory. But
the standard configuration is 16 to 64 meg of RAM plus
128K to 256K of ROM, so many addresses are unused.
b. The system might use virtual memory, which we shall discuss
shortly.
IV. Virtual Memory Systems
A. Virtual memory systems are characterized by logical address space >
physical address space; but rather than making some logical addresses
unusable, disk storage is used as an extension of main memory. (Note:
from here on out we will call "logical address space" "VIRTUAL address
space" in recognition of the fact that this makes the memory available
to a program look much larger than it actually is.)
B. The basic idea is this: only a portion of a program's virtual address
space is actually resident in physical memory at any given time. The
remainder is kept on secondary storage, to be brought in when needed.
C. As with systems where virtual space < physical space, a mapping scheme
is used to translate virtual addresses into physical. However, this
scheme includes one possibility not found in the earlier case.
1. The mapping scheme described earlier had three possibilities:
a. Success
b. Failure due to attempting to access an unmapped portion of
virtual space.
c. Failure due to writing a readonly portion of virutal space.
2. With virtual memory, a fourth possibility is introduced: the access
may be valid, but the portion of memory desired may not be physically
resident.
D. To accomplish the necessary mapping of addresses, many systems (but not
all) divide virtual space into fixed size regions called pages, and
physical space into fixed size regions called page frames. (The sizes
are equal, so one page will fit into one frame.)
Ex: DEC VAX - the pages and page frames are 512 bytes each.
1. A virtual address is broken into two fields: a page number and an
offset in page.
Ex: On the VAX, the low order 9 bits of the virtual address are the
offset, and the remaining 23 bits comprise the page number.
2. A physical address is likewise broken into two fields: a page frame
number and an offset.
Ex: On a VAX with 8 Meg of physical memory, a (valid) physical
address would be 23 bits long. Of these, the high order 14 bits
are the frame number.
3. For each user, the operating system maintains a table in memory
called the page table. A memory management system register holds
the address of the first word of the page table for the user whose
program is currently running. The page portion of a virtual address
is used as an index into this table to select an entry.
a. One portion of the entry contains various bits to control access
(such as readonly etc.) One key bit is a valid bit that indicates
if the page in question is currently resident in physical memory.
b. The rest of the entry has two possible uses:
i. If the page is physically resident, the entry will contain
the number of the frame where it is to be found.
ii. Otherwise, the entry may contain an indication as to where to
go on disk to find the page if it is needed.
c. An example of mapping, based on the DEC VAX:
Suppose the page table contains the following entries, among others:
Entry No Valid Frame Number
...
4 1 10101010101010101010101
5 0 ----
...
If the CPU generates the virtual address
000000000000000000000000100101010101
The memory management hardware divides this into a page number and
offset:
000000000000000000000000100 101010101
Now, since the page number is 4, the hardware goes to entry 4 in
the page table. Since this is a valid entry, it extracts the
frame number
10101010101010101010101
and forms the physical address
10101010101010101010101101010101
On the other hand, a virtual address
000000000000000000000000101101010101
would access page table entry 5. Since this is flagged as not
valid, a trap to the operating system would occur. The current
program would be suspended while the OS gets the proper page from
disk, loads it into an available frame, resets the page table
entry to point to the chosen frame (with valid bit now 1), and
then restarts the failed instruction.
4. To make this scheme work, an additional control register is needed.
a. In principle, the page table would have to have one entry for each
possible page number. For example, a VAX page table would need
2^23 = 8 million entries. (Since each entry is several bytes long,
this is all of physical memory several times over!)
b. In practice, the total virtual space for any given user is usually
much, much less than the maximum implied by the architecture.
Therefore, the actual page table usually only has as many entries as
are needed for all of the pages actually used by the program.
Therefore, in addition to the register that contains the address of
the beginning of a user's page table, there is usually also a
page table size register. Any page number that exceeds the value in
this register causes the mapping process to fail.
5. Note that, when a page needs to be brought into main memory, the
operating system must find a free frame to hold it. This often means
that a currently-resident page must be paged out to disk to make room.
a. If the page has not been written in since it was last loaded, and
if the original is still on disk, the page does not need to be
written out to disk; its page table entry can be flagged as invalid
and it can be displaced. However, if it has been modified it must
first be written back to disk.
b. To facilitate this, each page table entry must include a "written
in" bit that is set by the hardware when a write access to the page
is done.
E. An alternative to paging is segmentation. The difference is that virtual
space is divided into variable-size segments instead of fixed-size pages.
1. One problem with paging is INTERNAL FRAGMENTATION of memory. Since
a user's actual needs seldom comprise an even number of pages, the
last page allocated to each user will contain some unneeded and
therefore wasted space. Segmentation avoids this.
2. This also has the advantage that the division of virtual space may
mirror program logic - e.g. a segment may be a single procedure or a
single large data structure. This facilitates sharing of code among
different users.
3. The virtual address is now treated as a segment number plus offset
within segment. The segment number indexes a segment table, which is
like a page table except that each entry must include a segment length
field as well as written-in and valid-bits and a frame address. Also,
since segments are variable size, frames must be variable size too;
therefore the frame address must be a complete address, not just the
high order bits.
F. Segmentation runs into a physical memory fragmentation problem, since
segments can be of any size.
1. In time, memory can become checkerboarded - e.g.
--------------
| 8 K in use | Suppose a program needs to bring in a
-------------- 4K segment from disk. Since a total
| 2 K free | of 5K of physical memory is available,
-------------- this should be possible. But since
| 6 K in use | the 5K is in two non-contiguous
-------------- portions, this cannot be done.
| 3 K free |
--------------
| rest in use|
2. This is called EXTERNAL FRAGMENTATION.
3. Some systems solve this problem by using segmentation with paging:
Each segment is composed of 1 or more fixed-size pages. The segment
table now points to a page table for the segment.
a. The virtual address now has three fields:
segment number | page number within segment | offset within page
b. The mapping process is as follows:
i. Check segment number against segment limit. If it is in range,
use it as an index into the segment table.
ii. From the segment table entry, extract the address and size of the
page table for the segment.
iii. Check the page within segment to be sure it is within range. If
so, use it as an index into the segment's page table.
iv. From the page table entry extract control bits plus (if the valid
bit is set) the frame number of the physical address.
Concatenate the frame number and offset within page to form a
physical address.
4. We have now reintroduced internal fragmentation, while getting rid
of external fragmentation. (However, this is a more manageable
problem.) However, we have retained the logical advantages of
segmentation for code sharing etc.
Example: The Intel 80386/80486/Pentium chips offer the choice of pure
segmentation or segmentation with paging - as determined by the
setting of a bit in a CPU control register.
In either mode, logical addresses are composed of two parts -
a segment selector, and an offset in segment. The segment
selector is contained in one of 6 CPU segment registers, and the
offset is computed using the addressing mode specified by the
instruction. (The choice of segment register is normally
implicit in the type of reference being done - e.g. instruction
fetches use the CS segment register, references to the stack use
the SS segment register, and ordinary data references use the DS
segment register. The programmer can override the default by
prefixing the instruction with a special segment prefix.)
The segment selector is used to reference a segment descriptor
contained in one of two segment descriptor tables - a global
table shared by all tasks, and a local table for each task. One
bit in the selector specifies which table to use, and 13 bits
are used to select one of 8192 entries in the appropriate table.
The descriptor, in turn, contains a segment base address, a
segment limit (size), and some validity and protection bits.
If the segment is valid and the access is allowed by the
protection, then the offset is compared to the segment
limit; if it is <= the limit then the offset is added to the
segment base to form a "linear address" (i.e. an address within
a linear, unsegmented space.)
At this point, if paging is enabled, the resulting value is
mapped using a two-level page table; otherwise, it is used
directly as the physical address.
G. Paging and both forms of segmentation suffer from an important
overhead problem.
1. Since page/segment tables are stored in memory, each memory access
turns into multiple accesses:
a. With paging: one access to the page table, then one to the data -
two in all.
b. With pure segmentation: one access to the segment table, then one
to the data - two in all.
c. With segmentation with paging: one access to the segment table, one
to the segment's page table, and one to the data - three in all!
(Or on the 80x86 four in all, because the page table has two
levels!)
2. This overhead would, of course, be intolerable. There are two ways
to avoid or reduce it.
a. If page tables can be kept small, then the page table can be kept
in high-speed registers rather than memory.
i. The 80x86 does something like this with the segment table. The
CPU contains 6 segment registers, each of which can be loaded
with a segment number. Most instructions implicitly use one of
these segment registers in forming a logical address - e.g. code
references in branches etc. are always made using the code
segement register (CS). That is, most instructions generate
only the 32 bit offset portion of the address directly.
As a hidden step, whenever a program loads a segment register the
CPU also fetches the corresponding segment descriptor from the
segment table and stores it in a hidden part of the segment
register. Thus, the descriptor is always available when needed.
ii. Storing page/segment tables in registers is generally not useful
with virtual memory page tables, which tend to be quite large
because virtual memory is intended to support large address
spaces.
b. An alternative (that is more commonly used) is a set of registers
to store the most recently used virtual -> physical address
translations. These registers are sometimes called a translation
look-aside buffer. They are organized as an approximation to an
associative memory (like a cache). When translating a virtual
address, the hardware first looks to see if the translation is
available in the set of registers. If so, no additional memory
accesses are needed.
Pc -- K.mapping -- Mp
|
M.translation look-aside
3. Cache memory can also help, if it is installed between the CPU and the
mapping hardware. Now, only memory references that miss in the cache
are translated at all - typically less than 10% of all references.
H. Page Replacement Schemes
1. In any virtual memory system, we must have a policy for deciding which
page is to be removed from memory to provide a page frame for a newly
requested page when a page-fault occurs. When a process is just
starting up, we normally give it an initial page frame allocation; and
as it needs more room we give it an additional page frame each time it
faults, so that the total allocation for the process grows dynamically.
However, there must be an upper limit set for the process, and there
must be some mechanism for shrinking the allocation to a process
whose requirements have decreased.
2. This is primarily a concern of a course on operating systems. However,
certain schemes require various sorts of hardware assistance,
which we consider now.
3. The optimal policy would be to replace the page whose next reference
is furthest in the future, treating a page that will not be referenced
again at all as having infinite time until its next reference.
a. It can be shown that this policy would minimize page faults.
b. But, except in vary rare situations, it is totally impractical.
c. The other policies that are used in practice are approximations
aimed at achieving close to the same effect.
4. All schemes require a per-page written in bit, so that if a page is to
be removed from memory it can be rewritten to disk only if it has been
modified. (In fact, many schemes give priority to removing from memory
pages which have not been modified so as to reduce disk traffic. If a
modified page is kept around long enough one of two things will happen:
it will be read again (and hence the write-out and read-in traffic is
saved), or the task using it will complete, and it need not be written
at all. Normally a copy of this bit is maintained in the TLB. When a
write occurs to a page already in the TLB, if the bit there is set
then no update to the page table need occur; otherwise, the
written-in bit is set both in the TLB and in the page table.
5. Several schemes make use of a per page "referenced" bit - normally
stored in the page table along with valid bit and written in bit.
a. When the page is first brought into memory - and perhaps at other
times at well - the operating system software clears this bit.
b. When any reference is made to the page, the hardware sets the bit
automatically. Thus, the operating system can determine whether any
reference has been made to the page within the interval since it
last cleared the bit.
6. To summarize: hardware support for operating system's page replacement
scheme includes:
a. Minimally - a per-page "valid" bit and "written-in" bit.
b. In some cases, a per-page "referenced" bit.
Copyright ©1999 - Russell C. Bjork