CS222 Lecture: Pipelining 3/23/99
Objectives
----------
1. To introduce the basic concept of CPU speedup through pipelining
2. To explain how data and branch hazards arise as a result of pipelining, and
various means by which they can be resolved.
3. To introduce superpipelining and superscalar processors as means to get
further speedup, including techniques for dealing with more complex hazard
conditions that can arise.
Materials: Transparency of Patterson and Hennessy figure 6.12
I. Introduction
- ------------
A. We saw at the start of the course that the total time for the execution
of a given program is given by:
Time = cycle time * # of instructions * CPI
= # of instructions * CPI
----------------------
clock-rate
B. This equation suggests three basic strategies for running a given program
in less time: (ASK CLASS TO MAKE SUGGESTIONS FOR EACH)
1. Reduce the cycle time (increase the clock rate)
a. Can be achieved by use of improved hardware design/manufacturing
techniques. In particular, reducing the FEATURE SIZE (chip area
needed for one component), results in lower capacitance and
inductance, and can therefore run the chip at a higher frequency.
b. Do less computation on each cycle (which increases CPI, of course!)
2. Reduce the instruction count.
a. Better algorithms.
b. More powerful instruction sets - an impetus for the development of
CISCs. (This, however, leads to increased CPI!)
3. Reduce CPI
a. Simplify the instruction set - an impetus for the development of
RISCs. (This, however, leads to increased program length!).
b. Do more work per clock. (This, however, requires a longer
clock cycle and leads to a lower clock rate!).
4. Note that, in the case of clock rate and instruction count, there are
speedup techniques that are clear "wins" - utilizing them does not
adversely affect the other two components of the equation. It appears,
though, that it is only possible to reduce CPI at the cost of more
instructions or a slower clock.
5. While there is no way to reduce the total number of clocks needed for
an individual instruction without adversely impacting some other
component of performance, it is possible to reduce the AVERAGE CPI
by doing portions of two or more instructions in parallel. That is
the topic we look at in the next few lectures.
C. We now look at serveral speed-up techniques - of increasing sophistication
- whereby the different phases of several successive instructions are done
in parallel. To some extent, what we will look at is applicable to any
architecture; but it becomes most fully implementable for RISCs.
1. The timing of the CPU operations as we have discussed them thus far
can be pictured as follows, using a Gantt chart (where S1, S2, S3 are
a series of instructions being executed one after the other):
Step 5 -S1- -S2- -S3-
Step 4 -S1- -S2- -S3-
Step 3 -S1- -S2- -S3-
Decode -S1- -S2- -S3-
Fetch -S1- -S2- -S3-
Time ---------->
2. Notes:
a. This chart is a bit different in style from the ones in your book.
The horizontal axis is time, and the vertical axis is the various
steps of instruction execution. An instruction moves UPWARD and TO
THE RIGHT as it passes through the various steps.
b. To avoid tying the chart to any specific architecture, generic names
are used for the instruction phases after the first two.
(Regardless of architecture, the first two steps in any instruction
are fetching it and decoding it.)
c. The chart makes two simplifying assumptions that may or may not be
valid for a given architecture:
i. All instructions have the same number of steps.
ii. Each step takes the same amount of time.
In reality, neither of these assumptions is typically valid for
CISCs, but they do hold for RISCs (if we allow that, for
instructions that need fewer steps, we consider the last few
steps to be "do-nothings".)
3. By building additional hardware, it is possible to speed up execution
by doing portions of two or more instructions in parallel.
D. One of the simplest speedup techniques is PRE-FETCHING of instructions.
Step 5 -S1- -S2- -S3-
Step 4 -S1- -S2- -S3-
Step 3 -S1- -S2- -S3-
Decode -S1- -S2- -S3-
Fetch -S1- -S2- -S3-
Time ---------->
1. Though each instruction still takes 5 cycles from start to finish,
the average number of clocks per instruction - in the steady state -
is four - i.e. one instruction completes every 4th cycle. This
represents a 20% speedup.
2. One complication can arise if the current instruction is a conditional
branch. In this case, one cannot know while the instruction is being
executed whether to prefetch the next sequential instruction or the
one at the branch address. This is known as a BRANCH DEPENDENCY or
BRANCH (or CONTROL) HAZARD.
a. Prefetching can be suspended during execution of a branch
instruction until the outcome is known.
b. Some machines always prefetch the next instruction, or always
prefetch the branch target, or use some heuristic rule for
"guessing" which way the branch will turn out. If the guess is
wrong, then another fetch of the correct instruction must occur
before further computation can be done.
c. Other machines attempt to prefetch both ways and then discard the
wrong instruction.
d. We will look at some additional alternatives shortly.
3. Prefetching may not seem possible on machines that have variable
length instructions, since we don't know how long an instruction is
until we have decoded it (and perhaps partially executed it.).
However, many variable length instruction machines still do a form of
prefetching by using an instruction queue that simply holds as yet
uninterpreted bytes from the instruction stream.
a. Whenever execution calls for another byte from the instruction
stream, one is taken from the queue if possible.
b. Whenever the memory is idle, if there is room in the queue then
one or more bytes are fetched ahead.
c. If the outcome of a conditional branch sends the program down a
different path from the one the CPU has been prefetching on, then
the queue is flushed.
E. Further parallelism between instructions can be achieved at the cost
of increased complexity of hardware.
1. For example, suppose that we not only prefetched instructions, but also
overlapped the decoding of each instruction with the execution of its
predecessor.
Step 5 -S1- -S2- -S3-
Step 4 -S1- -S2- -S3-
Step 3 -S1- -S2- -S3-
Decode -S1- -S2- -S3-
Fetch -S1- -S2- -S3-
Time ---------->
[In this - idealized - case we have reduced average per instruction
time by 40%]
2. The ultimate degree of parallelism would be total overlap between all
phases of different instructions - e.g.
Step 5 -S1--S2--S3--S4--S5--S6--S7--S8-
Step 4 -S1--S2--S3--S4--S5--S6--S7--S8-
Step 3 -S1--S2--S3--S4--S5--S6--S7--S8-
Decode -S1--S2--S3--S4--S5--S6--S7--S8-
Fetch -S1--S2--S3--S4--S5--S6--S7--S8-
Time ---------->
[We have now reduced average per instruction time by 80% !]
a. We call this a FULLY PIPELINED CPU. In the steady state, it
completes one instruction on every cycle, so its average CPI is 1.
b. Of course, an average CPI of 1 is attainable only when the pipeline
is full of valid instructions. When the pipeline has been
flushed (e.g. after a branch), it may take several cycles for the
pipeline to fill up again. As a result, a fully pipelined machine
ends up in practice having a CPI somewhat bigger than 1.
3. This latter degree of parallelism introduces some further
complications, however.
a. Suppose instruction S2 uses some register as one of its
its operands, and suppose that the result of S1 is stored in this
same register - e.g.
S1: lw $2, some-address
S2: addi $3, $2, 1
(Clearly, the intention is for S2 to use the value stored by S1)
If S2 needs this operand on its first execution step (Step 3) and
S1 doesn't store it until its last step (Step 5), then the value
that S2 gets will be the PREVIOUS VALUE in the register - not the
one stored by S1, as intended by the programmer.
b. This sort of situation is called a DATA HAZARD or DATA DEPENDENCY.
We will discuss how it can be handled shortly.
F. So far in our discussion, we have assumed that the time for the actual
computation (EXEC) phase of an instruction is a single cycle. This is
realistic for simple operations like AND, fixed point ADD etc. However,
for some instructions multiple cycles are needed for the actual
computation.
1. These include fixed-point multiply and divide and all floating point
operations.
2. To deal with this issue, some pipelined CPU's simply exclude
such instructions from their instruction set - relegating them to
co-processors.
3. if such long operations are common (as would be true in a machine
dedicated to scientific computations), further parallelism might
be considered in which the computation phases of two or more
instructions overlap. We will not discuss this now, but will come
back to it when we get to vector processors toward the end of the
course.
G. The book discusses MIPS pipelining in detail. Before considering this,
we will look at a simpler pipeline structure, based on that of one of
the earliest RISC's: Berkeley RISC.
II. A Three-Stage Pipeline
-- - ----------- --------
A. Before discussing the MIPS pipeline, which has five stages, we will
first look at a simpler, three-stage pipeline.
1. Stage one is instruction fetch that both fetches an instruction from
the memory address pointed to by the PC and updates the PC).
2. Stage two is an ALU operation
a. Processing at this stage involves
i. Reading two source values out of appropriate register(s)
and/or a field in the instruction.
ii. Performing some basic operation (e.g. add, sub, etc.)
iii. Storing the result back into a register if approrpriate.
(Note: MIPS does each of these in a separate stage. There is no
reason why they cannot all be done in one stage if we are willing
to accept a longer clock cycle - which is, of course, the motivation
for breaking them into three stages on MIPS!)
b. The precise function performed depends on the instruction type
i. For R-Type instructions, this step does the actual computation.
ii. For memory reference instructions, this step calculates the
address.
iii. For branch type instructions, this step calculates the target
address and (if appropriate) updates the PC with the new value.
3. Stage three is a memory operation (read or write) - if needed. The
value is transferred between memory and a register in a single
cycle (not two as on MIPS). This stage is a "do-nothing" stage for
R-Type and branch type instructions.
4. To allow these three stages to be pipelined, the CPU is organized as
follows:
IF Unit ALU Unit Load/store
- +4 - I ----------------------------> I Unit
| | n n
PC -----> Instruction -> s Register ----> Arithmetic s Data
^ Memory t File ----> Logic ------ t Memory
| ^ Unit | ^
| R | | R |
| e | | e |
| g | | g |
-----------------------------------------------------<--------
a. There are three functional units - an instruction fetch unit, an
ALU, and a load/store unit - corresponding to the three stages of
instruction execution. To facilitate this, there are two access
paths to memory - one used by the instruction fetch stage
(instruction memory) and one used by the laod/store stage (data
memory.)
b. As an instruction is executed, it is passed from stage to stage -
like stations on an assembly line. It spends one basic clock
cycle at each stage. There are two instruction registers - one
between the instruction fetch stage and the ALU stage, and one
between the ALU stage and the load/store stage.
i. At the end of each cycle, the instruction fetch unit puts a new
instruction into the IR between the first two stages.
ii. At the end of each cycle, the instruction that is in the first
IR is copied into the second one.
c. Three instructions are in the pipeline at any time - one at each
stage. Thus, although it takes up to 3 clock cycles to execute any
one instruction, one instruction is completed on each clock and
so the effective time for an instruction is only one cycle - a
threefold speedup.
d. Actually, this pipeline - as we have described it - contains one
problematic feature. R-Type instructions write their result to the
register file on stage 2, while load type write their result to the
register file on stage 3. Thus, if a load type instruction is
immediately followed by an R-Type instruction, two writes to the
register file would occur at the same time (though hopefully not
to the same register!). This is not impossible - but in fact, most
actual pipelines do not encounter such a problem in any case
due to having more stages, as we shall see.
B. Because the pipeline is so regular in its operation, the compiler can
use knowledge about the pipeline to optimize the code it generates.
1. One problem faced by pipelined CPU's is data dependencies.
a. In this case, if one instruction loads a memory value into a
register and the very next uses that same register as an
input, then we have a problem since the load/store phase of the
first (which actually transfers the data from memory) overlaps
the ALU phase of the second (which uses it.)
b. One approach to handle such a situation is a pipeline stall, or
"bubble".
i. The hardware can detect the situation where the second IR
contains a load instruction whose destination register is the
same as one of the source operands of the instruction in the
first IR. (This is a simple comparison between IR field
contents that is easily implemented with just a few gates.)
ii. In such cases, the hardware can replace the instruction in the
first IR with a NOOP and force the IF stage to refetch the same
instruction instead of going on to the next.
iii. Of course, this means wasting a clock cycle, since the NOOP
does no useful work.
c. A better approach is to require the compiler to anticipate such a
problem by never emitting an instruction that uses the result of a
load immediately after that load. (The compiler can either put some
other, unrelated instruction after the load, or it can emit a NOOP
if all else fails.)
Example: suppose a programmer writes:
d := a + b + c + 1
This could be translated to the following.
load r10, a ; r10 <- a
load r11, b ; r11 <- b
noop -- inserted to prevent data conflict
add r11, r10, r11 ; r11 <- r10 + r11
load r12, c ; r12 <- c
noop -- inserted to prevent data conflict
add r12, r11, r12 ; r12 <- r11 + r12
add r12, r12, #1 ; r12 <- r12 + 1
store r12, d ; d <- r12
However, the NOOPs can be eliminated by rearranging the
operations.
load r10, a ; r10 <- a
load r11, b ; r11 <- b
load r12, c ; r12 <- c
add r11, r10, r11 ; r11 <- r10 + r11
add r12, r11, r12 ; r12 <- r11 + r12
add r12, r12, #1 ; r12 <- r12 + 1
store r12, d ; d <- r12
d. This latter approach - requiring that code not use a register
immediately after it has been loaded - is called DELAYED LOAD.
2. Another source of potential problems is branch dependencies.
a. Branch instructions are executed in the second stage of the
pipeline, and the branch target address (and decision whether or
not the branch is to be taken) are not available until the end of
the cycle. Thus, while the branch is being executed, the next
sequential instruction is being fetched by the instruction fetch
unit.
b. In this case, attempting to somehow predict the outcome of the
branch does not help, because the TARGET ADDRESS of the branch
becomes available at the same time the outcome becomes known - if
we predict the branch to be taken in advance of knowing its outcome,
we cannot do anything with the prediction because we don't yet know
where the branch will go.
c. Again, one way to handle this situation is with a "bubble" - if
the branch is taken, then the instruction behind it is converted
to a NOOP.
d. An alternate approach (which was used on Berkeley RISC and is used
by a number of other RISCS) is called DELAYED BRANCH.
i. All control transfer instructions (subroutine calls and returns
as well as jumps) take effect AFTER the next instruction in
sequence is executed.
Example:
Suppose we were compiling
if something then
a := a + 1;
else
...
Suppose further that a is a local variable allocated to
reside in r16.
Then the code for the "then" part would consist of an add 1 to
r16 plus a branch to skip over the "else" part. This could be
done this way:
addi r16, r16, 1 r16 <- r16 + 1
jmp end_if
noop
else_part:
....
end_if:
The noop is needed because the instruction after the jmp is
always executed - it's in the pipeline before the branch is
actually done.
However, a good compiler would emit the code this way
jmp end_if
add r16, r16, #1 r16 <- r16 + 1
else_part:
....
end_if:
ii. As the above example illustrates, the compiler can normally work
with this feature of the hardware by inserting the JUMP
instruction ahead of the last instruction to be done in the
current block of code. In some cases, though, a NOOP must be
inserted (e.g. if the jump is a conditional that depends on the
last operation to be done before the jump is taken.)
III. MIPS Pipelining
--- ---- ----------
A. As discussed in the book, MIPS instructions are implemented in up to
five steps. In a pipelined implementation, EVERY instruction has all
all five steps (though some may not actually do any useful work), and
the pipeline has 5 stages:
TRANSPARENCY - FIGURE 6.12
1. IF - both of the following (in parallel)
a. instruction fetch
b. program counter increment
2. ID - all of the following (in parallel)
a. instruction decode
b. register file read into ALU input holding registers A and B
c. branch target address calculation
(not all of these results may in fact be used.)
3. EXEC - one of the following
a. ALU operation for an R-Type instruction
b. address calculation for a memory reference instruction
In either case, the inputs come from input holding registers A and B
and/or a field in the instruction, and the output goes to the
ALU Out holding register
4. MEM - one of the following
a. Memory operation - read or write.
b. On non memory-reference instructions, this step does nothing
5. WB - one of the following
a. Write ALU Out register (computed in step 3) back to appropriate
destination in the register file (R-Type instruction)
b. Write value read from memory (in step 4) into to appropriate
destionation in the register file (load instruction)
c. On all other instructions, do nothing
6. Note the presence of pipeline registers between each pair of stages.
Since there are five instructions in the pipeline at any time, there
is a need to keep copies of four instructions (or at least portions
of them) in registers at any time. (There are only four instructions
in registers, because one is coming out of memory during stage 1).
In addition, we need to keep certain data in these registers.
a. IF/ID holds an instruction, plus the incremented PC value of where
it came from.
b. ID/EXEC holds the op-code (possibly in some decoded form) plus the
destination register specifier, immediate value, and funct fields of
the instruction, plus the A and B source values read out of the
register file, plus the PC passed on from the IF/ID register.
c. EXEC/MEM holds the op-code and destination register specifier
fields of the instruction (copied from ID/EXEC), plus the ALUOut
value, plus the data that is to be written to memory if the
instruction is a store (contents of register specified by rt).
d. MEM/WB holds the op-code and destination register specifier
fields of the instruction and the ALUOut value (copied from
ID/EXEC) plus the value just read from memory if the instruction
is a load.
B. The motivation for going to a five-stage pipeline on MIPS appears to be
the following
1. Doing the register file read, ALU operation, and write back of the
ALU result in one step (as is the case for the three-stage pipeline)
would require a longer clock cycle for this stage, and thus for all
stages in the pipline.
2. Likewise, doing a memory read and writing the item read back to a
register in one step (as is also the case for the three-stage
pipeline) would pose a similar problem.
3. Since the load instruction ends up requiring 5 steps (which is the
longest instruction) a 5 stage pipeline is called for.
C. Actually, the way the book describes the MIPS pipeline - and the way we
have described it here - is a bit oversimplified. The actual pipeline
on most MIPS implementations has five stages, but uses only four clocks
because two of the stages are just half a clock long. (Recall that the
clock is a square wave, and a complete cycle includes both rising and
falling edges). Here is the actual structure:
----------------------------------
WB (1/2 cycle) | S1 |/////| S2 |/////| S3 | ...
---------------------------------------------
MEM | S1 | S2 | S3 | S4 |
--------------------------------------------------------
EXEC | S1 | S2 | S3 | S4 | ...
--------------------------------------------------
ID (1/2 cycle) | S1 |/////| S2 |/////| S3 |/////| S4 | ...
--------------------------------------------------
IF | S1 | S2 | S3 | S4 | ...
---------------------------------------------
///// = idle on this half-cycle
(We will stick with the simplified version used in the book for most of
our discussion, since the basic issues are not affected.)
D. The use of a five-stage pipeline allows for a shorter clock cycle, but
the downside is that it complicates dealing with hazards.
1. At first glance, it would appear that the branch hazard problem would
be exacerbated, because instructions are normally executed in stage 3
(meaning that there would now be 2 instructions in the pipeline when
a branch is executed.) However, the MIPS hardware is arranged so that
branch instructions are executed in stage 2 of the pipeline, and MIPS
deals with the single instrucition behind it in the pipeline by using
delayed branching.
(Note: It would appear that use of delayed branching would have posed
a problem when you were writing MIPS assembly language programs in
lab, since you were unaware of this at the time. However, the MIPS
assembler automatically inserts a NOOP after a branch, so this was
not an issue.)
2. On the other hand, the data hazard issue is made much worse.
In particular, in the 3-stage pipeline, a data hazard could arise when
using the value read from memory by a load instruction too soon. In
the 5-stage pipeline, data hazards can also arise from dependent
sequences of computational instructions.
a. Example: Consider the following program fragment:
S1: add $2, $4, $5
S2: add $3, $6, $7
S3: add $3, $2, $3
S4: add $2, $2, $8
(where it is the intention that S3 use the values in $2 and $3
computed by S1 and S2, and S4 uses the value in $2 computed by S1)
i. This program would work correctly on the 3-stage pipeline.
ii. But consider what happens with a 5-stage pipline
WB S1: S2: ... ...
$2 <- $3 <-
$4+$5 $6+$7
MEM S1: S2: ... ...
(pass (pass
ALUOut ALUOut
thru) thru)
EXEC S1: S2: S3: S4:
ALUOut ALUOut ALUOut ALUOut
<- A+B <- A+B <- A+B <- A+B
($4+$5) ($6+$7) ($2+$3) ($2+$8)
ID S1: S2: S3: S4:
A<-$4 A<-$6 A<-$2 A<-$2
B<-$5 B<-$7 B<-$3 B<-$8
(BOTH (A
VALUES VALUE
WRONG!) WRONG!)
IF S1 S2 S3 S4
How many bubbles, NOOPs, or other instructions would need to be
inserted between S2 and S3 to make S3 and S4 get the right values?
ASK
- One would take care of getting the right $2 for S4, but
would not help S3 at all.
- Two would take care of getting the right $2 for S3 as well,
but $3 would still be wrong
- Three would make everything work correctly:
WB S1: S2: ... ...
$2 <- $3 <-
$4+$5 $6+$7
MEM S1: S2: ... ... ...
(pass (pass
ALUOut ALUOut
thru) thru)
EXEC S1: S2: ... ... ... ...
ALUOut ALUOut
<- A+B <- A+B
($4+$5) ($6+$7)
ID S1: S2: extra1 extra2 extra3 S3: S4:
A<-$4 A<-$6 A<-$2 A<-$2
B<-$5 B<-$7 B<-$3 B<-$8
IF S1 S2 extra1 extra2 extra3 S3 S4
b. Requiring three instructions between the time a value is computed
and the time it is used would have a very severe negative impact
on performance, so some other solution is desirable. MIPS uses
two.
c. A one instruction reduction in the size of the problem is
achieved automatically by the fact that ID and WB are half cycles.
(Refer to more accurate timing diagram).
d. To squeeze the remaining two delays out, observe that the values
needed by S3 EXIST at the time S3 needs them - they're just not
in the right places.
i. The value which will go into $2 is sitting in the ALUOut
portion of the MEM/WB pipeline register when S3 needs to use it
during its EXEC step.
ii. The value which will go into $3 is sitting in the ALUOut
portion of the EXEC/MEM pipeline register when S3 needs to use
it during its EXEC step.
iii. The two stage delay that would otherwise be needed between
computing a result and using it could be avoided if the ALU
input selection logic could be modified to either use a value
from any of the following:
- The register A or B portion of the ID/EXEC pipeline register
(as the case may be)
or - The ALUOut register portion of the EXEC/MEM pipeline register
or - The ALUOut register portion of the MEM/WB pipeline register
(This must be handled separately for each of the two ALU
inputs).
WB S1: S2: ... ...
$2 <- $3 <-
$4+$5 $6+$7
MEM S1: S2: ... ...
(pass (pass
ALUOut ALUOut
thru) thru)
EXEC S1: S2: S3: S4:
ALUOut ALUOut ALUOut ALUOut
<- A+B <- A+B <- M/W <- A+B
+ E/M
ALUOuts
($4+$5) ($6+$7) ($2+$3) ($2+$8)
ID S1: S2: S3: S4:
A<-$4 A<-$6 A<-$2 A<-$2
B<-$5 B<-$7 B<-$3 B<-$8
(BOTH
VALUES
WRONG!)
IF S1 S2 S3 S4
iv. This strategy is called DATA FORWARDING.
3. We have examined the impact of the five-stage pipeline on R-Type
instructions, and have seen that data forwarding can prevent
hazards. What about load-type instructions (where even the
three-stage pipeline had a hazard)?
a. Both R-Type instructions and load instructions write their value
to a register in the last stage of the pipeline, so the basic
issue is the same.
b. However, a key difference is that, with an R-Type instruction,
the value is actually available at the end of the EXEC stage
(stage 3) and can be forwarded from them, where as in the
load instruction the value does not become available until
the end of the MEM stage (stage 4).
c. As a result, by use of data forwarding, we can reduce the
delay down to the same delay as what is experienced in the
3-stage pipeline - i.e. we must have one instruction
intervening between a load instruction and any other
instruction that uses its result. On MIPS, this is handled by
using DELAYED LOAD, as discussed previously. (Once again,
this is hidden from the programmer by the compiler or
assembler.)
IV. Moving Beyond Basic Pipelining
-- ------ ------ ----- ----------
A. The potential speedup from pipelining is a function of the number of
stages in the pipeline.
1. For example, suppose that an instruction that would take 40 ns to
execute is implemented using a 4 stage pipeline with each stage
taking 10 ns. Then the speedup gained by pipelining is
w/o pipeline - 1 instruction / 40 ns
with pipeline - 1 instruction / 10 ns
40ns/10ns = 4:1
Now if, instead, we could implement the same instruction using a
5 stage pipeline with each stage taking 8ns, we could get a 5:1
speedup instead.
2. This leads to a desire to make the pipeline consist of as many
stages as possible, each as short as possible. This strategy is
known as SUPERPIPLINING, and is used by a number of CPU's.
However, superpiplining does run into some problems that prevent the
theoretical maximum speedup from being achieved.
a. If the individual steps an instruction is broken up into don't
all take the same time (a likely situation), then the time for
the longest step becomes the time for all steps.
Example: if in going from 4 stages to 5 in our example above it
turned out that one of the steps needed only 6 ns but
another needed 10 (still 40 ns total), the cycle time
would have to be 10ns, and the speedup would still be
only 4:1
b. The longer the pipeline, the greater the potential waste of
time due to data and branch hazards.
3. Note that superpipelining attempts to maintain CPI at 1 (or as close
as possible) while using a longer pipeline to allow the use of a
faster clock.
B. It would appear, at first, that a CPI of 1 is as good as we can get -
so there is nothing further that can be done beyond full pipelining to
reduce CPI. Actually, though, we can get CPI less than one if we
execute two or more instructions fully in parallel (i.e. fetch them at
the same time, do each of their steps at the same time, etc) by
duplicating major portions of the instruction execution hardware.
1. If we can start 2 instructions at the same time and finish them at
the same time, we complete 2 instructions per clock, so average CPI
drops to 0.5. If we can do 4 at a time, average CPI drops to 0.25.
2. Because various hazards make it impossible to always achieve the
maximum degree of parallelism, we speak of a machine as ISSUING
some number of instructions on a given clock. (An instruction is
issued when it is actually allowed to begin execution.)
Potentially, for a given machine, this might be 2 or 4; however, the
actual number issued on a given clock may be less.
3. This is facilitated by taking advantage of the fact that many
CPU's have separate execution units for executing different types
of instructions - e.g. there may be:
a. An integer execution unit used for executing integer instructions
like add, bitwise or, shift etc..
b. A floating point execution unit for executing floating point
arithmetic instructions. (Note that many architectures use
separate integer and floating point register sets).
c. A branch execution unit used for executing branch instructions.
If two instructions need two different execution units (e.g. if one
is an integer instruction and one is floating point) then they can
be issued simuiltaneously and execute totally in parallel with each
other, without needing to replicate execution hardware (though
decode and issue hardware does need to be replicated.)
Note that, for example, many scientific programs contain a mixture of
floating point operations (that do the bulk of the actual computation),
integer operations (used for subscripting arrays of floating point
values and for loop control), and branch instructions (for loops).
For such programs, issuing multiple instructions at the same time
becomes very feasible.
Note: in effect, a group of instructions that do a computation - say -
on an array element and then update a pointer to point to the
next element become, in effect, like a single CISC instruction
with autoincrement!
4. The earliest scheme used for doing this was the VERY LONG INSTRUCTION
WORD architecture. In this architecture, a single instruction could
specify more than one operation to be performed - in fact, it could
specify one operation for each execution unit on the machine.
a. The instruction contains one group of fields for each type of
instruction - e.g. one to specify an integer operation, one to
specify a floating point operation, etc.
b. If it is not possible to find operations that can be done at the
same time for all functional units, then the instruction may
contain a NOOP in the group of fields for unneeded units.
c. The VLIW architecture requires the compiler to be very knowledgeable
of implementation details of the target computer, and may require a
program to be recompiled if moved to a different implementation of
the same architecture.
d. Because most instrucion words contain some NOOP's, VLIW programs
tend to be very long.
5. Current practice - found on a number of RISCs including Dec Alpha and
the Power PC - is to use SUPERSCALAR architecture.
a. A superscalar CPU fetches groups of instructions at a time -
typically two (64 bits) or four (128 bits) and decodes them in
parallel.
b. A superscalar CPU has just one instruction fetch unit (which
fetches a whole group of instructions), but it has 2 or 4
decode units and a number of different execution units.
c. If the instructions fetched together need different execution units,
then they are issued at the same time. If two instructions need
the same execution unit, then only the first is issued; the second
is issued on the next clock. (This is called a STRUCTURAL HAZARD).
d. To reduce the number of structural hazards that occur, some
superscalar CPU's have two or more integer execution units,
along with a branch unit and a floating point unit, since integer
operations are more frequent. Or, they might have a unit that
handles integer multiply and divide and one that does add and
subtract.
6. Once again, the issue of data and branch hazards becomes more
complicated when multiple instructions are issued at once, since
an instruction cannot depend on the results of any instruction
issued at the same time, nor on the results of any instruction issued
on the next one or more clocks. With multiple instructions issued
per clock, this increases the potential for interaction between
instructions, of course.
a. Example: If a CPU issues 4 instructions per clock, then up to
seven instructions following a branch might be in the pipeline
by the time the branch instruction finishes computing its target
address. (If it is the first of a group of 4, plus a second
group of 4.)
b. Example: If a CPU issues 4 instructions per clock, then there may
need to be a delay of up to seven instructions before one can
use the result of a load instruction, even with data forwarding
as described above.
C. Dealing with Hazards on a Superscalar Machine
1. Data hazards
a. We have previously seen how data forwarding can be used to
eliminate data hazards between successive instructions where one
instruction uses a result computed by an immediately-preceeding
one. However, if "producer" and "consumer" instruction are
executed simultaneously in different execution units, forwarding
no longer helps. Likewise, the unavoidable one cycle delay
needed by a load could effect many successive instructions.
b. Superscalar machines typically incorporate hardware interlocks
to prevent data hazards from leading to wrong results. When an
instruction that will store a value into a particular register is
issued, a lock bit is set for that register that is not cleared
until the value is actually stored - typically several cycles
later. An instruction that uses a locked register as a data input
is not issued until the register(s) it needs is/are unlocked.
c. Further refinements on this include a provision that allows the
hardware to schedule instructions dynamically, so that a "later"
instruction that does not depend on a currently executing
instruction might be issued after an "earlier" instruction that
does. (This is called OUT OF ORDER EXECUTION.)
2. Branch hazards
a. To reduce the impact of branch hazards, some machines (CISCs as
well as highly-pipelined machines) make use of BRANCH PREDICTION.
When a conditional branch instruction is encountered, the fetch
unit makes a guess as to whether or not the branch is going to be
taken. If the prediction is that the branch will be taken (or the
branch is unconditional), then the fetch unit begins to fetch
instructions from the new location; otherwise, it keeps fetching
as usual. Only if the prediction is wrong does a pipeline flush
occur.
i. How can such a prediction be done?
- One way to do the prediction is to use the following rule of
thumb: assume that forward conditional branches will not be
taken, and backward conditional branches will be taken.
Why? ASK
- Forward branches typically arise from a construct like
if something then
common case
else
less common case
- Backward branches typically result from loops - and only the
last time the branch is encountered will it not be taken.
- Some machines incorporate bits into the format for branch
instructions whereby the compiler can furnish a hint as to
whether the branch will be taken.
- Some machines maintain a branch history table which stores
the address from which a given branch instruction was fetched
and an indication as to whether it was taken the last time
it was executed.
ii. In any case, prediction requires the ability to reach a
definitive decision about whether the branch is going to be
taken before any following instructions have stored any values
into memory or registers.
iii. Prediction cannot eliminate all problems, because if the branch
is predicted to be taken, it is not possible to begin
prefetching down the new path until the target address has
been computed (in stage 2 of the pipelines we have been using
for examples.) CPU's that record branch history can avoid this
problem by also storing the target address of the branch.
(Since the target address of a branch instruction is generally
computed by PC + displacement in instruction, a given branch
instruction will always point to the same target.)
b. An alternative to branch prediction that is being used on the new
Intel RISC architecture in development (IA 64) is called
PREDICATION. In this strategy, the CPU includes a number of one
bit predicate registers that can be set by conditional instructions.
The instruction format includes a number of bits that allow
execution of an instruction to be contingent on a particular
predicate register being true (or false). Further, a predicated
instruction can begin executing before the value of its predicate
is actually known, as long as the value becomes known before the
instruction needs to store its result. This can eliminate the
need for a lot of branch instructions.
Example:
if r10 = r11 then
r9 = r9 + 1
else
r9 = r9 - 1
Would be translated on MIPS as:
bneq $10, $11, else
nop # Branch delay slot
br endif
addi $9, $9, 1 # In branch delay slot - always done
else:
addi $9, $9, -1
endif:
Which is 5 instructions long and needs 4 clocks if $10 = $11
and 3 if not.
But on a machine with predication as:
set predicate register 1 true if $10 = $11
(if predicate register 1 is true) addi $9, $9, 1
(if predicate register 1 is false) addi $9, $9, -1
Which is 3 instructions long (all of which can be done in
parallel, provided the set predicate instruction sets the
predicate register earlier in its execution than the other
two store their results.)
D. Advanced CPU's use both superpiplining and superscalar techniques.
(E.g. Dec Alpha is both superpipelined and superscalar). The benefits
that can be achieved are, of course, dependent on the ability of the
compiler to arrange instructions in the program so that when one
instruction depends upon another it occurs enough later in the program
to prevent hazards from stalling execution and wasting the speedup that
could otherwise be attained.
Copyright ©1999 - Russell C. Bjork