CS222 Lecture: Parallelism Last revised 5/2/99
I. Introduction
-- ------------
A. Today's computers are orders of magnitude faster than those of a
generation ago. Nonetheless, there is a continuing demand for ever
faster computers.
Why? ASK
1. Applications that are not feasible with present computing speeds, but
would become feasible if speeds improved - e.g. improved weather
forecasting, simulation of various natural and man-made systems, etc.
2. Volume of use - e.g. servers that must carry an ever-increasing load
as more and more people make use of the services they provide.
3. People who use computers want instant results.
B. We have already seen that the time taken to perform a given task on a
computer can be calculated using the following equation:
instructions X clock-cycles/instruction X seconds/clock-cycle
C. Likewise, we have already seen that there are basically three ways to
make a computer system faster.
1. We can use better, faster algorithms - thus reducing the number of
instructions that need to be executed.
a. Historically, a great deal of progress has been made in this area,
and research certainly continues.
b. However, certain tasks have an inherent complexity that constitutes
a lower bound on the complexity of any algorithmic solution to
the problem.
Example: sorting by comparison is provably omega(n log n). While
a better algorithm might improve the constant of
proportionality some, no algorithm can do better than
O(n log n) for this task.
2. We can use faster components (CPU's, memorys, etc) - thus reducing
the time per clock cycle.
a. Again, significant progress has been made here.
i. The CPU's in the first PC's had clock rates on the order of 4
MHz - today systems are being developed that exceed 400 MHz - a
100-fold improvement.
ii. Magnetic core memory - the dominant memory technology 25
years ago - had an access time on the order of 1 microsecond.
Today's dynamic RAMs have access times on the order of 60 ns -
a 16 to 1 speed up - and the technology used for cache memory
is almost ten times as fast as that - so overall memory speeds
have also increased 100-fold.
b. While progress is continuing to be made in this area, we are
nearing some fundamental physical limits - e.g. CPU clock speeds
are increased by making the individual features on the chip
smaller, but there are fundamental limits on how small an
individual feature can be.
3. We can make use of parallelism - doing two or more things at the
same time - thus reducing the effective number of clock cycles
needed per instruction.
a. Effective number of cycles per instruction = number of cycles
needed for a single instruction / number of instructions being
executed in parallel.
b. Example: We have already looked at how RISC's make use of
pipelining - the execution of different portions of several
instructions at the same time - to reduce the effective number
of clock cycles per instruction to 1. Likewise, we have seen that
further development of this approach can reduce the effective number
of clock cycles per instruction to less than one by issuing more
than one instruction at a time.
i. This sort of parallelism is known as INSTRUCTION-LEVEL
PARALLELISM.
ii. The degree of parallelism that can be achieved this way is
inherently limited by sequential dependencies among instructions.
c. Pipelined computers typically have one instance of each basic
functional unit, but keep all of the functional units busy
constantly by doing different pieces of several different
instructions are being executed at the same time. Further
improvements are possible by the replication of functional units
or whole CPU's, so that several different instances of the SAME
kind of work can be done in parallel.
d. As we approach fundamental limits in terms of efficiency of
algorithms and raw CPU/memory speed, parallelism becomes the
key to achieving further major improvements in computer speed.
It is thus, in some sense, the "final frontier" of computer
architecture.
D. For the remainder of this lecture, we will look at:
1. Specialized CPU's for specialized tasks - vector processors.
2. Systems with multiple ALU': array processors
3. Multiple-CPU systems - multiprocessors.
NOTE: The chapter in the book focusses on the latter kind of system,
and discusses the first two types of system in the historical
perspective section.
E. First, through, we want to look at a system for classifying parallel
systems, known as FLYNN'S TAXONOMY.
1. Flynn's taxonomy classifies systems by the number of instruction
streams and data streams being processed at the same time.
2. Computers of the sort we have discussed up until now - i.e. all
conventional computers - are classified as SISD. A SISD machine has a
Single Instruction Stream and a Single Data Stream - it executes one
sequence of instructions, and each instruction works on one one set of
operands.
3. A SIMD machine has a Single Instruction Stream and Multiple Data
Stream - it executes one sequence of instructions, but each instruction
can work on several sets of operands in parallel. Vector processors
and array processors - which we shall discuss shortly - fall into
this category.
4. A possibility that doesn't really exist in practice is MISD (multiple
instruction stream, single data stream - i.e. one stream of data
passes through the system, but each item of data is operated on by
several insructions. (Though sometimes vector processors are
classified this way - probably not a good classification, though.)
5. A MIMD machine has Multiple Instruction Streams, each operating on
its own Data Stream. This kind of system, then, has multiple full
CPU's, and is commonly known as a multiprocessor system.
II. Vector Processors
-- ------ ----------
A. One interesting class of parallel machines is the class of Vector
processors. These differ from conventional CPU's in the following way:
1. A vector processor has a pipelined ALU with flexible interconnections
between the elements.
2. Its instruction set has two kinds of instructions:
a. Ordinary scalar instructions that operate on a single set of
scalar operands to produce a single scalar result.
b. Vector instructions that carry out an operation on a set of
vector operands to produce a vector result. This is done by
piping entire vectors through the pipeline, so that the operation
in question is applied to corresponding elements of each vector.
Once the pipeline is filled, it is usually possible to produce
one result per clock period.
c. Example: If A, B, and C are two-dimensional matrices, then the
matrix addition
MAT A = B + C
would require a pair of nested for loops on a conventional CPU,
but a single machine instruction on a vector processor. (Though
this instruction would take a long time to execute - typically
one clock per element of the matrix, plus some additional clocks
for setup and final flushing of the pipeline.)
B. For example, a few years ago, Digital announced a Vector Processing
extension to the VAX architecture that adds various vector operations to
the standard VAX instruction set. (This is an option on certain of
their larger machines.)
1. Example: VVADDF
Adds successive pairs of elements from two floating point source
vectors and stores the sums in successive elements of a floating
point destination vector. (All three vectors are, of course, of
the same length).
2. Contrast the task of adding two one-dimensional 1000 element real
vectors (B and C) to produce a vector sum (A), using conventional
instructions and VVADDF.
a. Conventional loop approach:
MOVL #1000, R0
MOVAF A, R1
MOVAF B, R2
MOVAF C, R3
LOOP: ADDF (R2)+, (R3)+, (R1)+
SOBGTR R0, LOOP
This requires the execution of a total of 2004 instructions, each
of which must be fetched, decoded, and executed. The total
number of clocks would be several times this, since each
instruction would require several clocks to execute.
b. Using VVADDF
VVADDF B, C, A
One instruction, and a bit more than 1000 clocks. (Once the
instruction and operand specifiers are fetched, execution
proceeds at one element per clock).
c. Note that, on a RISC, the total number of instructions (and hence
clocks) would be more like 7000-9000 - the loop body would need:
i. Two load instructions to load a pair of elements from the
two source vectors into registers.
ii. One add instruction
iii. One store instruction to store the result back to the
destination vector.
iv. At least one - and quite likely three - instructions to
increment address registers pointing to the two source vectors
and the destination vector. (To use a single index register
for all three, it would have to be possible to fit the base
address of each vector into the 16 bits allowed for a
displacement on an architecture like MIPS).
v. A decrement for the counter, followed by a conditional branch.
All of the above (7-9 instructions) would be executed 1000 times.
III. Array Processors
--- ----- ----------
A. Another interesting class of parallel machines is the class of
Array processors. These differ from conventional CPU's in the following
way:
1. A conventional CPU has a single (possibly pipelined) ALU. Thus, each
instruction that is fetched and executed initiates a single
computation in the ALU - though several such computations may overlap
in time.
2. An array processor has an array of processing elements - perhaps
scores or even hundreds.
a. Each processing element contains its own ALU (which may be
pipelined to achieve maximum speed.)
b. Each processing element also has its own private memory - a set
of registers and possibly a local RAM.
c. A single user instruction may initiate a computation in each ALU
(with each, of course, working on a different data item), or it may
initiate computation only in a subset of the ALU's or even just one.
d. To make full use of the power of the array processor, data must
be distributed appropriately across the processing elements.
Example: If we were working with matrices of 1000 elements each,
and had 1000 PE's, we might store one element of each
matrix in each PE - i.e. the [1] element of each matrix
would be stored with PE 1. This would allow all elements
of the matrix to be operated on at the same time.
3. The obvious use of such processors in in working with the vectors
and matrices that arise frequently in scientific computation. For
example, the matrix addition we considered above:
MAT A = B + C
could also be done by a single instruction on an array processor,
with each ALU performing a single addition in parallel with all the
others - assuming the elements of A, B, and C are distributed across
the various processing elements as described above.
a. Contrast this with the vector processor we just considered. On
the vector processor, the additions are still done sequentially,
but pipelining is used to finish one addition per clock period.
b. On an array processor, all the additions are done at the same
time, but by different ALU's - i.e. in at most a handful of clocks!
c. Clearly, the array processor is even faster than the vector
processor - but also more costly!
4. However, other applications arise as well - e.g. parallel searches -
each processor compares the key of a desired piece of information
against its locally-held information in parallel with other processors
doing the same thing.
B. An example of such a machine is the Connection Machine - developed
as a doctoral thesis by Daniel Hillis at MIT and at one time marketed
commercially by Thinking Machines, Inc in Cambridge.
1. A connection machine had a large array of very simple processing
elements. For example, the model CM-2 had up to 65,636 processors,
each of which had
a. A one bit ALU.
b. Four one bit registers
c. 64K bits of local memory
d. A network connection
2. The extreme simplicity of the processing elements was made up for
by the sheer number of them. For example, because arithmetic is
done one bit at a time, a 32-bit addition took 21 microseconds.
However, if 65,536 additions are done in parallel, the effective
time per addition dropped to 0.3 ns.
3. All the processing elements received instructions broadcast by a
central control unit, and act on them (when appropriate) in
parallel. (Condition bits in the individual PE's allow only a subset
to respond to a given instruction, with the others being idle).
4. A key feature of the architecture was the existence of flexible
interconnections between the PE's (hence the name "Connection
Machine") whereby the results of a computation on one PE may be
passed on to another PE for further work.
Example: suppose a 1024 element array is distributed over 512
PE's - 2 elements per PE. Consider summing all the
elements of the array (a 1023 step operation on a
conventional machine).
On the connection machine, each PE could add its two
elements in parallel, and the odd ones could broadcast
the sum to their even neighbor.
Next, each even PE could add its own sum plus that from its
neighbor, and half could broadcast their sums to the other
half.
Clearly, computing the overall sum this way would involve:
1 step using all 512 PE's
1 using 256
1 using 128
...
1 using 1 PE - or 9 steps taking just 9 time units.
In general, summing n numbers (with at most two on any
one processor) would take O(log n) time - hence this
algorithm is called the log sum algorithm.
C. Thinking Machines Corporation has since sold their product line to
the Gores Technology Group, which has created a business unit called
Connection Machines Services, Inc. that evidently still markets
the CM-2 and later CM-5 system; however, their web site was not
very informative. To my knowledge there are no other current commercial
examples of array processors. Instead, of having multiple ALU's with a
single control unit, most of todays parallel machines are based on
multiple off-the-shelf CPU's. However, the array processor idea may not
be dead forever!
IV. System Speed-Up by Using Multiple Full CPU's: Multiprocessors
-- ------ -------- -- ----- -------- ---- ----- ---------------
A. A SISD machine has one CPU with one ALU, one memory, and one control.
A SIMD machine has one CPU with either a pipelined ALU (vector
processor) or else multiple ALU's and memories (array processor), but
still one control. Further parallelism can be achieved by using multiple
complete CPU's.
1. Such a machine is called MIMD. It has Multiple Instruction Streams,
each working on its own Data Stream - hence Multiple Data Streams as
well. That is, each processor carries out its own set of instructions.
2. MIMD machines are distinguished from computer networks (which they
resemble in having multiple CPU's) by the fact that in a network the
cooperation of the CPU's is ocassional, while in an MIMD machine all
the CPU's work together on a common task.
B. MIMD systems are further classified by how the CPU's cooperate with
one another, and by how they are connected.
1. MIMD systems can be either based on SHARED MEMORY or on MESSAGE
PASSING.
a. In a shared memory system, all the CPU's share physical memory
(or at least some portion of it) in common. They communicate
with one another by writing/reading variables contained in
shared memory. A key feature of such a system is a single
address space - i.e. address 1000 refers to the same item in
memory, regardless of which processor generates the address.
b. In a message passing based system, each CPU has its own memory,
and CPU's cooperate by explictly sending messages to one another.
2. The CPU's in a MIMD system can either be connected by having a COMMON
BUS, or by a NETWORK.
C. Further comments about shared memory systems.
1. Two further variations are possible in a shared memory system:
a. In a Uniform Memory Access system (UMA), the time needed to access
a given location in memory is the same, regardless of which
processor accesses it. (Such systems are also called SYMMETRIC
MULTIPROCESSING (SMP) SYSTEMS.)
b. In a Non-Uniform Memory Access system (NUMA), a given processor
may have faster access to some regions of memory than others.
This is usually a consequence of different processors "owning"
different parts of the overall memory, so that some accesses are
more efficient than others.
2. Either way, some mechanism is needed for SYNCHRONIZING accesses to
shared memory.
a. Example: Suppose two different processors both need to periodically
increment some shared variable (perhaps a counter of some sort.)
If the processors are RISCs, the code would look something like:
lw $1, v
nop
addi $1, $1, 1
sw $1, v
Now suppose to processors happen to need to increment the variable
at about the same time. Suppose, further, that its initial
value is 40, so that it should be 42 after both increments take
place. Finally, suppose the following sequence of operations
occurs:
Processor 1 Processor 2
lw $1, v
nop lw $1, v
addi $1, $1, 1 nop
sw $1, v addi $1, $1, 1
sw $1, v
What is the final value of v? ASK
What went wrong?
b. Synchronization mechanisms are studied in detail in the operating
systems course (CS322), which you will take next year. Suffice
it to say that, at the heart of most such mechanisms is the idea
of LOCKING a data item so that one processor has exclusive
access to it during the performance of an operation like the one
just described.
3. Finally, because contention for access to shared memory would
severely limit performance of a shared memory system, having each
processor have its own cache is essential. (Thus, the number of
times a given processor actually accesses the shared memory is
minimized.) But now a new problem arises, because when one
processor updates a shared variable, copies of the old value may
still reside in the caches of other processors.
a. This problem is called the CACHE COHERENCY problem.
b. If a common bus is used to connect all the processors to the
shared memory, then one possible solution is to use SNOOPY
CACHEs, in conjunction with write-through caching.
i. In addition to responding to requests for data from its own
processor, a snoopy cache also monitors bus traffic and listens
for any memory write being done by another process to a
location it contains.
ii. When it hears such a write taking place, it either updates or
invalidates its own copy.
c. Because snoopy caching cannot be used with write-back caches, or
in the absence of a common bus, other cache coherency protocols
have been developed. These are discussed in the text.
4. Given the issues of synchronization and cache coherence, one might
ask why use shared memory? The basic answer is that it minimizes
overhead for inter-processor communication: the time needed to read
or write a shared variable is much less than the time needed to
send or receive a message.
D. Further comments about connection structures.
1. Conceptually, the simplest connection structured is a common bus.
However, the number of processors that can be connected to one bus
is limited in practice by problems of BUS CONTENTION, since only
one processor can be using the bus at a time. Also, as the number of
processors connected to the bus increases, its length must increase
and, as a result, its speed must decrease.
2. Use of some sort of network connection is attractive because it
allows more processors to be interconnected. Of course, there are
all sorts of possible network configurations that might be used.
a. A configuration in which each processor is directly connected to
each other processor is attractive, but requires building O(N^2)
links, with O(N^2) interfaces. (I.e. each CPU needs N-1
interfaces to the N-1 links connecting it to the other N-1 CPU's).
b. On the other hand, if the number of links is kept down (to reduce
cost), then communication delays increase when the destination
can be reached only indirectly; and links that lie on many paths
between different processors can become bottlenecks.
c. The book discusses several options for connection architectures.
We will look at just one.
E. One example of a MIMD connection architecture is the hyper-cube.
1. A hypercube has 2^n processors, for some n. Each CPU has a direct
link to n neighbors.
a. Example: n = 2 - 4 CPU's: P --- P
| |
P --- P
b. Example: n = 3 - 8 CPU's: P --- P
/ /|
P --- P P
| |/
P --- P
c. Example: n = 4 - 16 CPU's: Two 8-cubes with each node also linked
to the corresponding node in the other cube.
d. Commercial hypercube systems are available for at least n=6
(64 CPU's), 7 (128), 8 (256), 9 (512).
2. On a hypercube of dimensionality n, any processor can send a message
to any other processor using at most n steps. A simple routing
algorithm can be derived if the processors are assigned id numbers in
the range 0 .. 2^n - 1 in such a way that the binary representation
of each processor's id number differs from that of each of its
neighbors in exactly one bit position. Each bit position corresponds
to a direction of movement in n-dimensional space.
Example: In a 4 cube, the neighbors of processor 1010 are
0010, 1110, 1000, and 1011
A path can be found from one processor to another by xoring their
id numbers. Each 1 in the result corresponds to a direction in
which the message has to move.
Example: 3 cube
110 --- 111
/ / |
100 --- 101 011
| | /
000 --- 001 (010 hidden in back)
To send a message from 110 to 011:
110 xor 011 = 101. Message needs to move up (1xx) and right (xx1)
Possible paths: 110 010 011 or 110 111 011
V. Conclusion
- ----------
A. One ongoing research question is how to get a perfomance increase out
of a parallel system that is commensurate with the investment in extra
hardware.
1. The SPEEDUP of a parallel system is defined as
Time to perform a given task on a single, non-parallel system
---------------------------------------------------------------
Time to perform the same task on a parallel system
2. Ideally, we would achieve LINEAR SPEEDUP, where a parallel system
with n processors has a speedup of n when compared with a single
processor.
3. However, such speedup is never totally attainable, and often
difficult to even approach, due to issues like:
a. Inherent sequentialness in the problem.
Example: Earlier we considered adding up the elements of a 1024
element vector using 512 processors. We showed that this
can be done by the log sum algorithm in 10 steps. Since
doing the addition on a single processor would take 1023
steps, the speedup is about 100 - far less than the ideal
given the use of 512 processors.
b. Various overheads associated with parallelism - e.g.
i. Contention for shared resources (bus, memory).
ii. Synchronization.
iii. Network overhead for communication.
4. Whole textbooks exist on this topic!
B. In addition to enhanced speed, though, there are other motivations
for building parallel systems that may be even more important.
ASK
1. Reliability/availability
2. Incremental upgrade
Copyright ©1999 - Russell C. Bjork