Parallel Processing

CS222 Lecture: Parallelism                              Last revised 5/2/99

I. Introduction
-- ------------

   A. Today's computers are orders of magnitude faster than those of a 
      generation ago.  Nonetheless, there is a continuing demand for ever 
      faster computers.

      Why? ASK

      1. Applications that are not feasible with present computing speeds, but
         would become feasible if speeds improved - e.g. improved weather
         forecasting, simulation of various natural and man-made systems, etc.

      2. Volume of use - e.g. servers that must carry an ever-increasing load
         as more and more people make use of the services they provide.

      3. People who use computers want instant results.

   B. We have already seen that the time taken to perform a given task on a 
      computer can be calculated using the following equation:

        instructions X clock-cycles/instruction X seconds/clock-cycle

   C. Likewise, we have already seen that there are basically three ways to 
      make a computer system faster.

      1. We can use better, faster algorithms - thus reducing the number of
         instructions that need to be executed.

         a. Historically, a great deal of progress has been made in this area,
            and research certainly continues.

         b. However, certain tasks have an inherent complexity that constitutes
            a lower bound on the complexity of any algorithmic solution to
            the problem.

            Example: sorting by comparison is provably omega(n log n).  While
                     a better algorithm might improve the constant of
                     proportionality some, no algorithm can do better than
                     O(n log n) for this task.

      2. We can use faster components (CPU's, memorys, etc) - thus reducing
         the time per clock cycle.

         a. Again, significant progress has been made here.

            i. The CPU's in the first PC's had clock rates on the order of 4 
               MHz - today systems are being developed that exceed 400 MHz - a 
               100-fold improvement.

           ii. Magnetic core memory - the dominant memory technology 25
               years ago - had an access time on the order of 1 microsecond.
               Today's dynamic RAMs have access times on the order of 60 ns -
               a 16 to 1 speed up - and the technology used for cache memory
               is almost ten times as fast as that - so overall memory speeds
               have also increased 100-fold.

         b. While progress is continuing to be made in this area, we are
            nearing some fundamental physical limits - e.g. CPU clock speeds
            are increased by making the individual features on the chip
            smaller, but there are fundamental limits on how small an
            individual feature can be.

      3. We can make use of parallelism - doing two or more things at the
         same time - thus reducing the effective number of clock cycles
         needed per instruction.  

         a. Effective number of cycles per instruction = number of cycles
            needed for a single instruction / number of instructions being
            executed in parallel.

         b. Example: We have already looked at how RISC's make use of
            pipelining - the execution of different portions of several 
            instructions at the same time - to reduce the effective number
            of clock cycles per instruction to 1.  Likewise, we have seen that 
            further development of this approach can reduce the effective number
            of clock cycles per instruction to less than one by issuing more
            than one instruction at a time.

            i. This sort of parallelism is known as INSTRUCTION-LEVEL
               PARALLELISM.

           ii. The degree of parallelism that can be achieved this way is
               inherently limited by sequential dependencies among instructions.
 
         c. Pipelined computers typically have one instance of each basic
            functional unit, but keep all of the functional units busy 
            constantly by doing different pieces of several different 
            instructions are being executed at the same time.  Further 
            improvements are possible by the replication of functional units
            or whole CPU's, so that several different instances of the SAME
            kind of work can be done in parallel.

         d. As we approach fundamental limits in terms of efficiency of
            algorithms and raw CPU/memory speed, parallelism becomes the
            key to achieving further major improvements in computer speed.
            It is thus, in some sense, the "final frontier" of computer
            architecture.

   D. For the remainder of this lecture, we will look at:

      1. Specialized CPU's for specialized tasks - vector processors.

      2. Systems with multiple ALU': array processors

      3. Multiple-CPU systems - multiprocessors.

      NOTE: The chapter in the book focusses on the latter kind of system,
            and discusses the first two types of system in the historical
            perspective section.

   E. First, through, we want to look at a system for classifying parallel
      systems, known as FLYNN'S TAXONOMY.

      1. Flynn's taxonomy classifies systems by the number of instruction
         streams and data streams being processed at the same time.

      2. Computers of the sort we have discussed up until now - i.e. all
         conventional computers - are classified as SISD. A SISD machine has a
         Single Instruction Stream and a Single Data Stream - it executes one 
         sequence of instructions, and each instruction works on one one set of 
         operands.

      3. A SIMD machine has a Single Instruction Stream and Multiple Data
         Stream - it executes one sequence of instructions, but each instruction
         can work on several sets of operands in parallel.  Vector processors
         and array processors - which we shall discuss shortly - fall into
         this category.

      4. A possibility that doesn't really exist in practice is MISD (multiple
         instruction stream, single data stream - i.e. one stream of data
         passes through the system, but each item of data is operated on by
         several insructions.  (Though sometimes vector processors are
         classified this way - probably not a good classification, though.)
 
      5. A MIMD machine has Multiple Instruction Streams, each operating on
         its own Data Stream.  This kind of system, then, has multiple full
         CPU's, and is commonly known as a multiprocessor system.

II. Vector Processors
--  ------ ----------

   A. One interesting class of parallel machines is the class of Vector
      processors.  These differ from conventional CPU's in the following way:

      1. A vector processor has a pipelined ALU with flexible interconnections
         between the elements.

      2. Its instruction set has two kinds of instructions:

         a. Ordinary scalar instructions that operate on a single set of
            scalar operands to produce a single scalar result.

         b. Vector instructions that carry out an operation on a set of
            vector operands to produce a vector result.  This is done by
            piping entire vectors through the pipeline, so that the operation
            in question is applied to corresponding elements of each vector.
            Once the pipeline is filled, it is usually possible to produce
            one result per clock period.

         c. Example: If A, B, and C are two-dimensional matrices, then the 
            matrix addition

                MAT A = B + C

            would require a pair of nested for loops on a conventional CPU, 
            but a single machine instruction on a vector processor.  (Though
            this instruction would take a long time to execute - typically
            one clock per element of the matrix, plus some additional clocks
            for setup and final flushing of the pipeline.)

   B. For example, a few years ago, Digital announced a Vector Processing 
      extension to the VAX architecture that adds various vector operations to 
      the standard VAX instruction set.  (This is an option on certain of
      their larger machines.)

      1. Example: VVADDF

         Adds successive pairs of elements from two floating point source 
         vectors and stores the sums in successive elements of a floating
         point destination vector.  (All three vectors are, of course, of
         the same length).

      2. Contrast the task of adding two one-dimensional 1000 element real
         vectors (B and C) to produce a vector sum (A), using conventional 
         instructions and VVADDF.

         a. Conventional loop approach:

                        MOVL    #1000, R0
                        MOVAF   A, R1
                        MOVAF   B, R2
                        MOVAF   C, R3
           LOOP:        ADDF    (R2)+, (R3)+, (R1)+
                        SOBGTR  R0, LOOP

            This requires the execution of a total of 2004 instructions, each
            of which must be fetched, decoded, and executed.  The total
            number of clocks would be several times this, since each
            instruction would require several clocks to execute.

         b. Using VVADDF

                        VVADDF  B, C, A

            One instruction, and a bit more than 1000 clocks.  (Once the
            instruction and operand specifiers are fetched, execution
            proceeds at one element per clock).

         c. Note that, on a RISC, the total number of instructions (and hence
            clocks) would be more like 7000-9000 - the loop body would need:

            i. Two load instructions to load a pair of elements from the
               two source vectors into registers.

           ii. One add instruction

          iii. One store instruction to store the result back to the
               destination vector.

           iv. At least one - and quite likely three - instructions to
               increment address registers pointing to the two source vectors
               and the destination vector.  (To use a single index register
               for all three, it would have to be possible to fit the base
               address of each vector into the 16 bits allowed for a
               displacement on an architecture like MIPS).

            v. A decrement for the counter, followed by a conditional branch.

            All of the above (7-9 instructions) would be executed 1000 times.

III. Array Processors
---  ----- ----------

   A. Another interesting class of parallel machines is the class of
      Array processors.  These differ from conventional CPU's in the following
      way:

      1. A conventional CPU has a single (possibly pipelined) ALU.  Thus, each
         instruction that is fetched and executed initiates a single
         computation in the ALU - though several such computations may overlap
         in time.

      2. An array processor has an array of processing elements - perhaps 
         scores or even hundreds.  

         a. Each processing element contains its own ALU (which may be 
            pipelined to achieve maximum speed.)

         b. Each processing element also has its own private memory - a set
            of registers and possibly a local RAM.

         c. A single user instruction may initiate a computation in each ALU 
            (with each, of course, working on a different data item), or it may 
            initiate computation only in a subset of the ALU's or even just one.

         d. To make full use of the power of the array processor, data must
            be distributed appropriately across the processing elements.

            Example: If we were working with matrices of 1000 elements each,
                     and had 1000 PE's, we might store one element of each
                     matrix in each PE - i.e. the [1] element of each matrix
                     would be stored with PE 1.  This would allow all elements 
                     of the matrix to be operated on at the same time.

      3. The obvious use of such processors in in working with the vectors
         and matrices that arise frequently in scientific computation.  For
         example, the matrix addition we considered above:

                MAT A = B + C

         could also be done by a single instruction on an array processor, 
         with each ALU performing a single addition in parallel with all the 
         others - assuming the elements of A, B, and C are distributed across
         the various processing elements as described above.

         a. Contrast this with the vector processor we just considered.  On
            the vector processor, the additions are still done sequentially,
            but pipelining is used to finish one addition per clock period.

         b. On an array processor, all the additions are done at the same
            time, but by different ALU's - i.e. in at most a handful of clocks!

         c. Clearly, the array processor is even faster than the vector
            processor - but also more costly!

      4. However, other applications arise as well - e.g. parallel searches -
         each processor compares the key of a desired piece of information
         against its locally-held information in parallel with other processors
         doing the same thing.

   B. An example of such a machine is the Connection Machine - developed 
      as a doctoral thesis by Daniel Hillis at MIT and at one time marketed
      commercially by Thinking Machines, Inc in Cambridge.

      1. A connection machine had a large array of very simple processing 
         elements.  For example, the model CM-2 had up to 65,636 processors, 
         each of which had

         a. A one bit ALU.
        
         b. Four one bit registers

         c. 64K bits of local memory

         d. A network connection

      2. The extreme simplicity of the processing elements was made up for
         by the sheer number of them.  For example, because arithmetic is
         done one bit at a time, a 32-bit addition took 21 microseconds.
         However, if 65,536 additions are done in parallel, the effective
         time per addition dropped to 0.3 ns.

      3. All the processing elements received instructions broadcast by a
         central control unit, and act on them (when appropriate) in
         parallel.  (Condition bits in the individual PE's allow only a subset
         to respond to a given instruction, with the others being idle).

      4. A key feature of the architecture was the existence of flexible
         interconnections between the PE's (hence the name "Connection
         Machine") whereby the results of a computation on one PE may be
         passed on to another PE for further work.

         Example: suppose a 1024 element array is distributed over 512
                  PE's - 2 elements per PE.  Consider summing all the
                  elements of the array (a 1023 step operation on a
                  conventional machine).

                  On the connection machine, each PE could add its two 
                  elements in parallel, and the odd ones could broadcast 
                  the sum to their even neighbor.

                  Next, each even PE could add its own sum plus that from its
                  neighbor, and half could broadcast their sums to the other
                  half. 

                  Clearly, computing the overall sum this way would involve:

                  1 step using all 512 PE's
                  1 using 256
                  1 using 128
                  ...
                  1 using 1 PE - or 9 steps taking just 9 time units.

                  In general, summing n numbers (with at most two on any
                  one processor) would take O(log n) time - hence this
                  algorithm is called the log sum algorithm.

   C. Thinking Machines Corporation has since sold their product line to
      the Gores Technology Group, which has created a business unit called
      Connection Machines Services, Inc. that evidently still markets 
      the CM-2 and later CM-5 system; however, their web site was not
      very informative.  To my knowledge there are no other current commercial 
      examples of array processors.  Instead, of having multiple ALU's with a 
      single control unit, most of todays parallel machines are based on 
      multiple off-the-shelf CPU's.  However, the array processor idea may not 
      be dead forever!

IV. System Speed-Up by Using Multiple Full CPU's: Multiprocessors
--  ------ -------- -- ----- -------- ---- -----  ---------------

   A. A SISD machine has one CPU with one ALU, one memory, and one control.  
      A SIMD machine has one CPU with either a pipelined ALU (vector
      processor) or else multiple ALU's and memories (array processor), but 
      still one control.  Further parallelism can be achieved by using multiple
      complete CPU's.

      1. Such a machine is called MIMD.  It has Multiple Instruction Streams,
         each working on its own Data Stream - hence Multiple Data Streams as
         well.  That is, each processor carries out its own set of instructions.

      2. MIMD machines are distinguished from computer networks (which they
         resemble in having multiple CPU's) by the fact that in a network the
         cooperation of the CPU's is ocassional, while in an MIMD machine all
         the CPU's work together on a common task.

   B. MIMD systems are further classified by how the CPU's cooperate with
      one another, and by how they are connected.

      1. MIMD systems can be either based on SHARED MEMORY or on MESSAGE
         PASSING.

         a. In a shared memory system, all the CPU's share physical memory
            (or at least some portion of it) in common.  They communicate
            with one another by writing/reading variables contained in
            shared memory.  A key feature of such a system is a single
            address space - i.e. address 1000 refers to the same item in
            memory, regardless of which processor generates the address.

         b. In a message passing based system, each CPU has its own memory,
            and CPU's cooperate by explictly sending messages to one another.

      2. The CPU's in a MIMD system can either be connected by having a COMMON
         BUS, or by a NETWORK.

   C. Further comments about shared memory systems.

      1. Two further variations are possible in a shared memory system:

         a. In a Uniform Memory Access system (UMA), the time needed to access
            a given location in memory is the same, regardless of which 
            processor accesses it.  (Such systems are also called SYMMETRIC
            MULTIPROCESSING (SMP) SYSTEMS.)

         b. In a Non-Uniform Memory Access system (NUMA), a given processor
            may have faster access to some regions of memory than others.
            This is usually a consequence of different processors "owning"
            different parts of the overall memory, so that some accesses are
            more efficient than others.

      2. Either way, some mechanism is needed for SYNCHRONIZING accesses to
         shared memory.

         a. Example: Suppose two different processors both need to periodically 
            increment some shared variable (perhaps a counter of some sort.)  

            If the processors are RISCs, the code would look something like:

                 lw $1, v
                 nop
                 addi $1, $1, 1
                 sw $1, v

             Now suppose to processors happen to need to increment the variable 
             at about the same time.  Suppose, further, that its initial
             value is 40, so that it should be 42 after both increments take
             place.  Finally, suppose the following sequence of operations
             occurs:

                Processor 1                     Processor 2

                lw $1, v
                nop                             lw $1, v
                addi $1, $1, 1                  nop
                sw $1, v                        addi $1, $1, 1
                                                sw $1, v

             What is the final value of v?  ASK

             What went wrong?

         b. Synchronization mechanisms are studied in detail in the operating
            systems course (CS322), which you will take next year.  Suffice
            it to say that, at the heart of most such mechanisms is the idea
            of LOCKING a data item so that one processor has exclusive
            access to it during the performance of an operation like the one
            just described.

       3. Finally, because contention for access to shared memory would
          severely limit performance of a shared memory system, having each
          processor have its own cache is essential.  (Thus, the number of
          times a given processor actually accesses the shared memory is
          minimized.)  But now a new problem arises, because when one 
          processor updates a shared variable, copies of the old value may
          still reside in the caches of other processors.

          a. This problem is called the CACHE COHERENCY problem.  

          b. If a common bus is used to connect all the processors to the
             shared memory, then one possible solution is to use SNOOPY
             CACHEs, in conjunction with write-through caching.

             i. In addition to responding to requests for data from its own
                processor, a snoopy cache also monitors bus traffic and listens
                for any memory write being done by another process to a
                location it contains.  

            ii. When it hears such a write taking place, it either updates or 
                invalidates its own copy.

          c. Because snoopy caching cannot be used with write-back caches, or
             in the absence of a common bus, other cache coherency protocols
             have been developed.  These are discussed in the text.

      4. Given the issues of synchronization and cache coherence, one might
         ask why use shared memory?  The basic answer is that it minimizes
         overhead for inter-processor communication: the time needed to read
         or write a shared variable is much less than the time needed to
         send or receive a message.

   D. Further comments about connection structures.

      1. Conceptually, the simplest connection structured is a common bus.
         However, the number of processors that can be connected to one bus
         is limited in practice by problems of BUS CONTENTION, since only
         one processor can be using the bus at a time.  Also, as the number of
         processors connected to the bus increases, its length must increase
         and, as a result, its speed must decrease.

      2. Use of some sort of network connection is attractive because it
         allows more processors to be interconnected.  Of course, there are
         all sorts of possible network configurations that might be used.

         a. A configuration in which each processor is directly connected to 
            each other processor is attractive, but requires building O(N^2)
            links, with O(N^2) interfaces.  (I.e. each CPU needs N-1
            interfaces to the N-1 links connecting it to the other N-1 CPU's).

         b. On the other hand, if the number of links is kept down (to reduce
            cost), then communication delays increase when the destination
            can be reached only indirectly; and links that lie on many paths
            between different processors can become bottlenecks.

         c. The book discusses several options for connection architectures.
            We will look at just one.

   E. One example of a MIMD connection architecture is the hyper-cube.

      1. A hypercube has 2^n processors, for some n.  Each CPU has a direct
         link to n neighbors.

         a. Example: n = 2 - 4 CPU's:           P --- P
                                                |     |
                                                P --- P

         b. Example: n = 3 - 8 CPU's:             P --- P
                                                 /     /|
                                                P --- P P
                                                |     |/
                                                P --- P

         c. Example: n = 4 - 16 CPU's: Two 8-cubes with each node also linked
                                    to the corresponding node in the other cube.

         d. Commercial hypercube systems are available for at least n=6 
            (64 CPU's), 7 (128), 8 (256), 9 (512).

      2. On a hypercube of dimensionality n, any processor can send a message
         to any other processor using at most n steps.   A simple routing
         algorithm can be derived if the processors are assigned id numbers in
         the range 0 .. 2^n - 1 in such a way that the binary representation 
         of each processor's id number differs from that of each of its
         neighbors in exactly one bit position.  Each bit position corresponds
         to a direction of movement in n-dimensional space.

         Example: In a 4 cube, the neighbors of processor 1010 are
                  0010, 1110, 1000, and 1011

         A path can be found from one processor to another by xoring their
         id numbers.  Each 1 in the result corresponds to a direction in
         which the message has to move.

         Example:       3 cube          

                                          110 --- 111
                                          /       /  |
                                        100 --- 101 011
                                         |       | /
                                        000 --- 001   (010 hidden in back)

         To send a message from 110 to 011:

         110 xor 011 = 101. Message needs to move up (1xx) and right (xx1)

         Possible paths: 110 010 011 or 110 111 011

V. Conclusion
-  ----------

   A. One ongoing research question is how to get a perfomance increase out
      of a parallel system that is commensurate with the investment in extra
      hardware.

      1. The SPEEDUP of a parallel system is defined as

         Time to perform a given task on a single, non-parallel system
        ---------------------------------------------------------------
         Time to perform the same task on a parallel system

      2. Ideally, we would achieve LINEAR SPEEDUP, where a parallel system
         with n processors has a speedup of n when compared with a single
         processor.

      3. However, such speedup is never totally attainable, and often
         difficult to even approach, due to issues like:

         a. Inherent sequentialness in the problem.

            Example: Earlier we considered adding up the elements of a 1024
                     element vector using 512 processors.  We showed that this
                     can be done by the log sum algorithm in 10 steps.  Since
                     doing the addition on a single processor would take 1023
                     steps, the speedup is about 100 - far less than the ideal
                     given the use of 512 processors.

         b. Various overheads associated with parallelism - e.g.

            i. Contention for shared resources (bus, memory).

           ii. Synchronization.

          iii. Network overhead for communication.

      4. Whole textbooks exist on this topic!

   B. In addition to enhanced speed, though, there are other motivations
      for building parallel systems that may be even more important.

      ASK

      1. Reliability/availability

      2. Incremental upgrade
Copyright ©1999 - Russell C. Bjork