Distributed Databases

CS352 Lecture: Distributed Database Systems;            last revised 11/26/02
               Client-Server Systems

Materials: Transparency from Everest p. 747

I. What is a Distributed Database System?
-  ---- -- - ----------- -------- ------

   A. The text defined a distributed system this way: "In a distributed
      database system, the database is stored on several computers".

      1. Some sorts of parallel systems (e.g. shared nothing systems) would
         satisfy this definition - and, indeed, many of the issues faced by
         distributed systems of the sort we will discuss also arise in
         parallel systems that reside at a single site.

      2. Typically, though, when the term "distributed system" is used, we
         mean what Gio Wiederhold called "remotely distributed" systems - i.e.
         the computers holding the database are located at multiple physical
         sites.

   B. There are several further variations possible within this broad
      definition.

      1. The individual CPU's or nodes can be separated by long distances,
         and be connected by a wide area network; or they can reside in the
         same building or adjacent buildings, and be connected by a local
         area network.  The nature of the connection between the systems has
         significant impact on the cost of transmitting information from site
         to site, and thus on the way certain operations are performed.


      2. A distributed system can be either HOMOGENOUS or HETEROGENOUS.

         a. In a homogenous system, all sites run similar computer systems, with
            the same operating system and and the same brand of DBMS software.

         b. In a heterogenous system, different sites run different DBMS
            software, perhaps on host computers running different operating
            systems as well.

         c. Naturally, coordination of activities between sites is much
            easier in the homogenous case - and most actual implementations
            are of this type.

      3. Another variation is what Gio Wiederhold calls a Federated System

         a. In a federated system, different computer systems (often of
            different types), owned by different owners, are tied together
            to facilitate information sharing.

         b. As a general rule, information sharing in a federated system is
            read-only.  Only the owner of a particular site can actually
            update the data stored there.

         c. Each site may run its own brand of DBMS, and no attempt is made
            to standardize information formats etc. between sites.

         d. Federated systems are incorporated into many commercial networks,
            to allow users to access a variety of different databases.

         e. The world-wide web can be thought of as the ultimate expression
            of this notion - though it is not a database in the sense we have
            been using the term in this course.

         f. One of the reasons for interest in XML is that it facilitates
            communication of information between heterogenous systems.

      4. The client-server model, which we also discussed recently, can also
         be thought of as a form of distributed system.

II. What are the Advantages and Disadvantages of a Distributed System?
--  ---- --- --- ---------- --- ------------- -- - ----------- ------

   A. We will basically consider the advantages and disadvantages of a
      remotely distributed system as over against a large, centralized system.
      This accords with the trend in many companies to move away from large 
      central computer centers toward networks of PC's and minicomputers.
 
      1. One of the major advantages of a distributed system is sharing of
         data generated at the different sites, without requiring that all
         the data be moved to a single central site.

      2. Another advantage is the possibility of LOCAL CONTROL and AUTONOMY.
         Within boundaries established by the need for sharing, each site
         can control its own data, determine what is stored how etc.

      3. A third advantage is reliability and availability.

         a. In a centralized system, the failure of the main system shuts down
            all processing activity until the problem is fixed.

         b. In a distributed system, the failure of a site may reduce 
            performance and/or make some data unavailable.  But processing of
            most kinds of requests can continue.  We say that the system has
            improved availability.

      4. A fourth advantage is the possibility of improved response times to
         queries.  This can come about in two ways:

         a. As over-against a centralized system, a distributed system that
            stores data at the site(s) that use it the most allows those
            sites to access the data more quickly than they would if they
            had to get the data from a central site via a communication
            link.

         b. A multi-processor system can speed response to queries by the use of
            parallel processing.  Two or more nodes can work on different
            parts of the same query at the same time, thus reducing the delay
            between the issuing of the query and response to the user.

      5. A fifth advantage is the possibility of upgrading system capacity
         or performance incrementally.

         a. If more capacity or speed is needed on a centralized system,
            usually the only option is to replace it wholesale with a new, 
            larger or faster system.

         b. However, a distributed system can be upgraded by either adding
            one or more new nodes, or by upgrading one or more nodes.  The
            rest of the system continues to function as it is.

   C. Disadvantages

      1. One major disadvantage of a distributed system is the cost and
         time required for communication between sites.

         a. This is not necessarily a disadvantage, if the alternatives are
            a centralized system where ALL queries require communication vs
            a distributed system where SOME queries can be processed locally
            at the requesting site.

         b. But operations requiring access to data at multiple sites will
            almost always involve more communication between sites than would
            be required if all the data involved were at one location.

         c. The performance impact of communication depends a great deal on
            what kind of communication links are used.  This, in turn, depends
            on distances between sites - e.g. the communications links used
            for LAN's (e.g. Ethernet) are much faster than those typically
            used for WAN's (e.g. leased telephone lines).

         d. Also to be considered in communication cost is access delay: the
            overhead time needed to set up for a message between sites.  This
            varies greatly from system to system, but will tend to be a
            constant that is independent of message length.

         e. The time cost of any given message is given by:

            Access delay + (message length) / (data rate)

            i. For short messages, access delay may be the dominant cost.

           ii. For longer messages, (message length) / (data rate) may be the
               dominant cost.

         f. Since communication cost usually dominates disk access cost, 
            distributed systems must be optimized to minimize the number and 
            volume of messages, rather than disk accesses.

      2. A second disadvantage is increased complexity.  As we shall see,
         choosing a query processing strategy, performing updates, dealing
         with crashes, and concurrency control are all much more complex in
         a distributed system. 

      3. A third disadvantage related to the second is that distributed
         systems are much harder to debug.  In fact, the algorithms used must
         be bug-proof; discovering the cause of a problem that arises only
         under certain circumstances of operation timing is not possible
         using conventional debugging techniques.

III. Fragmentation and Replication of Data
---  ------------- --- ----------- -- ----

   A. At the heart of the idea of a distributed system is the distribution
      of data over multiple sites.

      1. The conceptually simplest distribution scheme is to distribute at
         the table level: any given table is stored in its entirety at some
         site.

      2. However, there are situations which call for splitting a table up
         between sites.  This is called FRAGMENTATION.

      3. Oftentimes, an organization will have branch offices at different
         physical locations, each of which is a network site.  If different
         tuples in a table are associated with different sites, it may make
         sense to fragment the table HORIZONTALLY - i.e. by rows.

         a. Example: the book used the example of a deposit relation for
            a bank, where each branch stores the rows pertaining to its
            customers.

         b. Example: a customer relation for a sales organization may be
            stored so that each customer's data is stored by the branch
            office that normally services him/her/it.

         c. When a table is fragmented horizontally, the entire table can
            be reconstructed by doing the UNION of the fragments.  

         d. The impact of fragmentation on performance involves both gains
            and losses:

            i. Operations by a local office on relevant rows of the table
               incur no communications cost.

           ii. Operations on the entire table (e.g. complete printouts, 
               searches for rows meeting some criterion other than the one
               used for fragmentation, summaries, etc) may incur massive
               communications cost - more than if the entire unfragmented
               relation were stored at one site.

      4. In other cases, it will make sense to fragment a table VERTICALLY -
         i.e. by columns.

         a. One motivation for this is security.

            i. Most DBMS's provide mechanisms whereby certain users may be 
               given access to some columns of a table but not others.  (In
               a relational scheme, this is done by using views.  The
               restricted users are given access to the view, but not directly
               to the underlying table.)

           ii. In a distributed system, there is increased risk of a breach
               of the security mechanisms, since each site has some degree of
               autonomy.  Thus, it may be decided that certain sensitive
               columns should only be stored at the site having primary
               responsibility for them.

          iii. For example, the salary column of a personnel relation may be
               stored only at the site containing the payroll department.
               Other sites may store personnel data contained in other
               columns.

         b. Another motivation for vertical fragmentation is selective
            replication of a table.  It may be decided that a copy of the
            entire table should be kept at one central site, but that
            other sites might store copies of the most frequently used
            columns.  In this way, most queries on the table can be done without
            communication cost; but some queries (those involving the
            infrequently-used columns) will have to be referred to the central
            site.)

         c. A vertically-fragmented table can be reconstructed using a
            natural join if one of the following criteria is met:

            i. Each fragment participating in the join includes the primary
               key.
       or
           ii. Each fragment includes a system-generated tuple identifier
               that is unique for each row of the table.  (In essence, this
               is a system-generated primary key for the table; but it is
               never actually seen by users.)

         d. Note that joining fragments to reconstruct the table is only
            necessary if there is NO site that has a copy of the whole table.
            This would be a fairly unusual situation.

      5. If reasons for both kinds of fragmentation pertain to a given table,
         then it can be fragmented both ways.

         EXAMPLE: TRANSPARENCY FROM EVEREST PAGE 747

         a. Note that personnel data is distributed horizontally by regional
            office.

         b. In two of the regions, salary data is fragmented off vertically.
            This is not done in the central region - perhaps the payroll
            department there shares a computer system with other major users.

         c. Job history data is fragmented off vertically but is not fragmented
            horizontally.  Presumably, an employees past job history may
            encompass several regions and so cannot be associated with any one.
            This fragment is presumably used less often than other parts of
            the table, and so may be stored at a central archival site.

      6. A fragmentation scheme can be specified by giving a relational
         algebra or SQL-like expression specifying exactly what goes into
         each fragment.

         a. In a horizontal fragmentation, each fragment will be specified by
            a selection operation for each fragment that specifies what
            tuples belong in it.

         b. In a vertical fragmentation, each fragement will be specified by
            a projection operation for each fragment that specifies what
            columns belong in it.

         c. In a combined fragmentation, both operations will be present.
            For example, C.J. Date gives the following example of using an 
            extended SQL to specify fragmentation of an accounts relation for 
            a bank having offices in San Franciso and Los Angeles.  Accounts
            are fragmented horizontally by location, and also vertically so
            that, at each site, there is one relation containing account#,
            branch, and type and a second containing account#, customer, and
            balance.  (This is an invention by Date - not actually part of
            the SQL standard.)

            DEFINE FRAGMENT SF1
              AS SELECT ACCOUNT#, BRANCH, TYPE
                 FROM ACCOUNTS
                 WHERE BRANCH = 'SF';

            DEFINE FRAGMENT SF2
              AS SELECT ACCOUNT#, CUSTOMER, BALANCE
                 FROM ACCOUNTS
                 WHERE BRANCH = 'SF';

            DEFINE FRAGMENT LA1
              AS SELECT ACCOUNT#, BRANCH, TYPE
                 FROM ACCOUNTS
                 WHERE BRANCH = 'LA';

            DEFINE FRAGMENT LA2
              AS SELECT ACCOUNT#, CUSTOMER, BALANCE
                 FROM ACCOUNTS
                 WHERE BRANCH = 'LA';

   B. Two of the major motivations for distributing a database are improved
      availability and improved performance through making it possible for
      frequent queries to be processed locally at the originating site.
      Both of these advantages typically require that some data be
      REPLICATED - i.e. that the same information be stored at more than
      one site.

      1. Replicated data may still be available in the face of the failure
         of one site, providing that another site has a copy.

      2. Local access to a private replica of commonly-used data is more
         efficient than having to go to some other site to get at the only copy.

      3. Replication can be combined with fragmentation: often only the
         most commonly-used fragment of a given table is replicated widely.

      4. Of course, replication is a form of redundancy, and creates its
         share of problems.

         a. Replicating data uses extra storage space.

         b. Updating of replicated data is a significant problem, to which
            there are no easy solutions.  Whatever update strategy is used
            must ensure that all copies of the data agree with one another;
            but it is inevitable that there be a time during the update
            process when some copies have been updated while others have not.
            The update strategy must prevent transactions from producing
            inconsistent results through reading the wrong version of an
            item being updated at the same time.

         c. When data is replicated, the advantages of improved availability
            and access must be weighed against these disadvantages to decide
            whether replicating a particular portion of the database is worth 
            it.

   C. One author has suggested the following broad guidelines for deciding
      whether to fragment, replicate, or centralize a given relation:

      1. If a relation used at many sites is small, then replicate it.

      2. If a relation used at many sites is large, but there is a way to
         partition it so that most queries will only access the local part,
         then fragment it.

      3. If a relation used at many sites is large, and cannot be fragmented
         in such a way as to allow most queries to only access the local part,
         then centralize it.  (In this case, a fragmented scheme would require
         more communications overhead since many queries would need access to
         data stored at two or more sites.)

      Example: suppose we were to take the example library database we have
               been using throughout the course and distribute it.  (We
               assume that the library has several branches.)

         - The category relation might well be replicated at each site,
           since it is small and referred to frequently.

         - The relation tying together a book's callno, copyno and shelfno
           might well be fragmented horizontally, with each branch getting the
           portion pertaining to the books that belong there.

         - The relation tying together a book and its authors might well be
           stored centrally, since it is large and cannot be fragmented in
           a site-specific way.

IV. Naming Distributed Data
--  ------ ----------- ----

   A. To access an item in a database, the person formulating a query must
      know the item's name.  Many DBMS's provide for two kinds of names for
      items:

      1. Complete or system names which fully and unambigously specify the item
         wanted. No two objects can have the same name at this level.

         Example: in SQL, within any given schema each table must have a
                  unique name.  If someone is accessing two or more schemas
                  simulataneously, then he may have to qualify a table name
                  by specifying the schema from which it comes.

                  e.g. The table EMPLOYEES in the PERSONNEL schema would be
        
                       PERSONNEL.EMPLOYEES

      2. Shorthand names or Synonyms, which are specific to each user.

         Example: in SQL, if a user is only accessing one schema, then he
                  can refer to table names without specifying a schema name.

         Example: Some versions of SQL include a CREATE SYNONYM facility that 
                  allows a user to define a single name as a synonym for any 
                  schema.tablename on the system.  (Not in RDB version, though) 

   B. In a distributed system, the system name must include some indication
      as to the site where an object is stored.  Thus, if an object is
      replicated, each replica will have its own name.

   C. However, a highly desirable property of a distributed system is
      LOCATION TRANSPARENCY: a user should not have to know WHERE an object is
      stored in order to access it.

      1. Location transparency makes it possible for data to be moved to another
         site if appropriate, without requiring users to alter their queries.

      2. Location transparency makes it possible for the query strategy planner
         to select the most inexpensive site to get data from in the case of
         replicated data.

      3. Location transparency allows a query to access replicated data from
         another site if the normal site from which it is obtained is down.

   D. In order to support location transparency, the system must provide some
      means of translating a user-specified name that does not include the 
      location of the object into a system name that does include the location.
      This is done through some sort of system catalog.  This catalog will
      also include fragmentation and replication information, so that only
      the needed fragment(s) are accessed and so that the most convenient
      replica can be used.  The question that now arises is where should this 
      catalog be stored?

      1. One option is to store it at a centralized site.

         a. To reduce communication costs, each site could also store a catalog
            of objects stored locally, so that the central catalog is referenced
            only in the case of access to a portion of the database not present
            at the local site.

         b. However, centralizing the catalog means that the central site can
            become a bottleneck in the system.

         c. Also, if the centralized site goes down then nobody can do anything!

      2. Another option is to replicate the system catalog at each site.  This,
         however, suffers from updating problems of the sort we will discuss
         shortly.

      3. A third option is to distribute the catalog, by having each site store
         catalog information about the data stored at it.

         a. The catalog would contain entries for all local data plus 
            locally-defined synonyms for data stored elsewhere.

         b. To reduce communication overhead, provision can be made to also
            cache catalog entries recently looked up at other sites.

         c. One problem that could arise with this scheme is dealing with a
            table that is moved from one site to another.
        
            i. Catalog entries at other sites that reference this table will
               still point to its original site.

           ii. But when another site attempts to access the data there, it won't
               find it.  How does the remote site know where to look?

          iii. One answer (used in System R* - a research distributed version
               of IBM's System R) is as follows:

               - When a table is originally created, an entry is made for it
                 in the catalog of the site where it is created.

               - If the table is moved to another site, the catalog at the
                 original site is updated to point to its new site.

               - Wherever the table may be moved, the original site will always
                 contain an entry pointing to its current location.

               - One component of the system name for the table is the name
                 of the site where it was originally created.

               - Thus, other sites can always access the table by going to
                 the original site.  Either it is there, or the original
                 site knows where it is.

V. Querying Distributed Data
-  -------- ----------- ----

   A. When data is distributed, we have to deal with the possibility that
      a given transaction will involve processing at more than one site.

      1. We associate a given transaction with the site where it originates;
         this is the place where the results must finally be sent.

      2. A transaction is said to be LOCAL if it only accesses data stored
         at the site if originates at.

      3. A transaction is said to be GLOBAL if it must access data at a site
         other than its point of origin.  From the standpoint of a given site
         doing some of the work on a global transaction, the transaction may
         be either:

         a. A transaction originating locally, but needing data from
            elsewhere.

         b. A transaction orginating elsewhere, but needed local data at the
            current site.

   B. An important goal of a distributed system is FRAGMENTATION TRANSPARENCY:
      In the case of fragmented or replicated tables, the user should be able
      to formulate a query without knowing how the table is fragmented and/or
      replicated across multiple sites - i.e. from the user's standpoint, it
      should appear as if the the entire table exists at one site.  (And when
      this is combined with location transparency, the appearance is given
      that this site is the user's own site.)

   C. The strategy selection component of a query processor has a challenging
      job when dealing with a global query.  Because of the time costs
      associated with communication of data, different plausible strategies
      may differ in time needed by orders of magnitude.  C.J. Date gives the
      following illustration of this point.

      1. Assume an organization has three tables, as follows:

         a. Schemes:

            Scheme S(S#, CITY) -  relating suppliers to the city where they
                                  are located.  (Primary key S#)

            Scheme P(P#, COLOR) - relating parts to their colors.
                                  (Primary key P#)

            Scheme SP(S#, P#) -   specifying which suppliers supply which parts

         b. Cardinalities of the tables (number of tuples):

            n(S)  =    10,000
            n(P)  =   100,000 
            n(SP) = 1,000,000

         c. Each tuple is 200 bits long.

         d. Tables S and SP are stored at site A, and P at site B.

      2. Suppose that the data communication system can transmit 50,000 bits
         per second, with an 0.1 second access delay for each message.
         (This is an old example.  While modern networks offer much higher
         data rates, the basic relationships between times for various
         strategies we shall see still hold.)

      3. Now consider the following query: Give the supplier numbers of all
         suppliers located in London who supply red parts.   In SQL:

         SELECT S.S#
            FROM S NATURAL JOIN SP NATURAL JOIN P
            WHERE CITY = 'LONDON' AND COLOR = 'RED';

         Suppose that statistical information leads to the following
         estimates of sizes:

         a. From the given n values, we can estimate that each part is
            supplied by 10 suppliers (n(SP) / n(P)), and that each supplier
            supplies an average of 100 parts (n(SP) / n(S))

         b. Suppose we also can learn that the number of red parts is 10,
            and the number of SP tuples pertaining to London suppliers is 
            100,000.

      4. Now consider various plausible strategies.  In each case, we consider
         only communication time.  The cost for doing the necessary joins is
         comparable for each strategy.

         a. Copy relation P to site A and do all processing there.

            T = 0.1 + (100,000 tuples * 200 bits/tuple) / (50,000 bits/second)
              = 0.1 + 400 seconds = 400.1 seconds = 6.67 minutes

            (The entire relation P is transmitted as a single long message)

         b. Copy relations S and SP to site B and do all processing there.

            T = 0.1 + (10,000 tuples * 200 bits/tuple) / (50,000 bits/second) +
                0.1 + (1,000,000 tuples * 200 bits/tuple) / (50,000 bits/second)

              = 40.1 seconds + 4000.1 seconds = 4040.2 seconds = 1.12 hours

         c. Join relations S and SP at site A, then select out the London
            tuples.  For each, query site B to see if the part involved is
            red.

            T = 2 * 0.1 * 100,000 seconds = 20,000 seconds = 5.56 hours

            - There are 100,000 SP tuples pertaining to London suppliers

            - Each gives rise to a query message and a response (2 messages).

            - Since the messages will presumably be small, the dominant
              cost will be the access delay of 0.1 second for each.  (Even if
              the message were the full 200 bit tuple, transmission time
              would only be 0.004 seconds.)

         d. Select tuples from P at site B for which the color is red, and
            send a message to site A requesting a list of London suppliers that
            supply that particular part.

            T = 10 * (0.1 + 0.1 + 200 / 50,000) = 2.04 seconds

            - There are 10 P tuples for red parts

            - Each gives rise to two messages - a query and a response.  The
              query message time is essentially just the access delay.

            - The statistical data leads us to expect that each part is supplied
              by 10 suppliers.  Since 1/10 of the suppliers are in London, the
              reply message will typically contain one supplier.  
            
         e. Join S and SP at site A, select tuples from the join for which
            the city is London, and project the result over (S# P#).  Move
            the projection to site B and complete the processing there.

            T = 0.1 + (100,000 * 200) / 50,000 = 400.1 seconds = 6.67 minutes

         f. Select tuples from P at site B for which the color is red.
            Send these to site A and complete the processing there

            T = 0.1 + (10 * 200) / 50,000 = approx 0.1 second

         g. Note, then, that these strategies (all of which are plausible)
            have communication costs ranging from 0.1 second to the better
            part of a day!

   D. One observation that can help to narrow down the number of options
      to be considered is to recognize that operations on data are of two
      types: data reducing and data expanding.

      1. A data reducing operation produces a result that is smaller in
         size than what it started with.

         a. Select and project are almost always data reducing.

         b. Natural join may be data reducing if the number of tuples
            which successfully join is a fraction of the total number of
            tuples in the relations involved.

         c. Theta join may also be data reducing if the selection condition
            eliminates most of the tuples produced by the join.

         d. Intersection and division are also data reducing.

         e. Union is data reducing if some tuples appear in both relations
            being combined.

         f. Summarizing functions (count, sum, average etc.) are data reducing

      2. A data expanding operation produces a result that is larger in
         size than what it started with.

         a. Cartesian join is always data expanding.

         b. Natural join and theta join may be data expanding.

      3. As a general rule, data reducing operations should be performed at
         the site where data is stored before sending it to another site,
         and data expanding operations should be performed at the site
         where data is needed after it has been sent there.

   E. One data reducing operator that can be used in some cases is the
      SEMIJOIN.

      1. Example:

         a. Suppose we distribute our library database so that checkout
            information is stored locally, but full catalog information on
            a book (callno, title etc.) is stored centrally.

         b. Suppose we need to print out overdue notices for a group of
            checkouts pertaining to books that have just become overdue.
            We will do something like:

            select callno, copyno, title, borrower_id
                from checkout natural join bookinfo 
                where datedue = yesterday;

         c. One option for accomplishing this is to copy the entire bookinfo
            table from the central site to the local branch, and then do the
            join locally. This is an obviously bad idea, since the book
            relation is presumably quite large, and only a few rows are
            relevant to our query.

         d. A better approach would be to send to the local branch only those 
            tuples of the book relation which are relevant to the query - i.e.
            those whose callno matches the callno of one of the overdue 
            books. This is computed by taking the semijoin:

                bookinfo |X checkout

            which is defined as

                project bookinfo |X| checkout
                    bookinfo scheme

         e. The complete strategy might then be as follows:

            i. Compute project      select              checkout
                            callno      datedue = yesterday

           ii. Send this to the central site.

          iii. Compute the semijoin of bookinfo with this, project out
               callno and title, and send the result back to the local
               office.

           iv. Join the information returned from the central site with
               checkout and project the desired fields.

         f. It is obvious that, in this case, the semijoin is far superior
            to sending the entire bookinfo relation to the local site.
            Even though tuples flow both ways, the total volume of information
            transmitted is much less.

         g. We might also compare this semijoin-based strategy to one other
            option involving only regular joins:  Send the relevant checkout 
            tuples to the central site, do the join with bookinfo there, and
            send the result back.  The semijoin-based strategy involves 
            transmitting the same number of tuples in each direction; but
            since the tuples are projected down to smaller size first in
            each case, the semijoin-based strategy has an edge in terms of
            total data volume.

      2. Formally, given relations r1 on scheme R1 and r2 on scheme R2,
         r1 |X r2 is defined to be

                project (r1 |X| r2)
                      R1

      3. Observe that r2 |X| (r1 |X r2) = r2 |X| r1.  This is one fact we
         relied on in the above example.

      4. Further, r1 |X r2 = r1 |X project               r2.  This is what
                                          R1 intersect R2
         let us do a projection locally before sending checkout tuples to the 
         central site to be semijoined with bookinfo there.

VI. Updating Distributed Data
--  -------- ----------- ----

   A. Most of the complexities associated with distributed databases arise
      in conjunction with the updating of data in the database.  (Read-only
      access, as in a federated system, is much easier to handle.)  The two
      major sources of complexity are these:

      1. If a transaction updates data stored at more than one site, we must
         ensure that either all the updates commit or none of the updates
         commit.  We must avoid the possibility that an update at one site
         commits while another aborts.

      2. When data is replicated, we must make sure that ALL replicas are
         updated consistently.  Since it is impossible to ensure that all
         updates to replicas of an item occur at exactly the same moment of
         time, we must also ensure that inconsistencies do not result from
         a transaction reading a replica that is not yet updated.

      3. In addressing each of these problems, we must bear in mind that a
         distributed system is vulnerable to partial failure - some sites may
         be up while others are down.

         a. An individual site may suffer a hardware failure or software crash
            at any time - including in the middle of an update operation in
            which it is one of the participating sites.

         b. A communication link between sites can fail at any time - including
            in the middle of an update operation.  The consequences of link
            failure range from causing a site to be out of touch with the rest
            of the network to a partitioning of the network into two
            disconnected subnetworks.  (We do not worry about a link failure
            that still leaves a site connected to the rest of the network
            through an alternate path.  We assume that lower levels of
            network software handle this transparently for us.)

         c. In addition to the total failure of a site or link, we must also
            consider the possible garbling or loss of a message in transit.

            i. If a message is garbled, lower layers of the software will
               request a repeat transmission.

           ii. However, if a message is lost, appropriate action might need
               to be taken at a higher level.

   B. We consider first the matter of updating data stored at more than one
      site.

      1. To ensure that either all updates commit or no updates commit, we
         can use a protocol called the TWO-PHASE COMMIT PROTOCOL.

         a. In this protocol, one site acts as the coordinator.  Normally,
            this is the site where the transaction originated.

         b. As each site completes its work on the transaction and becomes
            partially-committed, it sends a message to the coordinator.

            i. Once the coordinator receives completion messages from all
               participating sites, it can begin the commit protocol.

           ii. However, if it receives a failure message from any site,
               then it must instruct all sites to abort the transaction.  

          iii. The coordinator must also abort the transaction if it
               fails to receive a completion message from some site within a 
               reasonable time.  This is necessary to deal with the
               possibility that a participating site (or its link) might fail
               in mid transaction.

          c. In phase 1 of the protocol, the following events occur:

             i. The coordinator adds a <prepare T> entry to its log and forces
                all log records to stable storage.

            ii. The coordinator sends a prepare-to-commit message to all 
                participating sites.

           iii. Each site normally adds a <ready T> entry to its log and forces 
                all log entries to stable storage.  It then sends a ready
                message to the coordinator.

            iv. However, if some site wishes to abort the transaction, it may
                still do so by adding a <no T> entry to its log, forcing the
                log entries to stable storage, and then sending an abort
                message to the coordinator.

             v. Once a site sends a ready message to the coordinator, it
                gives up the right to abort the transaction.  It must go ahead
                with the commit if the coordinator tells it to do so.

          d. In phase 2 of the protocol, the coordinator waits for replies
             from its prepare-to-commit message to come back.

             i. If any comes back negative, or fails to come back in a
                reasonable time, then the coordinator must write an <abort T>
                entry in its log and send an abort message to all sites.

            ii. If all replies come back positive, the coordinator writes
                a <commit T> entry in its log and sends a commit message to
                all sites.
 
           iii. At this point, the decision of the coordinator is final.
                Whatever was decided, the protocol will now operate to ensure
                that it occurs - even if a site or link should now fail.

            iv. Each site, when it receives the message from the coordinator,
                either commits or aborts the transaction, makes an
                appropriate entry in its log, and then sends an acknowledge 
                message to the coordinator.

             v. When the coordinator receives acknowledge messages from all
                sites, it writes a <complete T> entry to its log.

            vi. Should any site fail to respond within reasonable time, the
                coordinator may resend the message to it.  This resending
                can continue at regular intervals until the site response
                comes back.  (This is not strictly necessary, however.
                Once the coordinator has decided the transaction's fate,
                the site will assume responsibility for finding it out anyway.)

      2. The two-phase commit protocol deals with site or link failures (other
         than failure of the coordinator) as follows:

         a. If a site or link fails before sending its ready message, then
            the transaction will abort.

            i. When the site comes back up or is reconnected, it may try to 
               send a ready message, but if the coordinator has decided to 
               abort the transaction the ready message will now be ignored.

           ii. The coordinator will continue sending periodic abort messages
               to the site until it acknowledges; thus the protocol will
               eventually finish in a consistent state.

         b. If a site or link fails after the site sends its ready message,
            and the failure results in the ready message being lost, then
            the result is the same as above.

         c. If a site or link fails after the coordinator receives the
            site's ready message but before the site receives the final
            decision from the coordinator, then the site's log will contain a
            <ready T> entry.  During recovery, the site must determine the
            fate of T by either receiving a message from the coordinator
            or consulting some other site that participated in the transaction.

         d. If a site or link fails after it receives the coordinator's
            decision and records <abort T> or <commit T> in its log - but
            before the site sends an acknowledge message - then the site
            knows what to do about the transaction when it comes back up.
            It will send an acknowledge message in response to a resend of
            the coordinator's verdict.  (The same situation holds if the site's
            acknowledge message is lost in transmission.)

      3. Coordinator failure is handled as follows:

         a. If the coordinator fails before it sends a final decision as to
            the transaction's fate to at least one site, then a site that
            has already sent a ready message to the coordinator must wait
            until the coordinator recovers or is reconnected before deciding
            what to do about the transaction.  (But a site that has not sent
            a ready message can timeout and decide to abort the transaction,
            of course; and if another site can find out it has done so, it
            can abort too.)

         b. If the coordinator fails after sending a final decision to at
            least one site, other sites may be able to obtain the decision
            from that site so as not to have to wait.

         c. In any case, when the coordinator recovers or is reconnected,
            it will inspect its log and send some decision to each site.

            i. If it finds <start T> but no <prepare T>, it will presumably
               abort the transaction.

           ii. If it finds <prepare T> but no <commit T> or <abort T>, it
               may try again to find out the status of the participating
               sites, or it may just abort the transaction.

          iii. If it finds <abort T> or <commit T> but no <complete T>, it
               can start the process of sending abort/commit messages and
               waiting for acknowledgements over again.

           iv. Of course, if it finds <complete T> then nothing need be done.

   C. Now we consider the matter of updating replicated data.

      1. Ultimately, all replicas of a given data item must be updated.
         However, the system should exhibit REPLICATION TRANSPARENCY; any
         update transaction should be able to execute without knowing how
         many replicas of a given item exist.  (I.e. a transaction's write
         to one copy of an item should be converted by the DBMS into writes
         to all of them.)

      2. One way to do this is to perform simultaneous updates at the
         transaction level - i.e. transform each update transaction into
         a set of updates at each site.

         a. This does ensure that all replicas of a given item are
            consistent, if each replica is locked as part of the update
            process.

         b. However, it can make response time for update transactions poor.

         c. Further, one or more replicas may be located at sites that are
            either down or out of contact due to failed links.  

            i. If we made a transaction wait if ANY replica of an updated item 
               is not available, then replication works against availability
               rather than for it.  (We would be better off, from an 
               availability for update standpoint, if data were not replicated.)

           ii. For this reason, some provision must be made for postponing
               an update to an inaccessible site.  Some site must be responsible
               for informing it of updates that were missed once it comes back.

          iii. This can get complicated, because we must decide who is
               responsible for informing the unavailable site.  What if the
               responsible site is offline when the other site comes back?

      3. An alternate that is often used is to designate one copy of the data
         as the PRIMARY copy. 

         a. Transactions that wish to read an item may read any replica, but 
            all updates are done first to the primary copy.  

         b. The site containing the primary copy is responsible for sending
            updates to sites containing replicas.

            i. This may be done after every update transaction.

           ii. This may involve resending the updates periodically if a site 
               is down.

          iii. Or, if updates are infrequent and currency of data is not
               vital, a fresh copy of the entire file may be shipped from the
               primary site to secondary sites on a periodic basis.

         c. We accept the possibility that data at secondary sites may be a
            bit out of date.  Thus, any transaction for which it is critical
            that we have the most recent data (e.g. a funds transfer) must
            read the primary copy.  Secondary copies can be read only in cases
            where inconsistency could not result from stale data.

         d. Of course, we now must face the possibility that the site
            containing the primary copy of a data item should be offline or
            disconnected.

            i. In this case, we may declare the data item unavailable for
               update until the primary copy site comes back online.

           ii. Or, we may temporarily designate a secondary copy as the
               primary copy, and then report changes back to the primary
               site when it comes back online.

          iii. However, designating a secondary site as the primary copy
               could lead to inconsistencies (requiring human intervention
               to straighten out) if partitioning of the network leads to
               two primary copies being declared - one in each partition.

   D. One special issue that sometimes arises is dealing with derived data.

      1. Derived data is data that can be computed on the basis of other
         data in the database - e.g. summary tables.  To avoid the need to
         recompute this data every time it needs to be accessed, it can be
         put in a table of its own.

      2. Obviously, any change to any item underlying a derived table would
         necessitate a recomputation of the derived table to bring it up to
         date.  To avoid having to do this all the time, derived data is
         often "frozen" as of a certain date - e.g.

         a. An organization's financial statement will reflect the state of
            affairs at some point in time, such as the end of a business
            period.

         b. A company may produce a summary of its customer accounts
            on a daily or weekly basis, and use these summaries for most
            routine inquiries such as credit checks.

         c. A company's price list for the goods it sells is generally updated
            periodically; but once a price list has been distributed it
            remains in effect for a specified time.

      3. Derived data that has been frozen in this way can be replicated at
         many sites in a system without incurring the update overhead that
         would be needed if the underlying raw data were distributed.
         Updates can be handled by periodically shipping a completely new
         copy of the derived data to each site.

      4. C.J. Date builds on this idea to suggest that one way of dealing with
         the update problem for replicated data is to eliminate replicated data
         per se. Instead, there would be one primary copy for each data item,
         plus "snapshots" made at periodic intervals and shipped to other sites.
         A transaction needing to access the most current value of the data
         would have to go to the primary site, while other transactions
         could access the snapshot.

VII. Concurrency Control for Distributed Data
---  ----------- ------- --- ----------- ----

   A. So far, our discussion has dealt with problems that arise if just one
      transaction attempts to access or update a data item.  We now consider
      how our concurrency control schemes must be modified to cope with
      distributed data.

   B. First, we consider the lock-based schemes.
      
      1. As always, we must lock a data item before we access it.  But now we
         face the possibility of having to lock items stored at several
         different sites.

      2. One approach to locking is to use a CENTRALIZED LOCK MANAGER located
         at one of the sites.  All locks are obtained by means of a message
         to this site.

         a. Compared to other schemes, this has two advantages.

            i. A site that needs to lock several replicas of the same item can
               get all the locks it needs with a single message.

           ii. As we shall see, this makes deadlock detection much easier.

         b. However, centralized lock management has the usual disadvantages
            of centralizing any function in a distributed system.

            i. Even a local transaction will now involve communication overhead
               if locking is needed.

           ii. The lock manager is a potential bottleneck.

          iii. Failure of the lock manager can cripple the system.  (There is
               a way to deal with this by using an election scheme to choose
               a new lock manager in such cases.  The book discusses this.)

      3. An alternative to centralized locking is distributed locking, with each
         site managing the locks on items stored there.  

         a. The advantages are the opposites of the disadvantages of the
            centralized scheme.  In particular, local transactions stay local.

         b. There are some disadvantages, though.

            i. The message overhead is increased significantly.  For each
               item to be locked, at least three messages are needed, IN
               ADDITION TO THE MESSAGES NEEDED TO ACCESS THE DATA:

               - A lock request message to the site where the item is stored.

               - A lock granted message back to the requestor.

               - An unlock message to finally release the item.

               (Actually, this last message might be combined with the messages
                involved in the two-phase commit protocol.)

          ii. Deadlock detection now becomes much harder.  No one site has all
              the information needed to determine whether the system is
              deadlocked.

              Example:  Transaction T1 locks item Q1 at its own site, site S1.

                        Transaction T2 (at site S2) locks item Q3 at 
                        another site, site S3.

                        Transaction T1 now sends a message to site S3,
                        requesting a lock on Q3.  Transaction T1 is forced to
                        wait.

                        Transaction T2 now sends a message to site S1,
                        requesting a lock on Q1.  Transaction T2 is forced to
                        wait.

                        Transactions T1 and T2 are now deadlocked, but neither
                        site knows it!

                        - Site S1 knows that its transaction, T1, is waiting 
                          for a lock at site S3, and that transaction T2 from S2
                          is waiting for a local lock held by T1.  This can be 
                          represented by the graph

                          (T2) --> (T1)

                        - Site S2 knows that its transaction, T2, is waiting
                          for a lock at site S1.

                        - Site S3 knows that transaction T1 from site S1 is 
                          waiting for a local lock held by T2. This can be 
                          represented by the graph

                          (T1) --> (T2)

                        - No one knows about the circular wait!

        c. Distributed lock management can further complicate dealing with
           updates to replicated data

           i. If the updating transaction is responsible for updating all copies
              of an item, then it would appear that no update can take place
              without first obtaining a lock on ALL the copies.  This requires
              a great deal of message overhead (three messages per replica), and
              would prevent doing an update if one site were down.  Furthermore,
              deadlock could occur if two transactions were trying to lock
              the same item but started with different copies.  (Each would
              be holding locks on some copies awaiting the others.)

          ii. These problems can be reduced by using a MAJORITY PROTOCOL.
              An updating transaction is only required to obtain locks on a
              majority of the replicas in order to proceed with the update.
              This cuts message traffic in half, allows updates even if one
              site is down, and reduces (but does not eliminate) the chance
              of deadlock.

              Note: in order for this to work correctly, a transaction that 
              needs to read and then update an item must obtain an exclusive
              lock originally, rather than obtaining a shared lock which is
              subsequently upgraded to exclusive.

         iii. It is also possible to use a BIASED PROTOCOL for updates, whereby
              read-only operations are only required to lock one copy of an
              item, while update transactions must lock all copies.

        d. Many of the difficulties in updated replicated items can be reduced
           if the primary copy method of updating is used.  In this case, we
           need only require the updating transaction to lock the primary copy 
           of the item.

   C. Another issue that must be dealt with is the management of locks in
      conjunction with recovery.  Any locks that were held by a transaction
      whose fate is in doubt during recovery must be held until the fate is
      determined - potentially resulting in an item being tied up for a long
      time. (The book discusses an alternative to two phase commit called
      three phase comit which addresses this issue.)

   D. Because of the complexity of implementing locking in a distributed
      system, timestamp-based methods of concurrency control become an
      attractive alternative.

      1. Some of the key ideas involved in adapting the timestamp algorithms
         for distributed use are discussed in the book.

      2. One important point in any distributed scheme based on timestamps
         is ensuring consistency and uniqueness of timestamps across the
         system.

         a. Since each site generates its own timestamps, it is quite possible
            for two transactions originating at different sites to be given
            the same timestamp locally.  To prevent this, we form a timestamp
            by concatenating two fields: the locally-generated timestamp and
            a unique site id, like this:

                [ locally-generated timestamp ] [ site id ]

            Note that the site id must become the low order part of the overall
            timestamp, so that it is significant only when resolving a situation
            where two transactions from different sites have the same
            locally-generated portion.

         b. We also want to ensure that if some transaction T1 receives a
            timestamp TS(T1) at some moment of time, some other transaction
            T2 does not receive a smaller timestamp at some later moment of
            time.

            i. This could theoretically be guaranteed if we used clocks
               to generate timestamps at each site, and if we ensured that the 
               clocks were always synchronized.  However, it is hard to ensure
               this in practice.

           ii. The following is sufficient: if any site receives a request
               from a transaction originating elsewhere whose timestamp is
               greater than the current reading of its own timestamp clock,
               it must advance its timestamp clock to be one greater than
               the value of the timestamp of the transaction.

          iii. This latter method will work even if timestamps are generated
               by a local counter, rather than a local clock.

VIII. Deadlock Detection
----  -------- ---------

   A. As we noted earlier, in a distributed scheme for controlling concurrency
      based on locking, it is possible to have global deadlock without any
      one site being aware of it.  We now consider how to deal with this.

   B. One approach to dealing with global deadlock is to have a site
      designated as the deadlock coordinator.  It will periodically poll the
      other sites for information as to transactions waiting for resources
      at those sites, and will combine the information to determine if a
      global deadlock is present.

      1. This, however, requires a fair amount of message traffic between
         the various sites and the coordinator.

      2. This scheme is vulnerable to all the usual problems of central site
         failure we have looked at earlier.  (As noted earlier, this can be
         handled by an election scheme to choose a new coordinator.)

   C. An alternative approach works like this:

      1. Each site maintains a wait-for graph showing all local transactions
         holding or requesting resources at that site, plus one additional
         node called EXT, for external.

         a. There is an edge from a transaction T in the local graph to EXT if
            T is waiting for some event at an external site.

         b. There is an edge from EXT to a transaction T in the local graph if
            T represents a request from another site.

         c. Example: our previous case of global deadlock

            Graph at site S1

                EXT --> (T2) --> (T1) --> EXT

            Graph at site S2

                (T2) --> EXT

            Graph at site S3

                EXT --> (T1) --> (T2) --> EXT

      2. The existence of a pattern of the form:

                EXT --> (Ti) --> (Tj) --> (Tk) ... --> EXT

         at some site implies that global deadlock MAY be present (but
         not necessarily!)

         a. To decide if global deadlock is actually present, the site detecting
            the potential deadlock must sent its wait-for graph to the external
            site holding the resource being waited for.  This site can then
            combine the graphs determine whether:

            i. Actual deadlock is present

           ii. No deadlock is present

          iii. Further information is needed.

            Example: If S1 sends its graph to S3, then S3 will be able to
                     form the composite graph

                --> (T1) --> (T2) -->
                |___________________|

               S3 can now initiate rollback of one of the transactions.

         b. Actually, in any case where deadlock is present, there will be
            two sites detecting possible deadlock.  To minimize messages,
            it suffices if a site send its graph to a neighbor only when
            the transaction ID of the "incoming" external transaction is
            greater than that of the "outgoing" external transaction.

IX. Final Thoughts: Client Server Systems
--  ----- --------  ------ ------ -------

   A. The various issues we have just been discussing (fragmentation,
      replication, distributed query processing, distributed update, etc.)
      have been discussed in terms of a remotely distributed DBMS - the most
      sophisticated sort of system.

   B. We noted earlier that, in recent years, the client-server model has
      become very popular.  Most of the issues we have discussed do not
      arise in the client server model, because all database accesses are
      usually done at a single site.  

      1. The tradeoff, however, is that to get this simplicity we have to
         usually give up:

         a. Location transparency - the client much know which server has
            the data it needs.

         b. Fragmentation transparency - if the data is fragmented, the
            server we contact must have the fragment we need.

         c. The ability to combine information stored at more than one site.

      2. Some extensions to the model can give back some of these things, at
         the cost of reintroducing complexity - e.g.

         a. Although the basic model allows a client to be connected to only
            one server at a time, some client-server systems allow a single 
            transaction to change connections in mid-transaction.  The price
            tag for this is that we reintroduce the needed for something like
            the two-phase commit protocol.

         b. Some client-server systems allow precompiled procedures to be
            created at a server and called remotely by a client.  Presumably,
            these procedures could contact other servers (so that the server
            becomes a client as well) to get data from multiple sources -
            but again, complexities like two-phase commit now come back.
Copyright ©2002 - Russell C. Bjork