CS352 Lecture: OO and Object-Relational Databases       last revised 10/13/04

Need: Tranparency: Figure 8.10 in book
      Excerpt from Date 7th ed ch. 25
      "Relational Model Rest in Peace" cartoon
      Transparency of Stonebraker matrix
      Excerpt from Industry Trends article - IEEE Computer 8/2000

I. Introduction
-  ------------

   A. There has been a considerable interest in recent years in bring together
      two technologies: Object Orientation and database systems.

      1. Example: when we transitioned our CS curriculum from a procedural
         approach to an OO approach, I attended a two-week NSF-sponsored
         workshop entitled  "Integrating Object-Orientation into Undergraduate 
         CS Curricula."  The professor who led the workshop had two research 
         interests: OO and Databases.  This is not an unusual combination.

      2. In one sense, this is surprising, because these two technologies have 
         very diverse origins and historically have served diverse application
         areas.

         a. OO: Comes out of the world of discrete simulation; much 
            subsequent work arose motivated by the development of GUI's in
            the desktop/laptop world - OO is the natural paradigm for 
            designing a GUI.

         b. Database systems: comes out of the world of business data
            processing; much of the work has been done in the mainframe
            world.  Database systems historically have had a strong
            batch processing flavor.

      3. Desire to bring these two technologies together has arisen from needs
         in both areas.

         a. In the OO world

            i. The need for PERSISTENCE - retaining objects between runs of 
               a program.

               Example: In OO research a classic paradigm for the structure of
               an application is the THREE TIER ARCHITECTURE, in which a 
               system is conceived of as having three major layers

               - An INTERFACE LAYER - typically a GUI

               - A BUSINESS LAYER - contains the features that are specific to
                 the particular application.

               - A DATABASE LAYER - provides persistent storage for permanent
                 objects between program executions.

                 e.g. WebAdvisor - the system faculty advisors use during 
                 registration.  The  interface layer is a web page; the business
                 layer is a collection of scripts and programs that provide
                 data on course availability and perform operations like
                 registering a student; the database layer is the college's 
                 POISE database.

               There is a desire in the OO world to take advantage of already
               developed database technology dealing with such matters as
            
               - crash recovery, concurrency, data integrity and security

               - the ability to use a common database to support diverse
                 applications, rather than having each application area "own"
                 its own data

               - the ability to query data interactively without having to write
                 a full-blown program to perform the query.  

            ii. Moreover, there is also a desire to be able to access 
                existing "legacy" data in organizations.  Such data is often
                stored (in large quantities) in relational databases.

         b. In the database world - interest in storing entities that don't
            fit the classic business data processing pardigm.

            i. Entities with complex internal structure - e.g. to support CAD,
               CASE.  

               The has-a/part-of/containment relationship between objects is 
               an important one in OO, but does not neatly fit traditional DBMS
               models.

           ii. Entities that have polymorphic behavior.  For example, different
               classes of employee may have different rules for calculating
               their paychecks.

               Traditional DBMS models do not provide for associating
               BEHAVIOR information with entities.

          iii. Entities which are "viewed" in diverse ways - e.g.
               multimedia, and hypertext/hypermedia systems with live links 
               between entities.

           iv. Traditional DBMS models are based on data types that are
               basically textual (e.g. strings, numbers, dates), and
               queries basically compare character strings (even numbers
               and dates are often represented this way in the "WHERE" clause
               of a SQL query).  But some applications call for manipulating
               data that does not have a simple textual representation.

            There is a desire in the database world to take advantage of
            already existing OO technology dealing with such matters as
            containment, inheritance and associating behaviors with entities,
            and the ability to easily define new types of objects.

      4. The desire to bring these technologies together has also been fueled
         by the growth of desktop/laptop usage in the business world.

         a. Historically, business application development was done using
            mainframe-style languages such as COBOL.  Such applications were
            typically designed to be run from "dumb" terminals, using
            a text-based user interface.

            The hierarchical, network, and relational DBMS models come out of 
            this world.

         b. Now, most business applications are developed for desktops/laptops
            (perhaps connected to a mainframe server), using GUI's.  Such 
            applications are typically developed using OO languages such as 
            Smalltalk, C++, and Java.

      5. This issue has risen to some prominence in the last decade or so.  
         For example, the 2nd (1991) edition of our text devoted one short 
         chapter to OODBMS's at the end of the text.  The 3rd (1998) and 
         4th (2002) editions have two chapters near the  middle of the text; 
         and the chapters in the 4th edition have some significant expansion 
         compared to the 3rd.

   B. Unfortunately, the basic relational model is not a good match for
      meeting these requirements.

      1. The requirement for atomic fields and normalization means that complex 
         objects must be "taken apart" to be stored in the database.

         Illustration: Can you store a car in a standard office filing
                       cabinet?  Sure - if you're willing to take it apart
                       far enough.  The problem is the work required to
                       reassemble it when you try to retrieve it!

         As we have just seen, normalization typically requires decomposition;
         and entities we are interested in may have to be reconstructed when
         needed by doing one or more joins.

      2. Lack of support for data types that are not basically textual in
         nature (e.g. multimedia as contrasted to strings, numbers, dates).

      3. Lack of support for storing behaviors with entities.

      4. Some writers have pointed out that there is a fundamental philosophical
         difference between the two approaches to representing information.

         a. Database systems seek to foster "data independence" - the idea
            that data should be stored in such a way as to be independent of
            the programs that use it - which facilitates sharing of data
            between application areas and the ability to perform ad-hoc
            queries.

         b. Object-orientation emphasizes the close connection between
            data and behavior.  An object doesn't just incorporate data -
            it also incorporates methods that operate on the data.  Moreover,
            OO stresses the encapsulation of data, which precludes operating
            on data except through the set of methods that a class furnishes.
       
         c. In a generic database, one often needs to use the data in ways
            that might not have been anticipated when the database was
            designed.  What does one do then?  Wait for new methods to be
            coded?  Break encapsulation?
     
      5. The phrase one often hears is that there is an "impedance mismatch"
         between the way relational database systems represent data (as
         normalized tables) and the way OO systems use data.

   C. Both the OO world and the relational DBMS world make use of 
      entity-relationship diagrams as modeling tools.  However, they 
      implement certain key concepts in very distinct ways.

      1. Identity

         a. In a relational DBMS, tuples have no identity apart from the values
            they hold.  Given the basic idea that a relation is a set, it is
            contrary to the fundamental ideas of the relational model to think
            of two tuples being distinct and yet having the same values.
        
            - Identity is modeled by VALUE - often by the use of an 
              externally-generated key - e.g. an ID number of some sort

            - A consequence of this is that the identifier for an entity is
              generally meaningful to the user - e.g. all of you know your
              student ID.

         b. In an OO system, objects have identity apart from the values they
            contain.  Two objects may have the exact same state, yet have
            distinct identities.

            - Identity is typically modeled by some sort of system-generated
              OBJECT IDENTIFIER - often related to the physical address in
              memory or on disk where the object resides.
              For example, in C++ or Java, an object's identity is the location 
              in memory where it is stored, and pointers/references to the 
              object hold the address of that location.

            - A consequence of this is that the identifier for an entity is
              generally _not_ meaningful to the user - e.g. I doubt whether
              anyone knows the disk address of his/her student record!

      2. Modeling of relationships

         a. When converting an E-R diagram to relational tables, one models
            each relationship set by creating a table whose columns are the
            primary keys of the entities related, plus any attributes of the
            relationship.  One key feature of the relational model is that
            entities and relationships are modeled in essentially the same
            way - as tables.

            That is, relationships are represented by foreign keys that must
            be looked up in the appropriate table to find the row actually
            being referred to.

         b. When converting an E-R diagram to an OO design, one models
            each relationship by either a pointer/reference in one object to
            another, or by a collection of such pointers.

            That is, the address of the object being referred to is stored
            directly, so no lookup is necessary.

         c. Note that, in general, foreign keys are more semantically
            meaningful, but physical references are more efficient.

            Example: Suppose you wanted to find the office of a given
                     faculty member, given their name.  You would have to
                     look up the office number in a directory, and then go
                     there.  But if you had the office number to begin with,
                     you could go there directly without having to look
                     anything up.

                     OTOH, as I was revising this lecture someone asked me
                     "where do I find room 209?"  After playing "20 questions"
                     for a while, I determined that what she was looking
                     for was the secretary's office.  Like most people, I can
                     remember where a person's office is much more easily than 
                     I can remember a room number - when I need to drop
                     something off for the secretary, I never think "I
                     need to go to room 209".

      3. Notice that, in both of these regards, there is more similarity 
         between OO and the network and hierarchical models (which also model
         relationships by links).  However, the latter models are "old
         technology" and lack the interactive query capability which creates
         interest in DBMS's in the OO world.

         Is it possible to create a model where one can "have his cake and
         eat it too"?

      4. Finally, though both E-R diagrams and OO have a notion of class
         hierarchy and inheritance, the way E-R diagram generalization/
         specialization is implemented using relational tables loses a lot of
         the potential power that is there - especially by way of polymorphic
         handling of operations by different subclasses derived from a
         common base class.

II. Approaches to Addressing this Issue
--  ---------- -- ---------- ---- -----

   A. Development of a new OO database model to replace the relational model.  
      Currently, there is no OODB model that is standardized and universally 
      accepted in the way the relational model is; but there are a variety of 
      commercial products plus industry efforts to develop such a model by 
      adding persistence to OO languages.

      1. Such approaches toward creating an OO data model typically have 
         the following characteristics.

         a. The design is based on equation that an entity-set = a class,
            and the class structure mirrors the class hierarchy and 
            associations in the ER diagram.

            (Note the similarity between a UML class diagram and an ER
             diagram - indeed, the latter is one of the ancestors of the
             former.)

         b. The query language is an OO programming language, with 
            extensions to support persistence typically in the form of 
            libraries rather than actual changes to the language.

            (Contrast this to the use of a separate query language embedded in 
             a host language, such as embedded SQL).

         c. The act of storing or retrieving data from the database is
            transparent - an object may be either transient or persistent.

            i. Transient objects reside only in main memory, and cease to
               exist when the program terminates.

           ii. Persistent objects reside on disk between runs of the program,
               and a given persistent object may reside either on disk or in
               memory while the program is running.  (But if it resides in
               memory, it is returned to disk before termination, and may
               be updated on disk whenever it is changed.)

          iii. Various OODB implementations differ as to whether the
               distinction between transient and persistent objects is made
               on a class basis (if a class is persistent, all its objects
               are persistent) or when an object is created (it is created
               as either transient or persistent), or by explicitly marking
               a given object to be persistent, or by reachability from a
               persistent root.  However, in all cases, apart from initially 
               making an object persistent all other references to the object 
               do not need to treat transient and permanent objects differently.

           iv. Contrast this to the use of something like embedded SQL in
               a host language such as Java, where the transfer of
               information to/from the database is performed by explicit
               SELECT, INSERT, DELETE, or UPDATE statement written in a
               different language than the rest of the program.

      2. Example: the ODMG extensions to C++ described in the text.

         a. A pointer/reference may either contain a memory location, or it may
            contain a location on disk.  The operator overloading mechanisms of
            C++ are used to allow a programmer to use "pointer syntax" to
            manipulate an object, regardless of where it actually resides.

            Of course, if an operation is being performed on an object that
            currently resides on disk, the underlying system must bring the
            object into memory.  This, however, can be done in a way that is
            transparent to the user; and a technique known as "pointer
            swizzling" can be used to prevent multiple trips to disk for the
            same object.

         b. The text also discussed a library of templates that are part of
            the ODMG proposal for C++, to support referential integrity.

      3. Problems: A lot may be lost by going this way:

         a. Support for ad-hoc queries.

            Example: it would be hard to imaging an ordinary user formulating
                     queries interactively in the language used for the
                     examples in the book!

            (As a result, some commercial OO DBMS systems provide a SQL
             query facility)

         b. Support for "set at a time" processing - to perform some operation
            on all the members of a collection, one must code a loop using an
            iterator.

            Example: contrast the use of an iterator in the book example
                
                     TRANSPARENCY: Figure 8.10

                     with

                        select * from customer;

                     Even if we neglect some of the other cumbersomeness,
                     a key difference is the need for an explicit while
                     loop instead of the implicit loop in SQL.

         c. Support for referential integrity through notion of keys, etc.  The
            ODMG extensions put a significant burden on the programmer to keep
            reference sets and their inverses up to date.

   B. Alternatively: extension/adaptation of the relational model to meet
      the need.

      Example: Chapter 9 is devoted to discussion of work on object-relational 
               models.

      A common thread in many (but not all) of these extensions is the equation
      domain = class - i.e. entity sets are still relations, but some of
      their attributes may move beyond traditional restrictions.

      These extensions move in a number of directions beyond the traditional
      relational model, including:

      1. Allowing non-atomic data types

         a. Columns whose values may contain internal structure (i.e. have
            fields of their own)

         b. Columns which may store a collection (e.g. set, array) of values 
            rather than a single value - a relation nested within a relation.

            A key motivation here is efficiency: multivalued attributes are
            associated with multivalued dependencies, which force decomposition
            during normalization and lead to the need for joins in queries.
            However, a join is inherently a computationally expensive 
            operation.

      2. Support for complex data types (clobs and blobs)

      3. Reference data types

         a. System-generated object identifiers (oids)

         b. The ability for a field to store the oid of a row in another
            table (in effect a pointer to it) rather than a key that must
            be looked up.

         Again, efficiency is a key motivation.

      4. Support for inheritance

         a. Types based on other types - so that a type includes fields of
            its own plus fields inherited from a parent type or types.

         b. Tables based on other tables - so that a given row, when
            inserted into a subtable, also becomes a row in a base table.

            (This moves into the realm of class = entity-set, of course)

      5. Storing procedures in the database along with the data

         a. As methods of specific data types.

         b. As "stand-alone" functions or procedures.

      6. Extending SQL to be a more fully-functional programming language in
         its own right, including various control structures and an ability to 
         call routines written in other languages from within SQL.

         Example: the CASE construct in SQL you used in an earlier homework.

      7. Various commercial systems developed during the 1990's incorporated
         some or all of these facilities, which became part of the 1999 SQL 
         standard and the latest revision - SQL 2003 - though standardization of
         implementations is still a ways off and no vendor comes even close to
         fully supporting the standard.  (And vendors often do things contained
         in the standard in their own, non-standard way.)
 
   C. There is some amount of conflict between proponents of a new OO model
      and proponents of the relational model.  

      1. Example: one OO database vendor's Technology Guide begins with a
         section headed "The computing world has entered a post-relational
         era".

      2. Example: title of a paper "Why the Object Data Model isn't a Data
         Model"

      3. Read excerpt from C.J. Date An Introduction to Database Systems
         - 7th ed (Addison Wesley, 2000) pp. 862-863

      4. One tension that often shows up is between efficiency for a specific
         application, on the one hand, and genericity and modifiability, on
         the other hand.  With regard to the present question, C.J. Date puts 
         it this way:

         "Although the programming language and database management
         disciplines certainly have a lot in common, they do also differ
         in in certain important aspects (of course).  To be specific:

         * An application program is intended - by definition - to solve
           some specific problem.

         * By contrast, a database is intended - again by definition - to
           solve a variety of different problems, some of which might not
           even be known at the time the database is established."

         (An Introduction to Database Systems - 7th ed (Addison Wesley,
          2000) p. 813)

   D. It is worth observing that how one intends to use the database affects 
      how one evaluates the relative merits of the traditional relational,
      object-relational, and OO models.

      1. A strictly OO model has performance advantages for high-volume 
         transaction processing (lots of inserts and deletes, supporting 
         access from many locations simultaneously, e.g. over the web or for
         an institution with many branch offices.)  Joins, in particular,
         are much more computationally-expensive than following pointers.

      2. The relational model has a mathematical simplicity and elegance
         that provides better support for preserving data integrity and
         for ad-hoc queries.  When pointers are used to model relationships,
         more of the burden has to shift to the programmer to navigate them
         and to ensure that they are manipulated correctly.  This gets in
         the way of facilitating ad-hoc queries by ordinary users.

      3. An object-relational database might allow one to use a relational
         approach in a domain where the traditional relational model doesn't
         provide the needed tools, though the question arises as to how
         much should the basic model be changed without losing the advantages
         of the relational model.

         Going down this road, of course, raises the issue of "how far do we
         go?" in terms of compromising the strict relational model to
         incorporate ideas from OO.  Date obviously has his thoughts on this
         subject; other writers would take a "tigher" or "looser" stance.

         TRANSPARENCY: "Rest in Peace" cartoon

      4. Arguably, an "OO database" might be more suited for providing 
         persistence in support of a specific application (e.g. CAD),
         while a relational database might be more suited for storing
         information in a generic way that supports lots of different ways
         of accessing it.  An object-relational database might be
         suitable for application domains that demand both the data independence
         of relational systems and support for complex data.

         One widely-cited representation of this approach is the
         "Stonebraker classification matrix" proposed by Michael
         Stonebraker:

         TRANSPARENCY

   E. Interestingly, it seems that recent developments in this area have
      been relatively slow.  I am going to read an excerpt from an article
      published in August, 2000; however, what it said then seems to still
      be fairly accurate today. 

      Read ecxerpt from "Industry Trends" article - IEEE Computer 33.8
      (August 2000) pp. 16-19.

Copyright ©2004 - Russell C. Bjork