CS352 Lecture: OO and Object-Relational Databases last revised 10/13/04
Need: Tranparency: Figure 8.10 in book
Excerpt from Date 7th ed ch. 25
"Relational Model Rest in Peace" cartoon
Transparency of Stonebraker matrix
Excerpt from Industry Trends article - IEEE Computer 8/2000
I. Introduction
- ------------
A. There has been a considerable interest in recent years in bring together
two technologies: Object Orientation and database systems.
1. Example: when we transitioned our CS curriculum from a procedural
approach to an OO approach, I attended a two-week NSF-sponsored
workshop entitled "Integrating Object-Orientation into Undergraduate
CS Curricula." The professor who led the workshop had two research
interests: OO and Databases. This is not an unusual combination.
2. In one sense, this is surprising, because these two technologies have
very diverse origins and historically have served diverse application
areas.
a. OO: Comes out of the world of discrete simulation; much
subsequent work arose motivated by the development of GUI's in
the desktop/laptop world - OO is the natural paradigm for
designing a GUI.
b. Database systems: comes out of the world of business data
processing; much of the work has been done in the mainframe
world. Database systems historically have had a strong
batch processing flavor.
3. Desire to bring these two technologies together has arisen from needs
in both areas.
a. In the OO world
i. The need for PERSISTENCE - retaining objects between runs of
a program.
Example: In OO research a classic paradigm for the structure of
an application is the THREE TIER ARCHITECTURE, in which a
system is conceived of as having three major layers
- An INTERFACE LAYER - typically a GUI
- A BUSINESS LAYER - contains the features that are specific to
the particular application.
- A DATABASE LAYER - provides persistent storage for permanent
objects between program executions.
e.g. WebAdvisor - the system faculty advisors use during
registration. The interface layer is a web page; the business
layer is a collection of scripts and programs that provide
data on course availability and perform operations like
registering a student; the database layer is the college's
POISE database.
There is a desire in the OO world to take advantage of already
developed database technology dealing with such matters as
- crash recovery, concurrency, data integrity and security
- the ability to use a common database to support diverse
applications, rather than having each application area "own"
its own data
- the ability to query data interactively without having to write
a full-blown program to perform the query.
ii. Moreover, there is also a desire to be able to access
existing "legacy" data in organizations. Such data is often
stored (in large quantities) in relational databases.
b. In the database world - interest in storing entities that don't
fit the classic business data processing pardigm.
i. Entities with complex internal structure - e.g. to support CAD,
CASE.
The has-a/part-of/containment relationship between objects is
an important one in OO, but does not neatly fit traditional DBMS
models.
ii. Entities that have polymorphic behavior. For example, different
classes of employee may have different rules for calculating
their paychecks.
Traditional DBMS models do not provide for associating
BEHAVIOR information with entities.
iii. Entities which are "viewed" in diverse ways - e.g.
multimedia, and hypertext/hypermedia systems with live links
between entities.
iv. Traditional DBMS models are based on data types that are
basically textual (e.g. strings, numbers, dates), and
queries basically compare character strings (even numbers
and dates are often represented this way in the "WHERE" clause
of a SQL query). But some applications call for manipulating
data that does not have a simple textual representation.
There is a desire in the database world to take advantage of
already existing OO technology dealing with such matters as
containment, inheritance and associating behaviors with entities,
and the ability to easily define new types of objects.
4. The desire to bring these technologies together has also been fueled
by the growth of desktop/laptop usage in the business world.
a. Historically, business application development was done using
mainframe-style languages such as COBOL. Such applications were
typically designed to be run from "dumb" terminals, using
a text-based user interface.
The hierarchical, network, and relational DBMS models come out of
this world.
b. Now, most business applications are developed for desktops/laptops
(perhaps connected to a mainframe server), using GUI's. Such
applications are typically developed using OO languages such as
Smalltalk, C++, and Java.
5. This issue has risen to some prominence in the last decade or so.
For example, the 2nd (1991) edition of our text devoted one short
chapter to OODBMS's at the end of the text. The 3rd (1998) and
4th (2002) editions have two chapters near the middle of the text;
and the chapters in the 4th edition have some significant expansion
compared to the 3rd.
B. Unfortunately, the basic relational model is not a good match for
meeting these requirements.
1. The requirement for atomic fields and normalization means that complex
objects must be "taken apart" to be stored in the database.
Illustration: Can you store a car in a standard office filing
cabinet? Sure - if you're willing to take it apart
far enough. The problem is the work required to
reassemble it when you try to retrieve it!
As we have just seen, normalization typically requires decomposition;
and entities we are interested in may have to be reconstructed when
needed by doing one or more joins.
2. Lack of support for data types that are not basically textual in
nature (e.g. multimedia as contrasted to strings, numbers, dates).
3. Lack of support for storing behaviors with entities.
4. Some writers have pointed out that there is a fundamental philosophical
difference between the two approaches to representing information.
a. Database systems seek to foster "data independence" - the idea
that data should be stored in such a way as to be independent of
the programs that use it - which facilitates sharing of data
between application areas and the ability to perform ad-hoc
queries.
b. Object-orientation emphasizes the close connection between
data and behavior. An object doesn't just incorporate data -
it also incorporates methods that operate on the data. Moreover,
OO stresses the encapsulation of data, which precludes operating
on data except through the set of methods that a class furnishes.
c. In a generic database, one often needs to use the data in ways
that might not have been anticipated when the database was
designed. What does one do then? Wait for new methods to be
coded? Break encapsulation?
5. The phrase one often hears is that there is an "impedance mismatch"
between the way relational database systems represent data (as
normalized tables) and the way OO systems use data.
C. Both the OO world and the relational DBMS world make use of
entity-relationship diagrams as modeling tools. However, they
implement certain key concepts in very distinct ways.
1. Identity
a. In a relational DBMS, tuples have no identity apart from the values
they hold. Given the basic idea that a relation is a set, it is
contrary to the fundamental ideas of the relational model to think
of two tuples being distinct and yet having the same values.
- Identity is modeled by VALUE - often by the use of an
externally-generated key - e.g. an ID number of some sort
- A consequence of this is that the identifier for an entity is
generally meaningful to the user - e.g. all of you know your
student ID.
b. In an OO system, objects have identity apart from the values they
contain. Two objects may have the exact same state, yet have
distinct identities.
- Identity is typically modeled by some sort of system-generated
OBJECT IDENTIFIER - often related to the physical address in
memory or on disk where the object resides.
For example, in C++ or Java, an object's identity is the location
in memory where it is stored, and pointers/references to the
object hold the address of that location.
- A consequence of this is that the identifier for an entity is
generally _not_ meaningful to the user - e.g. I doubt whether
anyone knows the disk address of his/her student record!
2. Modeling of relationships
a. When converting an E-R diagram to relational tables, one models
each relationship set by creating a table whose columns are the
primary keys of the entities related, plus any attributes of the
relationship. One key feature of the relational model is that
entities and relationships are modeled in essentially the same
way - as tables.
That is, relationships are represented by foreign keys that must
be looked up in the appropriate table to find the row actually
being referred to.
b. When converting an E-R diagram to an OO design, one models
each relationship by either a pointer/reference in one object to
another, or by a collection of such pointers.
That is, the address of the object being referred to is stored
directly, so no lookup is necessary.
c. Note that, in general, foreign keys are more semantically
meaningful, but physical references are more efficient.
Example: Suppose you wanted to find the office of a given
faculty member, given their name. You would have to
look up the office number in a directory, and then go
there. But if you had the office number to begin with,
you could go there directly without having to look
anything up.
OTOH, as I was revising this lecture someone asked me
"where do I find room 209?" After playing "20 questions"
for a while, I determined that what she was looking
for was the secretary's office. Like most people, I can
remember where a person's office is much more easily than
I can remember a room number - when I need to drop
something off for the secretary, I never think "I
need to go to room 209".
3. Notice that, in both of these regards, there is more similarity
between OO and the network and hierarchical models (which also model
relationships by links). However, the latter models are "old
technology" and lack the interactive query capability which creates
interest in DBMS's in the OO world.
Is it possible to create a model where one can "have his cake and
eat it too"?
4. Finally, though both E-R diagrams and OO have a notion of class
hierarchy and inheritance, the way E-R diagram generalization/
specialization is implemented using relational tables loses a lot of
the potential power that is there - especially by way of polymorphic
handling of operations by different subclasses derived from a
common base class.
II. Approaches to Addressing this Issue
-- ---------- -- ---------- ---- -----
A. Development of a new OO database model to replace the relational model.
Currently, there is no OODB model that is standardized and universally
accepted in the way the relational model is; but there are a variety of
commercial products plus industry efforts to develop such a model by
adding persistence to OO languages.
1. Such approaches toward creating an OO data model typically have
the following characteristics.
a. The design is based on equation that an entity-set = a class,
and the class structure mirrors the class hierarchy and
associations in the ER diagram.
(Note the similarity between a UML class diagram and an ER
diagram - indeed, the latter is one of the ancestors of the
former.)
b. The query language is an OO programming language, with
extensions to support persistence typically in the form of
libraries rather than actual changes to the language.
(Contrast this to the use of a separate query language embedded in
a host language, such as embedded SQL).
c. The act of storing or retrieving data from the database is
transparent - an object may be either transient or persistent.
i. Transient objects reside only in main memory, and cease to
exist when the program terminates.
ii. Persistent objects reside on disk between runs of the program,
and a given persistent object may reside either on disk or in
memory while the program is running. (But if it resides in
memory, it is returned to disk before termination, and may
be updated on disk whenever it is changed.)
iii. Various OODB implementations differ as to whether the
distinction between transient and persistent objects is made
on a class basis (if a class is persistent, all its objects
are persistent) or when an object is created (it is created
as either transient or persistent), or by explicitly marking
a given object to be persistent, or by reachability from a
persistent root. However, in all cases, apart from initially
making an object persistent all other references to the object
do not need to treat transient and permanent objects differently.
iv. Contrast this to the use of something like embedded SQL in
a host language such as Java, where the transfer of
information to/from the database is performed by explicit
SELECT, INSERT, DELETE, or UPDATE statement written in a
different language than the rest of the program.
2. Example: the ODMG extensions to C++ described in the text.
a. A pointer/reference may either contain a memory location, or it may
contain a location on disk. The operator overloading mechanisms of
C++ are used to allow a programmer to use "pointer syntax" to
manipulate an object, regardless of where it actually resides.
Of course, if an operation is being performed on an object that
currently resides on disk, the underlying system must bring the
object into memory. This, however, can be done in a way that is
transparent to the user; and a technique known as "pointer
swizzling" can be used to prevent multiple trips to disk for the
same object.
b. The text also discussed a library of templates that are part of
the ODMG proposal for C++, to support referential integrity.
3. Problems: A lot may be lost by going this way:
a. Support for ad-hoc queries.
Example: it would be hard to imaging an ordinary user formulating
queries interactively in the language used for the
examples in the book!
(As a result, some commercial OO DBMS systems provide a SQL
query facility)
b. Support for "set at a time" processing - to perform some operation
on all the members of a collection, one must code a loop using an
iterator.
Example: contrast the use of an iterator in the book example
TRANSPARENCY: Figure 8.10
with
select * from customer;
Even if we neglect some of the other cumbersomeness,
a key difference is the need for an explicit while
loop instead of the implicit loop in SQL.
c. Support for referential integrity through notion of keys, etc. The
ODMG extensions put a significant burden on the programmer to keep
reference sets and their inverses up to date.
B. Alternatively: extension/adaptation of the relational model to meet
the need.
Example: Chapter 9 is devoted to discussion of work on object-relational
models.
A common thread in many (but not all) of these extensions is the equation
domain = class - i.e. entity sets are still relations, but some of
their attributes may move beyond traditional restrictions.
These extensions move in a number of directions beyond the traditional
relational model, including:
1. Allowing non-atomic data types
a. Columns whose values may contain internal structure (i.e. have
fields of their own)
b. Columns which may store a collection (e.g. set, array) of values
rather than a single value - a relation nested within a relation.
A key motivation here is efficiency: multivalued attributes are
associated with multivalued dependencies, which force decomposition
during normalization and lead to the need for joins in queries.
However, a join is inherently a computationally expensive
operation.
2. Support for complex data types (clobs and blobs)
3. Reference data types
a. System-generated object identifiers (oids)
b. The ability for a field to store the oid of a row in another
table (in effect a pointer to it) rather than a key that must
be looked up.
Again, efficiency is a key motivation.
4. Support for inheritance
a. Types based on other types - so that a type includes fields of
its own plus fields inherited from a parent type or types.
b. Tables based on other tables - so that a given row, when
inserted into a subtable, also becomes a row in a base table.
(This moves into the realm of class = entity-set, of course)
5. Storing procedures in the database along with the data
a. As methods of specific data types.
b. As "stand-alone" functions or procedures.
6. Extending SQL to be a more fully-functional programming language in
its own right, including various control structures and an ability to
call routines written in other languages from within SQL.
Example: the CASE construct in SQL you used in an earlier homework.
7. Various commercial systems developed during the 1990's incorporated
some or all of these facilities, which became part of the 1999 SQL
standard and the latest revision - SQL 2003 - though standardization of
implementations is still a ways off and no vendor comes even close to
fully supporting the standard. (And vendors often do things contained
in the standard in their own, non-standard way.)
C. There is some amount of conflict between proponents of a new OO model
and proponents of the relational model.
1. Example: one OO database vendor's Technology Guide begins with a
section headed "The computing world has entered a post-relational
era".
2. Example: title of a paper "Why the Object Data Model isn't a Data
Model"
3. Read excerpt from C.J. Date An Introduction to Database Systems
- 7th ed (Addison Wesley, 2000) pp. 862-863
4. One tension that often shows up is between efficiency for a specific
application, on the one hand, and genericity and modifiability, on
the other hand. With regard to the present question, C.J. Date puts
it this way:
"Although the programming language and database management
disciplines certainly have a lot in common, they do also differ
in in certain important aspects (of course). To be specific:
* An application program is intended - by definition - to solve
some specific problem.
* By contrast, a database is intended - again by definition - to
solve a variety of different problems, some of which might not
even be known at the time the database is established."
(An Introduction to Database Systems - 7th ed (Addison Wesley,
2000) p. 813)
D. It is worth observing that how one intends to use the database affects
how one evaluates the relative merits of the traditional relational,
object-relational, and OO models.
1. A strictly OO model has performance advantages for high-volume
transaction processing (lots of inserts and deletes, supporting
access from many locations simultaneously, e.g. over the web or for
an institution with many branch offices.) Joins, in particular,
are much more computationally-expensive than following pointers.
2. The relational model has a mathematical simplicity and elegance
that provides better support for preserving data integrity and
for ad-hoc queries. When pointers are used to model relationships,
more of the burden has to shift to the programmer to navigate them
and to ensure that they are manipulated correctly. This gets in
the way of facilitating ad-hoc queries by ordinary users.
3. An object-relational database might allow one to use a relational
approach in a domain where the traditional relational model doesn't
provide the needed tools, though the question arises as to how
much should the basic model be changed without losing the advantages
of the relational model.
Going down this road, of course, raises the issue of "how far do we
go?" in terms of compromising the strict relational model to
incorporate ideas from OO. Date obviously has his thoughts on this
subject; other writers would take a "tigher" or "looser" stance.
TRANSPARENCY: "Rest in Peace" cartoon
4. Arguably, an "OO database" might be more suited for providing
persistence in support of a specific application (e.g. CAD),
while a relational database might be more suited for storing
information in a generic way that supports lots of different ways
of accessing it. An object-relational database might be
suitable for application domains that demand both the data independence
of relational systems and support for complex data.
One widely-cited representation of this approach is the
"Stonebraker classification matrix" proposed by Michael
Stonebraker:
TRANSPARENCY
E. Interestingly, it seems that recent developments in this area have
been relatively slow. I am going to read an excerpt from an article
published in August, 2000; however, what it said then seems to still
be fairly accurate today.
Read ecxerpt from "Industry Trends" article - IEEE Computer 33.8
(August 2000) pp. 16-19.
Copyright ©2004 - Russell C. Bjork