GEOG 863
GIS Mashups for Geospatial Professionals

Relational Database Concepts

PrintPrint

Why a database?

Reading your web map data from a shapefile or XML file works fine in certain situations.  However, it's not difficult to think of scenarios in which using flat files like these presents problems:

  • if the data are frequently changing, you'll need to have a step built into your business process that ensures there is an up-to-date copy of the files on the web server;
  • if you want to provide the user with the ability to map subsets of a larger dataset, doing so with flat files data can be clunky;
  • if you want to make it possible to allow updates of the data.

A major goal of this lesson is for you to learn how to write PHP scripts to extract data from a database and output it in XML form.  Once you've built such a script, with just a minor tweak to your map page you can go from reading a static XML file to reading XML data generated on the fly based on up-to-date information from your database.  But before we get to that point, let's first go over some basic RDBMS concepts.

Note

If you're experienced in working with relational databases, you can probably skip over the rest of this page.

What is a relational database?

A relational database is a system in which data are stored in tables of rows (records) and columns (fields). The defining characteristic of a relational database is that the full data set is broken down into multiple tables (each representing a single subject) and that relationships can be established between those tables through the use of key fields. Structured Query Language (SQL, sometimes pronounced "sequel") is used in an RDBMS to manipulate the tabular data.

There are many RDBMS software packages to choose from. In the commercial realm, Oracle has long been the most popular, followed by IBM's DB2 and Microsoft's SQL Server. In the open-source realm, MySQL and PostgreSQL are the major players. Microsoft Access is a relatively inexpensive package that is best suited for small datasets. Because of its user-friendly point-and-click environment, it is also frequently used as a "front-end" to more high-powered package — i.e., to connect to and manipulate data that are actually stored in a separate RDBMS.

In this course, we're going to create and populate MySQL databases using MS-Access as a front end.

Using SQL to select data

There are a number of database concepts that we should cover if you're going to build a database-driven mashup. Rather than start with database design concepts, which can be a bit dry, let's begin by looking at how SQL can be used to query an existing database. After getting a feel for the power of SQL, you should be more motivated to work towards understanding the less exciting steps involved in creating and populating a database.

The example queries in this section read data from a simple, two-table database of baseball statistics. Here are excerpts from these tables:

Players
PLAYER_ID FIRST_NAME LAST_NAME
1 Barry Bonds
2 Hank Aaron
3 Babe Ruth
4 Willie Mays
5 Sammy Sosa
STATS
PLAYER_ID YEAR AB HITS HR RBI TEAM
1 2006 367 99 26 77 SFG
1 2007 340 94 28 66 SFG
2 1954 468 131 13 69 MLN
2 1955 602 189 27 106 MLN
2 1956 609 200 26 92 MLN

First, let's look at a query that simply lists data from three selected columns in the STATS table:

SELECT PLAYER_ID, YEAR, HR FROM STATS;
PLAYER_ID YEAR HR
1 2006 26
1 2007 28
2 1954 13
2 1955 27
2 1956 26

Note the basic syntax used to list tabular data. The statement begins with the SELECT keyword and is followed by a list of the desired columns separated by commas. After the list of columns comes the FROM keyword and the name of the table in which the columns are found. The statement ends with a semi-colon. (Note that all columns can be retrieved using the asterisk; e.g., SELECT * FROM STATS;).

We can retrieve records that meet some criteria by adding a WHERE clause. For example, this query lists only those seasons in which the player had more than 50 home runs:

SELECT PLAYER_ID, YEAR, HR FROM STATS WHERE HR > 50;
PLAYER_ID YEAR HR
1 2001 73
3 1920 54
3 1921 59
3 1927 60
3 1928

54

4 1955 51
4 1965 52
5 1998 66

This list is begging to be sorted from high to low. To do that, we need to add an ORDER BY clause:

SELECT PLAYER_ID, YEAR, HR FROM STATS WHERE HR > 50 ORDER BY HR DESC;
PLAYER_ID YEAR HR
1 2001 73
8 1998 70
5 1998 66
8 1999 65
5 2001

64

3 1927 60
3 1921 59

If you wanted to sort from low to high, you would replace the DESC keyword with ASC or simply omit that keyword altogether (i.e., ascending order is the default). Also, keep in mind that it is possible to sort by multiple fields. For example, it is common to sort lists of people first by their last name, then by their first name. Such an ORDER BY clause would look like this: ORDER BY LAST_NAME, FIRST_NAME.

By now, I'm sure you're wondering how to view the names of the players along with their stats. This is done by joining the two tables based on the PLAYER_ID column they share in common. The following query brings together the names from the PLAYERS table and the year and HR count from the STATS table for the 50+ HR subset we've been working with:

SELECT PLAYERS.FIRST_NAME, PLAYERS.LAST_NAME, STATS.YEAR, STATS.HR
FROM STATS INNER JOIN PLAYERS
ON STATS.PLAYER_ID = PLAYERS.PLAYER_ID

WHERE HR > 50;

FIRST_NAME LAST_NAME YEAR HR
Barry Bonds 2001 73
Babe Ruth 1920 54
Babe Ruth 1921 59
Babe Ruth 1927 60
Babe Ruth 1928

54

Willie Mays 1955 51

Note that the FROM clause in this statement sandwiches the keywords INNER JOIN between the names of the two tables. The FROM clause is then immediately following by an ON clause in which the key fields to be used to create the join are specified. The field names are prefaced by the names of their parent tables using a "table.field" syntax. You may have noticed that the field names in the SELECT clause were also prefaced by their parent table names. This is not technically necessary if it is clear to the SQL execution engine which table holds each field. In other words, the FIRST_NAME field is found only in the PLAYERS table, so there is no question as to where to find that field. If on the other hand a FIRST_NAME field existed in both tables, it would be critical to include a reference to the appropriate table. This example errs on the side of caution by including the parent table names.

One of the more powerful features of SQL is the ability to group records by one or more fields and calculate summary statistics for each resulting group. In the example below, each player's career home run total is calculated:

SELECT PLAYERS.FIRST_NAME, PLAYERS.LAST_NAME, Sum(STATS.HR) AS CAREER_HR
FROM STATS INNER JOIN PLAYERS
ON STATS.PLAYER_ID = PLAYERS.PLAYER_ID
GROUP BY PLAYERS.FIRST_NAME, PLAYERS.LAST_NAME
ORDER BY Sum(STATS.HR) DESC;

FIRST_NAME LAST_NAME CAREER_HR
Barry Bonds 762
Hank Aaron 755
Babe Ruth 714
Willie Mays 660
Ken Griffey

611

Sammy Sosa 609

The first item to note in this statement is its GROUP BY clause. This clause specifies that you would like the SQL execution engine to output all of the unique FIRST_NAME/LAST_NAME combinations. The other part of the statement to pay particular attention to is the Sum(STATS.HR) AS CAREER_HR part. This specifies that you would like to have the sum of the values in the HR field for each FIRST_NAME/LAST_NAME combination included in the output under a heading of CAREER_HR.

Earlier I mentioned that MS-Access is frequently used as a "front-end" to more powerful RDBMS packages like Oracle and MySQL. One of the reasons for this is that Access has a user-friendly graphical user interface (GUI) for building queries. For example, here is a look at the Access query GUI after building the previous query:

MS-Access query GUI
 
 

Database design

When building a relational database from scratch, it is important that you put good deal of thought into the process. A poorly designed database can cause a number of headaches for its users, including:

  • loss of data integrity over time
  • inability to support needed queries
  • slow performance

Entire courses can be spent on database design concepts, but we don't have that kind of time, so let's just focus on some basic design rules that should serve you well. A well-designed table is one that:

  • represents a single subject;
  • has a primary key (a field or set of fields that uniquely identifies each record in the table);
  • does not contain multi-part fields (e.g., "302 Walker Bldg, University Park, PA 16802");
  • does not contain multi-valued fields (e.g., an Author field shouldn't hold values of the form "Jones, Martin, Williams");
  • does not contain unnecessary duplicate fields (e.g., avoid using Author1, Author2, Author3);
  • does not contain fields that rely on other fields (e.g., don't create a Wage field in a table that has PayRate and HrsWorked fields);
  • contains a minimum of redundant data.

Let's work through an example design scenario to demonstrate how these rules might be applied to produce an efficient database. Those ice cream entrepreneurs, Jen and Barry, have opened their business and now need a database to track orders. When taking an order, they record the customer's name, the details of the order such as the flavors and quantities of ice cream needed, the date the order is needed, and the delivery address. Their database needs to help them answer two important questions:

  1. Which orders are due to be shipped within the next two days?
  2. Which flavors must be produced in greater quantities?

A first crack at storing the order information might look like this:

Order Information, first version
Customer Order DeliveryDate DeliveryAdd
Eric Cartman 1 vanilla, 2 chocolate 12/1/08 101 Main St
Bart Simpson 10 chocolate, 10 vanilla, 5 strawberry 12/3/08 202 School Ln
Stewie Griffin 1 rocky road 12/3/08 303 Chestnut St
Bart Simpson 3 mint chocolate chip, 2 strawberry 12/5/08 202 School Ln
Hank Hill 2 coffee, 3 vanilla 12/8/08 404 Canary Dr
Stewie Griffin 5 rocky road 12/10/08 303 Chestnut St

The problem with this design becomes clear when you imagine trying to write a query that calculates the number of gallons of vanilla that have been ordered. The quantities are mixed with the names of the flavors and any one flavor could be listed anywhere within the order field (i.e., it won't be consistently listed first or second).

A design like the following would be slightly better:

Order Information, improved version
Customer Flavor1 Qty1 Flavor2 Qty2 Flavor3 Qty3 DeliveryDate DeliveryAdd
Eric Cartman vanilla 1 chocolate 2     12/1/08 101 Main St
Bart Simpson chocolate 10 vanilla 10 strawberry 5 12/3/08 202 School Ln
Stewie Griffin rocky road 1         12/3/08 303 Chestnut St
Bart Simpson mint chocolate chip 3 strawberry 2     12/5/08 202 School Ln
Hank Hill coffee 2 vanilla 3     12/8/08 404 Canary Dr
Stewie Griffin rocky road 5         12/10/08 303 Chestnut St

This is an improvement because it enables querying on flavors and summing quantities. However, to calculate the gallons of vanilla ordered, you would need to sum the values from three fields. Also, the design would break down if a customer ordered more than three flavors.

Slightly better still is this design:

Order Information, even more improved version
Customer Flavor Qty DeliveryDate DeliveryAdd
Eric Cartman vanilla 1 12/1/08 101 Main St
Eric Cartman chocolate 2 12/1/08 101 Main St
Bart Simpson chocolate 10 12/3/08 202 School Ln
Bart Simpson vanilla 10 12/3/08 202 School Ln
Bart Simpson strawberry 5 12/3/08 202 School Ln
Stewie Griffin rocky road 1 12/3/08 303 Chestnut St
Hank Hill coffee 2 12/8/08 404 Canary Dr
Hank Hill vanilla 3 12/8/08 404 Canary Dr
Stewie Griffin rocky road 5 12/10/08 303 Chestnut St

This design makes calculating the gallons of vanilla ordered much easier. Unfortunately, it also produces a lot of redundant data and spreads a complete order across multiple rows.

Better than all of these approaches would be to separate the data into four entities (Customers, Flavors, Orders, and Order Items):

CUSTOMERS
CustID NameLast NameFirst DeliveryAdd
1 Cartman Eric 101 Main St
2 Simpson Bart 202 School Ln
3 Griffin Stewie 303 Chestnut St
4 Hill Hank 404 Canary Dr
FLAVORS
FlavorID Name
1 vanilla
2 chocolate
3 strawberry
4 rocky road
5 mint chocolate chip
6 coffee
ORDERS
OrderID CustID DeliveryDate
1 1 12/1/08
2 2 12/3/08
3 3 12/3/08
4 2 12/5/08
5 4 12/8/08
6 3 12/10/08
ORDER ITEMS
OrderItemID OrderID FlavorID Qty
1 1 1 1
2 1 2 2
3 2 2 10
4 2 1 10
5 2 3 5
6 3 4 1
7 4 5 3
8 4 3 2
9 5 6 2
10 5 1 3
11 6 4 5

If one were to implement a design like this in MS-Access, the query needed to display orders that must be delivered in the next 2 days would look like this in the GUI:

Jen and Barry Query
 
 

It would produce the following SQL:

SELECT Orders.OrderID, Customers.NameLast, Customers.NameFirst, Flavors.Name As Flavor,
    OrderItems.Qty, Customers.DeliveryAdd, Orders.DeliveryDate
FROM ((Orders INNER JOIN OrderItems ON Orders.OrderID = OrderItems.OrderID)
    INNER JOIN Customers ON Orders.CustID = Customers.CustID)
    INNER JOIN Flavors ON OrderItems.FlavorID = Flavors.FlavorID
WHERE (((Orders.DeliveryDate) < Now()+2))
ORDER BY Orders.OrderID;

If you aren't experienced with relational databases, then such a table join may seem intimidating. You probably won't need to do anything quite so complex in this course. The purpose of this example is two-fold:

  1. To illustrate that multi-table designs are often preferable to a "one-big-spreadsheet" approach, and
  2. To emphasize the importance of SQL in pulling together data spread across multiple tables to perform valuable database queries.

 

Try This!

To get some practice with building queries, work through this tutorial that references the same baseball statistics database utilized above.

 

  1. Query_Writing_Exercise (for Access 2003)
  2. Query Writing Exercise (for Access 2007/2010)
  3. Baseball_stats.mdb