When building a relational database from scratch, it is important that you put a good deal of thought into the process. A poorly designed database can cause a number of headaches for its users, including:
Entire courses can be spent on database design concepts, but we don't have that kind of time, so let's just focus on some basic design rules that should serve you well. A well-designed table is one that:
The process of designing a database according to the rules described above is formally referred to as normalization. All database designers carry out normalization, whether they use that term to describe the process or not. Hardcore database designers not only use the term normalization, they're also able to express the extent to which a database has been normalized:
In most cases, normalizing a database so that it is in 3NF is sufficient. However, it is worth pointing out that there are other normal forms including Boyce-Codman normal form (BCNF, or 3.5NF), fourth normal form (4NF) and fifth normal form (5NF). Rather than spend time going through examples of these other forms, I encourage you to simply keep in mind the basic characteristics of a well-designed table listed above. If you follow those guidelines carefully, in particular, constantly being on the lookout for redundant data, you should be able to reap the benefits of normalization.
Generally speaking, a higher level of normalization results in a higher number of tables. And as the number of tables increases, the costs of bringing together data through joins increases as well, both in terms of the expertise required in writing the queries and in the performance of the database. In other words, the normalization process can sometimes yield a design that is too difficult to implement or that performs too slowly. Thus, it is important to bear in mind that database design is often a balancing of concerns related to data integrity and storage efficiency (why we normalize) versus concerns related to its usability (getting data into and out of the database).
Earlier, we talked about city/state combinations being redundant with zip code. That is a great example of a situation in which de-normalizing the data might make sense. I have no hard data on this, but I would venture to say that the vast majority of relational databases that store these three attributes keep them all together in the same table. Yes, there is a benefit to storing the city and state names once in the zip code table (less chance of a misspelling, less disk space used). However, my guess is that the added complexity of joining the city/state together with the rest of the address elements outweighs that benefit to most database designers.
Let's work through an example design scenario to demonstrate how these rules might be applied to produce an efficient database. Ice cream entrepreneurs Jen and Barry have opened their business and now need a database to track orders. When taking an order, they record the customer's name, the details of the order such as the flavors and quantities of ice cream needed, the date the order is needed, and the delivery address. Their database needs to help them answer two important questions:
A first crack at storing the order information might look like this:
Customer | Order | DeliveryDate | DeliveryAdd |
---|---|---|---|
Eric Cartman | 1 vanilla, 2 chocolate | 12/1/11 | 101 Main St |
Bart Simpson | 10 chocolate, 10 vanilla, 5 strawberry | 12/3/11 | 202 School Ln |
Stewie Griffin | 1 rocky road | 12/3/11 | 303 Chestnut St |
Bart Simpson | 3 mint chocolate chip, 2 strawberry | 12/5/11 | 202 School Ln |
Hank Hill | 2 coffee, 3 vanilla | 12/8/11 | 404 Canary Dr |
Stewie Griffin | 5 rocky road | 12/10/11 | 303 Chestnut St |
The problem with this design becomes clear when you imagine trying to write a query that calculates the number of gallons of vanilla that have been ordered. The quantities are mixed with the names of the flavors, and any one flavor could be listed anywhere within the order field (i.e., it won't be consistently listed first or second).
A design like the following would be slightly better:
Customer | Flavor1 | Qty1 | Flavor2 | Qty2 | Flavor3 | Qty3 | DeliveryDate | DeliveryAdd |
---|---|---|---|---|---|---|---|---|
Eric Cartman | vanilla | 1 | chocolate | 2 | 12/1/11 | 101 Main St |
||
Bart Simpson | chocolate | 10 | vanilla | 10 | strawberry | 5 | 12/3/11 | 202 School Ln |
Stewie Griffin | rocky road | 1 | 12/3/11 | 303 Chestnut St |
||||
Bart Simpson | mint chocolate chip | 3 | strawberry | 2 | 12/5/11 | 202 School Ln |
||
Hank Hill | coffee | 2 | vanilla | 3 | 12/8/11 | 404 Canary Dr |
||
Stewie Griffin | rocky road | 5 | 12/10/11 | 303 Chestnut St |
This is an improvement because it enables querying on flavors and summing quantities. However, to calculate the gallons of vanilla ordered, you would need to sum the values from three fields. Also, the design would break down if a customer ordered more than three flavors.
Slightly better still is this design:
Customer | Flavor | Qty | DeliveryDate | DeliveryAdd |
---|---|---|---|---|
Eric Cartman | vanilla | 1 | 12/1/11 | 101 Main St |
Eric Cartman | chocolate | 2 | 12/1/11 | 101 Main St |
Bart Simpson | chocolate | 10 | 12/3/11 | 202 School Ln |
Bart Simpson | vanilla | 10 | 12/3/11 | 202 School Ln |
Bart Simpson | strawberry | 5 | 12/3/11 | 202 School Ln |
Stewie Griffin | rocky road | 1 | 12/3/11 | 303 Chestnut St |
Hank Hill | coffee | 2 | 12/8/11 | 404 Canary Dr |
Hank Hill | vanilla | 3 | 12/8/11 | 404 Canary Dr |
Stewie Griffin | rocky road | 5 | 12/10/11 | 303 Chestnut St |
This design makes calculating the gallons of vanilla ordered much easier. Unfortunately, it also produces a lot of redundant data and spreads a complete order from a single customer across multiple rows.
Better than all of these approaches would be to separate the data into four entities (Customers, Flavors, Orders, and Order Items):
CustID | NameLast | NameFirst | DeliveryAdd |
---|---|---|---|
1 | Cartman | Eric | 101 Main St |
2 | Simpson | Bart | 202 School Ln |
3 | Griffin | Stewie | 303 Chestnut St |
4 | Hill | Hank | 404 Canary Dr |
FlavorID | Name |
---|---|
1 | vanilla |
2 | chocolate |
3 | strawberry |
4 | rocky road |
5 | mint chocolate chip |
6 | coffee |
OrderID | CustID | DeliveryDate |
---|---|---|
1 | 1 | 12/1/11 |
2 | 2 | 12/3/11 |
3 | 3 | 12/3/11 |
4 | 2 | 12/5/11 |
5 | 4 | 12/8/11 |
6 | 3 | 12/10/11 |
OrderItemID | OrderID | FlavorID | Qty |
---|---|---|---|
1 | 1 | 1 | 1 |
2 | 1 | 2 | 2 |
3 | 2 | 2 | 10 |
4 | 2 | 1 | 10 |
5 | 2 | 3 | 5 |
6 | 3 | 4 | 1 |
7 | 4 | 5 | 3 |
8 | 4 | 3 | 2 |
9 | 5 | 6 | 2 |
10 | 5 | 1 | 3 |
11 | 6 | 4 | 5 |
If one were to implement a design like this in MS-Access, the query needed to display orders that must be delivered in the next 2 days would look like this in the GUI:
It would produce the following SQL:
If you aren't experienced with relational databases, then such a table join may seem intimidating. You probably won't need to do anything quite so complex in this course. The purpose of this example is two-fold:
Whether it's just a quick sketch on a napkin or a months-long process involving many stakeholders, the life cycle of any effective database begins with data modeling. Data modeling itself begins with a requirements analysis, which can be more or less formal, depending on the scale of the project. One of the common products of the data modeling process is an entity-relationship (ER) diagram. This sort of diagram depicts the categories of data that must be stored (the entities) along with the associations (or relationships) between them. The Wikipedia entry on ER diagrams is quite good, so I'm going to point you there to learn more:
Entity-relationship model article at Wikipedia [2]
An ER diagram is essentially a blueprint for a database structure. Some RDBMSs provide diagramming tools (e.g., Oracle Designer, MySQL Workbench) and often include the capability of automatically creating the table structure conceptualized in the diagram.
In a GIS context, Esri makes it possible to create new geodatabases based on diagrams authored using CASE (Computer-Aided Software Engineering) tools. This blog post, Using Case tools in Arc GIS 10 [3], provides details if you are interested in learning more.
To help drive these concepts home, here is a scenario for you to consider. You work for a group with an idea for a fun website: to provide a place for music lovers to share their reviews of albums on the 1001 Albums You Must Hear Before You Die list [4]. (All of these albums had been streamable from Radio Romania 3Net [5], but sadly, it appears that's no longer the case.)
Spend 15-30 minutes designing a database (on paper, no need to implement it in Access) for this scenario. Your database should be capable of efficiently storing all of the following data:
When you're satisfied with your design, move to the next page and compare yours to mine.
Credit for all screenshots: © Penn State is licensed under CC BY-NC-SA 4.0 [6]
Links
[1] http://www.1keydata.com/database-normalization/second-normal-form-2nf.php
[2] http://en.wikipedia.org/wiki/Entity-relationship_model
[3] http://blogs.esri.com/esri/arcgis/2010/08/05/using-case-tools-in-arcgis-10/
[4] http://www.rocklistmusic.co.uk/1001Albums.htm
[5] https://www.radio3net.ro/
[6] https://creativecommons.org/licenses/by-nc-sa/4.0/