RE: Relational Database Design

Subject:	RE: Relational Database Design
From:	Marty Michener <>
Date:	Thu, 21 Feb 2002 12:58:30 -0500

At 10:23 AM 2/21/02 +0800, Horst wrote:
>Dear Marty,
>I would be very interested in a short course on "third form relational
>database design", so if you could email me this information I would be
>grateful. These days, coping (well, trying to anyway) with information
>overload means being organized from the start, so any information and
>solutions that I can learn from are most welcome.
>
>Thanks in advance,
>
>Regards and best wishes
>
>Horst
>--
>*********************************************
>Dr. Horst Flotow
>Senior Research Leader
>Center for Natural Product Research
>59A Science Park Drive, The Fleming
>Singapore Science Park
>Singapore 118240

This relational design is basic to all the big guys - using SQL queries,
Ingres, Sun, Oracle, etc.
And example of a simple query of a single table would limit the response to
a group of rows and of selected columns:

SELECT GENUS, SPECIES, ENGLISH FROM SPECIES.DBF WHERE GENUS="Parus"

Ok. The proper title is Third Normal Form Relational Database Design. This
is pretty close to giving you the first half to 3/4 of the instruction you
may need. Let me know what is unclear after printing it out and reading it
carefully.

The basic ideas, that most people grasp intuitively, at least partially are:

You record data as attributes where one file represents one type of thing
in the world.

Each attribute has to have one value for each instance of this thing
(record in the table).

You record data in only one place, then have all other table references to
it use unique KEYS. Think employee ID number, here.

A key is a short UNIQUE character set usually from 2 to 10 characters long,
for most purposes.

The main idea is not to repeat any information in different files (usually
called tables), and not to have varying attributes all on one record in a
table. Each table should be named, so that it includes the real world list
of those things.

Ok, that's enough theory, an example is critical: Let us look at a list of
attributes that we want to save in our recording library, and we will first
evolve a simple case, that will turn out to be how NOT to do it, then show
the advantage of how it is better done, and the advantages and
disadvantages. What do we want to save from a recording session? We'll use
only a few, for starters.

TABLE: CUTSTRY, first try.
Species names,
cut number,
original medium,
original date,
time
recordist
length,
equipment,
quality
archive name
archive location

So we start with a database file with column heading like these, and start
filling it in, to see how it works.

The immediate problems we realize we have are:
1. we are typing in the same species name, over and over, inviting a
spelling or taxonomic oddity error that will later fail to produce a true
search. Plus there is only room for one species per cut!
2. we are also typing in recordists names over and over, same problem.
3. and what about several recordists, or several species? Do we make
multiple columns? (No, this is Not the solution)

This reveals major design flaws: We need to re-define what all these things
are actually basic attributes OF, if not of CUTS!

The table we redefine is called CUTS. A Species name is NOT an attribute
of a cut, but it needs to be associated with it. A recordist name is also
not a cut attribute, but needs associating (linking, or RELATING) to each cut.
We also will include suggested data types, that reflect actual best
practices for search engines and for accurately representing data. (Zip
codes are NOT numbers, they are character lists; don't believe it? are any
arithmetic functions valid for ZIPS? No, they only serve for sorting into
key values, which is characteristic of characters, not numbers.) C6 means
characters, 6 spaces.

So we make three more tables, and in each we put attributes unique to each
record (instance).
SPECIES:
GSCode C6 KEY, unique. 3 FROM GENUS + 3 FROM SPECIES
Genus c 20
Species C 30
English Name C 30
Spanish Name C30
Vocal? C 1

Examples of SPECIES instances
RANHEC, Rana, hecksheri, River Frog, Rana de Rio, Y
SIASIA, Sialia, sialis, Eastern Bluebird, Azuejo graganta canela, Y
etc.

RECORDISTS:
Rdstcode C3 KEY, unique
Fname C 30
Lname C 30
Rphone C 10
Raddress C 40
Rtown C 25
RSta C 3
RZip C 5

Examples of RECORDISTS instances
KNW, Walt, Knap, ##########, smfkaoejnhdfjkf, Valdostea, GA etc.
MIM, Marty, Michener, 6035981605, 75 Hannah Drive, Hollis, NH, 03049
etc.

STATES:
Rsta C3 KEY, unique
State C 30

GA, Georgia
NH, New Hampshire

Now let's have a go at designing the CUTS table again, knowing better what
does not belong in it:

CUTS:
CUTNO, cut number, C 8, unique
LOCLAT C 8
LOCLON C 8
COUNTRY C 20
original medium, C10
ODATE, original date, D
OTIME, time T
length, N5 in seconds
equipment, mic
equipment, preamp
equipment, recorder
equipment settings, etc.
Quality
archive name C 10
archive location c 15

Notice mainly that we are now listing only things that have a single real
value for each cut.
Things that can have a MANY-TO-ONE relationship do NOT belong in the cuts
table, as they violate the rule: single unique values only. For recordists
and species we must generate new tables, called MANY-TO-MANY tables, to
relate, by unique codes, CUTS to RECORDISTS, or CUTS to SPECIES.

CUTS2RCDST:
CUTNO, C 8
Rdstcode C3

For each record in this table, you can say is an instance of a recordist
(Rdistcode) making the recording (CUTNO).
The date and equipment we already know, as they are truly 1:1 attributes of
CUTNO.

CUTSPECIES
CUTNO C 8
GSCode C6

For each record in this table, you can say it is an instance of SPECIES
GSCode, being recorded somewhere in CUT CUTNO. Notice that because we have
accurately designed this system to reflect real reality, there immediately
are suggested other attributes that describe each species vocalization
itself, and the system can become one level more sophisticated, by
redefining a record as "each instance of a sound from species GSCode
somewhere in CUTNO."

Now we can actually track the start times, and other sound analytical data
that in a less carefully designed system would be hopeless to tame. We
also note that not all sounds we may wish to catalog are animal
species. We can modify the SPECIES table to include rain, thunder, trains,
etc., as long as we assign unique codes to each sound source, running into
details at a level we might someday want to make. So we now take a more
rigorous approach:

CUTSOUND:
CUTNO C 8
GSCode C6
Start_time, seconds from cutstart.
Start-Index on original medium
Species_length, in seconds
Species_loudness
Species_quality
Pitch, most energy
Pitch, highest
Pitch, lowest
Description C 50 (
ID_comments C 40 (give method of iding sound as to source or species)
etc.

Remember, I said this "was different" but not hard. Everyone, when
starting, designs a database allowing "X" species per cut. This is
conceptually for the beginner much easier to deal with than a MANY-TO-MANY
table. But it has these flaws:
1. It wastes file space much of the time, when only 1 or 2 are used;
2. It cannot accept X+1 species, distorting reality in abundant cases.

Obviously, one cannot do all this with Excel. One needs Foxpro, Access,
etc. with the original dBase you had to write, for days and days, a
program to keep all this relational structure intact through all data
entry, changes in time, etc.. This is particularly critical: if you
mistype a species code you "orphan" that record - that is - it is not
related to any real record in the other table with any real value. This is
a huge BAD BAD in relational management.

My own software actually has each table a separate physical file, e.g.
SPECIES.DBF. It has an editor, called SELECT.EXE
that opens, appends to or edits ANY DBF file, only it can easily be tamed
without programming.

For each table, I create a dictionary file, of my own design: SPECIES.DIC.
The easy secret, here, is that a DIC file is just a renamed DBF file of a
specific structure, (and I have a utility that searches any DBF file you
might have and CREATES a DIC file for it, gathering data entry rules as it
goes). These are the column names in all DIC files:

FLD_NAME C 10 the name of the DBF column
ALLOWED C 128 what characters or values are allowed when editing the
parent file, includes look-ups.
USE C 64 what the real use is for this particular column,
your notes.
HELP C 78 what help bottom line to display when adding data, to
remind you what is and is not allowed.

The DIC file has exactly as many records as the parent file has columns,
one to tell the editor exactly how to treat each column when data is
entered or changed.

Each record deals with one of the columns in the parent file. This is all
read into the editor when you open the DBF file.
An easy rule is "For date types, you want numbers, and do not want anyone
typing in alphabetical letters." So, in the dictionary, for every DATE
column title, in ALLOWED, you type 0123456789. The editor will never allow
any other characters entered in appending or editing. Press an "A" and it
never shows up, and BEEPS.

Oh, no, you think, why do I want to know this? Marty has some unique
system. Because it is easy to use the editor I wrote to edit dictionary
files themselves, and thus embody the structure I have described above into
the files themselves. A simple example will show this.

You are editing and adding to the CUTSOUND table. Please note that if you
mistype either CUTNO or GSCode, both KEY attributes (that uniquely must
agree with a single record in another table) you produce a nonsense
orphaned record. In most data systems, you would have to write an
elaborate program that compares each entry to all the valid values in the
remote table, and then goes back and tells the user: "wrong value". Set up
correctly, the SELECT editor does all this and more.

Set up: Use the editor to edit CUTSOUND.DIC
There will be one record (row, line, going down) for every COLUMN in
CUTSOUND.DBF
Let say the CUTNO is defined as all numbers.
Then the lookup table to validate each key will be the original table,
called CUTS.DBF

In the DIC, the ALLOWED field will look like this:
FFCUTS.DBF#CUTNO#ODATE#OTIME#COUNTRY&&0123456789

WHAT DOES ALL THIS MEAN? FF means this is the file to look in, to match
values entered into this column: CUTS.DBF
[# ARE DIVIDERS], and the next entry is the column to look in for an exact
match: CUTNO.

So, to enter or edit any value in this table, it must relate exactly to one
record in the table CUTS, matching a value in the column CUTNO of that table.

&& tells when the lookup list is done; SO all the next entries until && are
other columns to be looked up and reported back on the status line of the
editor, when any match is found.
All the info found after && is allowed key characters which can be used in
this field, CUTNO in CUTSOUND.DBF

So, in the editor, you type 01231006, hit enter and the editor looks all
through the other table, CUTS.DBF, and if it doesn't find this key value in
column CUTNO of a record, it will not let you leave the field, with a beep
and warning message.

If it DOES find the key value 01231006 in CUTS.DBF, it accepts the entry,
and moves to the next field and displays the date, time and country found
in that cut, on the bottom line, as : "FOUND: 20010123 1023 Panama". So,
you not only found A record that agrees, you can check your notes to see if
it is the CORRECT record you meant to match.

The ALLOWED for the SPECIES table lookups, would be:

FFSPECIES.DBF#GSCODE#GENUS#SPECIES#ENGLISH&&ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

Same with the GSCODE, you type in ZIMVIL and it looks in SPECIES.DBF, IN
COLUMN GSCODE, and returns the line:
"FOUND: Zimmerius vilissima Paltry Tyrannulet"

After each record entry, you are assured that this MANY-TO-MANY table is
correct, and cannot be edited to orphan any record using the
system. "Programming" the DIC file takes only as long as to type the
ALLOWED lines. Errors are thus reduced by several thousand-fold.

I use the SELECT editor (limited to DOS, and the eight by three file names)
and DIC creator every day, but there are many things they need to do to be
generally useful that I will not have time to create and debug for at least
four months. Otherwise, I would share the software freely with anyone on
this group in a moment.

Best,
Marty Michener
Hollis, NH

________________________________________________________________________
________________________________________________________________________

<Prev in Thread]	Current Thread	[Next in Thread>
RE: Relational Database Design, Marty Michener <= Re: Relational Database Design, Walter Knapp

Previous by Date:	Re: Wind protection for mic, Lang Elliott
Next by Date:	Hello, Gordon Hempton
Previous by Thread:	Wind protection for mic, Callan Cohen
Next by Thread:	Re: Relational Database Design, Walter Knapp
Indexes:	[Date] [Thread] [Top] [All Lists]

The University of NSW School of Computer and Engineering takes no responsibility for the contents of this archive. It is purely a compilation of material sent by many people to the naturerecordists mailing list. It has not been checked for accuracy nor its content verified in any way. If you wish to get material removed from the archive or have other queries about the archive e-mail Andrew Taylor at this address: andrewt@cse.unsw.EDU.AU