naturerecordists
[Top] [All Lists]

RE: Relational Database Design

Subject: RE: Relational Database Design
From: Marty Michener <>
Date: Thu, 21 Feb 2002 12:58:30 -0500
At 10:23 AM 2/21/02 +0800, Horst wrote:
>Dear Marty,
>I would be very interested in a short course on "third form relational
>database design", so if you could email me this information I would be
>grateful. These days, coping (well, trying to anyway) with information
>overload means being organized from the start, so any information and
>solutions that I can learn from are most welcome.
>
>Thanks in advance,
>
>Regards and best wishes
>
>Horst
>--
>*********************************************
>Dr. Horst Flotow
>Senior Research Leader
>Center for Natural Product Research
>59A Science Park Drive, The Fleming
>Singapore Science Park
>Singapore 118240

This relational design is basic to all the big guys - using SQL queries, 
Ingres, Sun, Oracle, etc.
And example of a simple query of a single table would limit the response to 
a group of rows and of selected columns:

SELECT GENUS, SPECIES, ENGLISH FROM SPECIES.DBF WHERE GENUS="Parus"

Ok. The proper title is Third Normal Form Relational Database Design.  This 
is pretty close to giving you the first half to 3/4 of the instruction you 
may need.  Let me know what is unclear after printing it out and reading it 
carefully.

The basic ideas, that most people grasp intuitively, at least partially are:

You record data as attributes where one file represents one type of thing 
in the world.

Each attribute has to have one value for each instance of this thing 
(record in the table).

You record data in only one place, then have all other table references to 
it use unique KEYS.  Think employee ID number, here.

A key is a short UNIQUE character set usually from 2 to 10 characters long, 
for most purposes.

The main idea is not to repeat any information in different files (usually 
called tables), and not to have varying attributes all on one record in a 
table.  Each table should be named, so that it includes the real world list 
of those things.

Ok, that's enough theory, an example is critical: Let us look at a list of 
attributes that we want to save in our recording library, and we will first 
evolve a simple case, that will turn out to be how NOT to do it, then show 
the advantage of how it is better done, and the advantages and 
disadvantages.  What do we want to save from a recording session? We'll use 
only a few, for starters.

TABLE: CUTSTRY, first try.
Species names,
cut number,
original medium,
original date,
time
recordist
length,
equipment,
quality
archive name
archive location

So we start with a database file with column heading like these, and start 
filling it in, to see how it works.

The immediate problems we realize we have are:
1.  we are typing in the same species name, over and over, inviting a 
spelling or taxonomic oddity error that will later fail to produce a true 
search.  Plus there is only room for one species per cut!
2. we are also typing in recordists names over and over, same problem.
3. and what about several recordists, or several species?  Do we make 
multiple columns? (No, this is Not the solution)

This reveals major design flaws: We need to re-define what all these things 
are actually basic attributes OF, if not of CUTS!

The table we redefine is called CUTS.  A Species name is NOT an attribute 
of a cut, but it needs to be associated with it.  A recordist name is also 
not a cut attribute, but needs associating (linking, or RELATING) to each cut.
We also will include suggested data types, that reflect actual best 
practices for search engines and for accurately representing data. (Zip 
codes are NOT numbers, they are character lists; don't believe it? are any 
arithmetic functions valid for ZIPS? No, they only serve for sorting into 
key values, which is characteristic of characters, not numbers.)  C6 means 
characters, 6 spaces.

So we make three more tables, and in each we put attributes unique to each 
record (instance).
SPECIES:
GSCode  C6   KEY, unique. 3 FROM GENUS + 3 FROM SPECIES
Genus     c 20
Species   C 30
English Name  C 30
Spanish Name  C30
Vocal?  C 1

Examples of SPECIES instances
RANHEC, Rana, hecksheri, River Frog,  Rana de Rio, Y
SIASIA, Sialia, sialis, Eastern Bluebird, Azuejo graganta canela, Y
etc.

RECORDISTS:
Rdstcode  C3   KEY, unique
Fname  C 30
Lname  C 30
Rphone  C 10
Raddress C 40
Rtown  C 25
RSta    C 3
RZip    C 5

Examples of RECORDISTS instances
KNW, Walt, Knap, ##########, smfkaoejnhdfjkf, Valdostea, GA  etc.
MIM, Marty, Michener, 6035981605, 75 Hannah Drive, Hollis, NH, 03049
etc.

STATES:
Rsta  C3   KEY, unique
State C 30

GA, Georgia
NH, New Hampshire

Now let's have a go at designing the CUTS table again, knowing better what 
does not belong in it:

CUTS:
CUTNO, cut number,   C 8, unique
LOCLAT C 8
LOCLON C 8
COUNTRY C 20
original medium,  C10
ODATE, original date,   D
OTIME, time                T
length,             N5  in seconds
equipment,  mic
equipment, preamp
equipment, recorder
equipment settings, etc.
Quality
archive name   C 10
archive location   c 15

Notice mainly that we are now listing only things that have a single real 
value for each cut.
Things that can have a MANY-TO-ONE relationship do NOT belong in the cuts 
table, as they violate the rule: single unique values only.  For recordists 
and species we must generate new tables, called MANY-TO-MANY tables, to 
relate, by unique codes, CUTS to RECORDISTS, or CUTS to SPECIES.

CUTS2RCDST:
CUTNO, C 8
Rdstcode  C3

For each record in this table, you can say is an instance of a recordist 
(Rdistcode) making the recording (CUTNO).
The date and equipment we already know, as they are truly 1:1 attributes of 
CUTNO.

CUTSPECIES
CUTNO  C 8
GSCode  C6

For each record in this table, you can say it is an instance of SPECIES 
GSCode, being recorded somewhere in CUT CUTNO.  Notice that because we have 
accurately designed this system to reflect real reality, there immediately 
are suggested other attributes that describe each species vocalization 
itself, and the system can become one level more sophisticated, by 
redefining a record as "each instance of a sound from species GSCode 
somewhere in CUTNO."

Now we can actually track the start times, and other sound analytical data 
that in a less carefully designed system would be hopeless to tame.  We 
also note that not all sounds we may wish to catalog are animal 
species.  We can modify the SPECIES table to include rain, thunder, trains, 
etc., as long as we assign unique codes to each sound source, running into 
details at a level we might someday want to make.  So we now take a more 
rigorous approach:

CUTSOUND:
CUTNO  C 8
GSCode  C6
Start_time, seconds from cutstart.
Start-Index on original medium
Species_length, in seconds
Species_loudness
Species_quality
Pitch, most energy
Pitch, highest
Pitch, lowest
Description  C 50 (
ID_comments  C 40 (give method of iding sound as to source or species)
etc.

Remember, I said this "was different" but not hard.  Everyone, when 
starting, designs a database allowing "X" species per cut.  This is 
conceptually for the beginner much easier to deal with than a MANY-TO-MANY 
table.  But it has these flaws:
1.  It wastes file space much of the time, when only 1 or 2 are used;
2.  It cannot accept X+1 species, distorting reality in abundant cases.


Obviously, one cannot do all this with Excel.   One needs Foxpro, Access, 
etc.  with the original dBase you had to write, for days and days, a 
program to keep all this relational structure intact through all data 
entry, changes in time, etc..  This is particularly critical: if you 
mistype a species code you "orphan" that record - that is - it is not 
related to any real record in the other table with any real value.  This is 
a huge BAD BAD in relational management.

My own software actually has each table a separate physical file, e.g. 
SPECIES.DBF.  It has an editor, called SELECT.EXE
that opens, appends to or edits ANY DBF file, only it can easily be tamed 
without programming.

For each table, I create a dictionary file, of my own design: SPECIES.DIC.
The easy secret, here, is that a DIC file is just a renamed DBF file of a 
specific structure, (and I have a utility that searches any DBF file you 
might have and CREATES a DIC file for it, gathering data entry rules as it 
goes).  These are the column names in all DIC files:

FLD_NAME  C 10   the name of the DBF column
ALLOWED   C 128   what characters or values are allowed when editing the 
parent file, includes look-ups.
USE            C  64   what the real use is for this particular column, 
your notes.
HELP          C  78   what help bottom line to display when adding data, to 
remind you what is and is not allowed.

The DIC file has exactly as many records as the parent file has columns, 
one to tell the editor exactly how to treat each column when data is 
entered or changed.

Each record deals with one of the columns in the parent file. This is all 
read into the editor when you open the DBF file.
An easy rule is "For date types, you want numbers, and do not want anyone 
typing in alphabetical letters."  So, in the dictionary, for every DATE 
column title, in ALLOWED, you type 0123456789.  The editor will never allow 
any other characters entered in appending or editing. Press an "A" and it 
never shows up, and BEEPS.

Oh, no, you think, why do I want to know this?  Marty has some unique 
system.   Because it is easy to use the editor I wrote to edit dictionary 
files themselves, and thus embody the structure I have described above into 
the files themselves.  A simple example will show this.

You are editing and adding to the CUTSOUND table. Please note that if you 
mistype either CUTNO or GSCode, both KEY attributes (that uniquely must 
agree with a single record in another table) you produce a nonsense 
orphaned record.  In most data systems, you would have to write an 
elaborate program that compares each entry to all the valid values in the 
remote table, and then goes back and tells the user: "wrong value".  Set up 
correctly, the SELECT editor does all this and more.

Set up: Use the editor to edit CUTSOUND.DIC
There will be one record (row, line, going down) for every COLUMN in 
CUTSOUND.DBF
Let say the CUTNO is defined as all numbers.
Then the lookup table to validate each key will be the original table, 
called CUTS.DBF

In the DIC, the ALLOWED field will look like this:
FFCUTS.DBF#CUTNO#ODATE#OTIME#COUNTRY&&0123456789

  WHAT DOES ALL THIS MEAN?   FF means this is the file to look in, to match 
values entered into this column: CUTS.DBF
[# ARE DIVIDERS], and the next entry is the column to look in for an exact 
match: CUTNO.

So, to enter or edit any value in this table, it must relate exactly to one 
record in the table CUTS, matching a value in the column CUTNO of that table.

&& tells when the lookup list is done; SO all the next entries until && are 
other columns to be looked up and reported back on the status line of the 
editor, when any match is found.
All the info found after && is allowed key characters which can be used in 
this field, CUTNO in CUTSOUND.DBF

So, in the editor, you type 01231006, hit enter and the editor looks all 
through the other table, CUTS.DBF, and if it doesn't find this key value in 
column CUTNO of a record, it will not let you leave the field, with a beep 
and warning message.

If it DOES find the key value 01231006 in CUTS.DBF, it accepts the entry, 
and moves to the next field and displays the date, time and country found 
in that cut, on the bottom line, as : "FOUND:  20010123 1023 Panama".  So, 
you not only found A record that agrees, you can check your notes to see if 
it is the CORRECT record you meant to match.

The ALLOWED for the SPECIES table lookups, would be:

FFSPECIES.DBF#GSCODE#GENUS#SPECIES#ENGLISH&&ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

Same with the GSCODE, you type in ZIMVIL and it looks in SPECIES.DBF, IN 
COLUMN GSCODE, and returns the line:
"FOUND:  Zimmerius vilissima Paltry Tyrannulet"

After each record entry, you are assured that this MANY-TO-MANY table is 
correct, and cannot be edited to orphan any record using the 
system.  "Programming" the DIC file takes only as long as to type the 
ALLOWED lines.  Errors are thus reduced by several thousand-fold.

I use the SELECT editor (limited to DOS, and the eight by three file names) 
and DIC creator every day, but there are many things they need to do to be 
generally useful that I will not have time to create and debug for at least 
four months.  Otherwise, I would share the software freely with anyone on 
this group in a moment.

Best,
Marty Michener
Hollis, NH




________________________________________________________________________
________________________________________________________________________

<Prev in Thread] Current Thread [Next in Thread>
Admin

The University of NSW School of Computer and Engineering takes no responsibility for the contents of this archive. It is purely a compilation of material sent by many people to the naturerecordists mailing list. It has not been checked for accuracy nor its content verified in any way. If you wish to get material removed from the archive or have other queries about the archive e-mail Andrew Taylor at this address: andrewt@cse.unsw.EDU.AU