Adding Data to the Database

Written by Greg Michalec (greg@primate.net) - February, 2011

This is a brief outline of the steps involved in adding new data to the database, as well as some of the practices we've implemented and some things to watch out for.

Some Naming Conventions

  • Try and preserve the original source data as much as possible, even if it needs to be altered to fit into the schema. Thus, always preserve the name and department fields, even after you have assigned person_ids and department_ids
  • Fields that should not be used by researchers (fields that are reproduced in another related field, such as name or department name) should be named to begin with '__'. Thus, there are several fields in the dataset name '__faculty_name'
  • Tables that are not meant for researcher use should have a '_' prefix, thus '_name_source_lookup'

Data Import Procedure
First off, don't do development on the live db. Use the mimir_dev schema, or create a local copy of the database. Whatever schema you use, make sure to update it from the live dataset before you begin, to make sure you have the most recent data, and that you don't overwrite any data when you merge back into the live db.

Since most of the source data is in spreadsheet form, you usually will want to try and reorder/rename the columns in the spreadsheet to match what is in the existing database (assuming you are appending new data). You may also want to do some normalization of the data at this point, such as making sure dollar amounts are in integer format, getting dates into 2010-02-27 format, or getting rid of strange data translation problems (extraneous tabs or other odd characters). It is also sometimes necessary to add columns to spreadsheets and to combine spreadsheets - i.e. you have one spreadsheet for 2008 data and one for 2009, so you need to add a year field and combine the two into one spreadsheet
Once the spreadsheet is cleaned up, export it as a CSV file. You can then use the import_csv.pl script to load the data into a temporary table (see the data utilities page for details on this script).

Generally, I do most of the work of cleaning up /normalizing data in temporary working tables, and then merge the finished data into the real tables as a last step. This protects from messing up the existing data.

Many of the data come in a non-normalized format, where data is duplicated across rows. It's often necessary to split a single spreadsheet source into several related tables. The general rule is that there should always be only one place in the data that refers to a specific entity.

To normalize department names: update new_table a join departments_aliases b on department_name = alias set a.department_id = b.department_id
You should then look at any departments that did not get matched and try to resolve them into existing departments. If they seem to be part of an existing department, add them to the departments_aliases table, otherwise, create a new department in the departments table, then add them to the departments_aliases table.

Normalizing person names is usually the most tedious part of name matching (although maybe not anymore now that i've automated much of it). For details on how to do this, refer to the Name Matching Procedure document. It's pretty crucial to the integrity of the overall data that this procedure is followed. The whole point of this project is to track collaboration, and you can't do that if the database thinks that 'Dan McFarland' is a different person that 'Daniel McFarland'.

Once the names and departments are normalized, and everything is in the proper format, you can then merge your temporary table into the destination tables (if applicable). Finally, you can refresh the data on the live database with the data in the dev database. You can do this on a table-by-table basis, but I usually just dump the entire dev db (using mysqldump) and load in into the live db, like this:
mysqldump -u minerva_admin -p mimir_dev > mimir.sql;
mysql -u minerva_admin -p minerva < mimir.sql ;

If you do this, make sure to first delete any temporary tables you have created. If you are removing any tables, you'll need to manually delete them from the live db.

New Data Tables and Management Scripts
For any new data sources, there are some files that must be updated in order for the database management utilities to work correctly. (These files are in the /utils/db_management/ of the SVN checkout - see the data management utilities page for more details) If you don't follow these steps, it's very likely the person and department ids in your new data will get out of sync with the rest of the database over time.

  • update_name_source_lookup.pl - @qs array - You will need to add a query to this array that will be able to insert data from your new table into the _name_source_lookup table. This should just be a matter of creating a proper insert query - you can use the existing queries for examples. The purpose of this table is to get data about an individual from a single source into a normalized format that can be combined with all the sources. Thus, for each row, you want as much descriptive information about the person that row refers to that exists in the data source that row comes from. Note the $common_values variable, which contains the fields "id, name, person_id, sourcefield, sourcetable, sourceid" - these should always be inserted no matter what. The other fields may or may not appear, depending on the data source. Here's a description of the values in _name_source_lookup
    • id - the autoincrement row id - should always be inserted as null
    • name - the name from the source data - i.e. "Mcfarland, Dan'
    • person_id - the person id associated to the name i.e. 234
    • sourcefield - the name of the field that contains the name data i.e. __project_manager_name
    • sourcetable - the name of the table that this data is coming from i.e. pta
    • sourceid - the primary key of the row that the data is coming from i.e. 2352
    • emplid - the employee id
    • dept_id - a department id - if multiple department_ids exist for the data, choose the one that seems to most primary
    • gender - the gender of the person (M/F)
    • fiscal year - any year field associated with the data row
    • degreeyear - the year the person in question received their degree
    • school_id - the school_id associated with the row
    • ethnicity - the ethnicity code (or text) associated with the person - note that codes are expanded into text by the update script
    • hire_date - the date the person was hired
    • appointment_date - the date the person was appointed
    • birth_year - the approximate year the person was born - often can be calculated from year - age
    • id_text - any other useful descriptive text associated with the row - i.e. patent titles, dissertation titles, grant titles, etc.
    • duplicate_name - a binary flag that notes whether this name exists across multiple person_ids - is set by update_script
    • normalized_name - a normalized version of the name that is useful in aiding name matching - is set by update_script
    • duplicate_normal_name - a binary flag that notes whether this normalized name exists across multiple person_ids - is set by update_script

    Note that it is okay to join in other tables if it is relevant to the particular data source. For example, it is useful to know the department of a dissertation that a faculty member read, so the dissertation_members query joins in the dissertations table.
    Also, some tables may have more than one entry per row in _name_source_lookup - for example, if a table contains information about both a student and a faculty member (although we want to try to move away from this when designing the tables). There should be an entry for each reference to a person in a row for a table.

  • common.pl - $peoplefields array - This is an array of array that contains information about the structure of tables that contain person_ids. There should be an entry for every person_id field in the dataset (except those tables that are autogenerated). This array is used by the merge_people.pl and update_stanford_community.pl script. The format of an entry in the array looks like this:
    [ 'faculty_courses', 'person_id', '__faculty_name', 'faculty_course_id', 'year'],
    The fields in the array are as follows
    1. Name of the table - 'faculty_courses'
    2. Name of the person_id field - 'person_id'
    3. Name of the name field (if it exists, else set to '') - '__faculty_name'
    4. Name of the primary key (row id) field - 'faculty_course_id'
    5. Name of the year field (if it exists, else set to '') - 'year'
  • common.pl - $departmentfields array - Also an array of arrays, this is similar to $peoplefields, but for departments. Again, there should be one entry here for every time there is a department_id in the dataset (this is not in an autogenerated table). This array is used by the merge_department_ids script. The format of an entry in the array looks like:
    [ 'courtesy_appointments', 'department_id', '__department_name', 'courtesy_appointment_id']
    The fields are as follows:
    1. Name of the table - 'courtesy_appointments'
    2. Name of the department_id field - 'department_id'
    3. Name of the field that has the name of the department (if it exists) - '__department_name'
    4. Name of the primary key (row id) field - 'courtesy_appointment_id'