Written by Greg Michalec (email@example.com) - February, 2011
This is an overview of some of the utilities that have been developed to help maintain the MIMIR dataset.
All of these files are in the mimir SVN repository, under 'utils/db_management/'. You can check them out like this:
svn co file:///afs/ir/group/minerva/mimir_svn/util/db_management/ ./db_management/
Some of the scripts need to be run from a web browser, and thus should be in a cgi-enabled web directory. See the Name Matching Procedure page for details.
The files are all written in perl (except for a simple shell script that runs several perl scripts in batch). This is primarily because the best name matching library i could find was written in perl. Apologies if you feel like you've been thrown back to 1997.
The scripts pull their database configuration from db_info.pl. You will need to set this up with the database password for any of the scripts to work. Do not commit db_info.pl with the database password into SVN. We need to keep the db admin password secure.
You should never run these scripts against the live data. Do your work in a local copy of the database, or in the mimir_dev schema.
The scripts rely on two external perl modules, Lingua::EN::NameParse and Lingua::EN::MatchName. Versions of these modules that are compiled to run on the corn.stanford.edu servers are committed into the 'utils/db_management/lib/' svn directory and will be found by the scripts. If you run these scripts on a different machine, you may need to replace these libraries with versions that are compiled for your architecture.
This file contains some variables specific to the name matching procedure (see that doc for details).
This file contains local database configuration variables - $dbhost, $dbuser, $dbschema, and $dbpassword. These should be self-explanatory. $dbhost defaults to 'choroid.stanford.edu' if it isn't set. You should probably never set dbschema to 'minerva' - work in a development schema. Also, this file should never be committed back into svn with the database password.
- Name Matching Utilities
The following are utilities used by the name matching procedure. Their use is documented on the Name Matching Procedure page.
This is a useful utility to import a CSV file into a mysql table. The table will be named whatever the CSV file was named, and it will name the fields based on the values of the first row of the file. All of the fields in the table will be set to varchar(300), unless you put the fieldname in the $clobs array - these will be set to text(60000). Warnings should be printed if data is truncated. Depending on your CSV formatting, you may have to tweak some it the configuration values when Text::CSV_XS->new() is called. See the documentation for the Text::CSV perl module for details.
This utility is used to merge two departments into one. In essence, it replaces all the instances of one department_id with another. It relies on the $departmentfields variable in common.pl. It wont work unless this variable has been updated. (see 'Importing New Data' for details on this array). It's usage is as follows:
perl merge_departments.pl NEW_DEPARTMENT_ID DEPARTMENT_ID_TO_BE_REPLACED
This utility aids in merging two person_ids into one. It is a web based script. First, search for a name - partial matches are accepted, and the script will replace ' ' with '%' in the search query (the match in the query also begins and ends with '%'). It searches all the names in _name_source_lookup, and presents summary data for each unique person_id who has a name that matches the query string. Use the radio button to pick the id that will be preserved, then use the checkboxes to select all the person_ids to be merged into the preserved person_id. When you submit, The script will replace all the instances of the checkboxed person_ids with the radio-button person_id. It uses the $peoplefields variable in common.pl. It wont work unless this variable has been updated. See 'Importing New Data' for details. The script will also refresh the summary data in the people table for the preserved person_id. This script is also very useful as a front end to the dataset, because it allows you to easily get summary data based on an open-ended name search. This can be quite helpful when doing name matching, for example.
This utility is for splitting together people who have been incorrectly combined into a single person id. You search for a single person id, and the utility displays all the entries in the _name_source_lookup table for that person_id. This should make it relatively easy to review these entries and identify which belong with that person_id, and which do not. Additionally, at the top of the page is an iframe that links to merge_people_web.pl, which serves as an easy way to search on a name and get summary data on any person_id with that name. Thus, if you are looking at a person_id with 5 entries with the name 'McFarland, Dan' and 5 entries with the name 'Claus, Santa', you can use the iframe to lookup each name to see if there are other records that match that name.
When you determine a particular record should be separated from the current person_id, you can flag it by checking the checkbox on that row. You can check as many rows as necessary. If the rows need to be reassigned to an existing person_id (i.e. they match against another person record in the database), you can enter that person_id into the 'Reassign checked rows to person id' box at the top. If the records need to be split off into a brand new person entity, or you can select the button next to 'Create brand new person_id', and this will create a new record in the people table and assign the records to its new id. When you submit the form by pressing the 'split ids' button, the records in the source tables represented by the checked rows will be reassigned person_ids. Additionally, the rows will be update in _name_source_lookup.
This utility deletes the _name_source_lookup table and regenerates it. It first simply runs a series of SQL queries in batch (one for each instance of a person name in the data - see 'Importing New Data' for details). It also replaces ethnicity codes with their string equivalents, sets the 'normalized name' field, and sets the duplicate_name and duplicate_normal_name fields in _name_source_lookup. This should be run every time data is added or changed. The script takes no arguments.
This utility runs through each person_id in the people table and updates their metadata in the people table based on the data in the _name_source_lookup table. For the fields 'name', 'emplid', 'gender', 'degreeyear', 'ethnicity', 'hire_date', 'appointment_date', and 'birth_year', it will take the most common value for that person_id. It also selects the longest (most characters) name that the person_id uses for long_name, and calculates the lastfirstmiddle and lastfirst fields based on the most common name. This should be run every time data is added or changed. The script optionally accepts a person_id argument (to only update that person_id). Otherwise, it updates all person_ids by default.
This utility deletes the stanford_community table and refreshes it with data from _name_source_lookup. It creates a row for each year a person_id appears in each dataset that has year values, based on the $peoplefields variable in common.pl. This should be run every time data is added or changed. The script takes no arguments.
This is a convenience utility that runs update_name_source_lookup.pl, update_stanford_community.pl, update_people_data.pl and generate_lookup_names_scores.pl in batch. This makes it easier to run these four utilities each time data is added or changed. Note this is a bash script, not perl, and thus is run like this:
This utility goes through each person_id in _name_source_lookup, and then for each unique name for that person, generates a name match score versus all the other unique names for that person. It then takes the average of those scores and stores them in the _name_source_lookup table. This is to aid in identifying cases where two different people have been incorrectly merged together. Cases where names are very different will result in a low average_name_score.
This file is sourced by all the others. It contains a number of configuration variables and shared functions that are uses by the other utilities. This script sources local.pl (which it expects to be in the same directory or one directory level above. It appends the './lib' directory to the library path. It also includes the Lingua::EN::NameParse perl module, so this needs to be in the path. I'll attempt to briefly explain each of these:
- $db - the database connection
- $peoplefields - see 'Importing New Data'
- $departmentfields - see 'Importing New Data'
- dbconnect() - function to initialize the database connection (stored in the variable $db)
- merge_person_ids() - function to merge two person_ids. See merge_people.pl above.
- merge_department_ids() - function to merge two department_ids. See merge_departments above.
- lookup_person_info() - function that runs a series of queries to generate summary data about a given person_id. This is deprecated - use generate_summary_html() now
- refresh_person_data() - function that generates summary information about a person_id from _name_source_lookup and stores in the people table. See update_people_data.pl above
- generate_summary_html() - function to generate a series of HTML table cells containing summary information about a given person_id. Tries to include as much information as possible, including counts for each time a unique name occurs, etc. It also pulls in the name_odds table to give statistics on the surname (from the 2000 US census). It returns a string containing several HTML table cells. This is used by the merge_people.pl, match_duplicates_web.pl, and match_verify_score_web.pl scripts.
- display_summary_header() - function to return a series of HTML table headers that contain header titles corresponding to the cells returned by generate_summary_html(). It returns a string containing several HTML table headers. This is used by the merge_people.pl, match_duplicates_web.pl, and match_verify_score_web.pl scripts.
- query_to_html() - function that formats the result of an SQL query into an html table. Header fields are named based on the query's field names. Returns a string containing an HTML table.
- normalize_name() - function that normalizes a given name string. It uses the Name::Parse perl module for the initial normalization, converts to uppercase, and removes any '.' in the field. The normalized name is used for name matching. (See the Name Matching Procedure document for details.) Returns a normalized name string