Name Matching Procedure (as of 2/2/11)

Written by Greg Michalec ( - February, 2011

This is the documentation on how to use the name-matching utilities to help match/disambiguate person entities (assign person_ids) when importing new data into the database.

The name matching procedure automates the process of matching person entities as much as possible. For portions that require human hand-matching, web-based user interfaces are used, allowing the tasks to be distributed across members of the mimir group. Note that this process assumes that every unique name in the dataset to be matched refers to the same person, i.e. it assumes there are no two people who share the same name who do not already have a person_id. Here is a quick rundown of the steps involved (more details explanations will follow).

  • Run
  • Set up configuration variables
  • Run
  • Run
  • Run in a web browser
  • Run
  • Run in a web browser
  • Run again

First off, you need to make sure you have personal CGI enabled in your stanford home dir: Then you can cd into you ~/cgi-bin/ dir and checkout the db_management code:
svn co file:///afs/ir/group/minerva/mimir_svn/util/db_management/ ./db_management/
You should then be able to browse to and see these instructions. NOTE: There is an .htacess file in the db_management dir which will restrict web access to this directory to only members of the minerva AFS group

More details on can be found on the db management utilities page, but it should be enough to know that this script polls the entire database and sets up the people and _name_source_lookup tables with all the relevant data. This should be run anytime new data is added to the database, or data is changed.

Set up configuration variables
There is a file called '' in the directory that is sourced in all the programs. It contains the database connection information, as well as information about the data source you are matching. Do not commit with the database password into SVN. We need to keep the db admin password secure.

  • $email: email address included in web pages for people to notify when matching processes are finished
  • $dbuser, $dbschema, $dbpassword: self-explanatory. You should probably never set dbschema to 'minerva'
  • $table: The source table that contains the data you are trying to assign person_ids to
  • $name_field: The name of the field in $table that contains the name information
  • $employee_id_field: The name of the field in $table that contains employeeids. If employee_ids are not present in the data, set this to ''
  • $person_id_field: The name of the field in $table where the person_id will be assigned. Is 'person_id' for most tables.
  • $id_query: A query that will pull any information from $table (and possibly other tables) to help identify the person record. This will be different for each dataset being matched, but will often contain things like department names, employee_ids, years, dissertation_titles, etc. Anything that can be compared against other person entities to help a user verify a positive or negative match. This is used in and The query is looked up by name, so put a '?' where the name should go in the query. Example:
    select $name_field as name, group_concat(distinct __student_emplid) as empid, group_concat(distinct department_name) as dept, group_concat(distinct graduation_academic_year) as years, group_concat(distinct dissertation_title) as title from $table a left join departments using(department_id) where person_id is null and $name_field = ?

Many of the data sources contain mangled names, names with odd characters in them, names in "FirstName LastName" format, and names that are not really names at all, such as "Staff" or "". This script will help put the names into the format expected by the scripts, and resolve the most common problems.

  • The script first creates a column called ___clean_name to store the updated version of the name (this script will not change source data by itself)
  • It then automatically assigns a person_id of 0 to any common incorrect names, such as '', ',', 'unknown', and 'Staff:'
  • Next, it fills in the __clean_name field with the value name field
  • It then modifies __clean_name to correct cases of 'LASTNAME,FIRSTNAME' (no space) and 'FIRSTNAME LASTNAME' (where the name does not contain any of a known list of name prefixes, such as 'van der' and 'de la')
  • When the script has finished, it will print out a series of instructions:
    • it will tell you whether it was able to get all the __clean_name fields into an acceptable format, and a query to find the cases that still need fixing
    • It will print out a query to run that will allow you to compare the original name field to the __clean_name field to make sure the changes are acceptable
    • Once you have verified the changes, you can run the query it prints out to overwrite the values of the field field with ___clean_name, as well as the query to delete the __clean_name field

This script begins the initial round of matching and assigning person_ids.

  • It first verifies if the names are in the proper format, and won't run if they aren't. If you have names that you can't get into the right format, but need to proceed with the matching, assign those names a person_id, either 0 or something else that you can come back to.
  • It the adds a temporary column to the table called ___normalized_name. It populates this column with a 'normalized' version of the name - the names are run through the Lingua::EN::NameParse perl module, are converted to uppercase, and have any '.' stripped from them. NameParse will be reasonably smart about identifying the last names in names like 'Lief Van Klempt' or 'Donald de Lillo'.
  • The script then attempts to match the names to _name_source_lookup on employee_id, and assigned person_ids accordingly
  • It then assigns person_ids based on exact matches to names in _name_source_lookup where duplicate_name = 0
  • It then assigns person_ids based on exact matches of the normalized version of names to the normalized names in _name_source_lookup where duplicate_normal_name = 0

Run in a web browser allows us to assign a person_id to rows that exact match on name to more than one person. At this writing, there are about 250 names that are used by more than one person. I.E. there are two distinct people in the dataset who use the name 'Cohen, Jeremy'. This script will present a webpage that will display the person record being matched (as defined by the $id_query) variable, as well as summary data on all the records from _name_source_lookup that match that record either on name or normalized name. The summary data aggregates information across all the tables - i.e. all the departments a person is associated with. These are usually followed by the number of times each occurs in the data. The lastname odds field contains data that lists the number of people in the 2000 United States Census that have that last name, and the probability that the name will occur in 100,000 people. This helps gauge the likelihood that two different people have this same last name.

The interface allows you to select one of the optional matches as being the correct match. Additionally, you can select 'none match', or 'not enough' info.

  • Selecting a person record will set the person_id for the source record being matched to the person_id for the chosen record.
  • Selecting 'none match' will set the person_id as 999999999, which will be reset to null in the step.
  • Selecting 'not sure' will set the person_ids to 888888888.

Once the matching is completed, if any matches were marked 'not sure', you will get a link offering to recheck them (resetting their person_ids to null)

This is the meat of the name matching process. To prep, the script resets any person_ids that have been set to 888888888 or 999999999 from the previous script to null. It also drops the temporary ___normalized_name field.
It then creates a temporary table called _match_$table. It then compares every name from the source data that has not been matched to every unique name in _name_source_lookup and generates a match score based on the 'closeness' of the two names to one another, which is stored in the temporary table. The script uses a modified version of the perl module Lingue::EN::MatchNames ( to generate the scores. This module checks for:

  • inconsistent case (MacHenry = Machenry = MACHENRY)
  • inconsistent symbols (O'Brien = Obrien = O BRIEN)
  • misspellings (Grene = Green)
  • last name hyphenation (Smith-Curry = Curry)
  • similar phonetics (Hanson = Hansen)
  • nicknames (Midge = Peggy = Margaret)
  • extraneous initials (H. Ross = Ross)
  • extraneous suffixes (Reed, Jr. = Reed II = Reed)

Note that depending on the number of remaining matches, this script can take a long time to run. You may want to open a screen session first.

Run in a web browser
This will present you with a similar interface as was seen in The summary data from $id_query is seen at the top, and all the person records that have a match score > 60 will be seen below. The person records are color-coded by their match score - low scores are dark red. scores of 100 are white. Again, options of 'not sure' and 'none' apply. As each match is served to a user for verification, it's match field is set to '5' - this is to make sure that if two or more people are matching simultaneously, they aren't duplicating work. A positive match is set to '1', not sure is '3', none is '2'. Note that the form is automatically focused, so you can quickly run through matches using your arrow keys to select the match and pressing enter to submit - no need to use that pesky mouse!
Once matching is complete, you will presented with an option to recheck the 'not sure' matches. When there are no 'not sure' matches left, you will be presented with an option to merge in the results of the name matching. When you click this option, all the records that have had verified positive matches will have their person_ids set. Then, any records that still have a person_id of null will be inserted into the people table, creating new person_ids. These new person_ids will then be merged back into the source table. Congratulations!

You changed the data, so you need to refresh the meta tables.