Census Linking

I do a lot of census linking. I will post code and instructions and readmes about the various methods here (more coming soon, I promise).

For more on census linking methods, see my paper and my paper with Abramitzky, Boustan, Eriksson, and Perez. For codes for other methods, see Abramitzky’s website. For pre-made federal census to federal census links (with histids that will work with IPUMS data) see the Census Linking Project.

Census Linking Instructions

Linking the Iowa 1915 Census to the Federal 1940 Census

NB: These were the instructions given to census linkers building training data for my Iowa 1915-1940 links for this paper.

I have searched for every boy in the 1915 Iowa Census in the 1940 Federal Census, 25 years later. I’ve kept the possible or plausible matches (within pretty wide ranges of string distance on first and last name, +/- 3 years of birth, and blocked on state of birth and sex).

Now, for the fun part. We need to produce training data to build the linking algorithm with. Eventually, we’ll use the training data to train the ML linking model I’ve developed in the past. The model will use features like string distances, sound similarities, specific character matches, name commonness, age distance, etc. But how does the model know (or learn) how much weight to put on those various features? Well, we need training data. When a human decides if a record is a link or not, the human is implicitly weighing various features or factors. We want to make those weights explicit.

[omitting details on logging into the NBER server and linking platform]

You will be presented with the name of a person in the 1915 census, and underneath you will see a list of potential matches in the 1940 census. You’ll see names, ages in the censuses, and state or country of birth. Your job is to look at the options offered to you and make your best choice as to which name in the 1940 census is our person in the 1915 census file. The set of possible matches are roughly ordered based on some features of the records to make linking easier (that is, often the best match might be first or second) but the ordering is not perfect and there is random noise in the ranking. The correct or the best match will be farther down the list sometimes. In special cases it might be very far down the row). You can do the linking by looking at how the first and last names, as well as middle initial and age, match up across datasets. These things do not have to be exact matches. Timothy Mozgov in one dataset could easily be Timofy Mozgav in another, for example, as spelling mistakes are common. Sometimes Nomar Garciaparra is written as N Garciaparra or N Garciapara in the census, etc. Humans are good at distinguishing these kinds of clerical errors, which is why we have human coders train the computer, rather than force the computer to make exact matches or fuzzy string matches on its own.

The goal is to select whatever record you think is the correct link based on your best judgment. There are no hard and fast rules and the goal is to use all the real experience you have with looking at people’s names entered with some error or some natural variation and decide whether you think it is likely the two records are correct. Think about which letters might be confused for others (“m” in cursive looks like “rn”, “o” and “a” could be confused, “F” and “T” can be written similarly, etc). When making a determination, use not just the comparison of the record we’re looking for to each individual candidate match, but also think about the other candidates. You aren’t deciding just if the records match but if the given pair is a sufficiently good match compared to all the other candidates. That is if you are trying to match a relatively rare name and only one candidate is even plausible, maybe you are looser on rules for year of birth or name. If you are trying to match a more common name but you have agreement on middle initial or middle name and year of birth maybe that is a strong signal. The availability of middle initials should help as well but lots of records won’t have them (or I might give an initial in 1915 but not in 1940).

When you decide two records match, click the blue “select” button next to the record. If there are no links, that work, hit the red “no matches” button at the bottom. If there are multiple (you are looking for John Smith born in 1910 and there are 10 of them or you are looking for John Smith and the results are John A Smith and John B Smith and you have no way to distinguish), hit the orange “multiple matches” button. (And if you realize you clicked the wrong thing on the previous page you can undo with the light blue button but try not to make bad clicks).

A few more reasons why records do not match exactly:

  • All fields could be entered with some error
    • The person answering the census forgets or makes a mistake or guesses wrong (often one person answers for all household members)
    • The enumerator writes down the answer wrong
    • The transcriber types the answers wrong (cursive is hard to read)
  • Names
    • Nicknames or formal names are used inconsistently (Thomas or Tom or Tommy; William or Wm or Willy or Bill)
    • Also, names could become Americanized over time (Josef become Joseph or Joe, etc)
    • Last names are mispronounced or misspelled, especially foreign names
    • When testing matches, sometimes saying the names aloud will help: do they sound a lot alike?
  • Year of Birth
    • These might be entered with error (numbers are surprisingly easy to screw up)
    • People round their ages (to 0s or 5s often)
    • Censuses are taken at different points in the year and the census asks age on the census day NOT year of birth. And people are great at screwing up simple algorithms.