NYSIIS Comparison

March 17, 2018    R    stata    recordlinkage   

The R package phonics includes a nysiis function. But the function has an option for using the usual or the modified version of the NYSIIS function.

nysiis(word, maxCodeLen = 6, modified = FALSE)

What does this mean? I’m not sure and the help file isn’t super helpful:

The variable modified directs nysiis to use the modified method instead of the original.

The key question is what mimics the nysiis function (package?) used in Stata? Let’s figure it out.

library(data.table)
library(stringr)
library(tidyverse)
library(phonics)
library(haven)

Read in a file with a bunch of words to sound-score with NYSIIS. There are a bunch of last names here: https://www2.census.gov/topics/genealogy/1990surnames/dist.all.last, thanks Stack Overflow!

dt_raw <- "https://www2.census.gov/topics/genealogy/1990surnames/dist.all.last" %>%
  fread() %>%
  setnames(c("name", "popshare", "freq", "rank"))

dt_raw %>% head()
##        name popshare  freq rank
## 1:    SMITH    1.006 1.006    1
## 2:  JOHNSON    0.810 1.816    2
## 3: WILLIAMS    0.699 2.515    3
## 4:    JONES    0.621 3.136    4
## 5:    BROWN    0.621 3.757    5
## 6:    DAVIS    0.480 4.237    6
dt_raw %>% tail()
##         name popshare   freq  rank
## 1:    AARHUS        0 90.483 88794
## 2:   AARDEMA        0 90.483 88795
## 3:    AARANT        0 90.483 88796
## 4:  AANDERUD        0 90.483 88797
## 5:    AALUND        0 90.483 88798
## 6: AALDERINK        0 90.483 88799

Let’s NYSIIS score these names, both modified and regular.

dt <- dt_raw %>%
  .[, nysiis := nysiis(name)] %>%
  .[, nysiis_mod := nysiis(name, modified = TRUE)]

How often do the NYSIIS codes agree and disagree?

dt %>%
  .[, nysiis_agree := (nysiis == nysiis_mod)] %>%
  .[, mean(nysiis_agree)]
## [1] 0.8773072

87.7% of the time. So this probably isn’t a huge problem, either way. The most popular few names with disagreements?

dt[nysiis != nysiis_mod] %>% head()
##        name popshare   freq rank nysiis nysiis_mod nysiis_agree
## 1:   TAYLOR    0.311  5.623   10 TAYLAR      TALAR        FALSE
## 2:   THOMAS    0.311  6.245   12   THAN        TAN        FALSE
## 3:    WHITE    0.279  6.834   14   WHAT        WAT        FALSE
## 4: THOMPSON    0.269  7.651   17 THANPS     TANPSA        FALSE
## 5:    YOUNG    0.193 10.090   28   YANG        YNG        FALSE
## 6:   WRIGHT    0.189 10.662   31  WRAGT       WAGT        FALSE

Those are some pretty high ranking names, so this isn’t just a very rare name problem.

So what is going on in Stata? Well I don’t have it locally (and can’t run stata from inside R anyways) so let’s load up the NBER server and see what stata is doing.


import delimited https://www2.census.gov/topics/genealogy/1990surnames/dist.all.last

gen name = substr(v1, 1, strpos(v1, " ") - 1)

list v1 name in 1/6

nysiis name, gen(nysiis_stata)

list name nysiis_stata if inlist(name, "TAYLOR", "THOMAS", "WHITE", "THOMPSON", "YOUNG", "WRIGHT")

keep name nysiis_stata

save "/disk/bulkw/feigen/nysiis_stata.dta", replace

And now look at it in R.

nysiis_stata <- "../../static/post/nysiis_stata.dta" %>%
  read_dta() %>%
  as.data.table() %>%
  .[, nysiis_stata := nysiis_stata %>% str_to_upper()]

nysiis_stata[name %in% c("TAYLOR", "THOMAS", "WHITE", "THOMPSON", "YOUNG", "WRIGHT")]
##        name nysiis_stata
## 1:   TAYLOR       TAYLAR
## 2:   THOMAS          TAN
## 3:    WHITE          WAT
## 4: THOMPSON      TANPSAN
## 5:    YOUNG         YANG
## 6:   WRIGHT        WRAGT

Uh-oh. 3 match unmodified NYSIIS, 2 match modified NYSIIS, and THOMPSON is coded as tanpsan, matching neither… Let’s do this comparison across all names.

dt_compare <-
  dt %>% 
  merge(nysiis_stata, by = "name") %>%
  .[, match_reg := (nysiis == nysiis_stata)] %>%
  .[, match_mod := (nysiis_mod == nysiis_stata)] %>%
  .[, match_none := (nysiis != nysiis_stata & nysiis_mod != nysiis_stata)]

dt_compare %>% 
  group_by(nysiis_agree, match_reg, match_mod, match_none) %>% 
  summarize(total = n()) %>% 
  ungroup() %>% 
  mutate(share = 100 * total / sum(total)) %>%
  arrange(desc(total))
## # A tibble: 5 x 6
##   nysiis_agree match_reg match_mod match_none total share
##   <lgl>        <lgl>     <lgl>     <lgl>      <int> <dbl>
## 1 TRUE         TRUE      TRUE      FALSE      55218 62.2 
## 2 TRUE         FALSE     FALSE     TRUE       22686 25.5 
## 3 FALSE        FALSE     FALSE     TRUE        4228  4.76
## 4 FALSE        TRUE      FALSE     FALSE       3846  4.33
## 5 FALSE        FALSE     TRUE      FALSE       2820  3.18

To summarize:

  • 62.5% of names agree on all three NYSIIS codes, the two in R and the one in Stata
  • Another 25.5% agree within R but Stata is different…
  • And then 4%, 4%, and 3% are
    • Disagreed across all measures
    • Regular in R matches with Stata
    • Modified in R matches with Stata