The R package phonics
includes a nysiis
function. But the function has an option for using the usual or the modified version of the NYSIIS function.
nysiis(word, maxCodeLen = 6, modified = FALSE)
What does this mean? I’m not sure and the help file isn’t super helpful:
The variable modified directs nysiis to use the modified method instead of the original.
The key question is what mimics the nysiis
function (package?) used in Stata? Let’s figure it out.
library(data.table)
library(stringr)
library(tidyverse)
library(phonics)
library(haven)
Read in a file with a bunch of words to sound-score with NYSIIS. There are a bunch of last names here: https://www2.census.gov/topics/genealogy/1990surnames/dist.all.last, thanks Stack Overflow!
dt_raw <- "https://www2.census.gov/topics/genealogy/1990surnames/dist.all.last" %>%
fread() %>%
setnames(c("name", "popshare", "freq", "rank"))
dt_raw %>% head()
## name popshare freq rank
## 1: SMITH 1.006 1.006 1
## 2: JOHNSON 0.810 1.816 2
## 3: WILLIAMS 0.699 2.515 3
## 4: JONES 0.621 3.136 4
## 5: BROWN 0.621 3.757 5
## 6: DAVIS 0.480 4.237 6
dt_raw %>% tail()
## name popshare freq rank
## 1: AARHUS 0 90.483 88794
## 2: AARDEMA 0 90.483 88795
## 3: AARANT 0 90.483 88796
## 4: AANDERUD 0 90.483 88797
## 5: AALUND 0 90.483 88798
## 6: AALDERINK 0 90.483 88799
Let’s NYSIIS score these names, both modified and regular.
dt <- dt_raw %>%
.[, nysiis := nysiis(name)] %>%
.[, nysiis_mod := nysiis(name, modified = TRUE)]
How often do the NYSIIS codes agree and disagree?
dt %>%
.[, nysiis_agree := (nysiis == nysiis_mod)] %>%
.[, mean(nysiis_agree)]
## [1] 0.8773072
87.7% of the time. So this probably isn’t a huge problem, either way. The most popular few names with disagreements?
dt[nysiis != nysiis_mod] %>% head()
## name popshare freq rank nysiis nysiis_mod nysiis_agree
## 1: TAYLOR 0.311 5.623 10 TAYLAR TALAR FALSE
## 2: THOMAS 0.311 6.245 12 THAN TAN FALSE
## 3: WHITE 0.279 6.834 14 WHAT WAT FALSE
## 4: THOMPSON 0.269 7.651 17 THANPS TANPSA FALSE
## 5: YOUNG 0.193 10.090 28 YANG YNG FALSE
## 6: WRIGHT 0.189 10.662 31 WRAGT WAGT FALSE
Those are some pretty high ranking names, so this isn’t just a very rare name problem.
So what is going on in Stata? Well I don’t have it locally (and can’t run stata from inside R anyways) so let’s load up the NBER server and see what stata is doing.
import delimited https://www2.census.gov/topics/genealogy/1990surnames/dist.all.last
gen name = substr(v1, 1, strpos(v1, " ") - 1)
list v1 name in 1/6
nysiis name, gen(nysiis_stata)
list name nysiis_stata if inlist(name, "TAYLOR", "THOMAS", "WHITE", "THOMPSON", "YOUNG", "WRIGHT")
keep name nysiis_stata
save "/disk/bulkw/feigen/nysiis_stata.dta", replace
And now look at it in R.
nysiis_stata <- "../../static/post/nysiis_stata.dta" %>%
read_dta() %>%
as.data.table() %>%
.[, nysiis_stata := nysiis_stata %>% str_to_upper()]
nysiis_stata[name %in% c("TAYLOR", "THOMAS", "WHITE", "THOMPSON", "YOUNG", "WRIGHT")]
## name nysiis_stata
## 1: TAYLOR TAYLAR
## 2: THOMAS TAN
## 3: WHITE WAT
## 4: THOMPSON TANPSAN
## 5: YOUNG YANG
## 6: WRIGHT WRAGT
Uh-oh. 3 match unmodified NYSIIS, 2 match modified NYSIIS, and THOMPSON
is coded as tanpsan
, matching neither… Let’s do this comparison across all names.
dt_compare <-
dt %>%
merge(nysiis_stata, by = "name") %>%
.[, match_reg := (nysiis == nysiis_stata)] %>%
.[, match_mod := (nysiis_mod == nysiis_stata)] %>%
.[, match_none := (nysiis != nysiis_stata & nysiis_mod != nysiis_stata)]
dt_compare %>%
group_by(nysiis_agree, match_reg, match_mod, match_none) %>%
summarize(total = n()) %>%
ungroup() %>%
mutate(share = 100 * total / sum(total)) %>%
arrange(desc(total))
## # A tibble: 5 x 6
## nysiis_agree match_reg match_mod match_none total share
## <lgl> <lgl> <lgl> <lgl> <int> <dbl>
## 1 TRUE TRUE TRUE FALSE 55218 62.2
## 2 TRUE FALSE FALSE TRUE 22686 25.5
## 3 FALSE FALSE FALSE TRUE 4228 4.76
## 4 FALSE TRUE FALSE FALSE 3846 4.33
## 5 FALSE FALSE TRUE FALSE 2820 3.18
To summarize:
R
and the one in StataR
but Stata is different…R
matches with StataR
matches with Stata