library(ipumsr)
library(haven)
library(rstanarm)
13 Poststratification: regression with non-representative sample
Sometimes the sample that we use to fit the regression model does not represent the population well. For example, a randomized experiment can have equal numbers of males and females, but the same might not hold for the people in the city. In this case, the regression on the sample cannot be used to infer the population, but we can regress on each separate group and average the predictions according to the population; this technique is called post-stratification.
Suppose that we would like to predict \(y\) on a predictors \(x_{1}\) with a model \(\hat{y} = g^{-1}(\beta_{0}+\beta_{1}X_{1})\) for some link function \(g\). If the sample is not representative of the population, we cannot infer \(y\) on any level of \(X_{1}\). But if we have access to the data of the population, such as the census, we can use it to recalibrate the prediction of \(y\) as follows: Suppose that there are records of additional variables \(X_{-1}=(X_2,\ldots,X_p)\) in both the sample and the population. We can instead fit a regression model of \(y\) on the other variables:
\[ \hat{y} = g^{-1}(\hat{\beta}_{0}+\hat{\beta}_{1}X_{1}+\ldots+\hat{\beta}_pX_p). \]
Then, to infer \(y\) at level \(X_1=x_1\), we predict \(\hat{y}\) at each stratum, that is, each observed values \(x_{-1}=(x_2,\ldots,x_p)\) of \(X_{-1}\).
\[ \hat{y}_{x_{-1} \vert x_{1}} = g^{-1}(\hat{\beta}_{0}+\hat{\beta}_{1}x_{1}+\ldots+\hat{\beta}_px_p). \]
The final prediction \(\hat{y}_{x_{1}}\) is then obtained by combining these per-stratum predictions, weighted by the stratum proportions \(p_{x_{-1}\vert x_{1}}\) in the \(X_1=x_{1}\) subpopulation.
\[ \hat{y}_{x_{1}} = \sum_{x_{-1}} p_{x_{-1}\vert x_{1}}\hat{y}_{x_{-1} \vert x_{1}}. \]
We demonstrate using Monica Alexander’s example of post-marriage name change. The data, collected by Philip Cohen, consists of responses from 5,000 people regarding their decisions for a name change after marriage. However, the respondents tend to have higher education levels and are younger than average. If we want to, for example, find out the proportion of women who kept their names after marriage, we could not infer it directly from the data.
Instead, we fit a logistic regression model on the responses, calculate the proportion, and recalibrate it using population’s data, namely the U.S. census recorded at IPUMS USA.
Before anythin, we install ipumsr
to read the census data and haven
to read a *.dta
file.
First, we import the survey data and only focus on ever-married women. We remove rows with missing data, divide ages into age groups, marriage years into decades, and education levels into pre-bachelor, bachelor and post-bachelor.
<- read_dta("data/MNCS-PV2.dta")
mncs <- mncs[,c("yrmar",
mncs "agemar",
"agemarc",
"genmar",
"spgenmar",
"namechg",
"ednow",
"state")]
<- mncs[!is.na(mncs$agemar) &
mncs !is.na(mncs$yrmar) &
$genmar == 2 &
mncs$spgenmar == 1,]
mncs$kept_name <- as.numeric(mncs$namechg == 1)
mncs$state_name <- tolower(
mncsas.character(
factor(
$state,
mncslevels = attributes(mncs$state)$labels,
labels = names(attributes(mncs$state)$labels)
)
)
)$age <- mncs$agemar + (2019 - mncs$yrmar)
mncs$age_group <- (as.character(
mncscut(mncs$age,
breaks = c(seq(20, 80, by = 5), Inf),
labels = seq(20, 80, by = 5),
right = FALSE)
)
)$decade_married <- (as.character(
mncscut(mncs$yrmar,
breaks=c(seq(1969, 2019, by = 10), Inf),
labels=seq(1969, 2019, by = 10),
right=FALSE
)
)
)$educ_group <- cut(mncs$ednow,
mncsbreaks = c(-1, 4.5, 5, 9),
labels = c("<BA", "BA", ">BA"))
<- mncs[c("kept_name", "state_name", "age_group",
mncs "decade_married", "educ_group")]
head(mncs)
# A tibble: 6 × 5
kept_name state_name age_group decade_married educ_group
<dbl> <chr> <chr> <chr> <fct>
1 0 ohio 50 1979 >BA
2 0 virginia 35 1999 >BA
3 1 new york 35 2009 >BA
4 0 rhode island 55 1999 >BA
5 0 illinois 35 2009 >BA
6 0 north carolina 25 2009 >BA
We do not share the U.S. census data here, but you can sign up at the IPUMS US website and request for the data yourself. You may request for a single-year or 5-year data that includes the following variables: AGE
, PERWT
, SEX
, STATEFIP
, MARST
, YRMARR
, EDUC
. In addition to the *.dat
data, do not forget to download the accompanied *.xml
file.
<- read_ipums_ddi("data/usa_00001.xml")
ddi <- read_ipums_micro(ddi) ipums
Use of data from IPUMS USA is subject to conditions including that users should
cite the data appropriately. Use command `ipums_conditions()` for more details.
<- ipums[ipums$SEX == 2 & # only woman
ipums $AGE > 14 &
ipums$MARST != 6 & # exclude singles
ipums$YRMARR > 1968,]
ipums$state_name <- tolower(
ipumsas.character(
factor(
$STATEFIP,
ipumslevels = attributes(ipums$STATEFIP)$labels,
labels = names(attributes(ipums$STATEFIP)$labels)
)
)
)$age_group <- as.character(
ipumscut(ipums$AGE,
breaks = c(seq(20, 80, by = 5), Inf),
labels = seq(20, 80, by = 5),
right = FALSE
)
)$decade_married <- as.character(
ipumscut(ipums$YRMARR,
breaks = c(seq(1969, 2019, by = 10), Inf),
labels = seq(1969, 2019, by = 10),
right = FALSE
)
)$educ_group <- cut(ipums$EDUC,
ipumsbreaks = c(-1, 9.5, 10, 12),
labels = c("<BA", "BA", ">BA"))
<- ipums[c("state_name", "PERWT", "age_group",
ipums "decade_married", "educ_group")]
head(ipums)
# A tibble: 6 × 5
state_name PERWT age_group decade_married educ_group
<chr> <dbl> <chr> <chr> <fct>
1 alabama 19 75 1989 <BA
2 alabama 30 50 1989 <BA
3 alabama 24 65 1969 <BA
4 alabama 3 60 1969 <BA
5 alabama 3 35 1999 BA
6 alabama 57 25 2009 <BA
Then, we aggregate the number of people (PERWT
) by the other variables.
<- aggregate(PERWT ~ age_group + state_name
ipums + educ_group + decade_married,
ipums, sum)
head(ipums)
age_group state_name educ_group decade_married PERWT
1 50 alabama <BA 1969 160
2 55 alabama <BA 1969 11971
3 60 alabama <BA 1969 42487
4 65 alabama <BA 1969 40086
5 70 alabama <BA 1969 18581
6 75 alabama <BA 1969 7568
Suppose that we would like to calculate the proportion of women who changed their names for each age groups. Then, in each age groups, we need to calibrate our model’s predictions according to the proportion of the other variables.
<- transform(ipums, prop = ave(PERWT,
ipums
age_group,FUN = prop.table))
$PERWT <- NULL
ipums
head(ipums)
age_group state_name educ_group decade_married prop
1 50 alabama <BA 1969 1.714893e-05
2 55 alabama <BA 1969 1.202384e-03
3 60 alabama <BA 1969 4.400218e-03
4 65 alabama <BA 1969 5.183921e-03
5 70 alabama <BA 1969 4.467144e-03
6 75 alabama <BA 1969 4.001971e-03
Now we train the model on the survey data. Here, we decide not to regress on the education group
<- stan_glm(kept_name ~ age_group
fit + state_name
+ decade_married
+ educ_group,
family = binomial(link = "logit"),
data = mncs, refresh = 0)
We then make point predictions on the census data.
$pred <- predict(fit, type = "response",
ipumsnewdata = ipums)
print(ipums$pred[1:10])
[1] 0.02470312 0.03027410 0.05212266 0.08664387 0.09547531 0.04009368
[7] 0.03826284 0.08259951 0.09928059 0.15786060
Let us use these predictions to estimate the proportion of 25-to-30-year-old women who keep their names after marriage. The estimate is given by the sum of predictions for all strata in the 25-30 age group, weighted by the proportions of the corresponding stratum that we just calculated from the census.
<- (ipums$age_group == 25)
age_group_25 <- ipums[age_group_25, ]
ipums25
<- ipums25$prop
prop_age25 <- ipums25$pred
pred_age25 <- sum(prop_age25 * pred_age25)
poststrat_est
cat("Predicted proportion for age group 25 =", poststrat_est, "\n")
Predicted proportion for age group 25 = 0.1867677
To predict the proportions at all age levels, we can use aggregate
to sum the weighted predictions over the age groups. Here, we use transform
to create a new column named total
, which consists of the prediction-proportion products.
<- aggregate(
poststrat_pred ~ age_group,
total transform(ipums,
total = prop * pred),
sum)
print(poststrat_pred[1:10, ])
age_group total
1 20 0.2703339
2 25 0.1867677
3 30 0.2121638
4 35 0.2538540
5 40 0.2762636
6 45 0.3075979
7 50 0.3286734
8 55 0.3104564
9 60 0.3350600
10 65 0.3726031
Instead of point predictions, we can post-stratify on the posterior predictive distributions at each age group. Here, each row of the dataframe poststrat
consists of 4,000 posterior weighted predictions at each age group.
<- t(posterior_epred(fit, newdata = ipums))
pred_sim <- data.frame(
poststrat age_group = strtoi(ipums$age_group),
$prop * pred_sim)
ipums<- aggregate(. ~ age_group,
poststrat
poststrat,
sum)
print(poststrat[, 1:6])
age_group X1 X2 X3 X4 X5
1 20 0.23075332 0.1954003 0.2962969 0.30540858 0.2443019
2 25 0.18400275 0.1970721 0.1596328 0.20678737 0.1604003
3 30 0.22577802 0.2178550 0.2142322 0.19976549 0.1997783
4 35 0.26718977 0.2620582 0.2380734 0.23240628 0.2470484
5 40 0.29788354 0.2684942 0.2597924 0.24982996 0.2779903
6 45 0.32838205 0.3148195 0.3153428 0.30273628 0.3189486
7 50 0.33017046 0.3168893 0.2930055 0.32237179 0.2974434
8 55 0.30042490 0.2934830 0.2708072 0.32658511 0.2972430
9 60 0.37741082 0.3813491 0.3221430 0.30238999 0.3137456
10 65 0.41245492 0.3407676 0.3991818 0.29471220 0.3958659
11 70 0.36332892 0.4473712 0.4438214 0.31129420 0.3159407
12 75 0.42567943 0.3136357 0.1726485 0.23381899 0.1964923
13 80 0.08733452 0.1368219 0.3119119 0.07929532 0.3172687
Now we can plot the prediction, with uncertainty, of the proportion of women in a particular age group who keep their names. The predictions across all age groups are compared with the simple predictions using the sample proportions.
<- ncol(poststrat)
ncols <- rowMeans(poststrat[, 2:ncols])
means <- apply(poststrat[, 2:ncols], 1, sd)
sds
<- aggregate(kept_name ~ age_group,
simple_props
mncs,
mean)
# Plot stratified predictions
plot(poststrat$age_group,
means,xlab = "Age group",
ylab = "Proportion of women who keep their names",
ylim = c(0, 0.7),
pch = 16, cex = 2)
# Add error bars
arrows(x0 = poststrat$age_group,
y0 = means - 2 * sds,
x1 = poststrat$age_group,
y1 = means + 2 * sds,
code = 3,
angle = 90,
length = 0.1)
# Plot sample proportions
points(poststrat$age_group,
$kept_name,
simple_propscol = "red", pch = 16,
cex = 2)
We can see that the post-stratified predictions are lower than the sample proportions across all age groups. A possible explanation is that the respondents were mostly young and highly educated people, who are more likely to change their names after marriage.