# 13Poststratification: regression with non-representative sample

Sometimes the sample that we use to fit the regression model does not represent the population well. For example, a randomized experiment can have equal numbers of males and females, but the same might not hold for the people in the city. In this case, the regression on the sample cannot be used to infer the population, but we can regress on each separate group and average the predictions according to the population; this technique is called post-stratification.

Suppose that we would like to predict $$y$$ on a predictors $$x_{1}$$ with a model $$\hat{y} = g^{-1}(\beta_{0}+\beta_{1}X_{1})$$ for some link function $$g$$. If the sample is not representative of the population, we cannot infer $$y$$ on any level of $$X_{1}$$. But if we have access to the data of the population, such as the census, we can use it to recalibrate the prediction of $$y$$ as follows: Suppose that there are records of additional variables $$X_{-1}=(X_2,\ldots,X_p)$$ in both the sample and the population. We can instead fit a regression model of $$y$$ on the other variables:

$\hat{y} = g^{-1}(\hat{\beta}_{0}+\hat{\beta}_{1}X_{1}+\ldots+\hat{\beta}_pX_p).$

Then, to infer $$y$$ at level $$X_1=x_1$$, we predict $$\hat{y}$$ at each stratum, that is, each observed values $$x_{-1}=(x_2,\ldots,x_p)$$ of $$X_{-1}$$.

$\hat{y}_{x_{-1} \vert x_{1}} = g^{-1}(\hat{\beta}_{0}+\hat{\beta}_{1}x_{1}+\ldots+\hat{\beta}_px_p).$

The final prediction $$\hat{y}_{x_{1}}$$ is then obtained by combining these per-stratum predictions, weighted by the stratum proportions $$p_{x_{-1}\vert x_{1}}$$ in the $$X_1=x_{1}$$ subpopulation.

$\hat{y}_{x_{1}} = \sum_{x_{-1}} p_{x_{-1}\vert x_{1}}\hat{y}_{x_{-1} \vert x_{1}}.$

We demonstrate using Monica Alexander’s example of post-marriage name change . The data, collected by Philip Cohen, consists of responses from 5,000 people regarding their decisions for a name change after marriage. However, the respondents tend to have higher education levels and are younger than average. If we want to, for example, find out the proportion of women who kept their names after marriage, we could not infer it directly from the data.

Instead, we fit a logistic regression model on the responses, calculate the proportion, and recalibrate it using population’s data, namely the U.S. census recorded at IPUMS USA.

Before anythin, we install ipumsr to read the census data and haven to read a *.dta file.

library(ipumsr)
library(haven)
library(rstanarm)

First, we import the survey data and only focus on ever-married women. We remove rows with missing data, divide ages into age groups, marriage years into decades, and education levels into pre-bachelor, bachelor and post-bachelor.

mncs <- read_dta("data/MNCS-PV2.dta")
mncs <- mncs[,c("yrmar",
"agemar",
"agemarc",
"genmar",
"spgenmar",
"namechg",
"ednow",
"state")]
mncs <- mncs[!is.na(mncs$agemar) & !is.na(mncs$yrmar) &
mncs$genmar == 2 & mncs$spgenmar == 1,]
mncs$kept_name <- as.numeric(mncs$namechg == 1)
mncs$state_name <- tolower( as.character( factor( mncs$state,
levels = attributes(mncs$state)$labels,
labels = names(attributes(mncs$state)$labels)
)
)
)
mncs$age <- mncs$agemar + (2019 - mncs$yrmar) mncs$age_group <- (as.character(
cut(mncs$age, breaks = c(seq(20, 80, by = 5), Inf), labels = seq(20, 80, by = 5), right = FALSE) ) ) mncs$decade_married <- (as.character(
cut(mncs$yrmar, breaks=c(seq(1969, 2019, by = 10), Inf), labels=seq(1969, 2019, by = 10), right=FALSE ) ) ) mncs$educ_group <- cut(mncs$ednow, breaks = c(-1, 4.5, 5, 9), labels = c("<BA", "BA", ">BA")) mncs <- mncs[c("kept_name", "state_name", "age_group", "decade_married", "educ_group")] head(mncs) # A tibble: 6 × 5 kept_name state_name age_group decade_married educ_group <dbl> <chr> <chr> <chr> <fct> 1 0 ohio 50 1979 >BA 2 0 virginia 35 1999 >BA 3 1 new york 35 2009 >BA 4 0 rhode island 55 1999 >BA 5 0 illinois 35 2009 >BA 6 0 north carolina 25 2009 >BA  We do not share the U.S. census data here, but you can sign up at the IPUMS US website and request for the data yourself. You may request for a single-year or 5-year data that includes the following variables: AGE, PERWT, SEX, STATEFIP, MARST, YRMARR, EDUC. In addition to the *.dat data, do not forget to download the accompanied *.xml file. ddi <- read_ipums_ddi("data/usa_00001.xml") ipums <- read_ipums_micro(ddi) Use of data from IPUMS USA is subject to conditions including that users should cite the data appropriately. Use command ipums_conditions() for more details. ipums <- ipums[ipums$SEX == 2 &  # only woman
ipums$AGE > 14 & ipums$MARST != 6 &  # exclude singles
ipums$YRMARR > 1968,] ipums$state_name <- tolower(
as.character(
factor(
ipums$STATEFIP, levels = attributes(ipums$STATEFIP)$labels, labels = names(attributes(ipums$STATEFIP)$labels) ) ) ) ipums$age_group <- as.character(
cut(ipums$AGE, breaks = c(seq(20, 80, by = 5), Inf), labels = seq(20, 80, by = 5), right = FALSE ) ) ipums$decade_married <- as.character(
cut(ipums$YRMARR, breaks = c(seq(1969, 2019, by = 10), Inf), labels = seq(1969, 2019, by = 10), right = FALSE ) ) ipums$educ_group <- cut(ipums$EDUC, breaks = c(-1, 9.5, 10, 12), labels = c("<BA", "BA", ">BA")) ipums <- ipums[c("state_name", "PERWT", "age_group", "decade_married", "educ_group")] head(ipums) # A tibble: 6 × 5 state_name PERWT age_group decade_married educ_group <chr> <dbl> <chr> <chr> <fct> 1 alabama 19 75 1989 <BA 2 alabama 30 50 1989 <BA 3 alabama 24 65 1969 <BA 4 alabama 3 60 1969 <BA 5 alabama 3 35 1999 BA 6 alabama 57 25 2009 <BA  Then, we aggregate the number of people (PERWT) by the other variables. ipums <- aggregate(PERWT ~ age_group + state_name + educ_group + decade_married, ipums, sum) head(ipums)  age_group state_name educ_group decade_married PERWT 1 50 alabama <BA 1969 160 2 55 alabama <BA 1969 11971 3 60 alabama <BA 1969 42487 4 65 alabama <BA 1969 40086 5 70 alabama <BA 1969 18581 6 75 alabama <BA 1969 7568 Suppose that we would like to calculate the proportion of women who changed their names for each age groups. Then, in each age groups, we need to calibrate our model’s predictions according to the proportion of the other variables. ipums <- transform(ipums, prop = ave(PERWT, age_group, FUN = prop.table)) ipums$PERWT <- NULL

head(ipums)
  age_group state_name educ_group decade_married         prop
1        50    alabama        <BA           1969 1.714893e-05
2        55    alabama        <BA           1969 1.202384e-03
3        60    alabama        <BA           1969 4.400218e-03
4        65    alabama        <BA           1969 5.183921e-03
5        70    alabama        <BA           1969 4.467144e-03
6        75    alabama        <BA           1969 4.001971e-03

Now we train the model on the survey data. Here, we decide not to regress on the education group

fit <- stan_glm(kept_name ~ age_group
+ state_name
+ educ_group,
data = mncs, refresh = 0)

We then make point predictions on the census data.

ipums$pred <- predict(fit, type = "response", newdata = ipums) print(ipums$pred[1:10])
 [1] 0.02480524 0.03045005 0.05248408 0.08573069 0.09311544 0.03936215
[7] 0.03813904 0.08590608 0.10329217 0.16408253

Let us use these predictions to estimate the proportion of 25-to-30-year-old women who keep their names after marriage. The estimate is given by the sum of predictions for all strata in the 25-30 age group, weighted by the proportions of the corresponding stratum that we just calculated from the census.

age_group_25 <- (ipums$age_group == 25) ipums25 <- ipums[age_group_25, ] prop_age25 <- ipums25$prop
pred_age25 <- ipums25$pred poststrat_est <- sum(prop_age25 * pred_age25) cat("Predicted proportion for age group 25 =", poststrat_est, "\n") Predicted proportion for age group 25 = 0.186577  To predict the proportions at all age levels, we can use aggregate to sum the weighted predictions over the age groups. Here, we use transform to create a new column named total, which consists of the prediction-proportion products. poststrat_pred <- aggregate( total ~ age_group, transform(ipums, total = prop * pred), sum) print(poststrat_pred[1:10, ])  age_group total 1 20 0.2713729 2 25 0.1865770 3 30 0.2127739 4 35 0.2544373 5 40 0.2770835 6 45 0.3078431 7 50 0.3292738 8 55 0.3116066 9 60 0.3368955 10 65 0.3732657 Instead of point predictions, we can post-stratify on the posterior predictive distributions at each age group. Here, each row of the dataframe poststrat consists of 4,000 posterior weighted predictions at each age group. pred_sim <- t(posterior_epred(fit, newdata = ipums)) poststrat <- data.frame( age_group = strtoi(ipums$age_group),
ipums$prop * pred_sim) poststrat <- aggregate(. ~ age_group, poststrat, sum) print(poststrat[, 1:6])  age_group X1 X2 X3 X4 X5 1 20 0.10289222 0.2128899 0.2026354 0.21333193 0.19260699 2 25 0.17950721 0.2186083 0.1707474 0.22991881 0.23801011 3 30 0.18568722 0.2125503 0.2284248 0.19557829 0.20283401 4 35 0.25300019 0.2713337 0.2598084 0.24940851 0.25186717 5 40 0.28778235 0.2671291 0.2788813 0.27108239 0.28716434 6 45 0.29400365 0.3244204 0.3169143 0.30648993 0.31493513 7 50 0.31673011 0.3416799 0.3828944 0.32393316 0.30410929 8 55 0.34455908 0.2723711 0.3267372 0.32108771 0.30560544 9 60 0.28512356 0.3331241 0.3604717 0.39278239 0.36532392 10 65 0.43598097 0.4678410 0.3018612 0.35143944 0.42690287 11 70 0.34580661 0.5128352 0.2823019 0.34404580 0.47651287 12 75 0.21994127 0.2859360 0.3836879 0.12597988 0.19206811 13 80 0.04654375 0.2179899 0.4236130 0.04573537 0.03246203 Now we can plot the prediction, with uncertainty, of the proportion of women in a particular age group who keep their names. The predictions across all age groups are compared with the simple predictions using the sample proportions. ncols <- ncol(poststrat) means <- rowMeans(poststrat[, 2:ncols]) sds <- apply(poststrat[, 2:ncols], 1, sd) simple_props <- aggregate(kept_name ~ age_group, mncs, mean) # Plot stratified predictions plot(poststrat$age_group,
means,
xlab = "Age group",
ylab = "Proportion of women who keep their names",
ylim = c(0, 0.7),
pch = 16, cex = 2)

arrows(x0 = poststrat$age_group, y0 = means - 2 * sds, x1 = poststrat$age_group,
y1 = means + 2 * sds,
code = 3,
angle = 90,
length = 0.1)

# Plot sample proportions
points(poststrat$age_group, simple_props$kept_name,
col = "red", pch = 16,
cex = 2)

We can see that the post-stratified predictions are lower than the sample proportions across all age groups. A possible explanation is that the respondents were mostly young and highly educated people, who are more likely to change their names after marriage.

# References

Alexander, Monica. 2019. “Analyzing Name Changes After Marriage Using a Non-Representative Survey.” Personal Blog. https://www.monicaalexander.com/posts/2019-08-07-mrp/.