13  Poststratification: regression with non-representative sample

Sometimes the sample that we use to fit the regression model does not represent the population well. For example, a randomized experiment can have equal numbers of males and females, but the same might not hold for the people in the city. In this case, the regression on the sample cannot be used to infer the population, but we can regress on each separate group and average the predictions according to the population; this technique is called post-stratification.

Suppose that we would like to predict \(y\) on a predictors \(x_{1}\) with a model \(\hat{y} = g^{-1}(\beta_{0}+\beta_{1}X_{1})\) for some link function \(g\). If the sample is not representative of the population, we cannot infer \(y\) on any level of \(X_{1}\). But if we have access to the data of the population, such as the census, we can use it to recalibrate the prediction of \(y\) as follows: Suppose that there are records of additional variables \(X_{-1}=(X_2,\ldots,X_p)\) in both the sample and the population. We can instead fit a regression model of \(y\) on the other variables:

\[ \hat{y} = g^{-1}(\hat{\beta}_{0}+\hat{\beta}_{1}X_{1}+\ldots+\hat{\beta}_pX_p). \]

Then, to infer \(y\) at level \(X_1=x_1\), we predict \(\hat{y}\) at each stratum, that is, each observed values \(x_{-1}=(x_2,\ldots,x_p)\) of \(X_{-1}\).

\[ \hat{y}_{x_{-1} \vert x_{1}} = g^{-1}(\hat{\beta}_{0}+\hat{\beta}_{1}x_{1}+\ldots+\hat{\beta}_px_p). \]

The final prediction \(\hat{y}_{x_{1}}\) is then obtained by combining these per-stratum predictions, weighted by the stratum proportions \(p_{x_{-1}\vert x_{1}}\) in the \(X_1=x_{1}\) subpopulation.

\[ \hat{y}_{x_{1}} = \sum_{x_{-1}} p_{x_{-1}\vert x_{1}}\hat{y}_{x_{-1} \vert x_{1}}. \]

We demonstrate using Monica Alexander’s example of post-marriage name change (Alexander 2019). The data, collected by Philip Cohen, consists of responses from 5,000 people regarding their decisions for a name change after marriage. However, the respondents tend to have higher education levels and are younger than average. If we want to, for example, find out the proportion of women who kept their names after marriage, we could not infer it directly from the data.

Instead, we fit a logistic regression model on the responses, calculate the proportion, and recalibrate it using population’s data, namely the U.S. census recorded at IPUMS USA.

Before anythin, we install ipumsr to read the census data and haven to read a *.dta file.

library(ipumsr)
library(haven)
library(rstanarm)

First, we import the survey data and only focus on ever-married women. We remove rows with missing data, divide ages into age groups, marriage years into decades, and education levels into pre-bachelor, bachelor and post-bachelor.

mncs <- read_dta("data/MNCS-PV2.dta")
mncs <- mncs[,c("yrmar",
                "agemar",
                "agemarc",
                "genmar",
                "spgenmar",
                "namechg",
                "ednow",
                "state")]
mncs <- mncs[!is.na(mncs$agemar) &
             !is.na(mncs$yrmar) &
             mncs$genmar == 2 &
             mncs$spgenmar == 1,]
mncs$kept_name <- as.numeric(mncs$namechg == 1)
mncs$state_name <- tolower(
    as.character(
        factor(
            mncs$state,
            levels = attributes(mncs$state)$labels,
            labels = names(attributes(mncs$state)$labels)
        )
    )
)
mncs$age <- mncs$agemar + (2019 - mncs$yrmar)
mncs$age_group <- (as.character(
         cut(mncs$age,
             breaks = c(seq(20, 80, by = 5), Inf),
             labels = seq(20, 80, by = 5),
             right = FALSE)
     )
)
mncs$decade_married <- (as.character(
         cut(mncs$yrmar,
             breaks=c(seq(1969, 2019, by = 10), Inf),
             labels=seq(1969, 2019, by = 10),
             right=FALSE
             )
     )
)
mncs$educ_group <- cut(mncs$ednow,
                       breaks = c(-1, 4.5, 5, 9),
                       labels = c("<BA", "BA", ">BA"))
mncs <- mncs[c("kept_name", "state_name", "age_group",
               "decade_married", "educ_group")]

head(mncs)
# A tibble: 6 × 5
  kept_name state_name     age_group decade_married educ_group
      <dbl> <chr>          <chr>     <chr>          <fct>     
1         0 ohio           50        1979           >BA       
2         0 virginia       35        1999           >BA       
3         1 new york       35        2009           >BA       
4         0 rhode island   55        1999           >BA       
5         0 illinois       35        2009           >BA       
6         0 north carolina 25        2009           >BA       

We do not share the U.S. census data here, but you can sign up at the IPUMS US website and request for the data yourself. You may request for a single-year or 5-year data that includes the following variables: AGE, PERWT, SEX, STATEFIP, MARST, YRMARR, EDUC. In addition to the *.dat data, do not forget to download the accompanied *.xml file.

ddi <- read_ipums_ddi("data/usa_00001.xml")
ipums <- read_ipums_micro(ddi)
Use of data from IPUMS USA is subject to conditions including that users should
cite the data appropriately. Use command `ipums_conditions()` for more details.
ipums <- ipums[ipums$SEX == 2 &  # only woman
               ipums$AGE > 14 &
               ipums$MARST != 6 &  # exclude singles
               ipums$YRMARR > 1968,]
ipums$state_name <- tolower(
    as.character(
        factor(
            ipums$STATEFIP,
            levels = attributes(ipums$STATEFIP)$labels,
            labels = names(attributes(ipums$STATEFIP)$labels)
        )
    )
)
ipums$age_group <- as.character(
    cut(ipums$AGE,
        breaks = c(seq(20, 80, by = 5), Inf),
        labels = seq(20, 80, by = 5),
        right = FALSE
        )
)
ipums$decade_married <- as.character(
    cut(ipums$YRMARR,
        breaks = c(seq(1969, 2019, by = 10), Inf),
        labels = seq(1969, 2019, by = 10),
        right = FALSE
        )
)
ipums$educ_group <- cut(ipums$EDUC,
                        breaks = c(-1, 9.5, 10, 12),
                        labels = c("<BA", "BA", ">BA"))
ipums <- ipums[c("state_name", "PERWT", "age_group",
                 "decade_married", "educ_group")]

head(ipums)
# A tibble: 6 × 5
  state_name PERWT age_group decade_married educ_group
  <chr>      <dbl> <chr>     <chr>          <fct>     
1 alabama       19 75        1989           <BA       
2 alabama       30 50        1989           <BA       
3 alabama       24 65        1969           <BA       
4 alabama        3 60        1969           <BA       
5 alabama        3 35        1999           BA        
6 alabama       57 25        2009           <BA       

Then, we aggregate the number of people (PERWT) by the other variables.

ipums <- aggregate(PERWT ~ age_group + state_name
                   + educ_group + decade_married,
                   ipums, sum)

head(ipums)
  age_group state_name educ_group decade_married PERWT
1        50    alabama        <BA           1969   160
2        55    alabama        <BA           1969 11971
3        60    alabama        <BA           1969 42487
4        65    alabama        <BA           1969 40086
5        70    alabama        <BA           1969 18581
6        75    alabama        <BA           1969  7568

Suppose that we would like to calculate the proportion of women who changed their names for each age groups. Then, in each age groups, we need to calibrate our model’s predictions according to the proportion of the other variables.

ipums <- transform(ipums, prop = ave(PERWT,
                                   age_group,
                                   FUN = prop.table))
ipums$PERWT <- NULL

head(ipums)
  age_group state_name educ_group decade_married         prop
1        50    alabama        <BA           1969 1.714893e-05
2        55    alabama        <BA           1969 1.202384e-03
3        60    alabama        <BA           1969 4.400218e-03
4        65    alabama        <BA           1969 5.183921e-03
5        70    alabama        <BA           1969 4.467144e-03
6        75    alabama        <BA           1969 4.001971e-03

Now we train the model on the survey data. Here, we decide not to regress on the education group

fit <- stan_glm(kept_name ~ age_group
                + state_name
                + decade_married
                + educ_group,
                family = binomial(link = "logit"),
                data = mncs, refresh = 0)

We then make point predictions on the census data.

ipums$pred <- predict(fit, type = "response",
                    newdata = ipums)
print(ipums$pred[1:10])
 [1] 0.02480524 0.03045005 0.05248408 0.08573069 0.09311544 0.03936215
 [7] 0.03813904 0.08590608 0.10329217 0.16408253

Let us use these predictions to estimate the proportion of 25-to-30-year-old women who keep their names after marriage. The estimate is given by the sum of predictions for all strata in the 25-30 age group, weighted by the proportions of the corresponding stratum that we just calculated from the census.

age_group_25 <- (ipums$age_group == 25)
ipums25 <- ipums[age_group_25, ]

prop_age25 <- ipums25$prop
pred_age25 <- ipums25$pred
poststrat_est <- sum(prop_age25 * pred_age25)

cat("Predicted proportion for age group 25 =", poststrat_est, "\n")
Predicted proportion for age group 25 = 0.186577 

To predict the proportions at all age levels, we can use aggregate to sum the weighted predictions over the age groups. Here, we use transform to create a new column named total, which consists of the prediction-proportion products.

poststrat_pred <- aggregate(
    total ~ age_group,
    transform(ipums,
              total = prop * pred),
    sum)

print(poststrat_pred[1:10, ])
   age_group     total
1         20 0.2713729
2         25 0.1865770
3         30 0.2127739
4         35 0.2544373
5         40 0.2770835
6         45 0.3078431
7         50 0.3292738
8         55 0.3116066
9         60 0.3368955
10        65 0.3732657

Instead of point predictions, we can post-stratify on the posterior predictive distributions at each age group. Here, each row of the dataframe poststrat consists of 4,000 posterior weighted predictions at each age group.

pred_sim <- t(posterior_epred(fit, newdata = ipums))
poststrat <- data.frame(
    age_group = strtoi(ipums$age_group),
    ipums$prop * pred_sim)
poststrat <- aggregate(. ~ age_group,
                       poststrat,
                       sum)

print(poststrat[, 1:6])
   age_group         X1        X2        X3         X4         X5
1         20 0.10289222 0.2128899 0.2026354 0.21333193 0.19260699
2         25 0.17950721 0.2186083 0.1707474 0.22991881 0.23801011
3         30 0.18568722 0.2125503 0.2284248 0.19557829 0.20283401
4         35 0.25300019 0.2713337 0.2598084 0.24940851 0.25186717
5         40 0.28778235 0.2671291 0.2788813 0.27108239 0.28716434
6         45 0.29400365 0.3244204 0.3169143 0.30648993 0.31493513
7         50 0.31673011 0.3416799 0.3828944 0.32393316 0.30410929
8         55 0.34455908 0.2723711 0.3267372 0.32108771 0.30560544
9         60 0.28512356 0.3331241 0.3604717 0.39278239 0.36532392
10        65 0.43598097 0.4678410 0.3018612 0.35143944 0.42690287
11        70 0.34580661 0.5128352 0.2823019 0.34404580 0.47651287
12        75 0.21994127 0.2859360 0.3836879 0.12597988 0.19206811
13        80 0.04654375 0.2179899 0.4236130 0.04573537 0.03246203

Now we can plot the prediction, with uncertainty, of the proportion of women in a particular age group who keep their names. The predictions across all age groups are compared with the simple predictions using the sample proportions.

ncols <- ncol(poststrat)
means <- rowMeans(poststrat[, 2:ncols])
sds <- apply(poststrat[, 2:ncols], 1, sd)

simple_props <- aggregate(kept_name ~ age_group,
                          mncs,
                          mean)

# Plot stratified predictions
plot(poststrat$age_group,
     means,
     xlab = "Age group",
     ylab = "Proportion of women who keep their names",
     ylim = c(0, 0.7),
     pch = 16, cex = 2)

# Add error bars
arrows(x0 = poststrat$age_group,
       y0 = means - 2 * sds,
       x1 = poststrat$age_group,
       y1 = means + 2 * sds,
       code = 3,
       angle = 90,
       length = 0.1)

# Plot sample proportions
points(poststrat$age_group,
       simple_props$kept_name,
       col = "red", pch = 16,
       cex = 2)

We can see that the post-stratified predictions are lower than the sample proportions across all age groups. A possible explanation is that the respondents were mostly young and highly educated people, who are more likely to change their names after marriage.

References

Alexander, Monica. 2019. “Analyzing Name Changes After Marriage Using a Non-Representative Survey.” Personal Blog. https://www.monicaalexander.com/posts/2019-08-07-mrp/.