on
The Ground Game
This blog is part of a series related to Gov 1347: Election Analytics, a course at Harvard University taught by Professor Ryan D. Enos.
This week we learned about how election turnout can affect the outcome of races, and how on-the-ground campaigning efforts aim to mobilize and persuade voters. We can look at this data on the district-level to uncover patterns, and compare it to last week’s data on ad spending to see how they interact.
So far, we have compiled data on election outcomes, generic ballot polls, expert district-level polls, advertising campaigns, and more. When thinking about elections, however, one of the first things that may come to mind are the on-the-ground campaigns. This week we add this “Ground Game” data to our models, specifically with voter turnout at the district level.
For starters, below we have a basic regression of a district’s turnout percentage on the Democratic party’s major vote percentage in a district. The turnout percentage is coded as the sum of Democratic votes and Republican votes, divided by the district’s citizen voting age population (CVAP). The regression below looks at data from 2014 to 2020, including both presidential and midterm years.
##
## Call:
## lm(formula = DemVotesMajorPercent ~ dist_turnout, data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.629 -15.673 -2.765 14.912 48.823
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.629 1.928 28.34 <2e-16 ***
## dist_turnout -5.991 3.722 -1.61 0.108
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.5 on 1743 degrees of freedom
## Multiple R-squared: 0.001484, Adjusted R-squared: 0.0009114
## F-statistic: 2.591 on 1 and 1743 DF, p-value: 0.1077
In this regression, we can see that the intercept is 54.63 percent major vote share for Democrats, representing that when voter turnout is zero, we estimate an average Democratic major vote percentage of 54.63%. When voter turnout is 100%, the Democratic major vote percentage decreases by 5.99 percentage points, to 48.64%. While we do not expect a turnout of 0%, and can only one day hope for a voter turnout of 100%, this model shows us the overall relationship between party vote share and voter turnout. While it is common to think that high voter turnout would benefit Democrats, as they often stress the importance of voting more so than the Republican party, this regression model suggests that higher voter turnout actually hurts Democrats. It is worth noting, however, that the R-squared for this model is extremely low, at only 0.0015, essentially stating that there is no relationship between the values of turnout that affect the democratic two party vote percentage. The coefficient on turnout is also not statistically significant, so we don’t take much away from this model in terms of prediction power.
Now adding to this model, we incorporate whether or not the winning candidate in the district was an incumbent. It is thought that incumbents likely have an edge in an election, especially with house elections where often many people don’t even know who their representative is – it is just easier to vote for the incumbent if there are no real district-level issues. However, if people are unhappy with the government in some capacity at any level, then being an incumbent may be harmful: people just want change, even if the representative did nothing wrong. By regressing the candidate incumbency on the two party vote percentage for both Democrats and Republicans, we can see if there is a relationship between incumbency and vote share and whether it really is positive like we expect it to be. For this regression, incumbency is coded as 1 for an incumbent and 0 for a challenger. Again this model looks at the election years from 2014 to 2020. I will run it for both Republican vote share and Democratic vote share to see if there is a difference between parties.
##
## Call:
## lm(formula = DemVotesMajorPercent ~ cand_incumbent, data = x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.439 -7.749 -0.643 8.007 60.357
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6435 0.6448 61.48 <2e-16 ***
## cand_incumbent 25.7953 0.9446 27.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.68 on 1743 degrees of freedom
## Multiple R-squared: 0.2996, Adjusted R-squared: 0.2992
## F-statistic: 745.7 on 1 and 1743 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = RepVotesMajorPercent ~ cand_incumbent, data = x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.357 -8.007 0.643 7.749 65.439
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.3565 0.6448 93.61 <2e-16 ***
## cand_incumbent -25.7953 0.9446 -27.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.68 on 1743 degrees of freedom
## Multiple R-squared: 0.2996, Adjusted R-squared: 0.2992
## F-statistic: 745.7 on 1 and 1743 DF, p-value: < 2.2e-16
For the Democratic regression, when the candidate is a challenger, they have a predicted democrat two-party vote percentage of 39.64%. When the candidate is an incumbent, however, the vote percentage increases by 25.8 percentage points, to a Democratic vote percentage of 65.44%. The coefficient here is also highly significant, showing us that the relationship between a candidate’s status as an incumbent and their democratic vote share is highly correlated. I also ran this regression on Republican vote percentage and found a different outcome. Here, the model shows that when a candidate is a challenger, they receive 60.36% of the two-party vote share. The coefficient on candidate incumbency in this model is highly negative, with a coefficient of -25.80, and highly significant. For the Republican two-party vote share, when a candidate is an incumbent, it actually hurts their votes. These results seem a bit strange to me, especially in their magnitude of difference. Not only would I expect incumbency to have the same sign on the coefficient for both Republicans and Democrats, but I also did not expect the coefficient to be so large and significant. The adjusted r-squared of this model is 0.2992, showing that while not necessarily a good predictor of two-party vote share, the candidate’s incumbency does have a somewhat notable relationship with the vote percentage.
I would like to see how both incumbency and voter turnout influence vote share by adding the variables to the same regression.
##
## Call:
## lm(formula = DemVotesMajorPercent ~ dist_turnout + cand_incumbent,
## data = x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.683 -7.762 -0.847 7.879 62.032
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.601 1.731 21.725 <2e-16 ***
## dist_turnout 3.991 3.138 1.272 0.204
## cand_incumbent 25.936 0.951 27.274 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.68 on 1742 degrees of freedom
## Multiple R-squared: 0.3003, Adjusted R-squared: 0.2995
## F-statistic: 373.8 on 2 and 1742 DF, p-value: < 2.2e-16
Similar to the two separate regressions we saw before, in this model incumbency is again highly significant and of a large magnitude, while turnout is small and insignificant. The adjusted r-squared of this regression is 0.2995, which is pretty much the same as what we saw in the model that used only incumbency as an independent variable. Adding turnout does very little to adjust this model.
For the final part of this extension, I will also add the expert ratings to the model. This will hopefully have a large impact on the predicted vote percentage, as we saw these ratings to be a good indicator of vote share in earlier blogs. This expert rating is the generic ballot average for the year, as that data is national. I will use only the generic ballot support data for the democratic party, given that the dependent variable I am using is Democratic two-party vote percentage.
##
## Call:
## lm(formula = DemVotesMajorPercent ~ dist_turnout + cand_incumbent +
## gen_avg_dem, data = x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.330 -7.540 -0.070 7.817 61.315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.8545 10.3647 -1.337 0.1815
## dist_turnout -9.1218 4.0617 -2.246 0.0248 *
## cand_incumbent 25.4751 0.9488 26.849 < 2e-16 ***
## gen_avg_dem 1.2275 0.2438 5.034 5.29e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.54 on 1741 degrees of freedom
## Multiple R-squared: 0.3103, Adjusted R-squared: 0.3091
## F-statistic: 261.1 on 3 and 1741 DF, p-value: < 2.2e-16
Oh no… Looking at this we can immediately see that the intercept is negative, which as we spent a long time discussing this past week, is not possible! The democratic two-party vote share percentage only exists in the 0-100% range – it cannot be a negative value. And on top of that, the district’s turnout is a negative number as well with a high magnitude relative to the intercept coefficient. Something is seriously wrong here. As we discussed in class, this can be addressed by using a glm logit model rather than an lm model, as this would limit the range of dependent variables from 0 to 1. I’m not sure if this would solve all the issues going on with this model, as it just seems a little off in general, but it would probably help…
Rather than using the house representative’s incumbency, I am interested to see if the President’s party will make a difference.
##
## Call:
## lm(formula = DemVotesMajorPercent ~ dist_turnout + cand_incumbent +
## gen_avg_dem + president_party, data = x4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.087 -7.545 -0.121 7.502 63.301
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.9418 10.3318 -0.962 0.3361
## dist_turnout -9.0459 4.0362 -2.241 0.0251 *
## cand_incumbent 25.8607 0.9463 27.329 < 2e-16 ***
## gen_avg_dem 1.0925 0.2439 4.479 7.99e-06 ***
## president_partyR 4.5300 0.9434 4.802 1.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.42 on 1740 degrees of freedom
## Multiple R-squared: 0.3193, Adjusted R-squared: 0.3178
## F-statistic: 204.1 on 4 and 1740 DF, p-value: < 2.2e-16
As seen above, president’s party is a significant variable in predicting the Democratic party’s vote percentage. Interestingly enough, when the President’s party is a Republican, it increases the Democratic vote percentage. Since the president’s party was a strong predictor in my fundamental model, it is reassuring to see that it is also a significant predictor here.
I am hoping to fix these models in the coming weeks by adjusting the model to a glm, mapping the models to each district, and picking variables that I think will be better predictors. I also will be separating the data into two data sets: one with solid Democratic or Republican districts, and another with all the others to better incorporate expert predictions.
##
## Call:
## glm(formula = cbind(RepVotes, cvap - RepVotes) ~ dist_turnout +
## cand_incumbent + gen_avg_dem, family = binomial, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -540.12 -71.31 -0.95 61.60 648.48
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.112e+00 4.480e-04 -2482 <2e-16 ***
## dist_turnout 2.986e+00 1.759e-04 16974 <2e-16 ***
## cand_incumbent -7.537e-01 3.995e-05 -18868 <2e-16 ***
## gen_avg_dem -2.651e-02 1.027e-05 -2580 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1614644024 on 30543 degrees of freedom
## Residual deviance: 788511463 on 30540 degrees of freedom
## AIC: 788883064
##
## Number of Fisher Scoring iterations: 5