Final Prediction

This blog is part of a series related to Gov 1347: Election Analytics, a course at Harvard University taught by Professor Ryan D. Enos.

Over the past nine weeks, we have explored a multitude of variables that may impact election outcomes in an attempt to forecast the 2022 midterm election. We have learned about economic forces, polling, expert predictions, incumbency, advertising, campaigning, and more to find if they hold any predictive power, and how together in a model they may foreshadow what is to come in the results next week. In this blog, I will create a final prediction model for the House of Representatives and share its results, with the intention of reflecting on it after the election.

Background

Since this class first convened in the beginning of September, we have time and time again been reminded of the shortcomings of election forecasting. Elections are fickle and there is no way to know for certain how voters will vote on election day. The experts oftentimes cannot predict what will happen, if an election will be toss-up or a landslide. There are shocks that no one sees coming, costs of voting that may be unanticipated, various biases with polling, and ever-changing political opinions that result from all sorts of issues and news. While these unknowns make forecasting very difficult, we have gained the tools over the last nine weeks that give us a good foundation to create predictions of our own for the House of Representatives midterm election.

My Model Choices

In building my model, the first choice I had was what to use as my dependent variable and the level on which to predict on. For my dependent variable, I decided to run models for both Democratic seat share and Democratic vote share to see the results of both and how they may unveil different stories about the election. I also created variables putting the vote share and seat share in terms of the incumbent party to see how incumbents perform without paying specific attention to their party affiliation, but I decided that this approach would complicate interpreting results. While we spent multiple weeks working with district-level data and forecasting each of the 435 districts, in the end I have decided to predict vote share and seat share on the national level. We have more data for these variables and do not have to rely upon things like pooling, that may produce greater margins for error. Given the insufficient district-level data, a pooled model would have allowed us to take data from neighboring, similar districts and count it as its own. However, my national model was heading in the right direction in terms of improving predictability, and the lack of district data led me to decide on a national-level final model.

In the first week of class, we looked at the fundamentals of election forecasting models, with predictors like the president’s party. We learned about the phenomenon that takes place in most midterms where the president’s party tends to lose seats (Campbell 2018). Following these basics, in week two we looked at how Real Disposable Income (RDI) on its own does a decent job of predicting elections. We also took a look at a slew of other economic variables, such as the Consumer Price Index (CPI), GDP growth rate, unemployment, and more. Despite the overlap between the government and the economy, most of these variables proved to be insignificant and poor predictors for the House elections. The one economic variable I will be using in my final model the unemployment rate, as it is typically a decent indicator of the health of the economy. In addition to using it on its own, I have also decided to interact it with the president’s party. The most important topics to voters vary based on their political party affiliation, and Republicans are known to weigh economic factors more heavily. Therefore, an interaction term between party and unemployment rate can provide us with more significant results than the two can separately on their own.

We continued on from the fundamentals and economic variables onto polling, looking at the generic ballot as well as district-level polls. When adding these polls to my model, I had limited the data to 1990 and later, justifying this with the fact that polling methods have greatly changed since 1945 when the data was first captured. It was my hope that in reducing the sample, the model would better predict recent elections. While I think this justification still makes sense, especially for district-level polling where there does not exist much data even today, I will not be limiting the data by year as severely. In my final model, the only polling data I will be using is the generic ballot, measuring which party people support without regard for specific candidates. The data I will be using goes back to 1960, which is when all the variables of my interest have data for. It would still make sense to limit the data to more recent years for the same polling reasons, but since I am only dealing with the generic ballot, I believe the changing polling methods will not hinder the predictability of the model and it will be more useful to have the longer data.

While we did not dedicate much time in class to discussing presidential approval ratings and their impact on elections, some classmates and I spoke about incorporating it as another form of polling that has easily accessible historical data. I gathered data on presidential approval from Gallup Analytics that date back to 1960. Polling dates varied year to year, so I decided pulled the president’s approval rating from whichever the last poll right before the election was. A few years in the 1960s and 1970s, this ended up being polls that were taken from June-August of the election year, but for most other years and especially in more recent history, the ratings are from the final weeks of October. While it would have been nice to have a specific date that the approval rating was pulled from consistently every year, I believe using the poll prior to the election will mitigate any issue.

Incumbency has also been a popular word throughout our class, and it is proven as a good predictor for the House of Representatives elections. Looking at the district-level data we have from 1945-2020, we find that the incumbent wins the House race about 84% of the time. Below is a bar graph showing this relationship.

While this data is important at the district level, we learned about other incumbency phenomena that take place at a national level, such as the house flipping at midterm elections and the President’s party losing seats. Despite this incumbency data surely boasting high predictive power at the district-level, it does not provide much to us on a national-level. In a similar vein, however, rather than looking at the House majority incumbent from the year before, we can use the previous election’s vote share and seat share to predict that of the following election. I created lagged variables for both of these variables and will incorporate them into my final model, along with the President’s party variable that I previously mentioned.

The final variable I will be using in my model is an indicator variable for if the election takes place in a midterm year or not. From our discussions and readings, we know that different types of voters come out for midterm elections than presidential elections. We also learned in class that many Americans do not even know who their representative is. Those who vote in a midterm year are likely more involved politically and stay up-to-date with the news. I also will be interacting this term with the president’s party variable with the thought that it could account for some of the seat flipping that we historically see. Since 1955, every house flip has been the result of a midterm election (Reuters). This interaction term can account for the additional incumbent losses that occur during midterms but not during election years.

Throughout our time this semester, we have also worked with multiple other district-level variables that I will not be including in my model. We learned about advertising and the amount parties spend on advertising. In Gerber et al. 2011 the authors found that “televised ads have strong but short-lived effects on voting preferences”. With this in mind, along with in-class discussions about advertising data being unreliable and sometimes unavailable, I decided to forego using it in my model. Other district-level variables such as the cost of voting, ground campaigning efforts, and voter turnout I will not be adjusting to incorporate into my national model.

Formulas

Given all the data we have considered using this semester, and after many rounds of trial and error, I have created two final models. Below are the model equations and their corresponding regression tables using the national data that we are working with.

Democratic Vote Share

\[ Democratic Vote Share = \beta_0 + Generic Ballot Democrat\beta_1 + Lag House Dem Vote Share\beta_2 + President Party R\beta_3 + President Approval\beta_4 + Midterm\beta_5 + Unemployment Rate\beta_6 + President Party R * Midterm\beta_7 + President Party R * Unemployment\beta_8 \]

Dependent variable:
D_majorvote_pct
(1)(2)(3)(4)(5)(6)
Generic Ballot Support - D0.386***0.354***0.366***0.389***0.272***0.229**
(0.099)(0.098)(0.093)(0.093)(0.083)(0.082)
Lag House Dem VS0.2370.320**0.296**0.405***0.426***
(0.147)(0.144)(0.142)(0.119)(0.119)
President Party - R1.965**1.905**-0.811-6.339**
(0.922)(0.909)(1.042)(2.879)
President Approval Rating-5.416-3.627-5.435*
(3.970)(3.256)(3.155)
Midterm Year-3.837***-4.644***
(1.046)(1.053)
Unemployment Rate-0.690**
(0.307)
President Party - R * Midterm Year5.851***6.793***
(1.564)(1.565)
President Party - R * Unemployment0.793*
(0.424)
Constant33.616***22.757***16.775*19.685**20.475***27.238***
(4.809)(8.182)(8.204)(8.353)(6.756)(7.042)
Observations313131313131
R20.3440.4000.4870.5210.7110.767
Adjusted R20.3220.3570.4300.4470.6390.682
Residual Std. Error2.674 (df = 29)2.602 (df = 28)2.452 (df = 27)2.414 (df = 26)1.951 (df = 24)1.830 (df = 22)
F Statistic15.217*** (df = 1; 29)9.343*** (df = 2; 28)8.530*** (df = 3; 27)7.067*** (df = 4; 26)9.849*** (df = 6; 24)9.051*** (df = 8; 22)
Note:p<0.1; p<0.05; p<0.01

Model 6 is my final model for the Democratic Vote Share, the other models in this regression table simply show how each variable impacts and improves the model one step at a time. In this model, we see that all the variables are significant at the 0.1 significance level, and most at the 0.05 level. While approval rating and the lag of the House democratic vote share are the only variables that were not immediately significant when first added to the model, they both increased the R-squared and adjusted R-squared values, and eventually found a level of significance by the final model. Model 6 has an R-squared of 0.767 and an adjusted R-squared of 0.6822. I did some testing with other variables, but adding any of the other variables I was considering (such as percent change RDI) ended up decreasing the adjusted R-squared. The most significant variables in this model are the indicator variable for whether the election is a midterm election and the interaction between this indicator and the president’s party.

In this model, every one percentage point support in the generic ballot for democrats increases the Democratic vote share percent by 0.229. This positive relationship makes sense, as we expect democratic vote share to be higher in years when people feel more favorably towards the party all around. Every additional percentage point in the previous year’s democratic vote percentage for the house predicts a 0.426 percentage point increase in the democratic vote share. Again, this positive relationship is very intuitive.

When the President is a Republican, this negatively impacts democratic vote share for the house, lowering democratic vote share by -6.339. The relationship that we have discussed with the president’s party losing seats in the house was mainly in midterm years, which we see in the interaction term between president’s party and midterm year. The coefficient on this interaction term is a positive 6.793, which tells us in midterm years when the President is a Republican, democratic vote share increases slightly from this interaction. When we first added the president’s party variable to Model 3, prior to any interaction, we see it was significant and positive. In this simplified model we can interpret the coefficient to tell us that across all years, a Republican president is predicted to positively impact House democratic vote share. When it is not a midterm year, however, the president’s party aligns with the way voters vote for House representatives: a Republican president hurts democratic vote share.

While Presidential approval rating was not significant in either Model 4 or 5, it found significance in Model 6. The approval rating is on a scale from 0-1, with the percentages represented as decimals rather than whole numbers like some of the other variables. When a president has an approval of 1, or 100%, this decreases democratic vote share by 5.435% in the model. This variable would be easier to explain if the dependent variable was president party’s vote share, rather than democratic vote share, as it would be independent of party. Approval ratings do not often fall at either extreme, with most falling between 40-50%. With this in mind, the magnitude of the decrease in vote share decreases, but the intuition behind the sign of the coefficient remains ambiguous.

Looking at the unemployment rate coefficient and the interaction between it and the president’s party there are a few things to note. First, it is important to know that the variable is coded as the unemployment rate in Quarter 3 (July to September) as reported by the Bureau of Labor Statistics. The unemployment rate coefficient is -0.690, telling us that for every percentage point increase in the unemployment rate, the democratic vote share percentage declines by 0.69 when the president is a Democrat. When the President is a Republican, however, the interaction term of 0.793 comes into play and creates a 0.103 percentage point increase in democratic vote share. This tells us that when unemployment is high, people likely blame the party of the president in power, and thus vote in opposition to that party.

Deomcratic Seat Share

\[ Democratic Seat Share = \beta_0 + Generic Ballot Democrat\beta_1 + Lag House Dem Seat Share\beta_2 + President Party R\beta_3 + President Approval\beta_4 + Midterm\beta_5 + Unemployment Rate\beta_6 + President Party R * Midterm\beta_7 + President Party R * Unemployment\beta_8 \]

Dependent variable:
DemSeatShare
(1)(2)(3)(4)(5)(6)
Generic Ballot Support - D0.876***0.635***0.655***0.690***0.500***0.417**
(0.214)(0.179)(0.169)(0.170)(0.172)(0.173)
Lag House Dem Seat Share0.512***0.560***0.550***0.624***0.633***
(0.121)(0.116)(0.115)(0.108)(0.112)
President Party - R3.403**3.347**-0.713-10.078*
(1.596)(1.580)(2.070)(5.857)
President Approval Rating-8.731-6.249-9.768
(7.009)(6.410)(6.311)
Midterm Year-5.101**-6.607***
(2.064)(2.088)
Unemployment Rate-1.314**
(0.610)
President Party - R * Midterm Year8.521**10.355***
(3.119)(3.153)
President Party - R * Unemployment1.294
(0.875)
Constant12.723-4.066-9.406-6.202-0.01614.566
(10.406)(9.158)(8.983)(9.257)(8.649)(10.662)
Observations313131313131
R20.3650.6140.6700.6880.7660.807
Adjusted R20.3440.5860.6330.6400.7070.736
Residual Std. Error5.786 (df = 29)4.593 (df = 28)4.327 (df = 27)4.284 (df = 26)3.865 (df = 24)3.668 (df = 22)
F Statistic16.704*** (df = 1; 29)22.269*** (df = 2; 28)18.242*** (df = 3; 27)14.349*** (df = 4; 26)13.074*** (df = 6; 24)11.464*** (df = 8; 22)
Note:p<0.1; p<0.05; p<0.01

Running the same regression again, but replacing vote share with seat share for both the dependent variable as well as the lagged independent variable, I found similar results. However, there were a couple of variables whose significance did not hold in the seat share model. We see that the variable for the president’s approval rating does not have any significance across any of the three models it is incorporated into and has very high standard errors. Additionally, the interaction between the President’s party and the unemployment rate do not hold significance in this model either. Despite these insignificant variables, their presence improves the adjusted R-squared so I have decided to leave them in. Similar to the vote share model, we again find the most significant variables to be the indicator variable for whether the election is a midterm election and the interaction between this indicator and the president’s party. In addition, this model also shows the lag of seat share to be a highly significant variable. Slightly improving on the r-squared and adjusted r-squared of the vote share model, my final model here has an r-squared of 0.807 and an adjusted r-squared of 0.736.

In changing the dependent variable, we do not see any changes in the signs of the final model coefficients. Even though the magnitudes of the coefficients vary slightly, the overall effects of each variable on the democratic seat share percentage remain.

Midterm Models

After looking at both of those models that use data across all elections from 1960-2020, I wanted to see how these models would change if I limited the data to only midterm election years. As previously discussed, there is a different voting body in midterm elections versus presidential elections. In limiting the years I model on, I am hoping to improve their predictive power.

I removed the midterm indicator variable since the filter takes care of that, and also removed unemployment rate due to it causing a significant decrease in the adjusted r-squared. This left me with just the generic ballot support for democrats, the president’s party, and the previous democratic vote/seat share as independent variables. I ran this model on both vote share and seat share, and below are the results of this inquiry.

Dependent variable:
D_majorvote_pct
(1)(2)(3)
Generic Ballot Support - D0.671***0.568***0.515***
(0.152)(0.122)(0.130)
President Party - R3.290***4.165***
(1.058)(1.310)
Lag House Dem VS0.268
(0.241)
Constant19.272**22.578***10.626
(7.461)(5.876)(12.225)
Observations151515
R20.6000.7790.801
Adjusted R20.5690.7420.747
Residual Std. Error2.540 (df = 13)1.967 (df = 12)1.948 (df = 11)
F Statistic19.513*** (df = 1; 13)21.100*** (df = 2; 12)14.755*** (df = 3; 11)
Note:p<0.1; p<0.05; p<0.01
Dependent variable:
DemSeatShare
(1)(2)(3)
Generic Ballot Support - D1.309***1.206***0.923***
(0.287)(0.290)(0.224)
President Party - R3.2666.320***
(2.511)(2.010)
Lag House Dem Seat Share0.493***
(0.142)
Constant-9.051-5.769-20.835*
(14.076)(13.947)(10.938)
Observations151515
R20.6160.6630.840
Adjusted R20.5860.6070.796
Residual Std. Error4.792 (df = 13)4.669 (df = 12)3.364 (df = 11)
F Statistic20.851*** (df = 1; 13)11.826*** (df = 2; 12)19.236*** (df = 3; 11)
Note:p<0.1; p<0.05; p<0.01

Both of these models have higher R-squared and adjusted R-squared values than their all-elections counterpart, with the vote share model boasting a 0.801 r-squared, and the seat share model having a 0.840.

The lag of Democratic house vote share in the vote share model is insignificant, but improves the R-squared and adjusted r-squared so I kept it in - as well as for consistency.

2022 Predictions

Using generic ballot data from FiveThirtyEight, Biden approval ratings from Gallup, and the 2022 Q3 unemployment rate from the BLS, I have predictions for each of the four models, including their 95% confidence interval, as noted by “lwr” and “upr”:

Democratic Vote Share using all years

##        fit      lwr     upr
## 1 50.20221 47.86202 52.5424

Democratic Seat Share using all years

##        fit     lwr      upr
## 1 50.50952 45.7082 55.31085

Democratic Vote Share using midterm years

##        fit      lwr      upr
## 1 47.85624 45.81439 49.89809

Democratic Seat Share using midterm years

##        fit      lwr      upr
## 1 46.38345 42.94734 49.81955

The predictions for the first two models that used data across all years are slightly higher than I would expect, specifically for the seat share. The seat share here is predicted to be split just about 50/50, which most experts do not believe will be the case this year. The models using data only from midterm elections, however, seem like they will more accurately predict this years election, at least in terms of being closer to what the experts are saying.

The two midterm-only models also have confidence intervals that do not cross 50%, meaning that they are 95% confident that the Democrats will not have more than 50% of the two-party vote share nor the seat share, meaning Republicans will win the house.

The model of Democratic Seat Share using midterm years predicts that Democrats will win 202 seats in the House, thus losing their majority, compared to the 219 that are predicted in the model that goes across all years.

In-Sample Testing

To validate this model, I conducted in-sample testing, predicting the vote and seat shares for each year within the model. I took these predictions and plotted them against the actual vote and seat shares in those years, producing the following graphs. The R-squared value of each model is also reflected on the plot. An R-squared of 1 would mean the model perfectly predicts the outcome.

In each graph I also added the 2022 predictions from the regression models. While the second model had the highest R-squared in its in-sample validation, based on what we know about the expert forecasts for this year’s election it seems a bit too high of a prediction. This prediction of a 50.50952 seat share for democrats is equivalent to 220 seats for Democrats. The fourth model had the lowest R-squared for its validation, but it had the highest R-squared out of all four regressions.

Looking at these plots, they all have fairly high R-squareds, which instills confidence into our predictions. Surprisingly, the seat share across all years has the highest R-squared of 0.9, when its prediciton of the democrats winning 220 seats seemed very off. Both vote share models have similar accuracy, with the midterm years one having an R-squared of 0.88 and the all years one having an R-squared of 0.89. The midterm years seat share model has the lowest accuracy, with an R-squared of 0.83, which is still high, but relative to the others, low.

Conclusion

The chart below shows my results from the seat share models and compares it to the current house make-up.

In conclusion, these models seem to suggest a close race. The models that use data across all elections predict a near equal 50/50 split in both vote share and seat share, while the midterm-only models slightly favor Republicans and predict them winning the house with 202 seats.

Throughout this class we have learned all sorts of modelling techniques and taken close looks at various independent variables to include. While it is impossible to know what will certainly happen on election day and predict exactly how everyone will vote, these forecasting methods will hopefully get us close. I look forward to reflecting on these models and methods after seeing how they performed.