上QQ阅读APP看书，第一时间看更新

Predicting votes with linear models

Before we can make any predictions, we need to specify a model and train it with our training data (data_train) so that it learns how to provide us with the predictions we're looking for. This means that we will solve an optimization problem that outputs certain numbers that will be used as parameters for our model's predictions. R makes it very easy for us to accomplish such a task.

The standard way of specifying a linear regression model in R is using the lm() function with the model we want to build expressed as a formula and the data that should be used, and save it into an object (in this case fit) that we can use to explore the results in detail. For example, the simplest model we can build is one with a single regressor (independent variable) as follows:

fit <- lm(Proportion ~ Students, data_train)

In this simple model, we would let R know that we want to run a regression where we try to explain the Proportion variable using only the Students variable in the data. This model is too simple, what happens if we want to include a second variable? Well, we can add it using the plus (+) sign after our other regressors. For example (keep in mind that this would override the previous fit object with the new results, so if you want to keep both of them, make sure that you give the resulting objects different names):

fit <- lm(Proportion ~ Students + Age_18to44, data_train)

This may be a better way of explaining the Proportion variable since we are working with more information. However, keep in mind the collinearity problem; it's likely that the higher the students percentage is in a ward (Students), the higher the percentage of relatively young people (Age_18to44), meaning that we may not be adding independent information into the regression. Of course, in most situations, this is not a binary issue, it's an issue of degree and the analyst must be able to handle such situations. We'll touch more on this when checking the model's assumptions in the next section. For now let's get back to programming, shall we? What if we want to include all the variables in the data? Well, we have two options, include all variables manually or use R's shortcut for doing so:

# Manually
fit <- lm(Proportion ~ ID + RegionName + NVotes + Leave + Residents + Households + White + 
          Owned + OwnedOutright + SocialRent + PrivateRent + Students + Unemp + UnempRate_EA + 
          HigherOccup + Density + Deprived + MultiDepriv + Age_18to44 + Age_45plus + NonWhite + 
          HighEducationLevel + LowEducationLevel, data_train)

# R's shortcut
fit <- lm(Proportion ~ ., data_train)

These two models are exactly the same. However, there are a couple of subtle points we need to mention. First, when specifying the model manually, we had to leave the Proportion variable explicitly out of the regressors (variables after the ~ symbol) so that we don’t get an error when running the regressions (it would not make sense for R to allow us to try to explain the Proportion variable by using the same Proportion variable and other things). Second, if we make any typos while writing the variable names, we will get errors since those names will not be present in the variable names (if by coincidence your typo actually refers to another existing variable in the data it may be a hard mistake to diagnose). Third, in both cases the list of regressors includes variables that should not be there, like ID, RegionName, NVotes, Leave, and Vote. In the case of
ID it doesn’t make sense for that variable to be included in the analysis as it doesn’t have any information regarding the Proportion, it's just an identifier. In the case of RegionName it's a categorical variable so the regression would stop being a Standard Multiple Linear Regression and R would automatically make it work for us, but if we do not understand what we’re doing, it may produce confusing results. In this case we want to work only with numerical variables so we can remove it easily from the manual case, but we can’t do that in the shortcut case. Finally, in the case of NVotes, Leave, and Vote, those variables are expressing the same information in slightly the same way so they shouldn’t be included since we would have a multicollinearity problem.

Let's say the final model we want to work with includes all the valid numerical variables:

fit <- lm(Proportion ~ Residents + Households + White + Owned + OwnedOutright + SocialRent + PrivateRent + Students + Unemp + UnempRate_EA + HigherOccup + Density + Deprived + MultiDepriv + Age_18to44 + Age_45plus + NonWhite + HighEducationLevel + LowEducationLevel, data_train)

If we want to use the shortcut method, we can make sure that the data does not contain the problematic variables (using the selection techniques we looked at in Chapter 1, Introduction to R) and then using the shortcut.

To take a look at the results in detail, we use the summary() function on the fit object:

summary(fit)
#>
#> Call:
#> lm(formula = Proportion ~ Residents + Households + White + Owned +
#>    OwnedOutright + SocialRent + PrivateRent + Students + Unemp +
#>    UnempRate_EA + HigherOccup + Density + Deprived + MultiDepriv +
#>    Age_18to44 + Age_45plus + NonWhite + HighEducationLevel +
#>    LowEducationLevel, data = data_train)
#>
#> Residuals:
#>      Min       1Q  Median      3Q     Max
#> -0.21606 -0.03189 0.00155 0.03393 0.26753
#>
#> Coefficients:
#>                     Estimate Std. Error  t value  Pr(>|t|)
#> (Intercept)         3.30e-02   3.38e-01  0.10      0.92222
#> Residents           7.17e-07   2.81e-06  0.26      0.79842
#> Households         -4.93e-06   6.75e-06 -0.73      0.46570
#> White               4.27e-03   7.23e-04  5.91      6.1e-09 ***
#> Owned              -2.24e-03   3.40e-03 -0.66      0.51071
#> OwnedOutright      -3.24e-03   1.08e-03 -2.99      0.00293 **
#> SocialRent         -4.08e-03   3.60e-03 -1.13      0.25847
#> PrivateRent        -3.17e-03   3.59e-03 -0.89      0.37629
#> Students           -8.34e-04   8.67e-04 -0.96      0.33673
#> Unemp               5.29e-02   1.06e-02  5.01      7.3e-07 ***
#> UnempRate_EA       -3.13e-02   6.74e-03 -4.65      4.1e-06 ***
#> HigherOccup         5.21e-03   1.24e-03  4.21      2.9e-05 ***
#> Density            -4.84e-04   1.18e-04 -4.11      4.6e-05 ***
#> Deprived            5.10e-03   1.52e-03  3.35      0.00087 ***
#> MultiDepriv        -6.26e-03   1.67e-03 -3.75      0.00019 ***
#> Age_18to44          3.46e-03   1.36e-03  2.55      0.01117 *
#> Age_45plus          4.78e-03   1.27e-03  3.75      0.00019 ***
#> NonWhite            2.59e-03   4.47e-04  5.80      1.1e-08 ***
#> HighEducationLevel -1.14e-02   1.14e-03 -9.93      < 2e-16 ***
#> LowEducationLevel   4.92e-03   1.28e-03  3.85      0.00013 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Residual standard error: 0.0523 on 542 degrees of freedom
#> Multiple R-squared: 0.868, Adjusted R-squared: 0.863
#> F-statistic: 187 on 19 and 542 DF, p-value: <2e-16

These results tell us which command was used to create our model, which is useful when you're creating various models and want to quickly know the model associated to the results you're looking at. It also shows some information about the distribution of the residuals. Next, it shows the regression's results for each variable used in the mode. We get the name of the variable ((Intercept) is the Standard Linear Regression intercept used in the model's specification), the coefficient estimate for the variable, the standard error, the t statistic, the p-value, and a visual representation of the p-value using asterisks for significance codes. At the end of the results, we see other results associated with the model, including the R-squared and the F-statistic. As mentioned earlier, we won't go into details about what each of these mean, and we will continue to focus on the programming techniques. If you're interested, you may look at Casella and Berger's, Statistical Inference, 2002, or Rice's, Mathematical Statistics and Data Analysis, 1995.

Now that we have a fitted model ready in the fit object, we can use it to make predictions. To do so, we use the predict() function with the fit object and the data we want to produce predictions for, data_test in our case. This returns a vector of predictions that we store in the predictions object. We will get one prediction for each observation in the data_test object:

predictions <- predict(fit, data_test)

These predictions can be measured for accuracy as we will do in a later section in this chapter. For now, we know how to generate predictions easily with R.