上QQ阅读APP看书，第一时间看更新

Getting a better look with detailed scatter plots

Now that we know how to get a big-picture view of the scatter plots to get a general sense of the relations among variables, how can we get a more detailed look into each scatter plot? Well, I'm glad you asked! To achieve this, we'll do it in two steps. First, we are going to work on producing a single, detailed scatter plot that we're happy with. Second, we're going to develop a simple algorithm that will traverse all variable combinations and create the corresponding plot for each of them:

Scatter plot for NoQuals vs AdultMeanAge vs Proportion with Regression Line

The graph shown above shows our prototype scatter plot. It has a combination of variables in the x and y axes, NoQuals and AdultMeanAge in our case, assigns a color according to the corresponding Proportion, and places a line corresponding to a linear regression on top to get a general sense of the relation among the variables in the axes. Compare this plot to the left scatter plot of previous pair of scatter plots. They are the same plot, but this one is more detailed and conveys more information. This plot seems good enough for now.

plot <- ggplot(data, aes(x = NoQuals, y = AdultMeanAge, color = Proportion))
plot <- plot + stat_smooth(method = "lm", col = "darkgrey", se = FALSE)
plot <- plot + scale_color_viridis()
plot <- plot + geom_point()
print(plot)

Now we need to develop the algorithm that will take all variable combinations and create the corresponding plots. We present the full algorithm and explain part by part. As you can see, we start defining the create_graphs_iteratively function, which receives two parameters: the data and the plot_function. The algorithm will get the variable names for the data and store them in the vars variables. It will then remove Proportion from such variables, because they will be used to create the combinations for the axis, and Proportion will never be used in the axis; it will be used exclusively for the colors.

Now, if we imagine all the variable combinations in a matrix like the one for the matrix scatter plot shown previously, then we need to traverse the upper triangle or the lower triangle to get all possible combinations (in fact, the upper and lower triangles from matrix of scatter plots are symmetrical because they convey the same information). To traverse these triangles, we can use a known pattern, which uses two for-loops, each for one axis, and where the inner loop need only start at the position of the outer loop (this is what forms a triangle). The -1 and +1 are there to make sure we start and finish in appropriate places in each loop without getting an error for array boundaries.

Inside the inner loop is where we will create the name for the plot as a combination of the variable names and concatenate them using the paste() function, as well as create the plot with the plot_function we will send as a parameter (more on this ahead). The png() and dev.off() functions are there to save the plots to the computer's hard drive. Think of the png() function as the place where R starts looking for a graph, and dev.off() as the place where it stops the saving process. Feel free to look into their documentation or read more about devices in R.

create_plots_iteratively <- function(data, plot_function) {
    vars <- colnames(data)
    vars <- vars(!which(vars == "Proportion"))
    for (i in 1:(length(vars) - 1)) {
        for (j in (i + 1):length(vars)) {
            save_to <- paste(vars[i], "_", vars[j], ".png", sep = "")
            plot_function(data, vars[i], vars[j], save_to)
        }
    }
}

We're almost done; we just need to wrap the code we used to turn our plot prototype into a function and we will be all set. As you can see, we extracted the x, y, and color parameters for the aes() function as variables that are sent as parameters to the function (this is called parametrizing arguments), and we switched the aes() function with the aes_string() function, which is able to receive variables with strings for the parameters. We also added the option to send the var_color as FALSE to avoid using a color-version of the plot. Everything else is kept the same:

prototype_scatter_plot <- function(data, var_x, var_y, var_color = "Proportion", save_to = "") {
    if (is.na(as.logical(var_color))) {
        plot <- ggplot(data, aes_string(x = var_x, y = var_y, color = var_color))
    } else {
        plot <- ggplot(data, aes_string(x = var_x, y = var_y))
    }
    plot <- plot + stat_smooth(method = "lm", col = "darkgrey", se = FALSE)
    plot <- plot + scale_color_viridis()
    plot <- plot + geom_point()
    if (not_empty(save_to)) png(save_to)
    print(plot)
    if (not_empty(save_to)) dev.off()
}

Since we will be checking in various places whether the save_to string is empty, we name the check and wrap it in the not_empty() function. Now it's a bit easier to read our code.

not_empty <- function(file) {
    return(file != "")
}

With this prototype_scatter_plot() function, we can re-create the right scatter plots shown previously, as well as any other variable combination, quite easily. This seems pretty powerful, doesn't it?

Scatter plot for L4Quals_plus vs AdultMeanAge vs Proportion with Regression Line

Let's have a look at the following code:

prototype_scatter_plot(data, "L4Quals_plus", "AdultMeanAge")

Now that we have done the hard work, we can create all possible combinations quite easily. We just need to call the create_plots_iteratively() function with our data and the prototype_scatter_plot() function. Using functions as parameters for other functions is known as the strategy pattern. The name comes from the fact that we can easily change our strategy for plotting for any other one we want that receives the same parameters (data, var_x, and var_y) to create plots, without having to change our algorithm to traverse the variable combinations. This kind of flexibility is very powerful:

create_plots_iteratively(data, prototype_scatter_plot)

This will create all the plots for us and save them to our hard drive. Pretty cool, huh? Now we can look at each of them independently and do whatever we need with them, as we already have them as PNG files.