Stephen Feagin
October 3, 2016
library(dplyr) library(ggplot2) set.seed(500)
Often, we have a distribution of data that we would like to visualize.
data <- data_frame( x = seq(-6, 6, length.out = 10000), y = 2 * x + x^2 + 20 * rnorm(x) )
Here, we have Y as a quadratic function of X, with normally distributed noise indicated by rnorm(x)
.
Let's go ahead and plot that:
plot1 <- ggplot(data, aes(x = x, y = y)) + geom_point(alpha = 0.25) + theme_bw()
plot1
Since we know the data generating process, how could we plot the CEF onto this graph?
First, we define the function.
CEF <- function(x) { return(2 * x + x^2) }
plot1 + stat_function(fun = CEF, col = "cyan", size = 1.5)
What does stat_function()
do?
It takes a function (fun
) as an input, and calculates the value of fun
for every value of X.
It then plots that as a line (by default) on top of the existing graph.
ggplot2 works a lot like R: the plots and their components are objects
cef_plot <- stat_function(fun = CEF, col = "cyan", size = 1.5)
We can now use cef_plot
over and over again:
plot1 + cef_plot
We can also invoke anonymous functions with stat.function()
:
plot1 + stat_function(fun = function(x) 2 * x + x^2, col = "cyan", size = 1.5)
plot1 + cef_plot
What if we want to plot the best line?
Y=α+βX
What is α?
α=E[Y]−Cov[X,Y]V[X]E[X]
What is β?
β=Cov[X,Y]V[X]
In other words:
α=E[Y]−βE[X]
Let's start by calculating the value of α:
alpha <- mean(data$y) - (cov(data$x, data$y) / var(data$x)) * mean(data$x) alpha
## [1] 11.97447
That looks pretty clunky with all of the data$
terms. Let's simplify:
alpha <- with(data, mean(y) - (cov(x, y) / var(x)) * mean(x) )
Now repeat for β:
beta <- with(data, cov(x, y) / var(x) ) beta
## [1] 2.082254
Now we need to write a function for our BLP so that R can use it.
BLP <- function(x) { return(alpha + beta * x) }
plot1
How do we add the BLP onto the plot?
plot_blp <- plot1 + stat_function(fun = BLP, col = "yellow", size = 1.5)
plot_blp
plot_blp + cef_plot
An easier way!
plot_blp_easy <- plot_blp + geom_smooth(method = "lm", se = FALSE, col = "red", linetype = 2)
plot_blp_easy
Which choices of μ and σ (mean and standard deviation) in a set range will give us the normal distribution with the tallest peak (the largest average value of the PDF)?
We can use simulations to test out many different values of μ and σ in order to figure out our question.
First, we create potential values for μ and σ:
x <- seq(-10, 10, length.out = 100) mu <- seq(-5, 5, length.out = 100) sigma <- seq(1, 11, length.out = 100)
We first need a place to store our guesses and the results
guesses <- matrix(NA, nrow = length(mu), ncol = length(sigma), dimnames = list( mu, sigma ) ) rownames(guesses) <- mu colnames(guesses) <- sigma
for(i in 1:nrow(guesses)){ for(j in 1:ncol(guesses)){ guesses[i, j] <- max(dnorm(x, mean = mu[i], sd = sigma[j])) } }
arrayInd(which(guesses == max(guesses)), .dim = dim(guesses), .dimnames = dimnames(guesses), useNames = TRUE)
## row col ## -1.86868686868687 32 1 ## -1.76767676767677 33 1 ## -0.959595959595959 41 1 ## 1.66666666666667 67 1 ## 2.47474747474747 75 1
Notice that all of the results are in column 1
What is the standard deviation in column 1?
What can we take away from this? How do we maximize the height of the normal PDF?