October 3, 2016
library(dplyr) library(ggplot2) set.seed(500)
Often, we have a distribution of data that we would like to visualize.
data <- data_frame( x = seq(-6, 6, length.out = 10000), y = 2 * x + x^2 + 20 * rnorm(x) )
Here, we have \(Y\) as a quadratic function of \(X\), with normally distributed noise indicated by rnorm(x)
.
Let's go ahead and plot that:
plot1 <- ggplot(data, aes(x = x, y = y)) + geom_point(alpha = 0.25) + theme_bw()
plot1
Since we know the data generating process, how could we plot the CEF onto this graph?
First, we define the function.
CEF <- function(x) { return(2 * x + x^2) }
plot1 + stat_function(fun = CEF, col = "cyan", size = 1.5)
What does stat_function()
do?
It takes a function (fun
) as an input, and calculates the value of fun
for every value of X.
It then plots that as a line (by default) on top of the existing graph.
ggplot2 works a lot like R: the plots and their components are objects
cef_plot <- stat_function(fun = CEF, col = "cyan", size = 1.5)
We can now use cef_plot
over and over again:
plot1 + cef_plot
We can also invoke anonymous functions with stat.function()
:
plot1 + stat_function(fun = function(x) 2 * x + x^2, col = "cyan", size = 1.5)
plot1 + cef_plot
What if we want to plot the best line?
\[ Y = \alpha + \beta X \]
What is \(\alpha\)?
\[ \alpha = \text{E}[Y] - \frac{\text{Cov}[X,Y]}{\text{V}[X]}\,\text{E}[X] \]
What is \(\beta\)?
\[ \beta = \frac{\text{Cov}[X,Y]}{\text{V}[X]} \]
In other words:
\[ \alpha = \text{E}[Y] - \beta\,\text{E}[X] \]
Let's start by calculating the value of \(\alpha\):
alpha <- mean(data$y) - (cov(data$x, data$y) / var(data$x)) * mean(data$x) alpha
## [1] 11.97447
That looks pretty clunky with all of the data$
terms. Let's simplify:
alpha <- with(data, mean(y) - (cov(x, y) / var(x)) * mean(x) )
Now repeat for \(\beta\):
beta <- with(data, cov(x, y) / var(x) ) beta
## [1] 2.082254
Now we need to write a function for our BLP so that R can use it.
BLP <- function(x) { return(alpha + beta * x) }
plot1
How do we add the BLP onto the plot?
plot_blp <- plot1 + stat_function(fun = BLP, col = "yellow", size = 1.5)
plot_blp
plot_blp + cef_plot
An easier way!
plot_blp_easy <- plot_blp + geom_smooth(method = "lm", se = FALSE, col = "red", linetype = 2)
plot_blp_easy
Which choices of \(\mu\) and \(\sigma\) (mean and standard deviation) in a set range will give us the normal distribution with the tallest peak (the largest average value of the PDF)?
We can use simulations to test out many different values of \(\mu\) and \(\sigma\) in order to figure out our question.
First, we create potential values for \(\mu\) and \(\sigma\):
x <- seq(-10, 10, length.out = 100) mu <- seq(-5, 5, length.out = 100) sigma <- seq(1, 11, length.out = 100)
We first need a place to store our guesses and the results
guesses <- matrix(NA, nrow = length(mu), ncol = length(sigma), dimnames = list( mu, sigma ) ) rownames(guesses) <- mu colnames(guesses) <- sigma
for(i in 1:nrow(guesses)){ for(j in 1:ncol(guesses)){ guesses[i, j] <- max(dnorm(x, mean = mu[i], sd = sigma[j])) } }
arrayInd(which(guesses == max(guesses)), .dim = dim(guesses), .dimnames = dimnames(guesses), useNames = TRUE)
## row col ## -1.86868686868687 32 1 ## -1.76767676767677 33 1 ## -0.959595959595959 41 1 ## 1.66666666666667 67 1 ## 2.47474747474747 75 1
Notice that all of the results are in column 1
What is the standard deviation in column 1?
What can we take away from this? How do we maximize the height of the normal PDF?