October 3, 2016

Plotting Functions

Create a Distribution

library(dplyr)
library(ggplot2)
set.seed(500)

Often, we have a distribution of data that we would like to visualize.

data <- data_frame(
                   x = seq(-6, 6, length.out = 10000),
                   y = 2 * x + x^2 + 20 * rnorm(x)
                  )

Here, we have \(Y\) as a quadratic function of \(X\), with normally distributed noise indicated by rnorm(x).

Plotting the Distribution

Let's go ahead and plot that:

plot1 <-
  ggplot(data, aes(x = x, y = y)) +
    geom_point(alpha = 0.25) +
    theme_bw()

plot1

Our Goal

Defining Functions

Since we know the data generating process, how could we plot the CEF onto this graph?

First, we define the function.

CEF <- function(x) {
  return(2 * x + x^2)
}

stat_function()

plot1 +
  stat_function(fun = CEF, col = "cyan", size = 1.5)

What does stat_function() do?

It takes a function (fun) as an input, and calculates the value of fun for every value of X.

It then plots that as a line (by default) on top of the existing graph.

Modularity in ggplot2

ggplot2 works a lot like R: the plots and their components are objects

cef_plot <- stat_function(fun = CEF,
                          col = "cyan",
                          size = 1.5)

We can now use cef_plot over and over again:

plot1 + cef_plot

Anonymous Functions

We can also invoke anonymous functions with stat.function():

plot1 + stat_function(fun = function(x) 2 * x + x^2,
                      col = "cyan",
                      size = 1.5)

plot1 + cef_plot

Best Linear Predictor

What if we want to plot the best line?

  • What do we mean by "best"?
    • The line that minimizes mean squared error
  • What is mean squared error?
    • For an estimate \(c\):
    • \(MSE = \text{E}\big[(X - c)^2\big]\)
  • How do we calculate the best linear predictor to the CEF?

Best Linear Predictor Equation

\[ Y = \alpha + \beta X \]

What is \(\alpha\)?

\[ \alpha = \text{E}[Y] - \frac{\text{Cov}[X,Y]}{\text{V}[X]}\,\text{E}[X] \]

What is \(\beta\)?

\[ \beta = \frac{\text{Cov}[X,Y]}{\text{V}[X]} \]

In other words:

\[ \alpha = \text{E}[Y] - \beta\,\text{E}[X] \]

Calculate Intercept

Let's start by calculating the value of \(\alpha\):

alpha <-
  mean(data$y) - (cov(data$x, data$y) / var(data$x)) * mean(data$x)

alpha
## [1] 11.97447

That looks pretty clunky with all of the data$ terms. Let's simplify:

alpha <- with(data,
              mean(y) - (cov(x, y) / var(x)) * mean(x)
              )

Calculate Slope

Now repeat for \(\beta\):

beta <- with(data,
             cov(x, y) / var(x)
             )

beta
## [1] 2.082254

Best Linear Predictor Function

Now we need to write a function for our BLP so that R can use it.

BLP <- function(x) {
  return(alpha + beta * x)
}

plot1

Plotting BLP

How do we add the BLP onto the plot?

plot_blp <- plot1 +
              stat_function(fun = BLP,
                            col = "yellow", size = 1.5)

plot_blp

plot_blp + cef_plot

geom_smooth()

An easier way!

plot_blp_easy <- plot_blp +
  geom_smooth(method = "lm", se = FALSE,
              col = "red", linetype = 2)

plot_blp_easy

Grid Search Simulation

The Question

Which choices of \(\mu\) and \(\sigma\) (mean and standard deviation) in a set range will give us the normal distribution with the tallest peak (the largest average value of the PDF)?

Simulation!

We can use simulations to test out many different values of \(\mu\) and \(\sigma\) in order to figure out our question.

First, we create potential values for \(\mu\) and \(\sigma\):

x <- seq(-10, 10, length.out = 100)
mu <- seq(-5, 5, length.out = 100)
sigma  <- seq(1, 11, length.out = 100)

Making Guesses

We first need a place to store our guesses and the results

guesses <- matrix(NA,
                  nrow = length(mu),
                  ncol = length(sigma),
                  dimnames = list(
                    mu, sigma
                    )
                  )

rownames(guesses) <- mu
colnames(guesses) <- sigma

Create a Loop

for(i in 1:nrow(guesses)){
  for(j in 1:ncol(guesses)){
    guesses[i, j] <- max(dnorm(x, mean = mu[i], sd = sigma[j]))
    }
}

Results

arrayInd(which(guesses == max(guesses)), .dim = dim(guesses),
         .dimnames = dimnames(guesses),
         useNames = TRUE)
##                    row col
## -1.86868686868687   32   1
## -1.76767676767677   33   1
## -0.959595959595959  41   1
## 1.66666666666667    67   1
## 2.47474747474747    75   1

Notice that all of the results are in column 1

What is the standard deviation in column 1?

What can we take away from this? How do we maximize the height of the normal PDF?

Questions?