Correlation and Regression

this is a reading note of this tutorial.

About Our Dataset

The relationship between the blood alcohol level of two people and the number of beers they have drunk.

About Correlation analysis

  • Suitable for two numerical variable.
  • Range: \([-1, 1]\).
    • positive: One variable increases as another variable increases.
    • negative: the other way around.
  • Correlation is not a model.
    • It only tells how a variable behaves when on another variable changes.
    • It can’t make prediction. E.g. in our case, it can’t tell how much blood alcohol a person contains when this person drink 5 beers.
    • Correlation doesn’t imply causation. Sometimes two correlated variables might mean nothing.
  • Since usually it’s impossible to analyse the population, the realistic correlation analyses are based on samples.
  • Since a correlation analysis is based on a sample, we need to make a \(p\)-value test to prove that this analysis has statistical significance.

Prepare Runtime Environment & Load Data

# (install &) load packages
pacman::p_load(
  broom,
  conflicted,
  modelbased,
  tidyverse)

# handle function conflicts
conflicts_prefer(dplyr::filter) 
[conflicted] Will prefer dplyr::filter over any other package.
conflicts_prefer(dplyr::select)
[conflicted] Will prefer dplyr::select over any other package.
# load data
path <- "https://raw.githubusercontent.com/SchmidtPaul/dsfair_quarto/master/data/DrinksPeterMax.csv"
dat <- read_csv(path)
Rows: 20 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Person
dbl (2): drinks, blood_alc

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Make a Correlation Hypothesis

cor(dat$drinks, dat$blood_alc)
[1] 0.9559151

The estimation (\(r\)) is around \(0.96\).

Null Hypothesis Significance Testing & \(p\)-value

There are two important concepts:

  • \(r\): The estimation correlation.
  • \(\rho\): The true correlation.

Since we can’t obtain the real correlation value (since it’s impossible to make an experiment on the whole population), we need to test if our estimation is persuasive or not.

The definition of \(p\)-value from Wikipedia:

In null-hypothesis significance testing, the \(p\)-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct.

In correlation anaylysis, given two statistics (In our case: the relationship between people’s blood alcohol content and how many glasses of bear drunk by them), the null hypothesis \(H_0\) says the correlation \(\rho\) between these two statistics is \(0\).

Note

When talking about significance, make sure:

  • what data
  • what hypotheise

::: {.callout-note Extra information: publication bias title=“Tip with Title” collapse=“true”} In many cases, journals only accept research with a very significant \(p\) value. This results in some problems in the research field. See more on this Wikipedia website and this statement. :::

After testing, we are confident that the probability this null hypothesis really happens (\(p\)) is lower than \(0.05\). Thus we can reject the null hypothesis (i.e. claim that \(\rho != 0\)) and adopt the alternative hypothesis (namely \(H_A\) or \(H_1\)). In other words, we are certain that there DO exists a correlation between the two statistics.

Important

\(p < 0.05\) only tells us that there DO exists a correlation. It neither means the true correlation is equal to our estimation, nor it means that we’ve found an important result.

Make a Simple Linear Model

reg_simple <- lm(formula = blood_alc ~ drinks, data = dat)
reg_simple

Call:
lm(formula = blood_alc ~ drinks, data = dat)

Coefficients:
(Intercept)       drinks  
    0.04896      0.12105  

Let a linear regression relationship is written as \(y = a + bx\). Here blood_alc is \(y\) and drinks is \(x\).

Test this a Simple Linear Model

tibble(drinks = seq(0, 9)) %>% 
  estimate_expectation(model = reg_simple) %>% 
  as_tibble()
# A tibble: 10 × 5
   drinks Predicted     SE  CI_low CI_high
    <int>     <dbl>  <dbl>   <dbl>   <dbl>
 1      0    0.0490 0.0406 -0.0363   0.134
 2      1    0.170  0.0337  0.0993   0.241
 3      2    0.291  0.0278  0.233    0.349
 4      3    0.412  0.0238  0.362    0.462
 5      4    0.533  0.0226  0.486    0.581
 6      5    0.654  0.0247  0.602    0.706
 7      6    0.775  0.0294  0.713    0.837
 8      7    0.896  0.0357  0.821    0.971
 9      8    1.02   0.0428  0.927    1.11 
10      9    1.14   0.0505  1.03     1.24 

Problem: the blood alcohol value is \(0.049\) and thus larger than \(0\). By our common sense this result is wrong.

We then get the \(p\)-value using broom::tidy(). It produces more concise result compared to the summary() function.

tidy(reg_simple, conf.int = TRUE)
# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   0.0490   0.0406       1.21 2.43e- 1  -0.0363     0.134
2 drinks        0.121    0.00876     13.8  5.09e-11   0.103      0.139

The \(p\)-value is larger than \(0.05\), thus this model is not convincing.

Make an Improved Linear Model

This time we assume the linear model follows this patter: \(y = ax\).

reg_noint <- lm(formula = blood_alc ~ 0 + drinks, data = dat)
reg_noint

Call:
lm(formula = blood_alc ~ 0 + drinks, data = dat)

Coefficients:
drinks  
0.1298