Hypothesis Testing

Using computer simulation. Based on examples from the infer package. Code for Quiz 13

Load the R package we will use.

Question: t-test

#hr_2_tidy.csv# is the name of your data subset

hr  <- read_csv("https://estanny.com/static/week13/data/hr_2_tidy.csv", 
                col_types = "fddfff")

use the skim to summarize the data in hr

skim(hr)
Table 1: Data summary
Name hr
Number of rows 500
Number of columns 6
_______________________
Column type frequency:
factor 4
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
gender 0 1 FALSE 2 mal: 256, fem: 244
evaluation 0 1 FALSE 4 bad: 154, fai: 142, goo: 108, ver: 96
salary 0 1 FALSE 6 lev: 95, lev: 94, lev: 87, lev: 85
status 0 1 FALSE 3 fir: 194, pro: 179, ok: 127

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 39.86 11.55 20.3 29.60 40.2 50.1 59.9 ▇▇▇▇▇
hours 0 1 49.39 13.15 35.0 37.48 45.6 58.9 79.9 ▇▃▂▂▂

The mean hours worked per week is: 49.4

Q: Is the mean number of hours worked per week 48?

specify that hours is the variable of interest

hr %>%
  specify(response = hours)
Response: hours (numeric)
# A tibble: 500 x 1
   hours
   <dbl>
 1  78.1
 2  35.1
 3  36.9
 4  38.5
 5  36.1
 6  78.1
 7  76  
 8  35.6
 9  35.6
10  56.8
# … with 490 more rows

hypothesize that the average hours worked is 48

hr %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 48)
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 500 x 1
   hours
   <dbl>
 1  78.1
 2  35.1
 3  36.9
 4  38.5
 5  36.1
 6  78.1
 7  76  
 8  35.6
 9  35.6
10  56.8
# … with 490 more rows

generate 1000 replicates representing the null hypothesis

hr %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 48) %>%
  generate(reps = 1000, type = "bootstrap")
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 500,000 x 2
# Groups:   replicate [1,000]
   replicate hours
       <int> <dbl>
 1         1  37.5
 2         1  64.2
 3         1  37.1
 4         1  33.6
 5         1  50.8
 6         1  57.9
 7         1  45.3
 8         1  35.6
 9         1  40.8
10         1  37.4
# … with 499,990 more rows

The output has 500,000 rows

calculate the distribution of statistics from the generated data

null_t_distribution <- hr %>%
  specify(response = age) %>%
  hypothesize(null = "point", mu = 48) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "t")

null_t_distribution
# A tibble: 1,000 x 2
   replicate    stat
 *     <int>   <dbl>
 1         1 -1.56  
 2         2 -0.138 
 3         3  1.07  
 4         4  0.232 
 5         5  1.31  
 6         6  0.774 
 7         7 -0.0531
 8         8  0.748 
 9         9  0.164 
10        10  1.28  
# … with 990 more rows

visualize the simulated null distribution

visualize(null_t_distribution)

calculate the statistic from your observed data

observed_t_statistic <- hr %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 48) %>%
  calculate(stat = "t")

observed_t_statistic
# A tibble: 1 x 1
   stat
  <dbl>
1  2.37

get_p_value from the simulated null distribution and the observed statistic

null_t_distribution %>%
  get_p_value(obs_stat = observed_t_statistic, direction = "two-sided")
# A tibble: 1 x 1
  p_value
    <dbl>
1    0.02

shade_p_value on the simulated null distribution

null_t_distribution %>%
  visualize() +
  shade_p_value(obs_stat = observed_t_statistic, direction = "two-sided")

If the p-value < 0.05? Yes

Does your analysis support the null hypothesis that the true mean number of hours worked was 48? No

Question: 2 sample t-test

hr_1_tidy.csv is the name of your data subset

hr_2 <- read_csv("https://estanny.com/static/week13/data/hr_1_tidy.csv", 
                col_types = "fddfff")

Q: Is the average number of hours worked the same for both genders?

hr_2 %>%
  group_by(gender) %>%
  skim()
Table 2: Data summary
Name Piped data
Number of rows 500
Number of columns 6
_______________________
Column type frequency:
factor 3
numeric 2
________________________
Group variables gender

Variable type: factor

skim_variable gender n_missing complete_rate ordered n_unique top_counts
evaluation female 0 1 FALSE 4 fai: 81, bad: 71, ver: 57, goo: 51
evaluation male 0 1 FALSE 4 bad: 82, fai: 61, goo: 55, ver: 42
salary female 0 1 FALSE 6 lev: 54, lev: 50, lev: 44, lev: 41
salary male 0 1 FALSE 6 lev: 52, lev: 47, lev: 46, lev: 39
status female 0 1 FALSE 3 fir: 96, pro: 87, ok: 77
status male 0 1 FALSE 3 fir: 89, ok: 76, pro: 75

Variable type: numeric

skim_variable gender n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age female 0 1 41.78 11.50 20.5 32.15 42.35 51.62 59.9 ▆▅▇▆▇
age male 0 1 39.32 11.55 20.2 28.70 38.55 49.52 59.7 ▇▇▆▇▆
hours female 0 1 50.32 13.23 35.0 38.38 47.80 60.40 79.7 ▇▃▃▂▂
hours male 0 1 48.24 12.95 35.0 37.00 42.40 57.00 78.1 ▇▂▂▁▂

Use geom_boxplot to plot distributions of hours worked by gender

hr_2 %>%
  ggplot(aes(x = gender, y = hours)) +
  geom_boxplot()

specify the variables of interest are hours and gender

hr_2 %>%
  specify(response = hours, explanatory = gender)
Response: hours (numeric)
Explanatory: gender (factor)
# A tibble: 500 x 2
   hours gender
   <dbl> <fct> 
 1  36.5 female
 2  55.8 female
 3  35   male  
 4  52   female
 5  35.1 male  
 6  36.3 female
 7  40.1 female
 8  42.7 female
 9  66.6 male  
10  35.5 male  
# … with 490 more rows

hypothesize that the number of hours worked and gender are independent

hr_2 %>%
  specify(response = hours, explanatory = gender) %>%
  hypothesize(null = "independence")
Response: hours (numeric)
Explanatory: gender (factor)
Null Hypothesis: independence
# A tibble: 500 x 2
   hours gender
   <dbl> <fct> 
 1  36.5 female
 2  55.8 female
 3  35   male  
 4  52   female
 5  35.1 male  
 6  36.3 female
 7  40.1 female
 8  42.7 female
 9  66.6 male  
10  35.5 male  
# … with 490 more rows

generate 1000 replicates representing the null hypothesis

hr_2 %>%
  specify(response = hours, explanatory = gender) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute")
Response: hours (numeric)
Explanatory: gender (factor)
Null Hypothesis: independence
# A tibble: 500,000 x 3
# Groups:   replicate [1,000]
   hours gender replicate
   <dbl> <fct>      <int>
 1  62.5 female         1
 2  54.5 female         1
 3  63.5 male           1
 4  52.9 female         1
 5  59.7 male           1
 6  36.3 female         1
 7  45   female         1
 8  35.3 female         1
 9  36.8 male           1
10  36.1 male           1
# … with 499,990 more rows

The output has 500,000 rows

calculate the distribution of statistics from the generated data

null_distribution_2_sample_permute <- hr_2 %>%
  specify(response = hours, explanatory = gender) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "t", order = c("female", "male"))

null_distribution_2_sample_permute
# A tibble: 1,000 x 2
   replicate    stat
 *     <int>   <dbl>
 1         1  1.56  
 2         2  0.0802
 3         3 -1.06  
 4         4 -0.539 
 5         5  0.0188
 6         6  0.195 
 7         7 -0.353 
 8         8  0.677 
 9         9 -0.778 
10        10  0.140 
# … with 990 more rows

visualize the simulated null distribution

visualize(null_distribution_2_sample_permute)

calculate the statistic from your observed data

observed_t_2_sample_stat <- hr_2 %>%
  specify(response = hours, explanatory = gender) %>%
  calculate(stat = "t", order = c("female", "male"))

observed_t_2_sample_stat
# A tibble: 1 x 1
   stat
  <dbl>
1  1.78

get_p_value from the simulated null distribution and the observed statistic

null_t_distribution %>%
  get_p_value(obs_stat = observed_t_2_sample_stat, direction = "two-sided")
# A tibble: 1 x 1
  p_value
    <dbl>
1   0.054

shade_p_value on the simulated null distribution

null_t_distribution %>%
  visualize() +
  shade_p_value(obs_stat = observed_t_2_sample_stat, direction = "two-sided")

Is the p-value < 0.05? No

Does your analysis support the null hypothesis that the true mean number of hours worked by female and male employees was the same? Yes

Question: ANOVA

hr_2_tidy.csv is the name of your data subset

hr_anova <- read_csv("https://estanny.com/static/week13/data/hr_2_tidy.csv")

Q: Is the average number of hours worked the same for all three status (fired, ok and promoted) ?

use skim to summarize the data in hr_anova by status

hr_anova %>%
  group_by(status) %>%
  skim()
Table 3: Data summary
Name Piped data
Number of rows 500
Number of columns 6
_______________________
Column type frequency:
character 3
numeric 2
________________________
Group variables status

Variable type: character

skim_variable status n_missing complete_rate min max empty n_unique whitespace
gender fired 0 1 4 6 0 2 0
gender ok 0 1 4 6 0 2 0
gender promoted 0 1 4 6 0 2 0
evaluation fired 0 1 3 9 0 4 0
evaluation ok 0 1 3 9 0 4 0
evaluation promoted 0 1 3 9 0 4 0
salary fired 0 1 4 4 0 6 0
salary ok 0 1 4 4 0 6 0
salary promoted 0 1 4 4 0 6 0

Variable type: numeric

skim_variable status n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age fired 0 1 40.03 11.53 20.3 29.45 40.40 50.08 59.9 ▇▅▇▆▆
age ok 0 1 38.50 11.98 20.3 28.15 38.70 49.45 59.9 ▇▆▅▅▆
age promoted 0 1 40.63 11.25 20.4 30.75 41.10 50.25 59.9 ▆▇▇▇▇
hours fired 0 1 41.67 8.37 35.0 36.10 38.45 43.40 77.7 ▇▂▁▁▁
hours ok 0 1 47.35 10.86 35.0 37.10 45.70 54.50 78.9 ▇▅▃▂▁
hours promoted 0 1 59.21 12.66 35.0 49.75 58.90 70.65 79.9 ▅▆▇▇▇

Use geom_boxplot to plot distributions of hours worked by status

hr_anova %>%
  ggplot(aes(x = status, y = hours)) +
  geom_boxplot()

specify the variables of interest are hours and status

hr_anova %>%
  specify(response = hours, explanatory = status)
Response: hours (numeric)
Explanatory: status (factor)
# A tibble: 500 x 2
   hours status  
   <dbl> <fct>   
 1  78.1 promoted
 2  35.1 fired   
 3  36.9 fired   
 4  38.5 fired   
 5  36.1 fired   
 6  78.1 promoted
 7  76   promoted
 8  35.6 fired   
 9  35.6 ok      
10  56.8 promoted
# … with 490 more rows

hypothesize that the number of hours worked and status are independent

hr_anova %>%
  specify(response = hours, explanatory = status) %>%
  hypothesize(null = "independence")
Response: hours (numeric)
Explanatory: status (factor)
Null Hypothesis: independence
# A tibble: 500 x 2
   hours status  
   <dbl> <fct>   
 1  78.1 promoted
 2  35.1 fired   
 3  36.9 fired   
 4  38.5 fired   
 5  36.1 fired   
 6  78.1 promoted
 7  76   promoted
 8  35.6 fired   
 9  35.6 ok      
10  56.8 promoted
# … with 490 more rows

generate 1000 replicates representing the null hypothesis

hr_anova %>%
  specify(response = hours, explanatory = status) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute")
Response: hours (numeric)
Explanatory: status (factor)
Null Hypothesis: independence
# A tibble: 500,000 x 3
# Groups:   replicate [1,000]
   hours status   replicate
   <dbl> <fct>        <int>
 1  36.1 promoted         1
 2  35.1 fired            1
 3  45.9 fired            1
 4  46.6 fired            1
 5  55.4 fired            1
 6  50.7 promoted         1
 7  46.5 promoted         1
 8  37.3 fired            1
 9  35.2 ok               1
10  46.1 promoted         1
# … with 499,990 more rows

The output has 500,000 rows

calculate the distribution of statistics from the generated data

null_distribution_anova <- hr_anova %>%
  specify(response = hours, explanatory = gender) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "F")

null_distribution_anova
# A tibble: 1,000 x 2
   replicate    stat
 *     <int>   <dbl>
 1         1 0.883  
 2         2 0.0283 
 3         3 0.00965
 4         4 2.54   
 5         5 0.00148
 6         6 0.167  
 7         7 0.00912
 8         8 1.23   
 9         9 0.382  
10        10 0.181  
# … with 990 more rows

visualize the simulated null distribution

visualize(null_distribution_anova)

calculate the statistic from your observed data

_ Assign the output observed_f_sample_stat

observed_f_sample_stat <- hr_anova %>%
  specify(response = hours, explanatory = status) %>%
  calculate(stat = "F")

observed_f_sample_stat
# A tibble: 1 x 1
   stat
  <dbl>
1  128.

get_p_value from the simulated null distribution and the observed statistic

null_distribution_anova %>%
  get_p_value(obs_stat = observed_f_sample_stat, direction = "greater")
# A tibble: 1 x 1
  p_value
    <dbl>
1       0

shade_p_value on the simulated null distribution

null_t_distribution %>%
  visualize() +
  shade_p_value(obs_stat = observed_f_sample_stat, direction = "greater")

ggsave(filename = "preview.png",
       path = here::here("_posts", "2021-05-18-hypothesis-testing"))

Is the p-value < 0.05? Yes

Does your analysis support the null hypothesis that the true means of the number of hours worked for those that were “fired”, “ok” and “promoted” were the same? No