+ - 0:00:00
Notes for current slide
Notes for next slide

STA 326 2.0 Programming and Data Analysis with R

Statistical Hypothesis Testing

Dr Thiyanga Talagala

1

Why statistical hypothesis testing is important?

2

Why statistical hypothesis testing is important?

Hypothesis testing provides a reliable framework for making any data-driven decisions for your population of interest.

3
4

Do 12.5kg gas cylinders actually hold 12.5kg of weight?

5

Step 1

Establish null and alternative hypotheses

6

Step 1

Establish null and alternative hypotheses

H1:μ<12.5kg

7

Step 1

Establish null and alternative hypotheses

8

Step 1

Establish null and alternative hypotheses

H0:μ12.5kg

H1:μ<12.5kg μ - population mean

9

Suppose the manager wants to determine whether their process is out-of-control by using the weight of gas cylinders

10

H1:μ12.5kg μ - population mean

11

H0:μ=12.5kg

H1:μ12.5kg

μ - population mean

12
14

Step 2: Gather sample data

gas
## # A tibble: 35 × 1
## weight
## <dbl>
## 1 9.44
## 2 9.77
## 3 11.6
## 4 10.1
## 5 10.1
## 6 11.7
## 7 10.5
## 8 8.73
## 9 9.31
## 10 9.55
## # … with 25 more rows
15

Step 3: Visualize data

library(ggplot2)
ggplot(gas, aes(x=weight)) + geom_boxplot(alpha=0.5)

16
ggplot(gas, aes(x=weight)) + geom_boxplot(alpha=0.5) +
theme( # remove axis text and ticks
axis.text.y = element_blank(), axis.ticks = element_blank()) + labs(x="weight")

17

Step 4: Determine the appropriate statistical test

n>30

Central Limit Theorem

18

Distribution of data unknown/ If sample size is small, perform normality test.

Method 1

H0: The data are normally distributed.

H1: The data are not normally distributed.

19

Method 2

Let X be the weight of a randomly selected cylinder

H0: X is normally distributed

H1: X is not normally distributed

20

Normal-probability plot

ggplot(gas, aes(sample = weight)) +
stat_qq() + stat_qq_line(col="red") + coord_equal() + ylab("Theoretical Quantiles") + xlab("Sample Quantiles")

21

How to perform normality test in R?

#perform shapiro-wilk test
shapiro.test(gas$weight)
##
## Shapiro-Wilk normality test
##
## data: gas$weight
## W = 0.98495, p-value = 0.9027
22

Step 5: Test

"The R package statsr provides functions and datasets to support the Coursera Statistics with R Specialization videos and open access book An Introduction to Bayesian Thinking for learning Bayesian and frequentist statistics using R"

source: https://www.rdocumentation.org/packages/statsr/versions/0.3.0

23

statsr package installation

library(devtools)
devtools::install_github("statswithr/statsr",
dependencies=TRUE,
upgrade_dependencies = TRUE)
24
25

Method 1: Using statsr package

library(statsr)
inference(weight, data=gas, statistic="mean", type="ht", null=12.5,
alternative ="less", method="theoretical")
## Single numerical variable
## n = 35, y-bar = 10.0375, s = 0.9463
## H0: mu = 12.5
## HA: mu < 12.5
## t = -15.3956, df = 34
## p_value = < 0.0001

26
## Single numerical variable
## n = 35, y-bar = 10.0375, s = 0.9463
## H0: mu = 12.5
## HA: mu < 12.5
## t = -15.3956, df = 34
## p_value = < 0.0001

27

Confidence intervals with statsr

inference(weight, data=gas, statistic="mean", type="ci", method="theoretical")
## Single numerical variable
## n = 35, y-bar = 10.0375, s = 0.9463
## 95% CI: (9.7125 , 10.3626)

28

Method 2: Using stats package

t.test(gas$weight, mu=12.5, alternative = "less")
##
## One Sample t-test
##
## data: gas$weight
## t = -15.396, df = 34, p-value < 2.2e-16
## alternative hypothesis: true mean is less than 12.5
## 95 percent confidence interval:
## -Inf 10.30798
## sample estimates:
## mean of x
## 10.03752
29

Confidence intervals with stats

t.test(gas$weight, mu=12.5, alternative = "two.sided")
##
## One Sample t-test
##
## data: gas$weight
## t = -15.396, df = 34, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 12.5
## 95 percent confidence interval:
## 9.712466 10.362569
## sample estimates:
## mean of x
## 10.03752
30

Example 2

A chemist wants to measure the bias in a pH meter. She uses the meter to measure the pH in 14 neutral substances (pH=7) and obtains the data below.

ph <- c( 7.01, 7.04, 6.97, 7.00, 6.99, 6.97, 7.04, 7.04, 7.01, 7.00, 6.99, 7.04, 7.07, 6.97)

Is there sufficient evidence to support the claim that the pH meter is not correctly calibrated at the α = 0.05 level of significance?

31
ph.df <- data.frame(pH=ph)
ggplot(ph.df, aes(y=pH, x="")) +
geom_boxplot(outlier.shape = NA, fill="forestgreen") +
geom_jitter(alpha=0.5) +
labs(x = "")

32

In this case, we have only sixteen observations, meaning that the Central Limit Theorem does not apply. With a small sample, we should only use the t-test if we can reasonably assume that the population data is normally distributed. Hence, we must first verify that pH is normally distributed.

33
ggplot(ph.df,
aes(sample=pH))+
stat_qq() + stat_qq_line()+labs(x="Theoretical Quantiles", y="Sample Quantiles") +
theme(aspect.ratio=1)

34
shapiro.test(ph.df$pH)
##
## Shapiro-Wilk normality test
##
## data: ph.df$pH
## W = 0.91603, p-value = 0.1927
35

Hypothesis to be tested

H0:μ=7

H1:μ7

μ - Population mean pH value (in neutral substances).

36

Hypothesis test

inference(y=pH, data=ph.df, statistic="mean", type="ht", null=12.5,
alternative ="twosided", method="theoretical")
## Single numerical variable
## n = 14, y-bar = 7.01, s = 0.0316
## H0: mu = 12.5
## HA: mu != 12.5
## t = -649.5856, df = 13
## p_value = < 0.0001

37

Hypothesis test

## Single numerical variable
## n = 14, y-bar = 7.01, s = 0.0316
## H0: mu = 12.5
## HA: mu != 12.5
## t = -649.5856, df = 13
## p_value = < 0.0001

38

Confidence interval

inference(y=pH, data=ph.df, statistic="mean", type="ci", method="theoretical")
## Single numerical variable
## n = 14, y-bar = 7.01, s = 0.0316
## 95% CI: (6.9917 , 7.0283)

39

Confidence interval

## Single numerical variable
## n = 14, y-bar = 7.01, s = 0.0316
## 95% CI: (6.9917 , 7.0283)

40

Two-sample: Paired t-test

A dietician hopes to reduce a person’s cholesterol level by using a special diet supplemented with a combination of vitamin pills. Twenty (20) subjects were pre-tested and then placed on diet for two weeks. Their cholesterol levels were checked after the two week period. The results are shown below. Cholesterol levels are measured in milligrams per decilitre.

41

cont.

i) Test the claim that the Cholesterol level before the special diet is greater than the Cholesterol level after the special diet at α = 0.01 level of significance.

ii) Construct 99% confidence interval for the difference in mean cholesterol levels. Assume that the cholesterol levels are normally distributed both before and after.

42
id <- 1:20
before <- c(210, 235, 208, 190, 172, 244, 211, 235, 210,
190, 175, 250, 200, 270, 222, 203, 209, 220, 250, 280)
after <- c(190, 170, 210, 188, 173, 195, 228, 200, 210, 184,
196, 208, 211, 212, 205, 221, 240, 250, 230, 220)
cholesterol_1 <- tibble(id=id, before=before, after=after)
head(cholesterol_1)
## # A tibble: 6 × 3
## id before after
## <int> <dbl> <dbl>
## 1 1 210 190
## 2 2 235 170
## 3 3 208 210
## 4 4 190 188
## 5 5 172 173
## 6 6 244 195
43
library(tidyverse)
cholesterol_2 <- pivot_longer(cholesterol_1, before:after, "type", "value")
head(cholesterol_2)
## # A tibble: 6 × 3
## id type value
## <int> <chr> <dbl>
## 1 1 before 210
## 2 1 after 190
## 3 2 before 235
## 4 2 after 170
## 5 3 before 208
## 6 3 after 210
44
ggplot(data= cholesterol_2, aes(x=type, y=value)) +
geom_boxplot(outlier.shape = NA, aes(fill=type), alpha=0.5) +
geom_jitter(aes(fill=type))

45
cholesterol_2 %>%
group_by(type) %>%
summarize(mean = round(mean(value), 2),
sd = round(sd(value), 2))
## # A tibble: 2 × 3
## type mean sd
## <chr> <dbl> <dbl>
## 1 after 207. 21.0
## 2 before 219. 29.3
46
ggplot(data = cholesterol_2, aes(sample = value)) +
stat_qq() +
stat_qq_line() +
facet_grid(. ~ type)

47
shapiro.test(cholesterol_1$before)
##
## Shapiro-Wilk normality test
##
## data: cholesterol_1$before
## W = 0.9647, p-value = 0.6414
shapiro.test(cholesterol_1$after)
##
## Shapiro-Wilk normality test
##
## data: cholesterol_1$after
## W = 0.98535, p-value = 0.9836
48

Method 1: stats package

t.test(before, after, data=cholesterol_1, "greater", paired=TRUE)
##
## Paired t-test
##
## data: before and after
## t = 1.7754, df = 19, p-value = 0.04593
## alternative hypothesis: true mean difference is greater than 0
## 95 percent confidence interval:
## 0.3167385 Inf
## sample estimates:
## mean difference
## 12.15
49

Method 2: statsr package

diff <- cholesterol_1$before -cholesterol_1$after
diff_data <- data.frame(diff=diff)
ggplot(diff_data, aes(sample=diff))+
stat_qq() + stat_qq_line()+
labs(x="Theoretical Quantiles", y="Sample Quantiles") + theme(aspect.ratio=1)

50
inference(y=diff, data=diff_data,statistic="mean", type="ht", method="theoretical", alternative ="greater", null=0L)
## Single numerical variable
## n = 20, y-bar = 12.15, s = 30.6049
## H0: mu = 0
## HA: mu > 0
## t = 1.7754, df = 19
## p_value = 0.0459

51

Method 2: stats

t.test(x = diff_data$diff, alternative = c("greater"), mu=0)
##
## One Sample t-test
##
## data: diff_data$diff
## t = 1.7754, df = 19, p-value = 0.04593
## alternative hypothesis: true mean is greater than 0
## 95 percent confidence interval:
## 0.3167385 Inf
## sample estimates:
## mean of x
## 12.15
52

Your turn: Obtain confidence intervals

53

Two sample tests and ANOVA: switch to this

54

Thank you!

Slides available at: hellor.netlify.app

All rights reserved by Thiyanga S. Talagala

55

Why statistical hypothesis testing is important?

2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow