Hypothesis testing provides a reliable framework for making any data-driven decisions for your population of interest.
Establish null and alternative hypotheses
Establish null and alternative hypotheses
H1:μ<12.5kg
Establish null and alternative hypotheses
Establish null and alternative hypotheses
H0:μ≥12.5kg
H1:μ<12.5kg μ - population mean
H1:μ≠12.5kg μ - population mean
H0:μ=12.5kg
H1:μ≠12.5kg
μ - population mean
gas
## # A tibble: 35 × 1## weight## <dbl>## 1 9.44## 2 9.77## 3 11.6 ## 4 10.1 ## 5 10.1 ## 6 11.7 ## 7 10.5 ## 8 8.73## 9 9.31## 10 9.55## # … with 25 more rows
library(ggplot2)ggplot(gas, aes(x=weight)) + geom_boxplot(alpha=0.5)
ggplot(gas, aes(x=weight)) + geom_boxplot(alpha=0.5) + theme( # remove axis text and ticks axis.text.y = element_blank(), axis.ticks = element_blank()) + labs(x="weight")
n>30
Central Limit Theorem
Distribution of data unknown/ If sample size is small, perform normality test.
H0: The data are normally distributed.
H1: The data are not normally distributed.
Let X be the weight of a randomly selected cylinder
H0: X is normally distributed
H1: X is not normally distributed
ggplot(gas, aes(sample = weight)) + stat_qq() + stat_qq_line(col="red") + coord_equal() + ylab("Theoretical Quantiles") + xlab("Sample Quantiles")
How to perform normality test in R?
#perform shapiro-wilk testshapiro.test(gas$weight)
## ## Shapiro-Wilk normality test## ## data: gas$weight## W = 0.98495, p-value = 0.9027
"The R package statsr
provides functions and datasets to support the Coursera Statistics with R Specialization videos and open access book An Introduction to Bayesian Thinking for learning Bayesian and frequentist statistics using R"
source: https://www.rdocumentation.org/packages/statsr/versions/0.3.0
statsr
package installationlibrary(devtools)devtools::install_github("statswithr/statsr", dependencies=TRUE, upgrade_dependencies = TRUE)
Method 1: Using statsr
package
library(statsr)inference(weight, data=gas, statistic="mean", type="ht", null=12.5, alternative ="less", method="theoretical")
## Single numerical variable## n = 35, y-bar = 10.0375, s = 0.9463## H0: mu = 12.5## HA: mu < 12.5## t = -15.3956, df = 34## p_value = < 0.0001
## Single numerical variable## n = 35, y-bar = 10.0375, s = 0.9463## H0: mu = 12.5## HA: mu < 12.5## t = -15.3956, df = 34## p_value = < 0.0001
statsr
inference(weight, data=gas, statistic="mean", type="ci", method="theoretical")
## Single numerical variable## n = 35, y-bar = 10.0375, s = 0.9463## 95% CI: (9.7125 , 10.3626)
stats
packaget.test(gas$weight, mu=12.5, alternative = "less")
## ## One Sample t-test## ## data: gas$weight## t = -15.396, df = 34, p-value < 2.2e-16## alternative hypothesis: true mean is less than 12.5## 95 percent confidence interval:## -Inf 10.30798## sample estimates:## mean of x ## 10.03752
stats
t.test(gas$weight, mu=12.5, alternative = "two.sided")
## ## One Sample t-test## ## data: gas$weight## t = -15.396, df = 34, p-value < 2.2e-16## alternative hypothesis: true mean is not equal to 12.5## 95 percent confidence interval:## 9.712466 10.362569## sample estimates:## mean of x ## 10.03752
A chemist wants to measure the bias in a pH meter. She uses the meter to measure the pH in 14 neutral substances (pH=7) and obtains the data below.
ph <- c( 7.01, 7.04, 6.97, 7.00, 6.99, 6.97, 7.04, 7.04, 7.01, 7.00, 6.99, 7.04, 7.07, 6.97)
Is there sufficient evidence to support the claim that the pH meter is not correctly calibrated at the α = 0.05 level of significance?
ph.df <- data.frame(pH=ph)ggplot(ph.df, aes(y=pH, x="")) +geom_boxplot(outlier.shape = NA, fill="forestgreen") +geom_jitter(alpha=0.5) +labs(x = "")
In this case, we have only sixteen observations, meaning that the Central Limit Theorem does not apply. With a small sample, we should only use the t-test if we can reasonably assume that the population data is normally distributed. Hence, we must first verify that pH is normally distributed.
ggplot(ph.df,aes(sample=pH))+stat_qq() + stat_qq_line()+labs(x="Theoretical Quantiles", y="Sample Quantiles") + theme(aspect.ratio=1)
shapiro.test(ph.df$pH)
## ## Shapiro-Wilk normality test## ## data: ph.df$pH## W = 0.91603, p-value = 0.1927
H0:μ=7
H1:μ≠7
μ - Population mean pH value (in neutral substances).
inference(y=pH, data=ph.df, statistic="mean", type="ht", null=12.5, alternative ="twosided", method="theoretical")
## Single numerical variable## n = 14, y-bar = 7.01, s = 0.0316## H0: mu = 12.5## HA: mu != 12.5## t = -649.5856, df = 13## p_value = < 0.0001
## Single numerical variable## n = 14, y-bar = 7.01, s = 0.0316## H0: mu = 12.5## HA: mu != 12.5## t = -649.5856, df = 13## p_value = < 0.0001
inference(y=pH, data=ph.df, statistic="mean", type="ci", method="theoretical")
## Single numerical variable## n = 14, y-bar = 7.01, s = 0.0316## 95% CI: (6.9917 , 7.0283)
## Single numerical variable## n = 14, y-bar = 7.01, s = 0.0316## 95% CI: (6.9917 , 7.0283)
A dietician hopes to reduce a person’s cholesterol level by using a special diet supplemented with a combination of vitamin pills. Twenty (20) subjects were pre-tested and then placed on diet for two weeks. Their cholesterol levels were checked after the two week period. The results are shown below. Cholesterol levels are measured in milligrams per decilitre.
cont.
i) Test the claim that the Cholesterol level before the special diet is greater than the Cholesterol level after the special diet at α = 0.01 level of significance.
ii) Construct 99% confidence interval for the difference in mean cholesterol levels. Assume that the cholesterol levels are normally distributed both before and after.
id <- 1:20before <- c(210, 235, 208, 190, 172, 244, 211, 235, 210,190, 175, 250, 200, 270, 222, 203, 209, 220, 250, 280)after <- c(190, 170, 210, 188, 173, 195, 228, 200, 210, 184,196, 208, 211, 212, 205, 221, 240, 250, 230, 220)cholesterol_1 <- tibble(id=id, before=before, after=after)head(cholesterol_1)
## # A tibble: 6 × 3## id before after## <int> <dbl> <dbl>## 1 1 210 190## 2 2 235 170## 3 3 208 210## 4 4 190 188## 5 5 172 173## 6 6 244 195
library(tidyverse)cholesterol_2 <- pivot_longer(cholesterol_1, before:after, "type", "value")head(cholesterol_2)
## # A tibble: 6 × 3## id type value## <int> <chr> <dbl>## 1 1 before 210## 2 1 after 190## 3 2 before 235## 4 2 after 170## 5 3 before 208## 6 3 after 210
ggplot(data= cholesterol_2, aes(x=type, y=value)) +geom_boxplot(outlier.shape = NA, aes(fill=type), alpha=0.5) +geom_jitter(aes(fill=type))
cholesterol_2 %>%group_by(type) %>%summarize(mean = round(mean(value), 2),sd = round(sd(value), 2))
## # A tibble: 2 × 3## type mean sd## <chr> <dbl> <dbl>## 1 after 207. 21.0## 2 before 219. 29.3
ggplot(data = cholesterol_2, aes(sample = value)) +stat_qq() +stat_qq_line() +facet_grid(. ~ type)
shapiro.test(cholesterol_1$before)
## ## Shapiro-Wilk normality test## ## data: cholesterol_1$before## W = 0.9647, p-value = 0.6414
shapiro.test(cholesterol_1$after)
## ## Shapiro-Wilk normality test## ## data: cholesterol_1$after## W = 0.98535, p-value = 0.9836
Method 1: stats package
t.test(before, after, data=cholesterol_1, "greater", paired=TRUE)
## ## Paired t-test## ## data: before and after## t = 1.7754, df = 19, p-value = 0.04593## alternative hypothesis: true mean difference is greater than 0## 95 percent confidence interval:## 0.3167385 Inf## sample estimates:## mean difference ## 12.15
Method 2: statsr package
diff <- cholesterol_1$before -cholesterol_1$afterdiff_data <- data.frame(diff=diff)ggplot(diff_data, aes(sample=diff))+stat_qq() + stat_qq_line()+labs(x="Theoretical Quantiles", y="Sample Quantiles") + theme(aspect.ratio=1)
inference(y=diff, data=diff_data,statistic="mean", type="ht", method="theoretical", alternative ="greater", null=0L)
## Single numerical variable## n = 20, y-bar = 12.15, s = 30.6049## H0: mu = 0## HA: mu > 0## t = 1.7754, df = 19## p_value = 0.0459
Method 2: stats
t.test(x = diff_data$diff, alternative = c("greater"), mu=0)
## ## One Sample t-test## ## data: diff_data$diff## t = 1.7754, df = 19, p-value = 0.04593## alternative hypothesis: true mean is greater than 0## 95 percent confidence interval:## 0.3167385 Inf## sample estimates:## mean of x ## 12.15
Your turn: Obtain confidence intervals
Two sample tests and ANOVA: switch to this
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |