class: center, middle, inverse, title-slide .title[ # STA 326 2.0 Programming and Data Analysis with R ] .subtitle[ ## Statistical Hypothesis Testing ] .author[ ### ] .author[ ### Dr Thiyanga Talagala ] --- <style type="text/css"> .remark-slide-content { font-size: 50px; } </style> # Why statistical hypothesis testing is important? -- Hypothesis testing provides a **reliable framework** for making any **data-driven** decisions for your **population of interest**. --- background-image: url(ht/ht1.png) background-size: contain --- class: center, middle # Do 12.5kg gas cylinders actually hold 12.5kg of weight? --- # Step 1 Establish null and alternative hypotheses -- `$$H_1: \mu < 12.5kg$$` --- # Step 1 Establish null and alternative hypotheses -- `$$H_0: \mu \geq 12.5kg$$` `$$H_1: \mu < 12.5kg$$` `\(\mu\)` - population mean --- class: center, middle # Suppose the manager wants to determine whether their process is out-of-control by using the weight of gas cylinders --- class: center, middle `$$H_1: \mu \neq 12.5kg$$` `\(\mu\)` - population mean --- class: center, middle `$$H_0: \mu = 12.5kg$$` `$$H_1: \mu \neq 12.5kg$$` `\(\mu\)` - population mean --- background-image: url(ht/ht2.jpg) background-size: contain .footnote[ [https://www.sciencedirect.com/topics/mathematics/tailed-test](https://www.sciencedirect.com/topics/mathematics/tailed-test) ] --- background-image: url(ht/ht2.jpg) background-size: contain --- # Step 2: Gather sample data ```r gas ``` ``` ## # A tibble: 35 × 1 ## weight ## <dbl> ## 1 9.44 ## 2 9.77 ## 3 11.6 ## 4 10.1 ## 5 10.1 ## 6 11.7 ## 7 10.5 ## 8 8.73 ## 9 9.31 ## 10 9.55 ## # … with 25 more rows ``` --- # Step 3: Visualize data ```r library(ggplot2) ggplot(gas, aes(x=weight)) + geom_boxplot(alpha=0.5) ``` <img src="hypothesis_testing_22_files/figure-html/unnamed-chunk-3-1.png" width="100%" /> --- ```r ggplot(gas, aes(x=weight)) + geom_boxplot(alpha=0.5) + theme( # remove axis text and ticks axis.text.y = element_blank(), axis.ticks = element_blank()) + labs(x="weight") ``` <img src="hypothesis_testing_22_files/figure-html/unnamed-chunk-4-1.png" width="100%" /> --- # Step 4: Determine the appropriate statistical test `$$n > 30$$` Central Limit Theorem --- Distribution of data unknown/ If sample size is small, perform normality test. ### Method 1 H0: The data are normally distributed. H1: The data are not normally distributed. --- ### Method 2 Let `\(X\)` be the weight of a randomly selected cylinder H0: `\(X\)` is normally distributed H1: `\(X\)` is not normally distributed --- # Normal-probability plot ```r ggplot(gas, aes(sample = weight)) + stat_qq() + stat_qq_line(col="red") + coord_equal() + ylab("Theoretical Quantiles") + xlab("Sample Quantiles") ``` <img src="hypothesis_testing_22_files/figure-html/unnamed-chunk-5-1.png" width="100%" /> --- How to perform normality test in R? ```r #perform shapiro-wilk test shapiro.test(gas$weight) ``` ``` ## ## Shapiro-Wilk normality test ## ## data: gas$weight ## W = 0.98495, p-value = 0.9027 ``` --- # Step 5: Test "The R package `statsr` provides functions and datasets to support the Coursera Statistics with R Specialization videos and open access book An Introduction to Bayesian Thinking for learning Bayesian and frequentist statistics using R" source: https://www.rdocumentation.org/packages/statsr/versions/0.3.0 --- # `statsr` package installation ```r library(devtools) devtools::install_github("statswithr/statsr", dependencies=TRUE, upgrade_dependencies = TRUE) ``` --- background-image: url(ht/ht3.png) background-size: contain --- Method 1: Using `statsr` package ```r library(statsr) inference(weight, data=gas, statistic="mean", type="ht", null=12.5, alternative ="less", method="theoretical") ``` ``` ## Single numerical variable ## n = 35, y-bar = 10.0375, s = 0.9463 ## H0: mu = 12.5 ## HA: mu < 12.5 ## t = -15.3956, df = 34 ## p_value = < 0.0001 ``` <img src="hypothesis_testing_22_files/figure-html/unnamed-chunk-7-1.png" width="100%" /> --- ``` ## Single numerical variable ## n = 35, y-bar = 10.0375, s = 0.9463 ## H0: mu = 12.5 ## HA: mu < 12.5 ## t = -15.3956, df = 34 ## p_value = < 0.0001 ``` <img src="hypothesis_testing_22_files/figure-html/unnamed-chunk-8-1.png" width="100%" /> --- # Confidence intervals with `statsr` ```r inference(weight, data=gas, statistic="mean", type="ci", method="theoretical") ``` ``` ## Single numerical variable ## n = 35, y-bar = 10.0375, s = 0.9463 ## 95% CI: (9.7125 , 10.3626) ``` <img src="hypothesis_testing_22_files/figure-html/unnamed-chunk-9-1.png" width="100%" /> --- # Method 2: Using `stats` package ```r t.test(gas$weight, mu=12.5, alternative = "less") ``` ``` ## ## One Sample t-test ## ## data: gas$weight ## t = -15.396, df = 34, p-value < 2.2e-16 ## alternative hypothesis: true mean is less than 12.5 ## 95 percent confidence interval: ## -Inf 10.30798 ## sample estimates: ## mean of x ## 10.03752 ``` --- # Confidence intervals with `stats` ```r t.test(gas$weight, mu=12.5, alternative = "two.sided") ``` ``` ## ## One Sample t-test ## ## data: gas$weight ## t = -15.396, df = 34, p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 12.5 ## 95 percent confidence interval: ## 9.712466 10.362569 ## sample estimates: ## mean of x ## 10.03752 ``` --- # Example 2 A chemist wants to measure the bias in a pH meter. She uses the meter to measure the pH in 14 neutral substances (pH=7) and obtains the data below. ```r ph <- c( 7.01, 7.04, 6.97, 7.00, 6.99, 6.97, 7.04, 7.04, 7.01, 7.00, 6.99, 7.04, 7.07, 6.97) ``` Is there sufficient evidence to support the claim that the pH meter is not correctly calibrated at the α = 0.05 level of significance? --- ```r ph.df <- data.frame(pH=ph) ggplot(ph.df, aes(y=pH, x="")) + geom_boxplot(outlier.shape = NA, fill="forestgreen") + geom_jitter(alpha=0.5) + labs(x = "") ``` <img src="hypothesis_testing_22_files/figure-html/unnamed-chunk-13-1.png" width="100%" /> --- In this case, we have only sixteen observations, meaning that the Central Limit Theorem does not apply. With a small sample, we should only use the t-test if we can reasonably assume that the population data is normally distributed. Hence, we must first verify that pH is normally distributed. --- ```r ggplot(ph.df, aes(sample=pH))+ stat_qq() + stat_qq_line()+labs(x="Theoretical Quantiles", y="Sample Quantiles") + theme(aspect.ratio=1) ``` <img src="hypothesis_testing_22_files/figure-html/unnamed-chunk-14-1.png" width="100%" /> --- ```r shapiro.test(ph.df$pH) ``` ``` ## ## Shapiro-Wilk normality test ## ## data: ph.df$pH ## W = 0.91603, p-value = 0.1927 ``` --- # Hypothesis to be tested `$$H_0: \mu = 7$$` `$$H_1: \mu \neq 7$$` `\(\mu\)` - Population mean pH value (in neutral substances). --- ### Hypothesis test ```r inference(y=pH, data=ph.df, statistic="mean", type="ht", null=12.5, alternative ="twosided", method="theoretical") ``` ``` ## Single numerical variable ## n = 14, y-bar = 7.01, s = 0.0316 ## H0: mu = 12.5 ## HA: mu != 12.5 ## t = -649.5856, df = 13 ## p_value = < 0.0001 ``` <img src="hypothesis_testing_22_files/figure-html/unnamed-chunk-16-1.png" width="100%" /> --- ### Hypothesis test ``` ## Single numerical variable ## n = 14, y-bar = 7.01, s = 0.0316 ## H0: mu = 12.5 ## HA: mu != 12.5 ## t = -649.5856, df = 13 ## p_value = < 0.0001 ``` <img src="hypothesis_testing_22_files/figure-html/unnamed-chunk-17-1.png" width="100%" /> --- # Confidence interval ```r inference(y=pH, data=ph.df, statistic="mean", type="ci", method="theoretical") ``` ``` ## Single numerical variable ## n = 14, y-bar = 7.01, s = 0.0316 ## 95% CI: (6.9917 , 7.0283) ``` <img src="hypothesis_testing_22_files/figure-html/unnamed-chunk-18-1.png" width="100%" /> --- # Confidence interval ``` ## Single numerical variable ## n = 14, y-bar = 7.01, s = 0.0316 ## 95% CI: (6.9917 , 7.0283) ``` <img src="hypothesis_testing_22_files/figure-html/unnamed-chunk-19-1.png" width="100%" /> --- ## Two-sample: Paired t-test A dietician hopes to reduce a person’s cholesterol level by using a special diet supplemented with a combination of vitamin pills. Twenty (20) subjects were pre-tested and then placed on diet for two weeks. Their cholesterol levels were checked after the two week period. The results are shown below. Cholesterol levels are measured in milligrams per decilitre. --- cont. i) Test the claim that the Cholesterol level before the special diet is greater than the Cholesterol level after the special diet at α = 0.01 level of significance. ii) Construct 99% confidence interval for the difference in mean cholesterol levels. Assume that the cholesterol levels are normally distributed both before and after. --- ```r id <- 1:20 before <- c(210, 235, 208, 190, 172, 244, 211, 235, 210, 190, 175, 250, 200, 270, 222, 203, 209, 220, 250, 280) after <- c(190, 170, 210, 188, 173, 195, 228, 200, 210, 184, 196, 208, 211, 212, 205, 221, 240, 250, 230, 220) cholesterol_1 <- tibble(id=id, before=before, after=after) head(cholesterol_1) ``` ``` ## # A tibble: 6 × 3 ## id before after ## <int> <dbl> <dbl> ## 1 1 210 190 ## 2 2 235 170 ## 3 3 208 210 ## 4 4 190 188 ## 5 5 172 173 ## 6 6 244 195 ``` --- ```r library(tidyverse) cholesterol_2 <- pivot_longer(cholesterol_1, before:after, "type", "value") head(cholesterol_2) ``` ``` ## # A tibble: 6 × 3 ## id type value ## <int> <chr> <dbl> ## 1 1 before 210 ## 2 1 after 190 ## 3 2 before 235 ## 4 2 after 170 ## 5 3 before 208 ## 6 3 after 210 ``` --- ```r ggplot(data= cholesterol_2, aes(x=type, y=value)) + geom_boxplot(outlier.shape = NA, aes(fill=type), alpha=0.5) + geom_jitter(aes(fill=type)) ``` <img src="hypothesis_testing_22_files/figure-html/unnamed-chunk-22-1.png" width="100%" /> --- ```r cholesterol_2 %>% group_by(type) %>% summarize(mean = round(mean(value), 2), sd = round(sd(value), 2)) ``` ``` ## # A tibble: 2 × 3 ## type mean sd ## <chr> <dbl> <dbl> ## 1 after 207. 21.0 ## 2 before 219. 29.3 ``` --- ```r ggplot(data = cholesterol_2, aes(sample = value)) + stat_qq() + stat_qq_line() + facet_grid(. ~ type) ``` <img src="hypothesis_testing_22_files/figure-html/unnamed-chunk-24-1.png" width="100%" /> --- ```r shapiro.test(cholesterol_1$before) ``` ``` ## ## Shapiro-Wilk normality test ## ## data: cholesterol_1$before ## W = 0.9647, p-value = 0.6414 ``` ```r shapiro.test(cholesterol_1$after) ``` ``` ## ## Shapiro-Wilk normality test ## ## data: cholesterol_1$after ## W = 0.98535, p-value = 0.9836 ``` --- Method 1: stats package ```r t.test(before, after, data=cholesterol_1, "greater", paired=TRUE) ``` ``` ## ## Paired t-test ## ## data: before and after ## t = 1.7754, df = 19, p-value = 0.04593 ## alternative hypothesis: true mean difference is greater than 0 ## 95 percent confidence interval: ## 0.3167385 Inf ## sample estimates: ## mean difference ## 12.15 ``` --- Method 2: statsr package ```r diff <- cholesterol_1$before -cholesterol_1$after diff_data <- data.frame(diff=diff) ggplot(diff_data, aes(sample=diff))+ stat_qq() + stat_qq_line()+ labs(x="Theoretical Quantiles", y="Sample Quantiles") + theme(aspect.ratio=1) ``` <img src="hypothesis_testing_22_files/figure-html/unnamed-chunk-27-1.png" width="100%" /> --- ```r inference(y=diff, data=diff_data,statistic="mean", type="ht", method="theoretical", alternative ="greater", null=0L) ``` ``` ## Single numerical variable ## n = 20, y-bar = 12.15, s = 30.6049 ## H0: mu = 0 ## HA: mu > 0 ## t = 1.7754, df = 19 ## p_value = 0.0459 ``` <img src="hypothesis_testing_22_files/figure-html/unnamed-chunk-28-1.png" width="100%" /> --- Method 2: stats ```r t.test(x = diff_data$diff, alternative = c("greater"), mu=0) ``` ``` ## ## One Sample t-test ## ## data: diff_data$diff ## t = 1.7754, df = 19, p-value = 0.04593 ## alternative hypothesis: true mean is greater than 0 ## 95 percent confidence interval: ## 0.3167385 Inf ## sample estimates: ## mean of x ## 12.15 ``` --- class: center, middle Your turn: Obtain confidence intervals --- Two sample tests and ANOVA: switch to [this](https://hellor.netlify.app/slides/Hypothesis_testing.pdf) --- class: center, middle ## Thank you! Slides available at: hellor.netlify.app All rights reserved by [Thiyanga S. Talagala](https://thiyanga.netlify.app/)