In this project, we will use inferential statistics to find out if men or women oppose sex education in public schools. As the data we have is a sample from the population, it’s important to use a hypothesis driven approach to draw conclusions from the sample.
Let’s get started.
The project was completed as a part of Duke University’s ‘Introduction to Probability and Data’ online course on Coursera, the first of the Statistics with R Specialization.
library(ggplot2)
library(dplyr)
library(statsr)
load("gss.Rdata")
According to the documentation, the data in the General Social Survey (GSS) is collected from an independently drawn sample of English-speaking persons (and Spanish-speaking persons, beginning from 2006) 18 years of age or above, living in non-institutional arrangements within the United States. Full probability sampling was employed in all surveys beginning from 1977.
As probability sampling was used, we know that random selection was was a part of the sampling process. Thus, the data is generalizable to the population as a whole. But, since the study is an observational study, and not experimental, we cannot use it to determine causality between variables in the data.
In the year 2012, were men, of 18 years or above, more likely to oppose sex education in public schools than women?
Sex education is an important part of a student’s schooling, so they are aware of their own health and make the correct decisions in life. If we know that there’s a chance of difference between men and women’s attitude towards sex education, we can put in the resources for further research to determine the cause of this difference, and solve the issue.
We’ll start with filtering out the data for the year 2012, and then selecting the variables we’re interested in. These are
sexeduc: For or against sex education in public schools
dataset <- filter(gss, year == 2012) %>% select(sex,sexeduc)
Let’s look at the type of variables we’re dealing with:
str(dataset)
## 'data.frame': 1974 obs. of 2 variables:
## $ sex : Factor w/ 2 levels "Male","Female": 1 1 1 2 2 2 2 2 2 2 ...
## $ sexeduc: Factor w/ 3 levels "Favor","Oppose",..: 1 NA NA 1 1 1 1 NA 1 NA ...
The sex variable has two levels, and the sexeduc has 3 levels. Let’s plot the data:
ggplot(dataset, aes(x=sexeduc)) + geom_bar() + ggtitle('Favourability to Sex Education in Public Schools') + xlab('Favour or Oppose Sex Education') + theme_bw()
It seems like more people favor sex education and oppose it. Which is good!
We also have a lot of NA values that we don’t want to deal with for our question, so let’s remove those, and take a look at the new distribution.
dataset <- na.omit(dataset)
table(dataset)
## sexeduc
## sex Favor Oppose Depends
## Male 519 64 0
## Female 638 53 0
It looks like that, in our sample, there are more males than females that oppose sex education. We also have no responses for the ‘Depends’ category, so let’s remove that and visualize our data.
dataset <- droplevels(dataset)
mosaicplot(prop.table(table(dataset),1), main = 'Sex vs Favourability of Sex Education')
Now, we’ll use statistical inference to know if a larger propertion of men than women, oppose sex education for the whole population.
For our question, the null hypotheses is, there’s no difference in the proportions of men and women that oppose sex education.
The alternative hypotheses is, a larger proportion of men, than women, oppose sex education.
We now check the conditions for inference, for a difference in proportion.
And for each model to follow a nearly normal model,
The conditions for inference on a difference in proportion also holds true because we’ve determined each proportion to follow a normal model. And that the two samples are independent from each other because of random sampling.
We’ll examine the difference in proportions of men and women who oppose sex education by:
Calculating the p-value for the hypotheses that a higher proportion of men oppose sex education, given our sample. Then deciding if the p-value is small or large enough to reject, or not reject the null hypotheses.
Checking a 95% confidence interval for the difference in the proportions, and seeing if it contains 0 i.e. there is no difference.
inference(x= sex,y = sexeduc, data = dataset, statistic = "proportion", type = "ht", null = 0 ,method = "theoretical",alternative="greater", success = "Oppose")
## Response variable: categorical (2 levels, success: Oppose)
## Explanatory variable: categorical (2 levels)
## n_Male = 583, p_hat_Male = 0.1098
## n_Female = 691, p_hat_Female = 0.0767
## H0: p_Male = p_Female
## HA: p_Male > p_Female
## z = 2.0367
## p_value = 0.0208
We can see that the p-value is less than 5%, so for this question, we reject the null hypotheses and conclude that for our population in general, a larger proportion of males, compared to females, oppose sex education.
Reason for including CI: We’re constructing a confidence interval for our hypotheses test, to double check the results of the method with the p-value test.
inference(x= sex,y = sexeduc, data = dataset, statistic = "proportion", type = "ci",method = "theoretical", success = "Oppose")
## Response variable: categorical (2 levels, success: Oppose)
## Explanatory variable: categorical (2 levels)
## n_Male = 583, p_hat_Male = 0.1098
## n_Female = 691, p_hat_Female = 0.0767
## 95% CI (Male - Female): (9e-04 , 0.0653)
The 95% confidence interval for the difference between proportions of males and females that oppose sex education is between 0.0009 to 0.0653. So even though the difference is really small, it still exists. Hence the results for both tests agree, and we can say that a greater proportion of men, than females, oppose sex education, and steps can be taken to better educate the men on the issue.