Programming tutorials - page 18

 

Chi-Squared Goodness-of-Fit Testing in R


Chi-Squared Goodness-of-Fit Testing in R

Hello everyone, in today's session, we'll be diving into goodness-of-fit testing using R. We'll work through a couple of problems to understand the concept better. If you're unfamiliar with goodness-of-fit testing, I recommend watching my introductory video on the topic first (link provided above).

Let's start with the first problem. A college claims that 50% of students in its statistics classes are freshmen, 30% are sophomores, 10% are juniors, and 10% are seniors. We have obtained a simple random sample of 65 students, and the distribution in our sample is slightly different from the claimed proportions. We want to determine if these differences provide strong evidence against the college's claim or if they could be due to random variability.

To perform the goodness-of-fit test in R, we'll use the chi-square test. I've pulled up the help file for the chi-square.test function, but we'll focus solely on goodness-of-fit testing for now.

First, let's input our data. We'll create a vector called years to store the observed counts: 28 freshmen, 24 sophomores, 9 juniors, and 4 seniors.

years <- c(28, 24, 9, 4)

Next, we need to create a vector of expected proportions under the null hypothesis. In this case, the null hypothesis assumes that the claimed proportions are true. Let's call this vector props and assign the proportions: 0.5 for freshmen, 0.3 for sophomores, 0.1 for juniors, and 0.1 for seniors.

props <- c(0.5, 0.3, 0.1, 0.1)

Now, we can perform the chi-square test using the chi-square.test function. The basic syntax is straightforward: chi.square.test(data, p = expected_proportions). Remember to include p = props to specify the expected proportions.

result <- chi.square.test(years, p = props)

The test will output the degrees of freedom, the chi-square test statistic, and the p-value. For this problem, we have three degrees of freedom, a chi-square test statistic of 3.58, and a p-value of 0.31. These results match what we obtained in the introductory video.

With a p-value of 0.31, we do not have enough evidence to reject the null hypothesis. Therefore, we cannot conclude that the differences between our sample distribution and the claimed proportions are statistically significant. The data is compatible with the college's claim.

Now, let's move on to the second problem. We have conducted 200 rolls of a die, and the resulting distribution is as follows: 28 ones, 32 twos, and so on. We want to determine if this distribution provides evidence that the die is unfair.

We'll follow the same process as before. Let's create a vector called counts to store the observed counts: 28 ones, 30 twos, 22 threes, 31 fours, 38 fives, and 51 sixes.

counts <- c(28, 30, 22, 31, 38, 51)

Now, we can directly apply the chi-square test to these counts.

result <- chi.square.test(counts)

The test will output the degrees of freedom, the chi-square test statistic, and the p-value. In this case, we have five degrees of freedom, a chi-square test statistic of 15.22, and a p-value of 0.009463.

With a very small p-value of 0.009463, we have sufficient evidence to reject the null hypothesis. Thus, we can conclude that the die appears to be weighted and not fair based on the observed distribution.

That wraps up our discussion and application of the chi-square goodness-of-fit test using R. Remember, this test allows us to evaluate the compatibility of observed data with an expected distribution and make statistical inferences based on the p-value.

Chi-Squared Goodness-of-Fit Testing in R
Chi-Squared Goodness-of-Fit Testing in R
  • 2020.11.30
  • www.youtube.com
Chi-squared testing is easy with R. Give me just five minutes, and I'll show you how to do it!If this vid helps you, please help me a tiny bit by mashing tha...
 

Chi-Squared Testing for Independence in R


Chi-Squared Testing for Independence in R

Hey everyone, in today's video, we'll be using R to perform chi-squared testing for the independence of categorical variables. We'll be utilizing the chi-squared.test function for this purpose. Please note that in this video, we won't cover goodness-of-fit testing, which also employs the same underlying function in R. If you're interested in learning about goodness-of-fit testing, I have a separate video on that topic (link provided above).

Let's work through a problem taken from the OpenStax textbook on introductory statistics. The problem involves a volunteer group where adults aged 21 and older volunteer for one to nine hours per week to spend time with a disabled senior citizen. The program recruits volunteers from three categories: community college students, four-year college students, and non-students. We have a contingency table, or a two-way table, that displays the distribution of volunteers based on two categorical variables: the type of volunteer and the number of hours volunteered, categorized as one to three hours, four to six hours, and seven to nine hours.

Now, let's switch over to R and input the data and run a chi-squared test to determine if these categorical variables are associated with each other or not.

To input the data, we'll create a matrix called volunteers using the matrix function. We'll input the data row-wise, from left to right, top to bottom.

volunteers <- matrix( c(111, 96, 48, 96, 133, 61, 91, 150, 53), nrow = 3, byrow = TRUE )

Next, let's add row names and column names to make the matrix more interpretable.

row.names(volunteers) <- c("Community College Students", "Four-Year College Students", "Non-Students") colnames(volunteers) <- c("1-3 hours", "4-6 hours", "7-9 hours")

Now, we have a visually appealing table displaying the distribution of volunteers among the different categories.

To perform the chi-squared test, we'll use the chi.square.test function. Assigning the result to a variable, such as model, allows us to access additional information if needed.

model <- chi.square.test(volunteers)

To view the test results, simply type the name of the variable, model.

model

The test output will include the chi-squared test statistic, the degrees of freedom, and the p-value. For this example, we obtain a chi-squared test statistic of 12.991, 2 degrees of freedom, and a p-value that is typically very small.

It's important to note that the model object contains additional information, such as the expected cell counts and residuals. These can be accessed for further analysis if required.

Another way to perform the chi-squared test is by converting the matrix to a table and utilizing the summary function.

vol_table <- as.table(volunteers)
summary(vol_table)

This approach will also provide the chi-squared test statistic, degrees of freedom, and p-value.

That covers the process of performing chi-squared testing for the independence of categorical variables using R. Remember, the chi-squared test helps us determine if there is a significant association between two categorical variables based on observed and expected frequencies.

Chi-Squared Testing for Independence in R
Chi-Squared Testing for Independence in R
  • 2020.12.04
  • www.youtube.com
Let's learn how to use the chisq.test() function in R to check the independence of categorical variables. If this vid helps you, please help me a tiny bit by...
 

Goodness of fit testing with R: example


Goodness of fit testing with R: example

Today we'll be using R to tackle a typical problem of goodness-of-fit testing. Here it is:

In a random sample of 10 three-child families, the distribution of girls was as follows:

  • 12 families had no girls
  • 31 families had one girl
  • 42 families had two girls
  • 15 families had three girls

The question is: Is it plausible that the number of girls in such families follows a binomial distribution with parameters n=3 and p=0.5?

Let's switch to R, where I've already entered the observed values. To proceed, we need to calculate the expected values and compare them to the observed counts. We'll start with the expected proportions, which can be obtained using the dbinom function in R.

Here are the expected proportions for having 0, 1, 2, or 3 girls in a family:

  • 12.5% for 0 girls
  • 37.5% for 1 girl
  • 37.5% for 2 girls
  • 12.5% for 3 girls

Next, we'll calculate the expected counts by multiplying the expected proportions by 100 (since we have 100 families in total).

Now, let's proceed with two different approaches to solve this problem. First, we'll use the chisq.test function in R, which provides a direct answer by computing the test statistic and p-value. Then, we'll go through the calculations step by step to gain a deeper understanding of the process.

Using chisq.test:

observed_counts <- c(12, 31, 42, 15)
expected_proportions <- dbinom(0:3, size = 3, prob = 0.5)
expected_counts <- expected_proportions * 100
result <- chisq.test(observed_counts, p = expected_proportions)
p_value <- result$p.value print(p_value)

The obtained p-value indicates the probability of obtaining data as extreme as what we observed, assuming the null hypothesis is true. In this case, the p-value is approximately 0.07232.

Since our significance level was set at 0.05, and the p-value is greater than that, we do not have sufficient evidence to reject the null hypothesis. We can conclude that the data is consistent with the hypothesis that the number of girls in these families follows a binomial distribution with parameters n=3 and p=0.5.

Now, let's compute the chi-squared test statistic manually to better understand the process:

chi_stat <- sum((observed_counts - expected_counts)^2 / expected_counts)
degrees_of_freedom <- length(observed_counts) - 1
p_value_manual <- 1 - pchisq(chi_stat, df = degrees_of_freedom)
print(p_value_manual)

The manually calculated p-value matches the result obtained using chisq.test, confirming our earlier finding of approximately 0.07232.

In summary, both approaches yield the same conclusion: the data is consistent with the null hypothesis that the number of girls in these families follows a binomial distribution with parameters n=3 and p=0.5.

Goodness of fit testing with R: example
Goodness of fit testing with R: example
  • 2023.01.04
  • www.youtube.com
Is it plausible that a categorical variable was sampled from a particular distribution?If this vid helps you, please help me a tiny bit by mashing that 'like...
 

Correlation testing in R


Correlation testing in R

Hello everyone! Today, we're going to discuss correlation testing. I'll be using R for this demonstration, but the concepts we'll cover are applicable universally, regardless of your working environment. So, stick around even if you're using a different software.

For this example, I'll be using the college dataset from the islr2 package. I've already loaded the dataset and set my theme to minimal using the titiverse package. If you're interested in a detailed analysis of the college dataset, I have a video link in the description.

The college dataset consists of 777 observations, each representing a college from 1995. It contains various variables such as public or private designation, full-time undergraduate enrollment, and graduation rate.

Our focus will be on determining if there is a statistically significant correlation between the logarithm of full-time undergraduate enrollment and the graduation rate at public universities. We want to know if this apparent relationship is likely due to random chance or if it's a meaningful trend we should pay attention to.

To begin, I've created a scatter plot using ggplot, with the graduation rate on the y-axis and the logarithm of full-time undergraduate enrollment on the x-axis. I've also filtered out the private schools, so we only analyze the public colleges.

Now, let's address the logarithm. Don't be intimidated by it; it simply helps us interpret the scale of the data. In this case, we're using a base 10 logarithm, which tells us the number of zeros at the end of a value. For example, 3.0 on a logarithmic scale is 10^3, which is 1,000. By taking the logarithm, we achieve a more even spread and a roughly linear relationship between the variables.

Correlation measures the strength of a generally linear relationship between two quantitative variables. In this case, we have a positive correlation of approximately 0.22, indicating that as the full-time undergraduate enrollment increases, the graduation rate tends to increase as well. This positive relationship between college size and graduation rate may seem surprising, but it's worth exploring further.

The correlation is relatively weak, as correlations range from -1 to 1. A correlation of -1 represents a perfect negative relationship, while a correlation of 1 represents a perfect positive relationship.

Now, let's conduct a correlation test in R to determine if this correlation is statistically significant. The syntax for running a correlation test is similar to computing correlation. By using the cor.test function with the two variables of interest, we obtain the correlation and additional information.

In this case, the test results provide a p-value of 0.001, which suggests that if there were no correlation between these variables in the population, the observed correlation would occur by random chance only about 0.1% of the time. Such a low probability indicates that the correlation we observed is statistically significant, and we can conclude that there is a correlation between the logarithm of full-time undergraduate enrollment and the graduation rate at public universities.

Now, let's delve a bit deeper into the test itself. It examines whether the observed correlation in the sample data could reasonably be attributed to random chance. The test assumes a linear relationship between the variables and independence of observations, making it unsuitable for time series data. Additionally, it assumes the data follows a bivariate normal distribution, but deviations from perfect normality are generally acceptable.

It's important to note that this correlation test specifically tests the null hypothesis that the population correlation is zero. It cannot be used to test for correlations other than zero.

Under the hood, the test uses a test statistic called R, calculated as the observed sample correlation. This statistic follows a Student's t-distribution with n-2 degrees of freedom, assuming the null hypothesis of zero correlation is assumed. The degrees of freedom, denoted as n-2, depend on the sample size (n) and are determined by the number of independent observations available for estimation.

The test also provides a confidence interval for the population correlation. In this case, the 95% confidence interval ranges from 0.087 to 0.354. This interval gives us a range of plausible values for the population correlation based on our sample data. Since the interval does not include zero, we can infer that the population correlation is likely to be positive.

It's worth noting that correlation does not imply causation. Just because we observe a correlation between two variables does not mean that one variable is causing the other to change. Correlation simply indicates a relationship between the variables, but additional research and analysis are needed to establish causality.

To visualize the correlation, we can add a regression line to our scatter plot. The regression line represents the best-fit line through the data points, indicating the general trend of the relationship. By using the geom_smooth function in ggplot with the argument method = "lm", we can add a regression line to our plot.

Now we can see the regression line fitted to the data points. It provides a visual representation of the positive relationship between the logarithm of full-time undergraduate enrollment and the graduation rate at public universities.

In summary, we have conducted a correlation test to determine if there is a statistically significant relationship between the logarithm of full-time undergraduate enrollment and the graduation rate at public universities. The test results indicated a positive correlation with a p-value of 0.001, suggesting a significant relationship. However, remember that correlation does not imply causation, and further analysis is needed to establish causal relationships.

Correlation testing is a valuable tool in statistical analysis, allowing us to explore relationships between variables and identify potential trends or associations. It helps us make informed decisions and draw meaningful conclusions based on data.

Correlation testing in R
Correlation testing in R
  • 2023.03.29
  • www.youtube.com
Does a sample correlation imply a population correlation, or could the observed data just be due to random chance? Let's get into it!If this vid helps you, p...
Reason: