# Programming tutorials - page 17

### Introduction to Linear Regression

Introduction to Linear Regression

Hello everyone! Today, we're diving into linear regression. We've been examining scatter plots and discussing situations where we observe a linear relationship between variables. In other words, as the X variable increases, the Y variable tends to increase or decrease at a constant rate. We can discuss this phenomenon when we have a tight relationship, as shown on the left side of the graph, as well as when the relationship is more scattered, as seen on the right side.

To analyze this linear relationship, we can draw a line over the scatter plot in an intelligent manner. This line is known as the line of best fit or regression line. Now, let's delve into the mathematical aspects of linear regression. The key idea involves the notion of residuals. We place a line over our data and choose a specific X value. Then, we calculate the difference between the actual Y value in the data set and the predicted Y value on the line. This difference is called the residual, representing the deviation between the actual and expected heights. By calculating residuals for each point in our data set, squaring them, and summing them up, we obtain a quantity that can be minimized.

Using calculus, we can minimize this quantity and derive the equation for the least squares regression line. It turns out that this line passes through the point (X bar, Y bar), where X bar is the sample mean for the X values, and Y bar is the sample mean for the Y values. The slope of the least squares regression line is given by r × (sy / SX), where r is the coefficient of correlation, sy is the standard deviation of the Y values, and SX is the standard deviation of the X values. In summary, the equation for the least squares regression line is provided at the bottom of the slide.

Calculating these values manually can be cumbersome. To simplify the process, it's highly recommended to use technology or software. Let's consider the data corresponding to the scatter plot shown in a previous slide. By calculating the means and standard deviations, we find that X bar is 5.4, Y bar is 2.4, and so on. The coefficient of correlation is approximately 0.34, indicating a moderate to weak positive correlation. By plugging in these values, we obtain the equation for the least squares regression line: 0.19x + 1.34.

I must emphasize that performing these calculations by hand can be tedious. Utilizing technology is a much more efficient approach. Here's an example of what the least squares regression line looks like for this data. It appears to be a reasonable fit to the data points.

Introduction to Linear Regression
• 2020.04.17
Drawing a line of best fit over a scatterplot. So easy and fun! If this vid helps you, please help me a tiny bit by mashing that 'like' button. For more stat...

### Scatterplots and Regression Lines in R

Scatterplots and Regression Lines in R

Hello everyone! In this quick start guide, I'll show you how to create beautiful graphics using the ggplot2 package in RStudio. This discussion is suitable for beginners at the statistics one level. While there are more powerful and sophisticated methods available, I'll focus on the most intuitive and straightforward approaches. We'll be working with a subset of the iris dataset, specifically 50 rows corresponding to the virginica flower. Our goal is to create a scatter plot of sepal length versus sepal width.

Before we begin, make sure to load the tidyverse package or its family of packages. If you haven't installed it yet, use the command "install.packages('tidyverse')". If any errors occur during installation, it's recommended to search for solutions online. Once the package is loaded, we're ready to proceed.

To create a scatter plot, we'll use the basic syntax "qplot". First, specify the x-value, which is "virginica\$sepal_length" for the horizontal axis, where "virginica" is the dataset and "sepal_length" is the column name. Then, indicate the y-value as "virginica\$sepal_width" for the vertical axis. Next, we need to define how the data should be displayed. For a scatter plot, we use "geom = 'point'". Ensure that you spell "point" correctly. This will generate a basic scatter plot.

Let's improve the plot by adjusting the axis labels and exploring customization options like changing colors and point sizes. To modify the x-axis label, use "xlab = 'sepal length'". Similarly, set "ylab = 'sepal width'" to change the y-axis label. To alter the point color, add "color = 'darkred'". Note that the syntax for specifying color is a bit peculiar due to R's sophistication.

Now that the labels and point color have been adjusted, you can further experiment. For example, you can change the point size by using "size = ...". Additionally, you can add a main title to the plot. I encourage you to explore the capabilities of "qplot" further by using "?qplot" or searching online.

Let's take it a step further and add a regression line. One advantage of ggplot2 and the tidyverse is that you can add layers to your plot by simply extending the existing command. Start with the "qplot" command we created earlier, and now add "geom_smooth()". This will generate a fitted line. Since we're interested in linear regression, specify "method = 'lm'" to use the linear model. It's good practice to include this argument, especially in introductory statistics classes.

If you'd like to change the color of the regression line, you can include "color = 'darkgray'" within the "geom_smooth()" command. This will result in a different color.

Lastly, let's address the question of what happens if we remove "se = FALSE". Without this argument, R will display an error ribbon. Roughly speaking, this ribbon represents a confidence interval. If we were to graph all the plots in the dataset from which these 50 observations were sampled, we would expect the regression line to lie within this error ribbon, providing a rough measure of uncertainty.

Scatterplots and Regression Lines in R
• 2020.04.17
A quickstart guide to making scatterplots in R using the qplot() command. So easy! So much fun! If this vid helps you, please help me a tiny bit by mashing t...

### Using Regression Lines to Make Predictions

Using Regression Lines to Make Predictions

Hello everyone! Today, we're going to delve deeper into regression lines. We'll explore how to use them for making predictions, discuss prediction errors, and understand when it's inappropriate to use them for predictions. Let's get started!

You might recognize this example from my previous video. We have a small dataset with five values, and I've drawn a line of best fit: Ŷ = 0.19X + 1.34. Now, let's consider a new input value, x = 6. Using the regression equation, we can predict the corresponding y-value. In this case, the prediction is 2.54. We can plot this predicted value on the line as a blue dot at (6, 2.54).

Sometimes we make predictions when we have an x-value that corresponds to a y-value in the dataset. For example, at x = 3, we have the point (3, 1). In this case, what kind of error are we talking about? We refer to it as the residual. The residual for a data point is the difference between the actual y-value at that point and the y-value predicted by the regression line. At x = 3, the actual y-value is 1, and the predicted y-value is 1.97, resulting in a residual of -0.97. This means the point (3, 1) lies approximately 0.97 units below the regression line.

When using regression lines to make predictions, it's crucial to consider the range of the data set. We should only make predictions for x-values that fall within the range or a reasonable extension of the data set. A classic example is age versus weight. As shown in the graph, there is a linear relationship for people under the age of about 12. Within this range, we can make reasonably accurate weight predictions based on age using the linear relationship. This is called interpolation, where we predict values within the data set's range.

However, it would be erroneous to use this linear relationship to make predictions outside of that range, such as for a forty-year-old individual. If we were to apply the linear relationship to predict their weight, the result would be over three hundred and forty pounds, which is clearly unrealistic. This is called extrapolation, and it should be avoided.

In summary, when using regression lines, it's essential to understand prediction errors and limitations. Residuals help us quantify the discrepancies between actual and predicted values. We should only make predictions within the range of the data set or a reasonable extension of it. Extrapolation, which involves predicting values outside of the data set's range, can lead to inaccurate and unreliable results.

Using Regression Lines to Make Predictions
• 2020.04.18
Also discussed: residuals, interpolation and extrapolation. All the good stuff! If this vid helps you, please help me a tiny bit by mashing that 'like' butto...

### Regression and Prediction in R Using the lm() Command

Regression and Prediction in R Using the lm() Command

Hello everyone! Today, we'll be calculating regression lines in R using the built-in dataset "cars." To begin, let's take a look at the dataset and gather some information about it using the "view" and "question mark" commands. The "cars" dataset consists of 50 entries representing speeds and stopping distances of cars from the 1920s. Although it's not fresh data, we can still explore linear relationships.

To visualize the data, we'll use the "ggplot2" package from the "tidyverse" library. Make sure to load the package using the "library(tidyverse)" command. If you haven't installed the "tidyverse" package yet, you can do so with the "install.packages('tidyverse')" command.

Next, we'll create a scatter plot of the data using the "qplot" command. We'll plot speed on the x-axis (explanatory variable) and distance on the y-axis (response variable). To indicate that we're working with the "cars" dataset and want a scatter plot, we'll use "geom='point'". The plot reveals a mostly linear relationship, suggesting that performing a linear regression is reasonable.

To add a regression line to the plot, we'll use "geom_smooth(method = 'lm', se = FALSE)". This specifies a linear regression smoother without the standard error bar.

Now, let's determine the equation of the regression line. We'll use the "lm" command, which stands for linear model. The syntax follows a "y ~ x" pattern, where the response variable (distance) is related to the explanatory variable (speed). We'll assign the result to a variable called "model". By entering "summary(model)", we can obtain additional information about the regression line, including coefficients, residuals, and statistical measures like multiple R-squared and adjusted R-squared.

If we want to access specific information from the "model" object, we can treat it as a data frame and use "\$" to extract desired columns. For example, "model\$residuals" gives a vector of the 50 residuals.

We can even add the residuals and fitted values as new columns to the original "cars" dataset using "cars\$residuals" and "cars\$predicted" respectively.

Lastly, let's use the "predict" function to obtain predictions for speeds not present in the dataset. We'll provide the "model" as the first argument and create a data frame with a column named "speed" (matching the explanatory variable). Using the "data.frame" function, we'll input the desired speed values. For instance, we can predict stopping distances for speeds like 12.5, 15.5, and 17. The predicted values will be displayed.

Regression and Prediction in R Using the lm() Command
• 2021.02.24
Let's learn about the lm() and predict() functions in R, which let us create and use linear models for data. If this vid helps you, please help me a tiny bit...

### Residual Plots in R

Residual Plots in R

Hello everyone, in today's video, we will be exploring residual plots in R using the qplot command. I'll primarily be using base R functions in this tutorial. I'm also working on another video about the broom package, which is a standard way of performing tasks in R. I'll provide a link to that video once it's ready.

In this tutorial, we'll focus on the variables "wind" and "temp" from the built-in air quality dataset in R. This dataset contains daily air quality measurements in New York from May to September 1973.

To begin, let's load the tidyverse package. Although we'll only use the qplot function, let's load the entire package for consistency.

Before diving into modeling, it's essential to visualize our data. Let's create a qplot by setting "wind" as the explanatory variable (air_quality\$wind) and "temp" as the response variable (air_quality\$temp). Since we have two variables, R will default to a scatter plot.

Upon examining the plot, we can observe a linear relationship between the two variables, although it's not particularly strong. To quantify this relationship, let's calculate the correlation coefficient using the cor function. The resulting correlation coefficient is -0.458, indicating a negative correlation.

Now that we have established a linear relationship, we can add a regression line to the plot. We'll modify the qplot command by including the geom_smooth function with method = "lm" to indicate a linear model. Let's exclude the error ribbon for simplicity.

With the regression line added, we can proceed to construct a linear model and obtain the equation for the regression line. Let's assign the linear model to a variable called "model" using the lm function. We'll specify "temp" as the response variable and "wind" as the explanatory variable. It's important to mention the name of the data frame explicitly.

To gain more insights into the model, we can use the summary function to obtain a summary of the model. The summary provides various information, including the intercept (90.1349) and the coefficient for the slope (-1.23). The interpretation of the slope coefficient is that for every unit increase in wind, the temperature decreases by approximately 1.23 units. Checking the help file will provide information on the units used, such as wind in miles per hour and temperature in degrees Fahrenheit.

We can directly access the coefficients using the coefficients function, which returns the intercept and wind coefficient from the model. Additionally, we can obtain the fitted values using the fitted.values function, providing us with a vector of predicted temperatures for each wind value. We can add this as a new column, "predicted," to the air quality data frame.

Similarly, we can obtain the residuals using the residuals function, which gives us the differences between the observed and predicted values. Adding the residuals as another column, "residuals," to the data frame completes our exploration. We can visualize the data frame again to confirm the presence of the new columns.

To assess the relationship between the fitted values and residuals, we can create a residuals plot. In the qplot command, we'll set the fitted values as the x-axis variable (fitted.values(model)) and the residuals as the y-axis variable (residuals(model)). A scatter plot will be generated as specified in the qplot arguments.

The purpose of the residuals plot is to identify any patterns or trends in the residuals. In a valid linear model with constant variance, the plot should resemble a cloud without any discernible pattern. Adding a regression line with geom_smooth and method = "lm" will help verify this. We'll also set se = FALSE to remove the standard error bar.

By examining the residuals plot, we can see that there is no discernible pattern or trend, indicating that our model captures the linear relationship adequately. The regression line, represented by y = 0, confirms this observation.

That concludes our tutorial on creating residual plots in R using the qplot command. By visualizing and analyzing the residuals, we can assess the goodness of fit and the appropriateness of our linear model. Remember that there are multiple ways to achieve the same results in R, and exploring different syntaxes and functions can enhance your understanding of the language.

Residual Plots in R
• 2021.08.11
It's easy to make beautiful residual plots in R with ggplot. Let's go!If this vid helps you, please help me a tiny bit by mashing that 'like' button. For mor...

### Outliers: Leverage, Discrepancy, and Influence

Outliers: Leverage, Discrepancy, and Influence

Hello everyone! Today, we'll be delving into the concepts of leverage, discrepancy, and influence in the context of linear regression. Although I'll focus on the scenario with a single explanatory variable, please note that everything discussed here applies directly to higher dimensions as well.

In a dataset with two variables, individual observations can exhibit unusual characteristics in their x-values, y-values, or both. When we use the term "outlier," we specifically refer to observations that significantly deviate in the y-direction compared to the general trend of the data. These outliers are points with high discrepancy.

However, in everyday language, we often use the term "outlier" more loosely. To illustrate this concept, let's consider three data sets, each displaying a linear trend with one unusual observation. In the first two graphs, you'll notice a point that lies far away from the regression line, exhibiting high discrepancy. In the third case, the unusual value aligns fairly well with the overall data trend, so it wouldn't be considered an outlier based on discrepancy alone.

Now, let's shift our focus to leverage. Observations with unusual x-values have a greater potential to impact the model's fit, and such observations are said to have high leverage. Examining the same three plots from a leverage perspective, we find that the two rightmost plots contain observations with high leverage. These outliers have x-values that are significantly distant from the majority of the data. Conversely, the first plot features an outlier with low leverage since its x-value aligns well with the other values in the dataset.

An observation that substantially alters the fit of a model is considered to have high influence. Returning to the first two outliers from the previous plots, let's examine them through the lens of influence. In the first graph, we observe an outlier with low influence. If we remove this value from the dataset, the regression line doesn't undergo significant shifts. Notably, the slope remains relatively unchanged. Conversely, in the rightmost plot, we see an outlier with high influence. Upon removing it from the dataset, the regression line experiences substantial changes. Typically, influential observations exhibit both high discrepancy and high leverage.

While all of these concepts can be quantified, I won't delve into the details in this video. However, I do want to point you in the right direction if you wish to explore this further. Discrepancy is often measured using studentized residuals, which are standardized residuals that quantify the deviation of observations in the y-direction from the model's prediction. Leverage can be assessed using hat values, which measure the distance of x-values from the expected average x-value. Finally, influence is frequently quantified using Cook's distance.

Fortunately, you don't have to calculate these measures by hand, as R provides convenient methods. The broom package is particularly useful in this regard, and I'll create a video on it as soon as possible.

Outliers: Leverage, Discrepancy, and Influence
• 2021.07.14
How should we think about unusual values in two-variable data sets? How is an unusual x-value different from an unusual y-value? In this vid, we'll learn all...

### R^2: the Coefficient of Determination

R^2: the Coefficient of Determination

Today's topic is R-squared, the coefficient of determination. It measures the spread of observations around a regression line or any statistical model. It represents the proportion of the variance in the response variable (y) that can be attributed to changes in the explanatory variable(s), especially in higher-dimensional cases.

For linear models, R-squared always falls between 0 and 1. Values closer to 1 indicate that the data points are tightly clustered around the regression line, while values closer to 0 indicate greater spread.

To make this concept clearer, let's visualize three data sets. Each set has a variance of 1 for the y-values, and I've drawn the regression line for each case. As R-squared increases from 0.2 to 0.5 to 0.8, we observe a tighter and tighter spread of the data around the regression line.

Now, let's dive into a more precise definition. R-squared is calculated as the variance of the fitted y-values divided by the variance of the observed y-values. Algebraically, this can be expressed as 1 minus the variance of the residuals divided by the variance of the observed y-values. In a technical sense, we can write it as:

R-squared = (variance of residuals) / (variance of observed y-values)

To simplify further, we often abbreviate this algebraic expression as R-squared = 1 - (RSS / TSS), where RSS represents the residual sum of squares and TSS denotes the total sum of squares.

In a least-squares regression model with a single explanatory variable, an important fact to note is that the coefficient of determination is equal to the square of the sample coefficient of correlation (R). In other words, R-squared (big R-squared) is equal to little r-squared.

In the case of higher-dimensional models, the statement is similar. R-squared is equal to the square of the correlation between observed and fitted y-values. This holds true even for the single-variable case, although we don't usually think of it in those terms.

It's worth mentioning that R-squared is often misunderstood and misinterpreted. So, let's clarify its meaning and limitations. R-squared measures the proportion of variability in y that can be explained by the variability in x. By definition, it will be lower for datasets with high variability in the y-values. Therefore, models with R-squared close to 1 are not necessarily good, as demonstrated in an example where R-squared is 0.93, but the linear model is a poor fit for the data.

Similarly, models with low R-squared are not necessarily bad. For instance, a model with an R-squared of 0.16 may fit the data very well, but the data itself inherently contains a lot of natural variability and noise.

Remember that R-squared only measures variability about the regression line and does not directly indicate the usefulness or reasonability of a model. To assess linear models properly, consider multiple tools and factors, such as residual standard error (the standard deviation of the residuals), which provides insight into the variability of the data compared to predicted values. Additionally, you can examine the significance level of the regression using the t statistic for linear fits and the f statistic for testing the null hypothesis that all regression coefficients are zero in higher-dimensional models.

When evaluating models, it's crucial not to rely solely on R-squared but to consider it in conjunction with other metrics and analyses.

R^2: the Coefficient of Determination
• 2021.10.20
Let's get to know R^2, the coefficient of determination, which measures the spread of observations about a regression line or other statistical model.If this...

### Chi-Squared Calculations in R

Chi-Squared Calculations in R

Today we will be performing some chi-squared calculations in R. The chi-squared test is commonly used in inferential statistics for various purposes, such as goodness-of-fit testing and hypothesis testing involving variances. Chi-squared is a continuous random variable that is skewed to the right. Its expected value is denoted by "r," and its variance is 2r. In most applications, r is a positive integer, although it can also be a non-integer.

As the value of r increases, the probability density function (PDF) of the chi-squared distribution shifts to the right and starts to resemble a bell curve due to the central limit theorem. The parameter r is known as the number of degrees of freedom for the chi-squared distribution.

In R, there are four basic functions for calculating chi-squared distributions:

1. rchisq(r, n): This function generates n random values from the chi-squared distribution with r degrees of freedom. For example, rchisq(5, 16) generates 16 random values from chi-squared with 5 degrees of freedom.

2. pchisq(x, r): This is the cumulative distribution function (CDF) for the chi-squared distribution with r degrees of freedom. It returns the probability of randomly getting a value less than or equal to x in that distribution. For example, pchisq(8, 5) gives the probability of getting a value less than or equal to 8 in chi-squared with 5 degrees of freedom, which is approximately 0.844.

3. qchisq(p, r): This is the inverse CDF for the chi-squared distribution with r degrees of freedom. It returns the x value for which the probability of getting a value less than or equal to x is equal to p. For example, qchisq(0.5, 12) gives the median of chi-squared with 12 degrees of freedom, which is approximately 0.5.

4. dchisq(x, r): This function gives the value of the probability density function (PDF) of the chi-squared distribution with r degrees of freedom at x. The PDF is of theoretical importance but is less commonly used in numerical calculations.

Now, let's solve a few sample problems using these functions:

Problem 1: Compute the probability of randomly getting an x value between 12 and 18 in chi-squared with 15 degrees of freedom.

prob <- pchisq(18, 15) - pchisq(12, 15)

The probability is approximately 0.4163.

Problem 2: Given that there's an 80% chance a random draw from chi-squared with 20 degrees of freedom is greater than x, find the value of x.

x <- qchisq(0.2, 20)

The value of x is approximately 14.57844.

Problem 3: Simulate ten thousand draws from the chi-squared distribution with 4 degrees of freedom and generate a histogram of the results.

x <- rchisq(4, 10000)
library(ggplot2)
qplot(x, geom = "histogram", col = I("black"))

This will generate a histogram of the simulated values.

I hope this helps you understand and apply chi-squared calculations in R.

Chi-Squared Calculations in R
• 2020.10.15
In the vid, I cover the functions pchisq(), qchisq(), rchisq(), and dchisq(). If this vid helps you, please help me a tiny bit by mashing that 'like' button....

### Understanding the chi-squared distribution

Understanding the chi-squared distribution

Today, we're going to discuss the chi-squared distribution, a fundamental concept you'll encounter while studying statistical inference in your journey through data science. The chi-squared distribution arises when you want to measure how far a set of independent numerical observations deviates from their expected values.

To explain this more formally, you calculate a z-score for each observation by subtracting the expected value from the observation and dividing it by the standard deviation. After squaring each of these z-scores and summing them up, you obtain the chi-squared random variable. This variable quantifies the overall deviation of your observations from their expected values.

For instance, if all observations align perfectly with their expected values, the chi-squared statistic would be zero. As the results diverge further from the expected values, the chi-squared value increases. By squaring the z-scores, we ensure that low and high deviations don't cancel each other out.

The chi-squared distribution with r degrees of freedom represents the sampling distribution of this random variable. The degrees of freedom (r) correspond to the number of independent observations or z-scores. Note that the random variable shares the same name as the distribution, but the context usually distinguishes between them.

Since each z-score is a continuous random variable, the sum of their squares follows a chi-squared distribution. The probability density function of the chi-squared distribution is positive only for non-negative chi-squared values. The distribution is right-skewed because extremely high values for individual z-scores become increasingly less likely.

The typical graph of the chi-squared distribution with 5 degrees of freedom showcases this strong rightward skew. Its support (set of possible outcomes) consists strictly of positive values. Two important facts to remember are that the expected value of the chi-squared distribution with r degrees of freedom is equal to r and that the peak of the distribution occurs at R minus 2, given that R is at least two (otherwise, it's zero).

As the number of degrees of freedom increases, the chi-squared distribution approaches a normal distribution according to the central limit theorem. This approximation is observable in a sketch showing the chi-squared distribution with R equals 50, which still exhibits a slight rightward skew.

The chi-squared distribution is frequently used in inferential statistics, as evident from the initial slide. Some common applications include significance testing for variance under the assumption of a normal distribution, goodness-of-fit testing for categorical variables, and chi-squared tests for independence.

To compute probabilities in a chi-squared distribution, you can use the cumulative distribution function (CDF). The CDF, denoted as F(x), provides the probability of obtaining a value less than or equal to x in the specified chi-squared distribution. This can be better understood with a visual representation, where the shaded area represents the probability.

In R, you can perform chi-squared computations using the pchisq() command, specifying the value of interest and the number of degrees of freedom. For example, to compute the probability of obtaining a value less than or equal to 8 in the chi-squared distribution with five degrees of freedom, you would use pchisq(8, 5), resulting in approximately 0.843.

If you're interested in further details or computations involving the chi-squared distribution in R, I have specific videos that cover these topics. Feel free to check them out for more in-depth explanations.

Understanding the chi-squared distribution
• 2022.12.07
In absolute terms, just how far are your results from their expected values?If this vid helps you, please help me a tiny bit by mashing that 'like' button. F...

### Goodness-of-Fit Testing

Goodness-of-Fit Testing

Hey everyone, today we're going to discuss goodness-of-fit testing using the chi-squared distribution. Suppose we have a categorical variable, such as the year of college students in statistics classes at a large university, and we're told it follows a specific distribution: 50% freshmen, 30% sophomores, 10% juniors, and 10% seniors. How can we test if this distribution fits our sample data?

To begin, let's set up the null and alternative hypotheses. The null hypothesis states that the population of all students in statistics classes follows the claimed distribution (50% freshmen, 30% sophomores, etc.), while the alternative hypothesis assumes a different distribution. To test between these hypotheses, we'll compare the observed counts in our sample data to the expected counts under the null hypothesis.

Let's denote the observed counts as 'o' and the expected counts as 'e.' We'll calculate a test statistic called chi-squared, which is the sum of (o - e)^2 / e. If the null hypothesis is true, this test statistic follows a chi-squared distribution with k - 1 degrees of freedom, where k is the number of categories.

In our case, we have four categories, so we'll be using the chi-squared distribution with three degrees of freedom. A larger test statistic indicates that our sample data is less compatible with the null hypothesis, suggesting a poorer fit.

To perform the significance test and compute chi-squared, we need to calculate the expected counts under the null hypothesis. For a sample size of 65, we multiply the percentages by 65 to obtain expected counts of 32.5, 19.5, 6.5, and 6.5.

Next, we calculate the chi-squared test statistic by subtracting the expected count from the observed count for each cell, squaring the result, dividing by the expected count, and summing these values across all categories. In our case, the test statistic is 3.58.

To find the probability of obtaining a value greater than or equal to our observed chi-squared statistic, we use the cumulative distribution function in R, represented by the command p chi-squared. Subtracting the result from one gives us the p-value. In this example, the p-value is approximately 0.31, indicating that the data does not provide strong evidence against the null hypothesis.

It's essential to note that a large p-value does not prove the null hypothesis; it simply suggests a lack of evidence against it. Finally, we should consider when it's appropriate to use a chi-squared goodness-of-fit test. Firstly, it applies to categorical variables. If you have quantitative variables, you can transform them into categorical variables by binning them. Additionally, the data should be obtained through simple random sampling, and the expected cell counts should generally be at least five. If many bins are nearly empty, alternative methods may be more appropriate, such as Fisher's exact test in certain situations.

Apart from the considerations we've mentioned earlier, there are a few more points to keep in mind when deciding whether to use a chi-squared goodness-of-fit test. These include:

1. Independence: The observations within each category should be independent of each other. This assumption is important for the validity of the test. If the observations are not independent, alternative statistical tests may be more suitable.

2. Sample size: While there is no fixed rule, larger sample sizes tend to provide more reliable results. With larger samples, even small deviations from the expected distribution may yield statistically significant results. However, very large sample sizes can sometimes lead to significant results even for trivial deviations from the expected distribution, so it's essential to consider the practical significance as well.

3. Parameter estimation: In some cases, the expected counts for each category are not known precisely but are estimated from the data. When estimating parameters from the same data used for hypothesis testing, it can lead to biased results. In such situations, adjustments or alternative methods should be considered.

4. Categorical variables with multiple levels: The chi-squared goodness-of-fit test we discussed so far is appropriate when testing the fit of a single categorical variable to a specified distribution. However, if you have multiple categorical variables and want to examine their joint distribution, other tests like the chi-squared test of independence or log-linear models may be more suitable.

It's worth noting that the chi-squared goodness-of-fit test is a useful tool for examining whether observed data follows an expected distribution. However, it does not provide information about the reasons behind any discrepancies or identify which specific categories contribute the most to the differences.

As with any statistical test, the interpretation of the results should consider the context, background knowledge, and the specific objectives of the analysis. It's crucial to understand the limitations and assumptions of the test and to use it as part of a comprehensive analysis rather than relying solely on its outcome.

In summary, the chi-squared goodness-of-fit test is a valuable method for assessing the fit between observed data and an expected distribution for categorical variables. By comparing observed and expected counts, calculating the test statistic, and determining the p-value, we can evaluate the compatibility of the data with the null hypothesis. However, it's important to consider the assumptions, sample size, and other factors to ensure the validity and relevance of the test in a given context.

Goodness-of-Fit Testing
• 2020.11.10