Programming tutorials - page 12

 

Describing Data Qualitatively


Describing Data Qualitatively

Hello everyone, today we'll be discussing the qualitative description of dataset shapes, focusing on building vocabulary to effectively communicate our observations. We will explore various graphical representations such as histograms, frequency polygons, and stem plots, and discuss their characteristics. Let's dive into some examples:

First, let's examine a histogram. In this case, the graph exhibits a symmetric shape, with the left half resembling the right half. Although real data rarely exhibits perfect symmetry, we focus on describing the overall shape rather than pinpointing specific values. Another type of symmetric distribution is a uniform graph, where data values are evenly distributed across bins. This results in a horizontally flat shape, indicating equal likelihood of values falling into each bin.

Now, let's explore datasets that are not symmetric. Instead of histograms, we'll consider stem plots for a change. In this stem plot example, we can observe an asymmetric shape. It is evident that the distribution is not the same on both sides of the center, which lies around 92. Moreover, we can discern the direction of the asymmetry. In this case, there is a longer tail towards higher numbers, away from the center. This indicates a right-skewed distribution.

On the other hand, here is a stem plot that is left-skewed. We notice a longer tail on the smaller values side, whereas the data is more concentrated towards larger values. It is important to accurately describe the direction of asymmetry to provide a comprehensive understanding of the dataset.

Lastly, let's consider a dataset that may initially appear right-skewed due to a single large outlier around 160 or 170. However, if we disregard this outlier, the distribution exhibits a fairly symmetric shape, potentially resembling a bell curve. It is crucial to identify outliers as they may represent errors, exceptional cases, or phenomena requiring separate analysis. When describing the overall shape of the data, outliers should be acknowledged but not heavily considered.

By developing a vocabulary to describe dataset shapes, we can effectively communicate the key characteristics and patterns observed in the data. Understanding the shape of a dataset aids in interpreting its properties and enables us to draw meaningful insights.

Describing Data Qualitatively
Describing Data Qualitatively
  • 2020.07.12
  • www.youtube.com
It's time to build some vocabulary for describing single-variable data sets, and to look at some example histograms and stem plots. Yay! If this vid helps yo...
 

Understanding Mean, Median, and Mode


Understanding Mean, Median, and Mode

Hello everyone, today we will discuss the concepts of mean, median, and mode, focusing on their interpretations as measures of central tendency. Each measure has its own usefulness and understanding them is crucial. Let's quickly go through their definitions.

The mean represents the numerical average of a dataset. It is calculated by summing up all the values in the set and dividing the total by the number of values. The mean is commonly denoted by X-bar or X with a line over it, especially when dealing with samples.

The median is the value that divides the data exactly in half. To find the median, arrange the data from lowest to highest. If there is an odd number of values, the median is the middle value. For an even number of values, average the two middle values to find the median. The median is often denoted by a capital M.

The mode is simply the most common value in the dataset. A distribution can have multiple modes if two or more values have the same frequency, but if all the data has the same frequency, we say the distribution has no mode.

Let's consider an example. Suppose we have a dataset with 16 values. The mean is calculated by summing all the values and dividing by 16. In this case, the mean is 67.9375. The median, since we have an even number of values, is found by taking the average of the two middle values, resulting in 65.5. The mode, the most common value, is 65.

Each measure of central tendency also has a graphical interpretation. In a histogram, the mode is the highest point on the histogram, representing the most frequent value. The median is the value that splits the histogram in half, dividing the area equally. The mean is the value that would allow the histogram to balance.

Consider the example of a histogram. The mode can be determined by identifying the x-value where the histogram is tallest, which is slightly larger than 3 in this case. The median is the value that splits the histogram's area in half, which is around 4.5. The mean is the value that would balance the histogram, slightly less than 5.

Why do we need three measures of central tendency? Each measure has its advantages and disadvantages. The mean is commonly used in statistical analysis, and it is intuitive. However, it is highly influenced by outliers and may not be suitable for skewed distributions.

The median is simple to compute and understand, and it is not sensitive to outliers. However, it does not utilize all the information in the dataset and may present challenges in statistical inference.

The mode is a universal measure of central tendency, even for categorical variables. However, the most common value does not necessarily represent the middle of the distribution, making it less reliable as a measure of center.

Consider a small dataset of exam scores, including an outlier. In this case, the mean of 79 does not accurately describe the typical student's performance. The median of 94 is a more descriptive measure. Removing the outlier reveals the difference more clearly, as the mean changes significantly while the median remains unchanged.

Understanding the distinctions between the mean, median, and mode allows us to effectively interpret and communicate the central tendencies of a dataset, considering their strengths and limitations in different scenarios.

Understanding Mean, Median, and Mode
Understanding Mean, Median, and Mode
  • 2020.07.13
  • www.youtube.com
How can we measure the center of a data set? What are the strengths and weaknesses of each measure? How can we understand each graphically? If this vid helps...
 

Percentiles and Quantiles in R


Percentiles and Quantiles in R

Today we will be discussing percentiles and quantiles in R. Let's begin by reviewing their meanings.

Percentiles are a way of measuring the relative position of a value within a dataset. In general, the p-th percentile of a dataset is a value that is greater than p percent of the data. For example, the 50th percentile is the median, the 25th percentile is the first quartile, and the 75th percentile is the third quartile. It represents the value that lies above 75 percent of the data.

Different methods exist for computing percentiles, and there is no universally accepted approach. However, the good news is that all of the methods yield very similar results. To compute percentiles, it is best to rely on technology, such as R, which offers efficient and accurate calculations.

Quantiles, on the other hand, are essentially the same as percentiles. However, the term "quantiles" is often used when referring to decimal values, while "percentiles" are associated with integer values. For instance, you may have the 15th percentile but the 0.15 quantile. The advantage of quantiles is that they allow for greater precision by expressing values with as many decimal places as needed.

Now, let's switch over to R and explore how to compute percentiles and quantiles using the "faithful" dataset, which contains information about eruption length and waiting time of the Old Faithful geyser in the United States, measured in minutes.

To compute percentiles and quantiles in R, we can use the "quantile" function. It requires two arguments. First, we specify the variable we are interested in, which in this case is "faithful$waiting." Next, we indicate the desired quantile, written as a decimal. For example, to calculate the 35th percentile (0.35 quantile), we write 0.35 as the quantile argument. By executing the command, we obtain the result, such as 65 in this case. This implies that approximately 35% of all eruptions have a waiting time less than or equal to 65.

In R, it is possible to compute multiple quantiles simultaneously by providing a vector of quantiles. For instance, using the "c()" function, we can specify quantiles 0.35, 0.70, and 0.95. The result will be a vector containing the respective quantiles: 65, 81, and 89.

Another useful command is "summary," which provides a summary of the variable. By passing the variable "faithful$waiting" to the command, we obtain the first quartile (25th percentile), median (50th percentile), third quartile (75th percentile), as well as the minimum, maximum, and mean values.

Now, let's address the opposite question. If we have a value within the dataset and want to determine its percentile, we can use the "ecdf" command. By specifying the variable of interest, such as "faithful$waiting," and providing a specific value from the dataset, like 79, the command will return the percentile of that value. In this example, the result is 0.6617647, indicating that a waiting time of 79 corresponds to approximately the 66th percentile.

Understanding percentiles and quantiles allows us to assess the relative position of values within a dataset, providing valuable insights into the distribution and characteristics of the data.

Percentiles and Quantiles in R
Percentiles and Quantiles in R
  • 2020.07.18
  • www.youtube.com
Computing percentiles and quantiles by hand is for suckers! If this vid helps you, please help me a tiny bit by mashing that 'like' button. For more #rstats ...
 

Sample Variance and Standard Deviation


Sample Variance and Standard Deviation

Hey everyone, today we're going to delve into the concept of sample variance and standard deviation. These two measures help us understand the extent of variability or spread in a data set. They provide insights into how far the values in the data set deviate from the mean, on average.

Let's take a look at the formulas. In the formulas, "n" represents the total sample size, "X_i" denotes the values in the data set (e.g., X_1, X_2, X_3, and so on), and "X bar" (X with a line over it) represents the sample mean. While we typically use technology like R to compute these measures, it's crucial to understand the underlying concepts, especially since we no longer perform these calculations manually.

The key component in both measures is the term "X_i minus X bar," which represents the deviation of each value (X_i) from the sample mean. In other words, it quantifies how much each value differs, positively or negatively, from the average. Ideally, we want to determine the average of these deviations, but taking a simple average would yield zero since positive and negative deviations cancel each other out. To address this, we square each deviation (X_i minus X bar) before computing the average. This results in the formula for sample variance, which represents the average of the squared deviations from the mean.

However, you might have noticed that we divide by (n-1) instead of n in the variance formula. There are several reasons for this, but here's a straightforward one: when computing the sample mean (X bar), we only need (n-1) of the X_i values. This is because X bar is calculated as the sum of all X_i divided by n. Thus, we can solve for any X_i value once we have X bar. Dividing by (n-1) accounts for this and ensures that we compute the average of (n-1) distinct deviations, not all n of them. This way, we obtain the sample variance as a meaningful measure of variability.

Another issue is that variance is not on the same scale as the original data, making it abstract. To address this, we take the square root of the sample variance, resulting in the formula for the sample standard deviation. While the standard deviation requires more computation and can be theoretically challenging, it is easier to interpret and visualize than the variance. Both variance and standard deviation have their uses in different contexts.

Let's consider an example with a data set of only four values. To compute the sample variance and standard deviation, we first calculate the sample mean by summing the four values and dividing by four, obtaining a mean of 121. Using the variance formula, we square the deviations (X_i minus X bar) for each value and average the squared deviations, dividing by three (one less than the number of values). This yields a variance of 220. However, this value lacks immediate interpretability. To address this, we take the square root of the variance, resulting in a standard deviation of 14.8. This value makes more sense as a measure of spread in the data set.

In terms of technology, we can use commands like "var" and "sd" in R to compute variance and standard deviation, respectively. It is highly recommended to leverage technology for these calculations, as it saves time and provides accurate results. Calculating variance and standard deviation manually is no longer necessary in most cases.

Furthermore, it's important to note that in the majority of cases, about two-thirds of the data values will fall within one standard deviation of the mean. For a bell-shaped distribution (normal distribution), approximately 68% of the data lies within one standard deviation, about 95% lies within two standard deviations, and nearly all of it (99.7%) lies within three standard deviations of the mean. This is known as the empirical rule or the 68-95-99.7 rule.

To illustrate this, let's consider a dataset of 200 values randomly chosen from integers between 0 and 100. The mean of this dataset is 49.9, and the standard deviation is 27.3. Applying the empirical rule, if we go one standard deviation above and below the mean, we would capture 68% of the values, which amounts to 136 values. If the distribution follows a bell shape (normal distribution), we can make even more precise estimates. In this case, approximately 95% of the values (190 out of 200) would fall within two standard deviations of the mean, and nearly all values (199 out of 200) would lie within three standard deviations of the mean.

Let's conclude with one more example using the empirical rule. Suppose we have scores from a standardized test that approximately follow a bell-shaped distribution. The mean score is 1060, and the standard deviation is 195. Applying the empirical rule, we can estimate that about 68% of the scores would fall between 865 and 1255 (one standard deviation below and above the mean). Approximately 95% of the scores would lie between 670 and 1450 (two standard deviations below and above the mean). Finally, about 99.7% of the scores would be within the range of 475 and 1645 (three standard deviations below and above the mean).

Understanding variance and standard deviation helps us grasp the spread and variability within a dataset. While technology facilitates their computation, it is crucial to comprehend the underlying concepts to interpret and analyze data effectively. By utilizing these measures, we can gain valuable insights and make informed decisions based on the characteristics of the data.

Sample Variance and Standard Deviation
Sample Variance and Standard Deviation
  • 2020.07.15
  • www.youtube.com
Let's measure the spread of data sets! Variance and standard deviation are hugely important in statistics; they're also easy to misunderstand. If this vid he...
 

Z-Scores


Z-Scores

Hello everyone, in today's discussion, we will explore z-scores, also known as standard scores. This method allows us to measure the relative position of values within a dataset.

A z-score represents the number of standard deviations by which a value deviates from the mean. For example, if we have a dataset with a mean of 50 and a standard deviation of 8, a value of 62 would have a z-score of 1.5. This means that the value of 62 is 1.5 standard deviations above the mean.

Z-scores are particularly useful for assessing relative positions in datasets with symmetric distributions, especially those that follow a bell-shaped or normal distribution. However, when dealing with skewed data or datasets containing outliers, the mean and standard deviation may not accurately represent the center and spread of the data. Consequently, the usefulness of z-scores diminishes in such cases.

The formula for calculating a z-score is: z = (x - μ) / σ, where x is the value in the dataset, μ is the mean, and σ is the standard deviation. The mean is sometimes represented by x-bar and the standard deviation by s, but the formula remains the same.

Z-scores are particularly valuable when comparing the relative positions of values across different datasets. Let's consider an example to illustrate this. The average height of adult men in the United States is 69.4 inches, with a standard deviation of 3.0 inches. On the other hand, the average height of adult women in the United States is 64.2 inches, with a standard deviation of 2.7 inches. Now, we can compare the relative rarity of a 64.2-inch tall man and a 69.4-inch tall woman.

To calculate the z-score for the man, we use the formula (64.2 - 69.4) / 3.0. The resulting z-score is -1.73, indicating that the man's height is 1.73 standard deviations below the mean height of men. For the woman, the z-score is (69.4 - 64.2) / 2.7, resulting in a z-score of 1.93. This means that the woman's height is 1.93 standard deviations above the mean height of women. Comparing the absolute values of the two z-scores, we can conclude that the woman's height is more unusual relative to the average height of women.

It's important to note that z-scores alone do not provide a definitive distinction between "usual" and "unusual" values. A common convention is to consider values more than two standard deviations away from the mean as unusual and values more than three standard deviations away as very unusual. However, this is just a rule of thumb, and the decision ultimately depends on the context and specific distribution of the data.

To demonstrate this, let's consider the case of a 76-inch tall man. Using the same formula and the given mean and standard deviation for men, we calculate a z-score of 2.2. Since this value is greater than 2 in absolute value, we would consider the man's height as unusual according to the convention.

The empirical rule provides a guideline when dealing with approximately bell-shaped distributions. Around 68% of values fall within one standard deviation of the mean (z-scores between -1 and 1), approximately 95% fall within two standard deviations (z-scores between -2 and 2), and about 99.7% fall within three standard deviations (z-scores between -3 and 3).

In conclusion, z-scores offer a useful way to assess the relative position of values within a dataset. They are particularly valuable for comparing values across different datasets and determining the rarity or unusualness of a specific value. However, it's essential to consider the distribution's shape, outliers, and the context of the data when interpreting z-scores.

Let's conclude with a brief example. Suppose we have a dataset of adult women's heights in the United States, which approximately follows a bell-shaped distribution. The mean height is 64.2 inches, with a standard deviation of 2.7 inches.

Using the empirical rule, we can estimate the height ranges within which a certain percentage of women fall. Within one standard deviation of the mean, approximately 68% of women's heights will be found. By subtracting 2.7 from 64.2, we obtain 61.5 inches, and by adding 2.7, we get 66.9 inches. Thus, we can estimate that about 68% of women's heights will fall between 61.5 and 66.9 inches.

Expanding to two standard deviations, we find that approximately 95% of women's heights lie within this range. Subtracting 2.7 twice from the mean, we get 58.8 inches, and adding 2.7 twice gives us 69.6 inches. Therefore, about 95% of women's heights can be expected to fall between 58.8 and 69.6 inches.

Finally, within three standard deviations, which covers approximately 99.7% of the data, we subtract 2.7 three times from the mean to get 56.1 inches, and we add 2.7 three times to obtain 71.7 inches. Hence, we can estimate that about 99.7% of women's heights will fall between 56.1 and 71.7 inches.

Understanding z-scores and their interpretation allows us to assess the relative position and rarity of values within a dataset, providing valuable insights in various fields such as statistics, research, and data analysis.

Remember, z-scores provide a standardized measure of relative position, considering the mean and standard deviation of the dataset. They are a powerful tool for understanding the distribution and comparing values across different datasets.

Z-Scores
Z-Scores
  • 2020.07.19
  • www.youtube.com
Let's understand z-scores! This is a simple way of describing position within a data set, most appropriate to symmetric (particularly bell-shaped) distributi...
 

The Five-Number Summary and the 1.5 x IQR Test for Outliers


The Five-Number Summary and the 1.5 x IQR Test for Outliers

Hello everyone! Today, we will delve into the concepts of the five-number summary and the 1.5 times IQR test for outliers. Let's start by defining the quartiles of a dataset. Quartiles are values that divide a dataset into four equal parts. The first quartile (Q1) lies above approximately 25% of the data, the second quartile (Q2) lies above about half of the data (also known as the median), and the third quartile (Q3) lies above approximately 75% of the data.

It's important to note that the division into four equal parts may not be exact if the dataset doesn't evenly divide. The first and third quartiles can be found by first determining the median. To find Q1 and Q3, we divide the dataset into an upper half and a lower half and calculate the medians of those two halves. The median of the upper half is Q3, while the median of the lower half is Q1.

Let's work through an example to illustrate this. Consider the following dataset with 17 values, listed from lowest to highest. The median, or Q2, will be the value in the middle, which in this case is the ninth value (since 17 is an odd number of values). Therefore, the median is 42. To find Q1, we consider the eight values smaller than the median. Sorting them, we find 16, 18, 20, and 22. Since this is an even number of values, we take the average of the two middle values, which gives us 18. Similarly, for Q3, we consider the eight values larger than the median, which are 45, 48, 50, and 55. Again, taking the average of the two middle values, we obtain Q3 as 52.

Thus, for this example, the quartiles are Q1 = 18, Q2 = 42, and Q3 = 52. The five-number summary of a dataset consists of these quartiles along with the minimum and maximum values in the dataset. In our case, the five-number summary is 5, 18, 42, 52, and 93, where 5 represents the minimum value and 93 represents the maximum.

Another useful measure is the interquartile range (IQR), which quantifies the spread of the middle half of the data. It is calculated as the difference between Q3 and Q1. In our example, the IQR is 52 - 18 = 34. The IQR focuses on the range of values within the middle 50% of the dataset and is less affected by extreme values.

Now, let's consider another example. Suppose we have the exam scores of 22 students listed below. We want to describe the distribution of scores using the five-number summary and IQR. First, we should be cautious of using the mean as a measure of center, as it might be influenced by extreme values. In this case, the mean is 75.3, but since a few students scored exceptionally low, the mean might not represent the typical student performance accurately. Similarly, the range, which is the difference between the minimum and maximum values (2 and 100, respectively), can be misleading due to the extreme values.

To obtain a more accurate description, we calculate the five-number summary. Sorting the scores, we find the minimum value as 2 and the maximum value as 100. The median (Q2) is the value in the middle, which in this case is 80. The lower half of the dataset consists of the eight values smaller than the median, with 76 and 83 as the two middle values. Taking their average, we find Q1 as 79. Similarly, for the upper half of the dataset, we have the median as 83, resulting in Q3 as 83.

Therefore, the five-number summary for this dataset is 2, 79, 80, 83, and 100. From this summary, we observe that the middle half of the scores lies between 79 and 83, indicating that the scores are tightly packed around the median.

To identify outliers in the dataset, we can employ the 1.5 times IQR test. The IQR, as calculated earlier, is 83 - 79 = 4. Multiplying the IQR by 1.5 gives us 6. We subtract 6 from Q1 and add 6 to Q3 to establish the range within which values are not considered outliers. In this case, any value below 73 or above 89 should be treated as an outlier according to this rule.

Applying this test to the dataset, we find that 2 and 100 should be considered outliers. As a professor, it is advisable to disregard these extreme scores or give them less weight when determining the exam curve.

By utilizing the five-number summary, IQR, and the 1.5 times IQR test, we gain a better understanding of the distribution of scores and can identify potential outliers that might affect the overall analysis.

The Five-Number Summary and the 1.5 x IQR Test for Outliers
The Five-Number Summary and the 1.5 x IQR Test for Outliers
  • 2020.07.15
  • www.youtube.com
The Five-Number Summary and the 1.5 x IQR Test for Outliers. If this vid helps you, please help me a tiny bit by mashing that 'like' button. For more statist...
 

Boxplots


Boxplots

Today, we will be discussing box plots, also known as box and whisker plots. A box plot is a graphical representation of a single-variable dataset based on the five-number summary. Let's dive right into an example to better understand them.

Suppose we have a dataset for which we want to construct a five-number summary and a box plot. The dataset is as follows: 34, 42, 48, 51.5, and 58. First, we arrange the numbers in ascending order to find the minimum (34) and maximum (58) values. As there is an odd number of values, the median is the value in the middle, which in this case is 48.

Next, we divide the dataset into two halves: the lower half and the upper half. The median of the lower half is 42, and the median of the upper half is 51.5. These values are known as the first quartile (Q1) and the third quartile (Q3), respectively.

Using the five-number summary, we can construct the box plot. The box plot consists of a box that represents the range between Q1 and Q3. The bottom of the box corresponds to Q1, the top of the box corresponds to Q3, and the horizontal line inside the box represents the median. The "arms" of the box plot extend from the box to the minimum and maximum values (34 and 58, respectively).

The purpose of the box plot is to visualize the distribution of the data. The box represents the middle 50% of the dataset, while the arms encompass the remaining values. In the given example, since there are no extreme values, there are no outliers displayed on the box plot.

Let's consider another example where we want to determine the five-number summary, test for outliers using the 1.5 times IQR test, and construct a box plot. The dataset is as follows: 62, 64, 75, 81.5, and 110.

Calculating the interquartile range (IQR) by subtracting Q1 from Q3, we find it to be 17.5. To perform the 1.5 times IQR test, we multiply the IQR by 1.5. Subtracting 1.5 times the IQR from Q1 (64 - 1.5 * 17.5), we obtain 37.5. Adding 1.5 times the IQR to Q3 (81.5 + 1.5 * 17.5), we get 107.75. Any value below 37.5 or above 107.75 should be considered an outlier.

In this case, the value 110 exceeds the upper limit and is classified as an outlier. Constructing the box plot, we draw the arms of the box plot only up to the most extreme values that are not outliers. The outlier value of 110 is indicated by a separate point, and the upper arm extends only up to 90, which represents the highest value within the non-outlier range.

Box plots are particularly useful when comparing data between groups, such as plotting one categorical and one quantitative variable. This type of plot, often referred to as a side-by-side box plot, provides a clear visual comparison of different groups. As an example, we can consider the famous iris dataset, where we compare the petal widths of three species: setosa, versicolor, and virginica. By examining the box plot, we can observe that the setosa species generally has narrower petals compared to the other two species. Additionally, we can discern the differences in spread among the petal widths within each group.

In summary, box plots provide a concise visualization of the five number summary and allow for easy comparison between different groups. They display the minimum, first quartile (Q1), median, third quartile (Q3), and maximum values of a dataset. The box represents the middle 50% of the data, with the bottom of the box at Q1 and the top of the box at Q3. The line inside the box represents the median.

Box plots also have the capability to display outliers, which are values that fall outside the range determined by the 1.5 times IQR test. To determine outliers, we calculate the IQR (Q3 - Q1) and multiply it by 1.5. We then subtract 1.5 times the IQR from Q1 and add 1.5 times the IQR to Q3. Any values below the lower limit or above the upper limit are considered outliers.

When constructing a box plot with outliers, the arms of the plot extend only up to the most extreme values that are not outliers. Outliers are depicted as individual points outside the arms of the box plot. This ensures that the box plot accurately represents the distribution of the non-outlier data and avoids misleading interpretations.

Box plots are particularly useful when comparing data between different groups or categories. By plotting multiple box plots side by side, it becomes easier to compare the distributions and understand the differences in the variables being analyzed.

For instance, using the iris dataset, we can create a side-by-side box plot to compare the petal widths of the setosa, versicolor, and virginica species. This allows us to visually observe the differences in petal width between the species and the spread of values within each group.

In summary, box plots provide a visual summary of the five-number summary, making it easier to understand the distribution of data and compare different groups. They provide insights into the central tendency, spread, and presence of outliers in a dataset, making them a valuable tool for data analysis and visualization.

Boxplots
Boxplots
  • 2020.07.16
  • www.youtube.com
What is a boxplot? How can you construct one? Why would you want to? If this vid helps you, please help me a tiny bit by mashing that 'like' button. For more...
 

Boxplots in R


Boxplots in R

Hello everyone! Today, we're going to learn how to create beautiful box plots in R using the qplot command. There are multiple ways of creating box plots in R, but the most visually appealing ones often come from the ggplot2 package, which is part of the tidyverse family of packages. So, let's dive into it!

If you haven't used these functions before, you'll need to install the tidyverse package on your machine using the install.packages command. This step is quick if you haven't done it already. Once installed, you need to load the package into memory using the library(tidyverse) command at the beginning of each session to access its functions.

In this tutorial, we'll focus on using the qplot command from the ggplot2 package. Now, let's start with two examples of creating box plots.

First, let's manually input some data. We'll create a vector called "scores" with a length of 21, which could represent scores of students on a math exam in a class of size 21.

To create a box plot of the scores, we use the qplot command. The basic syntax remains the same: specify the variables for the x and y axes, and use the geom argument to indicate that we want a box plot. In this case, we'll plot the scores on the x-axis.

To make our box plot more visually appealing, we can make some improvements. Firstly, we can remove the meaningless numbers on the y-axis using y = "". Next, if we want a vertical box plot, we can switch the axes by using y for the scores and removing the x-axis label. We can also add color to the lines and the interior of the box using the color and fill arguments, respectively. Finally, we can customize the labels and add a title to the graph using ylab and main arguments.

Now, let's move on to the second example using a built-in dataset called chickweights. This dataset contains 71 observations with two variables: weights of different chicks and the feeds they were given. We'll create a side-by-side box plot to compare the distributions of chick weights across different feed types.

Similar to the previous example, we use the qplot command and specify the dataset using data = chickweights. We then indicate that we want a vertical box plot with the weights on the y-axis and the feeds on the x-axis. To differentiate the box plots by feed type, we can use the fill argument and map it to the feed variable.

Once again, there are many other options available for customization, including font styles, label sizes, and point sizes. You can explore further by searching online.

With just a few modifications, we can create professional-looking box plots in R. These examples demonstrate the power and flexibility of the ggplot2 package for data visualization.

Boxplots in R
Boxplots in R
  • 2020.07.17
  • www.youtube.com
In this vid, we use the qplot() command in the {ggplot2} package to produce gorgeous boxplots in R. Note: since I recorded this vid, the qplot() command has ...
 

Probability Experiments, Outcomes, Events, and Samples Spaces


Probability Experiments, Outcomes, Events, and Samples Spaces

Hello everyone! Today, we will be delving into the fundamentals of probability. We'll explore topics such as sample spaces, outcomes, events, and more. A probability experiment, also known as a random experiment, is a trial where the outcome cannot be predicted with certainty. However, repeated trials may reveal certain trends. Let's take a look at a few examples.

  1. Flip a coin and record whether it lands on heads or tails.
  2. Use a random dialer to contact 10 voters and ask whom they intend to vote for.
  3. Roll two dice and record the sum of the numbers.
  4. Roll two dice and count the number of times a six appears.

Notice that in the last two examples, although the action is the same (rolling two dice), the data recorded is slightly different. Hence, we consider them as separate probability experiments. Now, let's discuss some vocabulary.

The result of a specific trial in a probability experiment is called an outcome. The collection of all possible outcomes in a probability experiment is referred to as the sample space (denoted by capital S). A subset of the sample space is called an event.

To illustrate this, let's consider an example. Suppose we flip two coins and record the results. The sample space consists of four outcomes: heads-heads, heads-tails, tails-heads, and tails-tails. If we define the event E as "both flips are the same," then we have two outcomes within that event: heads-heads and tails-tails. This event is a subset of the sample space.

Generally, an event represents something that can occur during a probability experiment, but there may be multiple ways for it to happen. In the previous example, the event "both flips are the same" can occur in two different ways.

If an event can only happen in one way, meaning it consists of a single outcome, we call it a simple event. The complement of an event E, denoted as E' or sometimes with a bar over E, is the set of all outcomes in the sample space that are not in E. When E occurs, E' does not occur, and vice versa.

For instance, suppose we randomly select an integer from 1 to 9 using a spinner. Let E be the event "the result is a prime number." The sample space is the integers from 1 to 9, and E is the set of prime numbers less than 10: {2, 3, 5, 7}. The complement of E (E') is the event that E does not occur, which consists of the numbers less than 10 that are not prime: {1, 4, 6, 8, 9}.

Two events are disjoint if they have no outcomes in common, meaning they cannot both occur simultaneously in one trial of the probability experiment. For example, consider flipping four coins and recording the results. Let E be the event "the first two flips are heads," and let F be the event "there are at least three tails." These two events can be represented as follows:

E: {HHHH, HHHH...} F: {TTTTT, TTTTH, TTTHT, TTTTH...}

Notice that there are no outcomes shared between the sets E and F. Thus, these events are disjoint.

There are different ways to describe the probability of an event, and two common approaches are empirical probability (or statistical probability) and classical probability (or theoretical probability).

Empirical probability is based on observation. We run a probability experiment multiple times, count how many times the event occurs, and divide it by the total number of trials. It corresponds to the proportion of times the event has occurred in the past. For example, if we flip a coin 100 times and it comes up heads 53 times, the empirical probability of the coin coming up heads is 53/100 or 53%.

Classical probability, on the other hand, applies when all the outcomes in a sample space are equally likely. We count the number of outcomes in the event and divide it by the total number of outcomes in the sample space. Mathematically, it is expressed as the cardinality (number of elements) of event E divided by the cardinality of the sample space S. For instance, if we roll a fair die, there are six equally likely outcomes, and if we're interested in the simple event E of getting a five, the classical probability is 1/6.

Let's consider another example. If we flip a fair coin three times, there are eight equally likely outcomes: HHH, HHT, HTH, HTT, THH, THT, TTH, TTT. Let E be the event of getting exactly two heads. Within the sample space, there are three outcomes (HHH, HHT, and HTH) in event E. Therefore, the classical probability of event E is 3/8.

Now, let's explore a probability question using the frequency distribution of an introductory statistics class at a large university. The distribution shows the number of students in each class level: 67 freshmen, 72 sophomores, and so on. If we randomly select a person from this class, what's the probability that they are a sophomore? This is a classical probability question.

In the given frequency distribution, there are 222 total outcomes (students in the class), and out of those, 72 outcomes correspond to sophomores. Thus, the probability of randomly selecting a sophomore is 72/222, approximately 32.4%.

Now, let's shift our focus to a slightly different question using the same frequency distribution. What's the probability that the next person who registers for the course will be either a junior or a senior? This time, we're interested in empirical probability since we don't have certainty about the future registration.

We look at the data we have about students who have already registered. Among them, there are 29 juniors and 54 seniors. To calculate the empirical probability, we divide the number of students who fit the event (junior or senior) by the total number of registered students. Therefore, the probability is (29 + 54) / 222, approximately 37.7%.

It's important to note that whether we're dealing with empirical or classical probability, certain facts hold true. The probability of any event lies between 0 and 1. An event with a probability of 0 is impossible, while an event with a probability of 1 is certain. If the sample space is denoted as S, the probability of S occurring is always 1.

If we have disjoint events E and F (with no outcomes in common), the probability of at least one of them occurring is the sum of their individual probabilities. However, the probability of both E and F occurring simultaneously is 0, as they are mutually exclusive.

Additionally, if we have complementary events (events that cover all possible outcomes), the sum of their probabilities is always 1. If event E occurs, the probability of its complement (E') not occurring is 1 minus the probability of E occurring.

In everyday language, we often use probability informally based on intuition and personal experience. This is known as subjective probability. However, in statistics, we rely on empirical and classical probability for rigorous calculations. Subjective probability lacks mathematical precision and is not the focus of statistical analysis.

Probability Experiments, Outcomes, Events, and Samples Spaces
Probability Experiments, Outcomes, Events, and Samples Spaces
  • 2020.07.25
  • www.youtube.com
We'll also learn about empirical vs. classical probability, as well as disjoint events. All the good stuff.If this vid helps you, please help me a tiny bit b...
 

The Addition Rule for Probabilities


The Addition Rule for Probabilities

Hello everyone, today we'll be discussing the addition rule for probabilities. This rule allows us to calculate the probabilities of unions of events. Let's start with a simplified version of the rule.

Suppose we have two events, A and B, that are disjoint, meaning they have no outcomes in common. In this case, the probability of either event happening is simply the sum of their individual probabilities. This can be written as:

P(A ∪ B) = P(A) + P(B)

Here, A ∪ B represents the set of all outcomes that are in A or in B, essentially meaning "A or B". It's important to remember that disjoint events cannot both occur as they have no outcomes in common. Sometimes these events are referred to as mutually exclusive.

To illustrate this version of the addition rule, let's consider an example. Suppose we roll a fair die twice, and we define event A as the first roll being a six, and event B as the sum of the rolls being three. These events are mutually exclusive because if the first roll is a six, the sum cannot be three. Now, to compute the probability of A or B (the first roll being a six or the sum being three), we need the individual probabilities of these events.

The probability of the first roll being a six is 1/6 since there are six possible outcomes and only one of them is a six. The probability of the sum of the rolls being three is 2/36, considering there are 36 total possible outcomes for two dice rolls, and two outcomes result in a sum of three (1+2 and 2+1). Adding these probabilities, we get a total probability of 2/9.

Let's move on to another example, taken from the textbook "Elementary Statistics" by Larson and Farber. In a survey of homeowners, they were asked about the time that passes between house cleanings. The results are summarized in a pie chart, showing different time intervals. We want to find the probability that a randomly selected homeowner lets more than two weeks pass between cleanings.

In this case, we're interested in the probability of selecting a homeowner from either the blue or yellow segment of the pie chart. Since these segments are mutually exclusive (you can't clean your house both every three weeks and four weeks or more), we can add the probabilities of these events. The probability of cleaning the house every three weeks is 10% and the probability of cleaning it four weeks or more is 22%. Adding these probabilities gives us a total probability of 32%.

Now, let's consider a more general case where two events, A and B, are not disjoint. In this scenario, the addition rule becomes slightly more complex. The probability of A or B is given by:

P(A ∪ B) = P(A) + P(B) - P(A ∩ B)

Here, A ∩ B represents the outcomes that are in both A and B. It's important to subtract the probability of A ∩ B because when A and B overlap, the outcomes in A ∩ B are counted twice (once in A and once in B).

To illustrate this version of the addition rule, let's use an example from a survey about smoking habits and seat belt use. The survey asked 242 respondents about their habits, and a table summarizes the results. We want to find the probability that a randomly selected respondent doesn't smoke or wear a seat belt.

Let A be the event of not smoking and B be the event of not wearing a seat belt. We're interested in the probability of A or B (A ∪ B). To calculate this, we need the individual probabilities of A, B, and A ∩ B. The probability of not smoking is 169 out of 242, as there are 169 individuals who don't smoke in the sample of 242 people. The probability of not wearing a seat belt is 114 out of 242. Now, we also need the probability of A ∩ B, which represents the individuals who both don't smoke and don't wear a seat belt. From the table, we see that there are 81 such individuals.

Using the addition rule for events that are not disjoint, we can calculate the probability of A or B as follows:

P(A ∪ B) = P(A) + P(B) - P(A ∩ B)

Substituting the values, we get:

P(A ∪ B) = 169/242 + 114/242 - 81/242

Simplifying the expression, we find that:

P(A ∪ B) = 202/242

Now, let's compute the probability of A or B directly by adding the individual probabilities. In this case, we can use the addition rule for disjoint events since the events in each cell of the table are mutually exclusive. Adding the probabilities of the five cells representing A or B, we obtain:

P(A ∪ B) = 88/242 + 81/242 + 9/242 + ... (remaining probabilities)

After performing the addition, we again arrive at the probability of 202/242.

Therefore, both methods yield the same probability of A or B, which is 202/242.

The Addition Rule for Probabilities
The Addition Rule for Probabilities
  • 2021.02.17
  • www.youtube.com
How can we compute P(A or B)? With the addition rule, of course! If this vid helps you, please help me a tiny bit by mashing that 'like' button. For more sta...
Reason: