Chapter 9 Introduction to Statistical Inference

9.1 Student Learning Objectives

The next section of this chapter introduces the basic issues and tools of statistical inference. These tools are the subject matter of the second part of this book. In Chapters 9–15 we use data on the specifications of cars in order to demonstrate the application of the tools for making statistical inference. In the third section of this chapter we present the data frame that contains this data. The fourth section reviews probability topics that were discussed in the first part of the book and are relevant for the second part. By the end of this chapter, the student should be able to:

Define key terms that are associated with inferential statistics.
Recognize the variables of the “cars.csv” data frame.
Revise concepts related to random variables, the sampling distribution and the Central Limit Theorem.

9.2 Key Terms

The first part of the book deals with descriptive statistics and with probability. In descriptive statistics one investigates the characteristics of the data by using graphical tools and numerical summaries. The frame of reference is the observed data. In probability, on the other hand, one extends the frame of reference to include all data sets that could have potentially emerged, with the observed data as one among many.

The second part of the book deals with inferential statistics. The aim of statistical inference is to gain insight regarding the population parameters from the observed data. The method for obtaining such insight involves the application of formal computations to the data. The interpretation of the outcome of these formal computations is carried out in the probabilistic context, in which one considers the application of these formal computations to all potential data sets. The justification for using the specific form of computation on the observed data stems from the examination of the probabilistic properties of the formal computations.

Typically, the formal computations will involve statistics, which are functions of the data. The assessment of the probabilistic properties of the computations will result from the sampling distribution of these statistics.

An example of a problem that requires statistical inference is the estimation of a parameter of the population using the observed data. Point estimation attempts to obtain the best guess to the value of that parameter. An estimator is a statistic that produces such a guess. One may prefer an estimator whose sampling distribution is more concentrated about the population parameter value over another estimator whose sampling distribution is less so. Hence, the justification for selecting a specific statistic as an estimator is a consequence of the probabilistic characteristics of this statistic in the context of the sampling distribution.

For example, a car manufacture may be interested in the fuel consumption of a new type of car. In order to do so the manufacturer may apply a standard test cycle to a sample of 10 new cars of the given type and measure their fuel consumptions. The parameter of interest is the average fuel consumption among all cars of the given type. The average consumption of the 10 cars is a point estimate of the parameter of interest.

An alternative approach for the estimation of a parameter is to construct an interval that is most likely to contain the population parameter. Such an interval, which is computed on the basis of the data, is called the a confidence interval. The sampling probability that the confidence interval will indeed contain the parameter value is called the confidence level. Confidence intervals are constructed so as to have a prescribed confidence level.

A different problem in statistical inference is hypothesis testing. The scientific paradigm involves the proposal of new theories and hypothesis that presumably provide a better description for the laws of Nature. On the basis of these hypothesis one may propose predictions that can be examined empirically. If the empirical evidence is consistent with the predictions of the new hypothesis but not with those of the old theory then the old theory is rejected in favor of the new one. Otherwise, the established theory maintains its status. Statistical hypothesis testing is a formal method for determining which of the two hypothesis should prevail that uses this paradigm.

Each of the two hypothesis, the old and the new, predicts a different distribution for the empirical measurements. In order to decide which of the distributions is more in tune with the data a statistic is computed. This statistic is called the test statistic. A threshold is set and, depending on where the test statistic falls with respect to this threshold, the decision is made whether or not to reject the old theory in favor of the new one.

This decision rule is not error proof, since the test statistic may fall by chance on the wrong side of the threshold. Nonetheless, by the examination of the sampling distribution of the test statistic one is able to assess the probability of making an error. In particular, the probability of erroneously rejecting the currently accepted theory (the old one) is called the significance level of the test. Indeed, the threshold is selected in order to assure a small enough significance level.

Returning to the car manufacturer. Assume that the car in question is manufactured in two different factories. One may want to examine the hypothesis that the car’s fuel consumption is the same for both factories. If 5 of the tested cars were manufactured in one factory and the other 5 in the other factory then the test may be based on the absolute value of the difference between the average consumption of the first 5 and the average consumption of the other 5.

The method of testing hypothesis is also applied in other practical settings where it is required to make decisions. For example, before a new treatment to a medical condition is approved for marketing by the appropriate authorities it must undergo a process of objective testing through clinical trials. In these trials the new treatment is administered to some patients while other obtain the (currently) standard treatment. Statistical tests are applied in order to compare the two groups of patient. The new treatment is released to the market only if it is shown to be beneficial with statistical significance and it is shown to have no unacceptable side effects.

In subsequent chapters we will discuss in more details the computation of point estimation, the construction of confidence intervals, and the application of hypothesis testing. The discussion will be initiated in the context of a single measurement but will later be extended to settings that involve comparison of measurements.

An example of such analysis is the analysis of clinical trials where the response of the patients treated with the new procedure is compared to the response of patients that were treated with the conventional treatment. This comparison involves the same measurement taken for two sub-samples. The tools of statistical inference – hypothesis testing, point estimation and the construction of confidence intervals – may be used in order to carry out this comparison.

Other comparisons may involve two measurements taken for the entire sample. An important tool for the investigation of the relations between two measurements, or variables, is regression. Models of regression describe the change in the distribution in one variable as a function of the other variable. Again, point estimation, confidence intervals, and hypothesis testing can be carried out in order to examine regression models. The variable whose distribution is the target of investigation is called the response. The other variable that may affect that distribution is called the explanatory variable.

9.3 The Cars Data Set

Statistical inference is applied to data in order to address specific research question. We will demonstrate different inferential procedures using a specific data set with the aim of making the discussion of the different procedures more concrete. The same data set will be used for all procedures that are presented in Chapters 10–15 ²². This data set contains information on various models of cars and is stored in the CVS file “cars.csv”²³. The file can be found on the internet at http://pluto.huji.ac.il/~msby/StatThink/Datasets/cars.csv. You are advised to download this file to your computer and store it in the working directory of R.

Let us read the content of the CSV file into an R data frame and produce a brief summary:

cars <- read.csv("_data/cars.csv")
summary(cars)

##          make      fuel.type   num.of.doors       body.style drive.wheels
##  toyota    : 32   diesel: 20   four:114     convertible: 6   4wd:  9     
##  nissan    : 18   gas   :185   two : 89     hardtop    : 8   fwd:120     
##  mazda     : 17                NA's:  2     hatchback  :70   rwd: 76     
##  honda     : 13                             sedan      :96               
##  mitsubishi: 13                             wagon      :25               
##  subaru    : 12                                                          
##  (Other)   :100                                                          
##  engine.location   wheel.base         length          width      
##  front:202       Min.   : 86.60   Min.   :141.1   Min.   :60.30  
##  rear :  3       1st Qu.: 94.50   1st Qu.:166.3   1st Qu.:64.10  
##                  Median : 97.00   Median :173.2   Median :65.50  
##                  Mean   : 98.76   Mean   :174.0   Mean   :65.91  
##                  3rd Qu.:102.40   3rd Qu.:183.1   3rd Qu.:66.90  
##                  Max.   :120.90   Max.   :208.1   Max.   :72.30  
##                                                                  
##      height       curb.weight    engine.size      horsepower   
##  Min.   :47.80   Min.   :1488   Min.   : 61.0   Min.   : 48.0  
##  1st Qu.:52.00   1st Qu.:2145   1st Qu.: 97.0   1st Qu.: 70.0  
##  Median :54.10   Median :2414   Median :120.0   Median : 95.0  
##  Mean   :53.72   Mean   :2556   Mean   :126.9   Mean   :104.3  
##  3rd Qu.:55.50   3rd Qu.:2935   3rd Qu.:141.0   3rd Qu.:116.0  
##  Max.   :59.80   Max.   :4066   Max.   :326.0   Max.   :288.0  
##                                                 NA's   :2      
##     peak.rpm       city.mpg      highway.mpg        price      
##  Min.   :4150   Min.   :13.00   Min.   :16.00   Min.   : 5118  
##  1st Qu.:4800   1st Qu.:19.00   1st Qu.:25.00   1st Qu.: 7775  
##  Median :5200   Median :24.00   Median :30.00   Median :10295  
##  Mean   :5125   Mean   :25.22   Mean   :30.75   Mean   :13207  
##  3rd Qu.:5500   3rd Qu.:30.00   3rd Qu.:34.00   3rd Qu.:16500  
##  Max.   :6600   Max.   :49.00   Max.   :54.00   Max.   :45400  
##  NA's   :2                                      NA's   :4

Observe that the first 6 variables are factors, i.e. they contain qualitative data that is associated with categorization or the description of an attribute. The last 11 variable are numeric and contain quantitative data.

Factors are summarized in R by listing the attributes and the frequency of each attribute value. If the number of attributes is large then only the most frequent attributes are listed. Numerical variables are summarized in R with the aid of the smallest and largest values, the three quartiles (Q1, the median, and Q3) and the average (mean).

The third factor variable, “num.of.doors”, as well as several of the numerical variables have a special category titled “NA’s”. This category describes the number of missing values among the observations. For a given variable, the observations for which a value for the variable is not recorded, are marked as missing. R uses the symbol “NA” to identify a missing value²⁴.

Missing observations are a concern in the analysis of statistical data. If the relative frequency of missing values is substantial and the reason for not obtaining the data for specific observations is related to the phenomena under investigation than naïve statistical inference may produce biased conclusions. In the “cars” data frame missing values are less of a concern since their relative frequency is low.

One should be on the lookout for missing values when applying R to data since the different functions may have different ways for dealing with missing values. One should make sure that the appropriate way is applied for the specific analysis.

Consider the variables of the data frame “cars”:

make:: The name of the car producer (a factor).
fuel.type:: The type of fuel used by the car, either diesel or gas (a factor).
num.of.doors:: The number of passenger doors, either two or four (a factor).
body.style:: The type of the car (a factor).
drive.wheels:: The wheels powered by the engine (a factor).
engine.location:: The location in the car of the engine (a factor).
wheel.base:: The distance between the centers of the front and rear wheels in inches (numeric).
length:: The length of the body of the car in inches (numeric).
width:: The width of the body of the car in inches (numeric).
height:: The height of the car in inches (numeric).
curb.weight:: The total weight in pounds of a vehicle with standard equipment and a full tank of fuel, but with no passengers or cargo (numeric).
engine.size:: The volume swept by all the pistons inside the cylinders in cubic inches (numeric).
horsepower:: The power of the engine in horsepowers (numeric).
peak.rpm:: The top speed of the engine in rounds-per-minute (numeric).
city.mpg:: The fuel consumption of the car in city driving conditions, measured as miles per gallon of fuel (numeric).
highway.mpg:: The fuel consumption of the car in highway driving conditions, measured as miles per gallon of fuel (numeric).
price:: The retail price of the car in US Dollars (numeric).

9.4 The Sampling Distribution

9.4.1 Statistics

Statistical inferences, be it point estimation, confidence intervals, or testing hypothesis, are based on statistics computed from the data. Examples of statistics are the sample average and the sample standard deviation. These are important examples, but clearly not the only ones. Given numerical data, one may compute the smallest value, the largest value, the quartiles, and the median. All are examples of statistics. Statistics may also be associated with factors. The frequency of a given attribute among the observations is a statistic. (An example of such statistic is the frequency of diesel cars in the data frame.) As part of the discussion in the subsequent chapters we will consider these and other types of statistics.

Any statistic, when computed in the context of the data frame being analyzed, obtains a single numerical value. However, once a sampling distribution is being considered then one may view the same statistic as a random variable. A statistic is a function or a formula which is applied to the data frame. Consequently, when a random collection of data frames is the frame of reference then the application of the formula to each of the data frames produces a random collection of values, which is the sampling distribution of the statistic.

We distinguish in the text between the case where the statistic is computed in the context of the given data frame and the case where the computation is conducted in the context of the random sample. This distinguishing is emphasized by the use of small letters for the former and capital letters for the later. Consider, for example, the sample average. In the context of the observed data we denote the data values for a specific variable by $x_1, x_2, \ldots, x_n$. The sample average computed for these values is denoted by

\[\bar x = \frac{x_1 + x_2 + \cdots + x_n}{n}\;.\] On the other hand, if the discussion of the sample average is conducted in the context of a random sample then the sample is a sequence $X_1, X_2, \ldots, X_n$ of random variables. The sample average is denoted in this context as

\[\bar X = \frac{X_1 + X_2 + \cdots + X_n}{n}\;.\] The same formula that was applied to the data values is applied now to the random components of the random sample. In the first context $\bar x$ is an observed non-random quantity. In the second context $\bar X$ is a random variable, an abstract mathematical concept.

A second example is the sample variance. When we compute the sample variance for the observed data we use the formula:

\[s^2 = \frac{\mbox{Sum of the squares of the deviations}}{\mbox{Number of values in the sample}-1}= \frac{\sum_{i=1}^n (x_i - \bar x)^2}{n-1}\;.\] However, when we discuss the sampling distribution of the sample variance we apply the same formula to the random sample:

\[S^2 = \frac{\mbox{Sum of the squares of the deviations}}{\mbox{Number of values in the sample}-1}= \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n-1}\;.\] Again, $S^2$ is a random variable whereas $s^2$ is a non-random quantity: The evaluation of the random variable at the specific sample that is being observed.

9.4.2 The Sampling Distribution

The sampling distribution may emerge as random selection of samples from a particular population. In such a case, the sampling distribution of the sample, and hence of the statistic, is linked to the distribution of values of the variable in the population.

Alternatively, one may assign theoretical distribution to the measurement associated with the variable. In this other case the sampling distribution of the statistic is linked to the theoretical model.

Consider, for example, the variable “price” that describes the prices of the 205 car types (with 4 prices missing) in the data frame “cars”. In order to define a sampling distribution one may imagine a larger population of car types, perhaps all the car types that were sold during the 80’s in the United States, or some other frame of reference, with the car types that are included in the data frame considered as a random sample from that larger population. The observed sample corresponds to car types that where sold in 1985. Had one chosen to consider car types from a different year then one may expect to obtain other evaluations of the price variable. The reference population, in this case, is the distribution of the prices of the car types that were sold during the 80’s and the sampling distribution is associated with a random selection of a particular year within this period and the consideration of prices of car types sold in that year. The data for 1985 is what we have at hand. But in the sampling distribution we take into account the possibility that we could have obtained data for 1987, for example, rather than the data we did get.

An alternative approach for addressing sampling distribution is to consider a theoretical model. Referring again to the variable “price” one may propose an Exponential model for the distribution of the prices of cars. This model implies that car types in the lower spectrum of the price range are more frequent than cars with a higher price tag. With this model in mind, one may propose the sampling distribution to be composed of 205 unrelated copies from the Exponential distribution (or 201 if we do not want to include the missing values). The rate $\lambda$ of the associated Exponential distribution is treated as an unknown parameter. One of the roles of statistical inference is to estimate the value of this parameter with the aid of the data at hand.

Sampling distribution is relevant also for factor variables. Consider the variable “fuel.type” as an example. In the given data frame the frequency of diesel cars is 20. However, had one considered another year during the 80’s one may have obtained a different frequency, resulting in a sampling distribution. This type of sampling distribution refers to all cars types that were sold in the United States during the 80’s as the frame of reference.

Alternatively, one may propose a theoretical model for the sampling distribution. Imagine there is a probability $p$ that a car runs on diesel (and probability $1-p$ that it runs on gas). Hence, when one selects 205 car types at random then one obtains that the distribution of the frequency of car types that run on diesel has the $\mathrm{Binomial}(205,p)$ distribution. This is the sampling distribution of the frequency statistic. Again, the value of $p$ is unknown and one of our tasks is to estimate it from the data we observe.

In the context of statistical inference the use of theoretical models for the sampling distribution is the standard approach. There are situation, such as the application surveys to a specific target population, where the consideration of the entire population as the frame of reference is more natural. But, in most other applications the consideration of theoretical models is the method of choice. In this part of the book, where we consider statistical inference, we will always use the theoretical approach for modeling the sampling distribution.

9.4.3 Theoretical Distributions of Observations

In the first part of the book we introduced several theoretical models that may describe the distribution of an observation. Let us take the opportunity and review the list of models:

Binomial:: The Binomial distribution is used in settings that involve counting the number of occurrences of a particular outcome. The parameters that determine the distribution are $n$, the number of observations, and $p$, the probability of obtaining the particular outcome in each observation. The expression “$\mathrm{Binomial}(n,p)$” is used to mark the Binomial distribution. The sample space for this distribution is formed by the integer values $\{0, 1, 2, \ldots, n\}$. The expectation of the distribution is $np$ and the variance is $np(1-p)$. The functions “dbinom”, “pbinom”, and “qbinom” may be used in order to compute the probability, the cumulative probability, and the percentiles, respectively, for the Binomial distribution. The function “rbinom” can be used in order to simulate a random sample from this distribution.
Poisson:: The Poisson distribution is also used in settings that involve counting. This distribution approximates the Binomial distribution when the number of examinations $n$ is large but the probability $p$ of the particular outcome is small. The parameter that determines the distribution is the expectation $\lambda$. The expression “$\mathrm{Poisson}(\lambda)$” is used to mark the Poisson distribution. The sample space for this distribution is the entire collection of natural numbers $\{0, 1, 2, \ldots\}$. The expectation of the distribution is $\lambda$ and the variance is also $\lambda$. The functions “dpois”, “ppois”, and “qpois” may be used in order to compute the probability, the cumulative probability, and the percentiles, respectively, for the Poisson distribution. The function “rpois” can be used in order to simulate a random sample from this distribution.
Uniform:: The Uniform distribution is used in order to model measurements that may have values in a given interval, with all values in this interval equally likely to occur. The parameters that determine the distribution are $a$ and $b$, the two end points of the interval. The expression “$\mathrm{Uniform}(a,b)$” is used to identify the Uniform distribution. The sample space for this distribution is the interval $[a,b]$. The expectation of the distribution is $(a+b)/2$ and the variance is $(b-a)^2/12$. The functions “dunif”, “punif”, and “qunif” may be used in order to compute the density, the cumulative probability, and the percentiles for the Uniform distribution. The function “runif” can be used in order to simulate a random sample from this distribution.
Exponential:: The Exponential distribution is frequently used to model times between events. It can also be used in other cases where the outcome of the measurement is a positive number and where a smaller value is more likely than a larger value. The parameter that determines the distribution is the rate $\lambda$. The expression “$\mathrm{Exponential}(\lambda)$” is used to identify the Exponential distribution. The sample space for this distribution is the collection of positive numbers. The expectation of the distribution is $1/\lambda$ and the variance is $1/\lambda^2$. The functions “dexp”, “pexp”, and “qexp” may be used in order to compute the density, the cumulative probability, and the percentiles, respectively, for the Exponential distribution. The function “rexp” can be used in order to simulate a random sample from this distribution.
Normal:: The Normal distribution frequently serves as a generic model for the distribution of a measurement. Typically, it also emerges as an approximation of the sampling distribution of statistics. The parameters that determine the distribution are the expectation $\mu$ and the variance $\sigma^2$. The expression “$\mathrm{Normal}(\mu,\sigma^2)$” is used to mark the Normal distribution. The sample space for this distribution is the collection of all numbers, negative or positive. The expectation of the distribution is $\mu$ and the variance is $\sigma^2$. The functions “dnorm”, “pnorm”, and “qnorm” may be used in order to compute the density, the cumulative probability, and the percentiles for the Normal distribution. The function “rnorm” can be used in order to simulate a random sample from this distribution.

9.4.4 Sampling Distribution of Statistics

Theoretical models describe the distribution of a measurement as a function of a parameter, or a small number of parameters. For example, in the Binomial case the distribution is determined by the number of trials $n$ and by the probability of success in each trial $p$. In the Poisson case the distribution is a function of the expectation $\lambda$. For the Uniform distribution we may use the end-points of the interval, $a$ and $b$, as the parameters. In the Exponential case, the rate $\lambda$ is a natural parameter for specifying the distribution and in Normal case the expectation $\mu$ and the variance $\sigma^2$ my be used for that role.

The general formulation of statistical inference problems involves the identification of a theoretical model for the distribution of the measurements. This theoretical model is a function of a parameter whose value is unknown. The goal is to produce statements that refer to this unknown parameter. These statements are based on a sample of observations from the given distribution.

For example, one may try to guess the value of the parameter (point estimation), one may propose an interval which contains the value of the parameter with some subscribed probability (confidence interval) or one may test the hypothesis that the parameter obtains a specific value (hypothesis testing).

The vehicles for conducting the statistical inferences are statistics that are computed as a function of the measurements. In the case of point estimation these statistics are called estimators. In the case where the construction of an interval that contains the value of the parameter is the goal then the statistics are called confidence interval. In the case of testing hypothesis these statistics are called test statistics.

In all cases of inference, The relevant statistic possesses a distribution that it inherits from the sampling distribution of the observations. This distribution is the sampling distribution of the statistic. The properties of the statistic as a tool for inference are assessed in terms of its sampling distribution. The sampling distribution of a statistic is a function of the sample size and of the parameters that determine the distribution of the measurements, but otherwise may be of complex structure.

In order to assess the performance of the statistics as agents of inference one should be able to determine their sampling distribution. We will apply two approaches for this determination. One approach is to use a Normal approximation. This approach relies on the Central Limit Theorem. The other approach is to simulate the distribution. This other approach relies on the functions available in R for the simulation of a random sample from a given distribution.

9.4.5 The Normal Approximation

In general, the sampling distribution of a statistic is not the same as the sampling distribution of the measurements from which it is computed. For example, if the measurements are from the Uniform distributed then the distribution of a function of the measurements will, in most cases, not possess the Uniform distribution. Nonetheless, in many cases one may still identify, at least approximately, what the sampling distribution of the statistic is.

The most important scenario where the limit distribution of the statistic has a known shape is when the statistic is the sample average or a function of the sample average. In such a case the Central Limit Theorem may be applied in order to show that, at least for a sample size not too small, the distribution of the statistic is approximately Normal.

In the case where the Normal approximation may be applied then a probabilistic statement associated with the sampling distribution of the statistic can be substituted by the same statement formulated for the Normal distribution. For example, the probability that the statistic falls inside a given interval may be approximated by the probability that a Normal random variable with the same expectation and the same variance (or standard deviation) as the statistic falls inside the given interval.

For the special case of the sample average one may use the fact that the expectation of the average of a sample of measurements is equal to the expectation of a single measurement and the fact that the variance of the average is the variance of a single measurement, divided by the sample size. Consequently, the probability that the sample average falls within a given interval may be approximate by the probability of the same interval according to the Normal distribution. The expectation that is used for the Normal distribution is the expectation of the measurement. The standard deviation is the standard deviation of the measurement, divided by the square root of the number of observations.

The Normal approximation of the distribution of a statistic is valid for cases other than the sample average or functions thereof. For example, it can be shown (under some conditions) that the Normal approximation applies to the sample median, even though the sample median is not a function of the sample average.

On the other hand, one need not always assume that the distribution of a statistic is necessarily Normal. In many cases it is not, even for a large sample size. For example, the minimal value of a sample that is generated from the Exponential distribution can be shown to follow the Exponential distribution with an appropriate rate²⁵, regardless of the sample size.

9.4.6 Simulations

In most problems of statistical inference that will be discussed in this book we will be using the Normal approximation for the sampling distribution of the statistic. However, every now and then we may want to check the validity of this approximation in order to reassure ourselves of its appropriateness. Computerized simulations can be carried out for that checking. The simulations are equivalent to those used in the first part of the book.

A model for the distribution of the observations is assumed each time a simulation is carried out. The simulation itself involves the generation of random samples from that model for the given sample size and for a given value of the parameter. The statistic is evaluated and stored for each generated sample. Thereby, via the generation of many samples, an approximation of the sampling distribution of the statistic is produced. A probabilistic statement inferred from the Normal approximation can be compared to the results of the simulation. Substantial disagreement between the Normal approximation and the outcome of the simulations is an evidence that the Normal approximation may not be valid in the specific setting.

As an illustration, assume the statistic is the average price of a car. It is assumed that the price of a car follows an Exponential distribution with some unknown rate parameter $\lambda$. We consider the sampling distribution of the average of 201 Exponential random variables. (Recall that in our sample there are 4 missing values among the 205 observations.) The expectation of the average is $1/\lambda$, which is the expectation of a single Exponential random variable. The variance of a single observation is $1/\lambda^2$. Consequently, the standard deviation of the average is $\sqrt{(1/\lambda^2)/201} = (1/\lambda)/\sqrt{201} = (1/\lambda)/14.17745 = 0.0705/\lambda$.

In the first part of the book we found out that for $\mathrm{Normal}(\mu,\sigma^2)$, the Normal distribution with expectation $\mu$ and variance $\sigma^2$, the central region that contains 95% of the distribution takes the form $\mu \pm 1.96\, \sigma$ (namely, the interval $[\mu-1.96\,\sigma,\mu + 1.96\, \sigma]$). Thereby, according to the Normal approximation for the sampling distribution of the average price we state that the region $1/\lambda \pm 1.96 \cdot 0.0705/\lambda$ should contain 95% of the distribution.

We may use simulations in order to validate this approximation for selected values of the rate parameter $\lambda$. Hence, for example, we may choose $\lambda = 1/12,000$ (which corresponds to an expected price of $12,000 for a car) and validate the approximation for that parameter value.

The simulation itself is carried out by the generation of a sample of size $n=201$ from the $\mathrm{Exponential}(1/1200)$ distribution using the function “rexp” for generating Exponential samples²⁶. The function for computing the average (mean) is applied to each sample and the result stored. We repeat this process a large number of times (100,000 is the typical number we use) in order to produce an approximation of the sampling distribution of the sample average. Finally, we check the relative frequency of cases where the simulated average is within the given range²⁷. This relative frequency is an approximation of the required probability and may be compared to the target value of 0.95.

Let us run the proposed simulation for the sample size of $n=201$ and for a rate parameter equal to $\lambda = 1/12000$. Observe that the expectation of the sample average is equal to $12,000$ and the standard deviation is $0.0705\times 12000$. Hence:

X.bar <- rep(0,10^5)
for(i in 1:10^5) {
  X <- rexp(201,1/12000)
  X.bar[i] <- mean(X)
}
mean(abs(X.bar-12000) <= 1.96*0.0705*12000)

## [1] 0.94978

Observe that the simulation produces 0.9496 as the probability of the interval. This result is close enough to the target probability of 0.95, proposing that the Normal approximation is adequate in this example.

The simulation demonstrates the appropriateness of the Normal approximation for the specific value of the parameter that was used. In order to gain more confidence in the approximation we may want to consider other values as well. However, simulations in this book are used only for demonstration. Hence, in most cases where we conduct a simulation experiment, we conduct it only for a single evaluation of the parameters. We leave it to the curiosity of the reader to expand the simulations and try other evaluations of the parameters.

Simulations may also be used in order to compute probabilities in cases where the Normal approximation does not hold. As an illustration, consider the mid-range statistic. This statistic is computed as the average between the largest and the smallest values in the sample. This statistic is discussed in the next chapter.

Consider the case where we obtain 100 observations. Let the distribution of each observation be Uniform. Suppose we are interested as before in the central range that contains 95% of the distribution of the mid-range statistic. The Normal approximation does not apply in this case. Yet, if we specify the parameters of the Uniform distribution then we may use simulations in order to compute the range.

As a specific example let the distribution of an observation be $\mathrm{Uniform}(3,7)$. In the simulation we generate a sample of size $n=100$ from this distribution²⁸ and compute the mid-range for the sample.

For the computation of the statistic we need to obtain the minimal and the maximal values of the sample. The minimal value of a sequence is compute with the function “min”. The input to this function is a sequence and the output is the minimal value of the sequence. Similarly, the maximal value is computed with the function “max”. Again, the input to the function is a sequence and the output is the maximal value in the sequence. The statistic itself is obtained by adding the two extreme values to each other and dividing the sum by two²⁹.

We produce, just as before, a large number of samples and compute the value of the statistic to each sample. The distribution of the simulated values of the statistic serves as an approximation of the sampling distribution of the statistic. The central range that contains 95% of the sampling distribution may be approximated with the aid of this simulated distribution.

Specifically, we approximate the central range by the identification of the 0.025-percentile and the 0.975-percentile of the simulated distribution. Between these two values are 95% of the simulated values of the statistic. The percentiles of a sequence of simulated values of the statistic can be identified with the aid of the function “quantile” that was presented in the first part of the book. The first argument to the function is a sequence of values and the second argument is a number $p$ between 0 and 1. The output of the function is the $p$-percentile of the sequence³⁰. The $p$-percentile of the simulated sequence serves as an approximation of the $p$-percentile of the sampling distribution of the statistic.

The second argument to the function “quantile” may be a sequence of values between 0 and 1. If so, the percentile for each value in the second argument is computed³¹.

Let us carry out the simulation that produces an approximation of the central region that contains 95% of the sampling distribution of the mid-range statistic for the Uniform distribution:

mid.range <- rep(0,10^5)
for(i in 1:10^5) {
  X <- runif(100,3,7)
  mid.range[i] <- (max(X)+min(X))/2
}
quantile(mid.range,c(0.025,0.975))

##     2.5%    97.5% 
## 4.941019 5.058398

Observe that (approximately) 95% of the sampling distribution of the statistic are in the range $[4.941680, 5.059004]$.

Simulations can be used in order to compute the expectation, the standard deviation or any other numerical summary of the sampling distribution of a statistic. All one needs to do is compute the required summary for the simulated sequence of statistic values and hence obtain an approximation of the required summary. For example, we my use the sequence “mid.range” in order to obtain the expectation and the standard deviation of the mid-range statistic of a sample of 100 observations from the $\mathrm{Uniform}(3,7)$ distribution:

mean(mid.range)

## [1] 4.999871

sd(mid.range)

## [1] 0.02772162

The expectation of the statistic is obtained by the application of the function “mean” to the sequence. Observe that it is practically equal to 5. The standard deviation is obtained by the application of the function “sd”. Its value is approximately equal to 0.028.

9.5 Exercises

Magnetic fields have been shown to have an effect on living tissue and were proposed as a method for treating pain. In the case study presented here, Carlos Vallbona and his colleagues³² sought to answer the question “Can the chronic pain experienced by postpolio patients be relieved by magnetic fields applied directly over an identified pain trigger point?”

A total of 50 patients experiencing post-polio pain syndrome were recruited. Some of the patients were treated with an active magnetic device and the others were treated with an inactive placebo device. All patients rated their pain before (score1) and after application of the device (score2). The variable “change” is the difference between “score1” and “score2. The treatment condition is indicated by the variable “active.” The value “1” indicates subjects receiving treatment with the active magnet and the value “2” indicates subjects treated with the inactive placebo.

This case study is taken from the Rice Virtual Lab in Statistics. More details on this case study can be found in the case study Magnets and Pain Relief that is presented in that site.

Exercise 9.1 The data for the 50 patients is stored in file “magnets.csv”. The file can be found on the internet at http://pluto.huji.ac.il/~msby/StatThink/Datasets/magnets.csv. Download this file to your computer and store it in the working directory of R. Read the content of the file into an R data frame. Produce a summary of the content of the data frame and answer the following questions:

What is the sample average of the change in score between the patient’s rating before the application of the device and the rating after the application?
Is the variable “active” a factor or a numeric variable?
Compute the average value of the variable “change” for the patients that received and active magnet and average value for those that received an inactive placebo. (Hint: Notice that the first 29 patients received an active magnet and the last 21 patients received an inactive placebo. The sub-sequence of the first 29 values of the given variables can be obtained via the expression “change[1:29]” and the last 21 vales are obtained via the expression “change[30:50]”.)
Compute the sample standard deviation of the variable “change” for the patients that received and active magnet and the sample standard deviation for those that received an inactive placebo.
Produce a boxplot of the variable “change” for the patients that received and active magnet and for patients that received an inactive placebo. What is the number of outliers in each subsequence?

Exercise 9.2 In Chapter 13 we will present a statistical test for testing if there is a difference between the patients that received the active magnets and the patients that received the inactive placebo in terms of the expected value of the variable that measures the change. The test statist for this problem is taken to be

\[\frac{\bar X_1 - \bar X_2}{\sqrt{S^2_1/29 + S^2_2/21}}\;,\] where $\bar X_1$ and $\bar X_2$ are the sample averages for the 29 patients that receive active magnets and for the 21 patients that receive inactive placebo, respectively. The quantities $S^2_1$ and $S_2^2$ are the sample variances for each of the two samples. Our goal is to investigate the sampling distribution of this statistic in a case where both expectations are equal to each other and to compare this distribution to the observed value of the statistic.

Assume that the expectation of the measurement is equal to 3.5, regardless of what the type of treatment that the patient received. We take the standard deviation of the measurement for patients the receives an active magnet to be equal to 3 and for those that received the inactive placebo we take it to be equal to 1.5. Assume that the distribution of the measurements is Normal and there are 29 patients in the first group and 21 in the second. Find the interval that contains 95% of the sampling distribution of the statistic.
Does the observed value of the statistic, computed for the data frame “magnets”, falls inside or outside of the interval that is computed in 1?

9.6 Summary

Glossary

Statistical Inferential:: Methods for gaining insight regarding the population parameters from the observed data.
Point Estimation:: An attempt to obtain the best guess of the value of a population parameter. An estimator is a statistic that produces such a guess. The estimate is the observed value of the estimator.
Confidence Interval:: An interval that is most likely to contain the population parameter. The confidence level of the interval is the sampling probability that the confidence interval contains the parameter value.
Hypothesis Testing:: A method for determining between two hypothesis, with one of the two being the currently accepted hypothesis. A determination is based on the value of the test statistic. The probability of falsely rejecting the currently accepted hypothesis is the significance level of the test.
Comparing Samples:: Samples emerge from different populations or under different experimental conditions. Statistical inference may be used to compare the distributions of the samples to each other.
Regression:: Relates different variables that are measured on the same sample. Regression models are used to describe the effect of one of the variables on the distribution of the other one. The former is called the explanatory variable and the later is called the response.
Missing Value:: An observation for which the value of the measurement is not recorded. R uses the symbol “NA” to identify a missing value.

Discuss in the forum

A data set may contain missing values. Missing value is an observation of a variable for which the value is not recorded. Most statistical procedures delete observations with missing values and conduct the inference on the remaining observations.

Some people say that the method of deleting observations with missing values is dangerous and may lead to biased analysis. The reason is that missing values may contain information. What is your opinion?

When you formulate your answer to this question it may be useful to come up with an example from you own field of interest. Think of an example in which a missing value contains information relevant for inference or an example in which it does not. In the former case try to assess the possible effects on the analysis that may emerge due to the deletion of observations with missing values.

For example, the goal in some clinical trials is to assess the effect of a new treatment on the survival of patients with a life-threatening illness. The trial is conducted for a given duration, say two years, and the time of death of the patients is recorded. The time of death is missing for patients that survived the entire duration of the trial. Yet, one is advised not to ignore these patients in the analysis of the outcome of the trial.

Other data sets will be used in Chapter \[ch:CaseStudies\] and in the quizzes and assignments.↩
The original “Automobiles” data set is accessible at the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). This data was assembled by Jeffrey C. Schlimmer, using as source the 1985 Model Import Car and Truck Specifications, 1985 Ward’s Automotive Yearbook. The current file “cars.csv” is based on all 205 observations of the original data set. We selected 17 of the 26 variables available in the original source.↩
Indeed, if you scan the CSV file directly by opening it with a spreadsheet then every now and again you will encounter this symbol.↩
If the rate of an Exponential measurement is $\lambda$ then the rate of the minimum of $n$ such measurements is $n\lambda$.↩
The expression for generating a sample is “rexp(201,1/12000)”↩
In the case where the simulated averages are stored in the sequence “X.bar” then we may use the expression “mean(abs(X.bar - 12000) <= 1.96*0.0705*12000)” in order to compute the relative frequency.↩
With the expression “runif(100,3,7)”.↩
If the sample is stored in an object by the name “X” then one may compute the mid-range statistic with the expression “(max(X)+min(X))/2”.↩
The $p$-percentile of a sequence is a number with the property that the proportion of entries with values smaller than that number is $p$ and the proportion of entries with values larger than the number is $1-p$.↩
If the simulated values of the statistic are stored in a sequence by the name “mid.range” then the 0.025-percentile and the 0.975-percentile of the sequence can be computed with the expression “quantile(mid.range,c(0.025,0.975))”.↩
Vallbona, Carlos, Carlton F. Hazlewood, and Gabor Jurida. (1997). Response of pain to static magnetic fields in postpolio patients: A double-blind pilot study. Archives of Physical and Rehabilitation Medicine 78(11): 1200-1203.↩