Chapter 10 Point Estimation

10.1 Student Learning Objectives

The subject of this chapter is the estimation of the value of a parameter on the basis of data. An estimator is a statistic that is used for estimation. Criteria for selecting among estimators are discussed, with the goal of seeking an estimator that tends to obtain values that are as close as possible to the value of the parameter. Different examples of estimation problems are presented and analyzed. By the end of this chapter, the student should be able to:

  • Recognize issues associated with the estimation of parameters.

  • Define the notions of bias, variance and mean squared error (MSE) of an estimator.

  • Estimate parameters from data and assess the performance of the estimation procedure.

10.2 Estimating Parameters

Statistic is the science of data analysis. The primary goal in statistic is to draw meaningful and solid conclusions on a given phenomena on the basis of observed data. Typically, the data emerges as a sample of observations. An observation is the outcome of a measurement (or several measurements) that is taken for a subject that belongs to the sample. These observations may be used in order to investigate the phenomena of interest. The conclusions are drawn from the analysis of the observations.

A key aspect in statistical inference is the association of a probabilistic model to the observations. The basic assumption is that the observed data emerges from some distribution. Usually, the assumption is that the distribution is linked to a theoretical model, such as the Normal, Exponential, Poisson, or any other model that fits the specifications of the measurement taken.

A standard setting in statistical inference is the presence of a sequence of observations. It is presumed that all the observations emerged from a common distribution. The parameters one seeks to estimate are summaries or characteristics of that distribution.

For example, one may be interested in the distribution of price of cars. A reasonable assumption is that the distribution of the prices is the \(\mathrm{Exponential}(\lambda)\) distribution. Given an observed sample of prices one may be able to estimate the rate \(\lambda\) that specifies the distribution.

The target in statistical point estimation of a parameter is to produce the best possible guess of the value of a parameter on the basis of the available data. The statistic that tries to guess the value of the parameter is called an estimator. The estimator is a formula applied to the data that produces a number. This number is the estimate of the value of the parameter.

An important characteristic of a distribution, which is always of interest, is the expectation of the measurement, namely the central location of the distribution. A natural estimator of the expectation is the sample average. However, one may propose other estimators that make sense, such as the sample mid-range that was presented in the previous chapter. The main topic of this chapter is the identification of criteria that may help us choose which estimator to use for the estimation of which parameter.

In the next section we discuss issues associated with the estimation of the expectation of a measurement. The following section deals with the estimation of the variance and standard deviation – summaries that characterize the spread of the distribution. The last section deals with the theoretical models of distribution that were introduced in the first part of the book. It discusses ways by which one may estimate the parameters that characterize these distributions.

10.3 Estimation of the Expectation

A natural candidate for the estimation of the expectation of a random variable on the basis of a sample of observations is the sample average. Consider, as an example, the estimation of the expected price of a car using the information in the data file “cars.csv”. Let us read the data into a data frame named “cars” and compute the average of the variable “price”:

cars <- read.csv("_data/cars.csv")
mean(cars$price)
## [1] NA

The application of the function “mean” for the computation of the sample average produced a missing value. The reason is that the variable “price” contains 4 missing values. As default, when applied to a sequence that contains missing values, the function “mean” produce as output a missing value.

The behavior of the function “mean” at the presence of missing values is determined by the argument “na.rm33. If we want to compute the average of the non-missing values in the sequence we should specify the argument “na.rm” as “TRUE”. This can be achieved by the inclusion of the expression “na.rm=TRUE” in the arguments of the function:

mean(cars$price,na.rm=TRUE)
## [1] 13207.13

The resulting average price is, approximately, $13,000.

10.3.1 The Accuracy of the Sample Average

How close is the estimated value of the expectation – the average price – to the expected price?

There is no way of answering this question on the basis of the data we observed. Indeed, we think of the price of a random car as a random variable. The expectation we seek to estimate is the expectation of that random variable. However, the actual value of that expectation is unknown. Hence, not knowing what is the target value, how can we determine the distance between the computed average 13207.13 and that unknown value?

As a remedy for not being able to answer the question we would like to address we, instead, change the question. In the new formulation of the question we ignore the data at hand altogether. The new formulation considers the sample average as a statistic and the question is formulated in terms of the sampling distribution of that statistic. The question, in its new formulation is: How close is the sample average of the price, taken as a random variable, to the expected price?

Notice that in the new formulation of the question the observed average price \(\bar x = 13207.13\) has no special role. The question is formulated in terms of the sampling distribution of the sample average (\(\bar X\)). The observed average value is only one among many in the sampling distribution of the average.

The advantage of the new formulation of the question is that it can be addressed. We do have means for investigating the closeness of the estimator to the parameter and thereby producing meaningful answers. Specifically, consider the current case where the estimator is the sample average \(\bar X\). This estimator attempts to estimate the expectation \(\Expec(X)\) of the measurement, which is the parameter. Assessing the closeness of the estimator to the parameter corresponds to the comparison between the distribution of the random variable, i.e. the estimator, and the value of the parameter.

For this comparison we may note that the expectation \(\Expec(X)\) is also the expectation of the sample average \(\bar X\). Consequently, in this problem the assessment of the closeness of the estimator to the parameter is equivalent to the investigation of the spread of the distribution of the sample average about its expectation.

Consider an example of such investigation. Imagine that the expected price of a car is equal to $13,000. A question one may ask is how likely it is that the estimator’s guess at the value is within $1,000 of the actual value? In other words, what is the probability that sample average falls in the range \([12,000, 14,000]\)?

Let us investigate this question using simulations. Recall our assumption that the distribution of the price is Exponential. An expectation of 13,000 corresponds to a rate parameter of \(\lambda = 1/13,000\). We simulate the sampling distribution of the estimator by the generation of a sample of 201 Exponential random variables with this rate. The sample average is computed for each sample and stored. The sampling distribution of the sample average is approximated via the production of a large number of sample averages:

lam <- 1/13000
X.bar <- rep(0,10^5)
for(i in 1:10^5) {
  X <- rexp(201,lam)
  X.bar[i] <- mean(X)
}
mean(abs(X.bar - 1/lam) <= 1000)
## [1] 0.72405

In the last line of the code we compute the probability of being within $1,000 of the expected price. Recall that the expected price in the Exponential case is the reciprocal of the rate \(\lambda\). In this simulation we obtained 0.7247 as an approximation of the probability.

In the case of the sample average we may also apply the Normal approximation in order to assess the probability under consideration. In particular, if \(\lambda = 1/13,000\) then the expectation of an Exponential observation is \(\Expec(X) = 1/\lambda = 13,000\) and the variance is \(\Var(X) = 1/\lambda^2 = (13,000)^2\). The expectation of the sample average is equal to the expectation of the measurement, 13,000 in this example. The variance of the sample average is equal to the variance of the observation, divided by the sample size. In the current setting it is equal to \((13,000)^2/201\). The standard deviation is equal to the square root of the variance.

The Normal approximation uses the Normal distribution in order to compute probabilities associated with the sample average. The Normal distribution that is used has the same expectation and standard deviation as the sample average:

mu <- 13000
sig <- 13000/sqrt(201)
pnorm(14000,mu,sig) - pnorm(12000,mu,sig)
## [1] 0.7245391

The probability of falling within the interval \([12000, 14000]\) is computed as the difference between the cumulative Normal probability at 14,000 and the cumulative Normal probability at 12,000.

These cumulative probabilities are computed with the function “pnorm”. Recall that this function computes the cumulative probability for the Normal distribution. If the first argument is 14,000 then the function produces the probability that a Normal random variable is less than or equal to 14,000. Likewise, if the first argument is 12,000 then the computed probability is the probability of being less than or equal to 12,000. The expectation of the Normal distribution enters in the second argument of the function and the standard deviation enters in the third argument.

The Normal approximation of falling in the interval \([12000, 14000]\), computed as the difference between the two cumulative probabilities, produces 0.7245391 as the probability34. Notice that the probability 0.7247 computed in the simulations is in agreement with the Normal approximation.

If we wish to assess the accuracy of the estimator at other values of the parameter, say \(\Expec(X) = 12,000\) (which corresponds to \(\lambda = 1/12,000\)) or \(\Expec(X) = 14,033\), (which corresponds to \(\lambda = 1/14,033\)) all we need to do is change the expression “lam <- 1/13000” to the new value and rerun the simulation.

Alternatively, we may use a Normal approximation with modified interval, expectation, and standard deviation. For example, consider the case where the expected price is equal to $12,000. In order to asses the probability that the sample average falls within $1,000 of the parameter we should compute the probability of the interval \([11,000, 13,000]\) and change the entries to the first argument of the function “pnorm” accordingly. The new expectation is 12,000 and the new standard deviation is \(12,000/\sqrt{201}\):

mu <- 12000
sig <- 12000/sqrt(201)
pnorm(13000,mu,sig) - pnorm(11000,mu,sig)
## [1] 0.7625775

This time we get that the probability is, approximately, 0.763.

The fact that the computed value of the average 13,207.13 belongs to the interval \([12,000, 14,000]\) that was considered in the first analysis but does not belong to the interval \([11,000, 13,000]\) that was considered in the second analysis is irrelevant to the conclusions drawn from the analysis. In both cases the theoretical properties of the sample average as an estimator were considered and not its value at specific data.

In the simulation and in the Normal approximation we applied one method for assessing the closeness of the sample average to the expectation it estimates. This method involved the computation of the probability of being within $1,000 of the expected price. The higher this probability, the more accurate is the estimator.

An alternative method for assessing the accuracy of an estimator of the expectation may involve the use of an overall summary of the spread of the distribution. A standard method for quantifying the spread of a distribution about the expectation is the variance (or its square root, the standard deviation). Given an estimator of the expectation of a measurement, the sample average for example, we may evaluate the accuracy of the estimator by considering its variance. The smaller the variance the more accurate is the estimator.

Consider again the case where the sample average is used in order to estimate the expectation of a measurement. In such a situation the variance of the estimator, i.e. the variance of the sample average, is obtained as the ratio between the variance of the measurement \(\Var(X)\), divided by the sample size \(n\):

\[\Var(\bar X) = \Var(X)/n\;.\] Notice that for larger sample sizes the estimator is more accurate. The lager the sample size \(n\) the smaller is the variance of the estimator, in which case the values of the estimator tend to be more concentrated about the expectation. Hence, one may make the estimator more accurate by increasing the sample size.

Another method for improving the accuracy of the average of measurements in estimating the expectation is the application of a more accurate measurement device. If the variance \(\Var(X)\) of the measurement device decreases so does the variance of the sample average of such measurements.

In the sequel, when we investigate the accuracy of estimators, we will generally use overall summaries of the spread of their distribution around the target value of the parameter.

10.3.2 Comparing Estimators

Notice that the formulation of the accuracy of estimation that we use replaces the question: “How close is the given value of the estimator to the unknown value of the parameter?” by the question: “How close are the unknown (and random) values of the estimator to a given value of the parameter?” In the second formulation the question is completely academic and unrelated to actual measurement values. In this academic context we can consider different potential values of the parameter. Once the value of the parameter has been selected it can be treated as known in the context of the academic discussion. Clearly, this does not imply that we actually know what is the true value of the parameter.

The sample average is a natural estimator of the expectation of the measurement. However, one may propose other estimators. For example, when the distribution of the measurement is symmetric about the expectation then the median of the distribution is equal to the expectation. The sample median, which is a natural estimator of the measurement median, is an alternative estimator of the expectation in such case. Which of the two alternatives, the sample average or the sample median, should we prefer as an estimator of the expectation in the case of a symmetric distribution?

The straightforward answer to this question is to prefer the better one, the one which is more accurate. As part of the solved exercises you are asked to compare the sample average to the sample median as estimators of the expectation. Here we compare the sample average to yet another alternative estimator – the mid-range estimator – which is the average between the smallest and the largest observations.

In the comparison between estimators we do not evaluate them in the context of the observed data. Rather, we compare them as random variables. The comparison deals with the properties of the estimators in a given theoretical context. This theoretical context is motivated by the realities of the situation as we know them. But, still, the frame of reference is the theoretical model and not the collected data.

Hence, depending on the context, we may assume in the comparison that the observations emerge from some distribution. We may specify parameter values for this distribution and select the appropriate sample size. After setting the stage we can compare the accuracy of one estimator against that of the other. Assessment at other parameter values in the context of the given theoretical model, or of other theoretical models, may provide insight and enhance our understanding regarding the relative merits and weaknesses of each estimator.

Let us compare the sample average to the sample mid-range as estimators of the expectation in a situation that we design. Consider a Normal measurement \(X\) with expectation \(\Expec(X) = 3\) and variance that is equal to 2. Assume that the sample size is \(n = 100\). Both estimators, due to the symmetry of the Normal distribution, are centered at the expectation. Hence, we may evaluate the accuracy of the two estimators using their variances. These variances are the measure of the spread of the distributions of each estimator about the target parameter value.

We produce the sampling distribution and compute the variances using a simulation. Recall that the distribution of the mid-range statistic was simulated in the previous chapter. In the computation of the mid-range statistic we used the function “max” that computes the maximum value of its input and the function “min” that computes the minimum value:

mu <- 3
sig <- sqrt(2)
X.bar <- rep(0,10^5)
mid.range <- rep(0,10^5)
for(i in 1:10^5) {
  X <- rnorm(100,mu,sig)
  X.bar[i] <- mean(X)
  mid.range[i] <- (max(X)+min(X))/2
}
var(X.bar)
## [1] 0.02003947
var(mid.range)
## [1] 0.1870101

We get that the variance of the sample average35 is approximately equal to 0.02. The variance of the mid-range statistic is approximately equal to 0.185, more than 9 times as large. We see that the accuracy of the sample average is better in this case than the accuracy of the mid-range estimator. Evaluating the two estimators at other values of the parameter will produce the same relation. Hence, in the current example it seems as if the sample average is the better of the two.

Is the sample average necessarily the best estimator for the expectation? The next example will demonstrate that this need not always be the case.

Consider again a situation of observing a sample of size \(n=100\). However, this time the measurement \(X\) is Uniform and not Normal. Say \(X \sim \mathrm{Uniform}(0.5,5.5)\) has the Uniform distribution over the interval \([0.5, 5.5]\). The expectation of the measurement is equal to 3 like before, since \(\Expec(X) = (0.5+5.5)/2 = 3\). The variance on an observation is \(\Var(X) = (5.5 - 0.5)^2/12 = 2.083333\), not much different from the variance that was used in the Normal case. The Uniform distribution, like the Normal distribution, is a symmetric distribution about the center of the distribution. Hence, using the mid-range statistic as an estimator of the expectation makes sense36.

We re-run the simulations, using the function “runif” for the simulation of a sample from the Uniform distribution and the parameters of the Uniform distribution instead of the function “rnorm” that was used before:

a <- 0.5
b <- 5.5
X.bar <- rep(0,10^5)
mid.range <- rep(0,10^5)
for(i in 1:10^5) {
  X <- runif(100,a,b)
  X.bar[i] <- mean(X)
  mid.range[i] <- (max(X)+min(X))/2
}
var(X.bar)
## [1] 0.02087122
var(mid.range)
## [1] 0.001209319

Again, we get that the variance of the sample average is approximately equal to 0.02, which is close to the theoretical value37. The variance of mid-range statistic is approximately equal to 0.0012.

Observe that in the current comparison between the sample average and the mid-range estimator we get that the latter is a clear winner. Examination of other values of \(a\) and \(b\) for the Uniform distribution will produce the same relation between the two competitors. Hence, we may conclude that for the case of the Uniform distribution the sample average is an inferior estimator.

The last example may serve as yet another reminder that life is never simple. A method that is good in one situation may not be as good in a different situation.

Still, the estimator of choice of the expectation is the sample average. Indeed, in some cases we may find that other methods may produce more accurate estimates. However, in most settings the sample average beats its competitors. The sample average also possesses other useful benefits. Its sampling distribution is always centered at the expectation it is trying to estimate. Its variance has a simple form, i.e. it is equal to the variance of the measurement divided by the sample size. Moreover, its sampling distribution can be approximated by the Normal distribution. Henceforth, due to these properties, we will use the sample average whenever estimation of the expectation is required.

10.4 Estimation of the Variance and Standard Deviation

The spread of the measurement about its expected value may be measured by the variance or by the standard deviation, which is the square root of the variance. The standard estimator for the variance of the measurement is the sample variance and the square root of the sample variance is the default estimator of the standard deviation.

The computation of the sample variance from the data is discussed in Chapter 3. Recall that the sample variance is computed via the formula:

\[s^2 = \frac{\mbox{Sum of the squares of the deviations}}{\mbox{Number of values in the sample}-1}= \frac{\sum_{i=1}^n (x_i - \bar x)^2}{n-1}\;,\] where \(\bar x\) is the sample average and \(n\) is the sample size. The term \(x_i-\bar x\) is the deviation from the sample average of the \(i\)th observation and \(\sum_{i=1}^n (x_i - \bar x)^2\) is the sum of the squares of deviations. It is pointed out in Chapter 3 that the reason for dividing the sum of squares by \((n-1)\), rather than \(n\), stems from considerations of statistical inference. A promise was made that these reasonings will be discussed in due course. Now we want to deliver on this promise.

Let us compare between two competing estimators for the variance, both considered as random variables. One is the estimator \(S^2\), which is equal to the formula for the sample variance applied to a random sample:

\[S^2 = \frac{\mbox{Sum of the squares of the deviations}}{\mbox{Number of values in the sample}-1}= \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n-1}\;,\] The computation of this statistic can be carried out with the function “var”.

The second estimator is the one obtained when the sum of squares is divided by the sample size (instead of the sample size minus 1):

\[\frac{\mbox{Sum of the squares of the deviations}}{\mbox{Number of values in the sample}}=\frac{\sum_{i=1}^n (X_i - \bar X)^2}{n}\;.\] Observe that the second estimator can be represented in the form:

\[\frac{\sum_{i=1}^n (X_i - \bar X)^2}{n} =\frac{n-1}{n} \cdot \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n-1}= [(n-1)/n] S^2\;.\] Hence, the second estimator may be obtained by the multiplication of the first estimator \(S^2\) by the ratio \((n-1)/n\). We seek to compare between \(S^2\) and \([(n-1)/n] S^2\) as estimators of the variance.

In order to make the comparison concrete, let us consider it in the context of a Normal measurement with expectation \(\mu = 5\) and variance \(\sigma^2 = 3\). Let us assume that the sample is of size 20 (\(n=20\)).

Under these conditions we carry out a simulation. Each iteration of the simulation involves the generation of a sample of size \(n=20\) from the given Normal distribution. The sample variance \(S^2\) is computed from the sample with the application of the function “var”. The resulting estimate of the variance is stored in an object that is called “X.var”:

mu <- 5
std <- sqrt(3)
X.var <- rep(0,10^5)
for(i in 1:10^5) {
  X <- rnorm(20,mu,std)
  X.var[i] <- var(X)
}

The content of the object “X.var”, at the end of the simulation, approximates the sampling distribution of the estimator \(S^2\).

Our goal is to compare between the performance of the estimator of the variance \(S^2\) and that of the alternative estimator. In this alternative estimator the sum of squared deviations is divided by the sample size (\(n=20\)) and not by the sample size minus 1 (\(n-1 = 19\)). Consequently, the alternative estimator is obtained by multiplying \(S^2\) by the ratio \(19/20\). The sampling distribution of the values of \(S^2\) is approximated by the content of the object “X.var”. It follows that the sampling distribution of the alternative estimator is approximated by the object “(19/20)*X.var”, in which each value of \(S^2\) is multiplied by the appropriate ratio. The comparison between the sampling distribution of \(S^2\) and the sampling distribution of the alternative estimator is obtained by comparing between “X.var” and “(19/20)*X.var”.

Let us start by the investigation of the expectation of the estimators. Recall that when we analyzed the sample average as an estimator of the expectation of a measurement we obtained that the expectation of the sampling distribution of the estimator is equal to the value of the parameter it is trying to estimate. One may wonder: What is the situation for the estimators of the variance? Is it or is it not the case that the expectation of their sampling distribution equals the value of the variance? In other words, is the distribution of either estimators of the variance centered at the value of the parameter they are trying to estimate?

Compute the expectations of the two estimators:

mean(X.var)
## [1] 2.995744
mean((19/20)*X.var)
## [1] 2.845957

Note that 3 is the value of the variance of the measurement that was used in the simulation. Observe that the expectation of \(S^2\) is essentially equal to 3, whereas the expectation of the alternative estimator is less than 3. Hence, at least in the example that we consider, the center of the distribution of \(S^2\) is located on the target value. On the other hand, the center of the sampling distribution of the alternative estimator is located off that target value.

As a matter of fact it can be shown mathematically that the expectation of the estimator \(S^2\) is always equal to the variance of the measurement. This holds true regardless of what is the actual value of the variance. On the other hand the expectation of the alternative estimator is always off the target value38.

An estimator is called unbiased if its expectation is equal to the value of the parameter that it tries to estimate. We get that \(S^2\) is an unbiased estimator of the variance. Similarly, the sample average is an unbiased estimator of the expectation. Unlike these two estimators, the alternative estimator of the variance is a biased estimator.

The default is to use \(S^2\) as the estimator of the variance of the measurement and to use its square root as the estimator of the standard deviation of the measurement. A justification, which is frequently quoted to justify this selection, is the fact that \(S^2\) is an unbiased estimator of the variance39.

In the previous section, when comparing two competing estimators of the expectation, or main concern was the quantification of the spread of the sampling distribution of either estimator about the target value of the parameter. We used that spread as a measure of the distance between the estimator and the value it tries to estimate. In the setting of the previous section both estimators were unbiased. Consequently, the variance of the estimators, which measures the spread of the distribution about its expectation, could be used in order to quantify the distance between the estimator and the parameter. (Since, for unbiased estimators, the parameter is equal to the expectation of the sampling distribution.)

In the current section one of the estimators (\(S^2\)) is unbiased, but the other (the alternative estimator) is not. In order to compare their accuracy in estimation we need to figure out a way to quantify the distance between a biased estimator and the value it tries to estimate.

Towards that end let us recall the definition of the variance. Given a random variable \(X\) with an expectation \(\Expec(X)\), we consider the square of the deviations \((X - \Expec(X))^2\), which measure the (squared) distance between each value of the random variable and the expectation. The variance is defined as the expectation of the squared distance: \(\Var(X) = \Expec[(X-\Expec(X))^2]\). One may think of the variance as an overall measure of the distance between the random variable and the expectation.

Assume now that the goal is to assess the distance between an estimator and the parameter it tries to estimate. In order to keep the discussion on an abstract level let us use the Greek letter \(\theta\) (read: theta) to denote this parameter40. The estimator is denoted by \(\hat \theta\) (read: theta hat). It is a statistic, a formula applied to the data. Hence, with respect to the sampling distribution, \(\hat \theta\) is a random variable41. The issue is to measure the distance between the random variable \(\hat \theta\) and the parameter \(\theta\).

Motivated by the method that led to the definition of the variance we consider the deviations between the estimator and the parameter. The square deviations \((\hat \theta - \theta)^2\) may be considered in the current context as a measure of the (squared) distance between the estimator and the parameter. When we take the expectation of these square deviations we get an overall measure of the distance between the estimator and the parameter. This overall distance is called the mean square error of the estimator and is denoted by MSE:

\[\mathrm{MSE} = \Expec\big[(\hat \theta - \theta)^2\big]\;.\]

The mean square error of an estimator is tightly linked to the bias and the variance of the estimator. The bias of an estimator \(\hat \theta\) is the difference between the expectation of the estimator and the parameter it seeks to estimate:

\[\mathrm{Bias} = \Expec(\hat \theta) - \theta\;.\] In an unbiased estimator the expectation of the estimator and the estimated parameter coincide, i.e. the bias is equal to zero. For a biased estimator the bias is either negative, as is the case for the alternative estimator of the variance, or else it is positive.

The variance of the estimator, \(\mbox{Variance} = \Var(\hat \theta)\), is a measure of the spread of the sampling distribution of the estimator about its expectation.

The link between the mean square error, the bias, and the variance is described by the formula:

\[\mbox{MSE} = \mbox{Variance} + (\mbox{Bias})^2\;.\] Hence, the mean square error of an estimator is the sum of its variance, the (squared) distance between the estimator and its expectation, and the square of the bias, the square of the distance between the expectation and the parameter. The mean square error is influenced both by the spread of the distribution about the expected value (the variance) and by the distance between the expected value and the parameter (the bias). The larger either of them become the larger is the mean square error, namely the distance between the estimator and the parameter.

Let us compare between the mean square error of the estimator \(S^2\) and the mean square error of the alternative estimator \([19/20] S^2\). Recall that we have computed their expectations and found out that the expectation of \(S^2\) is essentially equal to 3, the target value of the variance. The expectation of the alternative estimator turned out to be equal to 2.845630, which is less than the target value42. It turns out that the bias of \(S^2\) is zero (or essentially zero in the simulations) and the bias of the alternative estimator is \(2.845630 - 3 = -0.15437 \approx -0.15\).

In order to compute the mean square errors of both estimators, let us compute their variances:

var(X.var)
## [1] 0.9484016
var((19/20)*X.var)
## [1] 0.8559324

Observe that the variance of \(S^2\) is essentially equal to 0.936 and the variance of the alternative estimator is essentially equal to 0.845.

The estimator \(S^2\) is unbiased. Consequently, the mean square error of \(S^2\) is equal to its variance. The bias of the alternative is -0.15. As a result we get that the mean square error of this estimator, which is the sum of the variance and the square of the bias, is essentially equal to

\[0.845 + (-0.15)^2 = 0.845 + 0.0225 = 0.8675\;.\] Observe that the mean square error of the estimator \(S^2\), which is equal to 0.936, is larger than the mean square error of the alternative estimator.

Notice that even though the alternative estimator is biased it still has a smaller mean square error than the default estimator \(S^2\). Indeed, it can be prove mathematically that when the measurement has a Normal distribution then the mean square error of the alternative estimator is always smaller than the mean square error of the sample variance \(S^2\).

Still, although the alternative estimator is slightly more accurate than \(S^2\) in the estimation of the variance, the tradition is to use the latter. Obeying this tradition we will henceforth use \(S^2\) whenever estimation of the variance is required. Likewise, we will use \(S\), the square root of the sample variance, to estimate the standard deviation.

In order to understand how is it that the biased estimator produced a smaller mean square error than the unbiased estimator let us consider the two components of the mean square error. The alternative estimator is biased but, on the other hand, it has a smaller variance. Both the bias and the variance contribute to the mean square error of an estimator. The price for reducing the bias in estimation is usually an increase in the variance and vice versa. The consequence of producing an unbiased estimator such as \(S^2\) is an inflated variance. A better estimator is an estimator that balances between the error that results from the bias and the error that results from the variance. Such is the alternative estimator.

We will use \(S^2\) in order to estimate the variance of a measurement. A context in which an estimate of the variance of a measurement is relevant is in the assessment of the variance of the sample mean. Recall that the variance of the sample mean is equal to \(\Var(X)/n\), where \(\Var(X)\) is the variance of the measurement and \(n\) is the size of the sample. In the case where the variance of the measurement is not known one may estimate it from the sample using \(S^2\). It follows that the estimator of the variance of the sample average is \(S^2/n\). Similarly, \(S/\sqrt{n}\) can be used as an estimator of the standard deviation of the sample average.

10.5 Estimation of Other Parameters

In the previous two section we considered the estimation of the expectation and the variance of a measurement. The proposed estimators, the sample average for the expectation and the sample variance for the variance, are not tied to any specific model for the distribution of the measurement. They may be applied to data whether or not a theoretical model for the distribution of the measurement is assumed.

In the cases where a theoretical model for the measurement is assumed one may be interested in the estimation of the specific parameters associated with this model. In the first part of the book we introduced the Binomial, the Poisson, the Uniform, the Exponential, and the Normal models for the distribution of measurements. In this section we consider the estimation of the parameters that determine each of these theoretical distributions based on a sample generated from the same distribution. In some cases the estimators coincide with the estimators considered in the previous sections. In other cases the estimators are different.

Start with the Binomial distribution. We will be interested in the special case \(X \sim \mathrm{Binomial}(1,p)\). This case involves the outcome of a single trial. The trial has two possible outcomes, one of them is designated as “success” and the other as “failure”. The parameter \(p\) is the probability of the success. The \(\mathrm{Binomial}(1,p)\) distribution is also called the Bernoulli distribution. Our concern is the estimation of the parameter \(p\) based on a sample of observations from this Bernoulli distribution.

This estimation problem emerges in many settings that involve the assessment of the probability of an event based on a sample of \(n\) observations. In each observation the event either occurs or not. A natural estimator of the probability of the event is its relative frequency in the sample. Let us show that this estimator can be represented as an average of a Bernoulli sample and the sample average is used for the estimation of a Bernoulli expectation.

Consider an event, one may code a measurement \(X\), associated with an observation, by 1 if the event occurs and by 0 if it does not. Given a sample of size \(n\), one thereby produces \(n\) observations with values 0 or 1. An observation has the value 1 if the event occurs for that observation or, else, the value is 0.

Notice that \(\Expec(X) = 1 \cdot p = p\). Consequently, the probability of the event is equal to the expectation of the Bernoulli measurement43. It turns out that the parameter one seeks to estimate is the expectation of a Bernoulli measurement. The estimation is based on a sample of size \(n\) of Bernoulli observations.

In Section 10.3 it was proposed to use the sample average as an estimate of the expectation. The sample average is the sum of the observations, divided by the number of observation. In the specific case of a sample of Bernoulli observations, the sum of observation is the sum of zeros and one. The zeros do not contribute to the sum. Hence, the sum is equal to the number of times that 1 occurs, namely the frequency of the occurrences of the event. When we divide by the sample size we get the relative frequency of the occurrences. The conclusion is that the sample average of the Bernoulli observations and the relative frequency of occurrences of the event in the sample are the same. Consequently, the sample relative frequency of the event is also a sample average that estimates the expectation of the Bernoulli measurement.

We seek to estimate \(p\), the probability of the event. The estimator is the relative frequency of the event in the sample. We denote this estimator by \(\hat P\). This estimator is a sample average of Bernoulli observations that is used in order to estimate the expectation of the Bernoulli distribution. From the discussion in Section 10.3 one may conclude that this estimator is an unbiased estimator of \(p\) (namely, \(\Expec(\hat P) = p\)) and that its variance is equal to:

\[\Var(\hat P) = \Var(X) / n = p(1-p)/n\;,\] where the variance of the measurement is obtained from the formula for the variance of a \(\mathrm{Binomial}(1,p)\) distribution44.

The second example of an integer valued random variable that was considered in the first part of the book is the \(\mathrm{Poisson}(\lambda)\) distribution. Recall that \(\lambda\) is the expectation of a Poisson measurement. Hence, one may use the sample average of Poisson observations in order to estimate this parameter.

The first example of a continuous distribution that was discussed in the first part of the book is the \(\mathrm{Uniform}(a,b)\) distribution. This distribution is parameterized by \(a\) and \(b\), the end-points of the interval over which the distribution is defined. A natural estimator of \(a\) is the smallest value observed and a natural estimator of \(b\) is the largest value. One may use the function “min” for the computation of the former estimate from the sample and use the function “max” for the computation of the later. Both estimators are slightly biased but have a relatively small mean square error.

Next considered the \(X \sim \mathrm{Exponential}(\lambda)\) random variable. This distribution was applied in this chapter to model the distribution of the prices of cars. The distribution is characterized by the rate parameter \(\lambda\). In order to estimate the rate one may notice the relation between it and the expectation of the measurement:

\[\Expec(X) = 1/\lambda \quad \Longrightarrow \quad \lambda = 1/\Expec(X)\;.\]

The rate is equal to the reciprocal of the expectation. The expectation can be estimated by the sample average. Hence a natural proposal is to use the reciprocal of the sample average as an estimator of the rate:

\[\hat \lambda = 1/ \bar X\;.\]

The final example that we mention is the \(\mathrm{Normal}(\mu,\sigma^2)\) case. The parameter \(\mu\) is the expectation of the measurement and may be estimated by the sample average \(\bar X\). The parameter \(\sigma^2\) is the variance of a measurement, and can be estimated using the sample variance \(S^2\).

10.6 Exercises

Exercise 10.1 In Subsection 10.3.2 we compare the average against the mid-range as estimators of the expectation of the measurement. The goal of this exercise is to repeat the analysis, but this time compare the average to the median as estimators of the expectation in symmetric distributions.

  1. Simulate the sampling distribution of average and the median of a sample of size \(n=100\) from the \(\mathrm{Normal}(3,2)\) distribution. Compute the expectation and the variance of the sample average and of the sample median. Which of the two estimators has a smaller mean square error?

  2. Simulate the sampling distribution of average and the median of a sample of size \(n=100\) from the \(\mathrm{Uniform}(0.5,5.5)\) distribution. Compute the expectation and the variance of the sample average and of the sample median. Which of the two estimators has a smaller mean square error?

Exercise 10.2 The goal in this exercise is to assess estimation of a proportion in a population on the basis of the proportion in the sample.

The file “pop2.csv” was introduced in Exercise 7.1 of Chapter 7. This file contains information associated to the blood pressure of an imaginary population of size 100,000. The file can be found on the internet (http://pluto.huji.ac.il/~msby/StatThink/Datasets/pop2.csv). One of the variables in the file is a factor by the name “group” that identifies levels of blood pressure. The levels of this variable are “HIGH”, “LOW”, and “NORMAL”.

The file “ex2.csv” contains a sample of size \(n=150\) taken from the given population. This file can also be found on the internet (http://pluto.huji.ac.il/~msby/StatThink/Datasets/ex2.csv). It contains the same variables as in the file “pop2.csv”. The file “ex2.csv” corresponds in this exercise to the observed sample and the file “pop2.csv” corresponds to the unobserved population.

Download both files to your computer and answer the following questions:

  1. Compute the proportion in the sample of those with a high level of blood pressure45.

  2. Compute the proportion in the population of those with a high level of blood pressure.

  3. Simulate the sampling distribution of the sample proportion and compute its expectation.

  4. Compute the variance of the sample proportion.

  5. It is proposed in Section 10.5 that the variance of the sample proportion is \(\Var(\hat P) = p(1-p)/n\), where \(p\) is the probability of the event (having a high blood pressure in our case) and \(n\) is the sample size (\(n=150\) in our case). Examine this proposal in the current setting.

10.7 Summary

Glossary

Point Estimation:

An attempt to obtain the best guess of the value of a population parameter. An estimator is a statistic that produces such a guess. The estimate is the observed value of the estimator.

Bias:

The difference between the expectation of the estimator and the value of the parameter. An estimator is unbiased if the bias is equal to zero. Otherwise, it is biased.

Mean Square Error (MSE):

A measure of the concentration of the distribution of the estimator about the value of the parameter. The mean square error of an estimator is equal to the sum of the variance and the square of the bias. If the estimator is unbiased then the mean square error is equal to the variance.

Bernoulli Random Variable:

A random variable that obtains the value “1” with probability \(p\) and the value “0” with probability \(1-p\). It coincides with the \(\mathrm{Binomial}(1,p)\) distribution. Frequently, the Bernoulli random variable emerges as the indicator of the occurrence of an event.

Discuss in the forum

Performance of estimators is assessed in the context of a theoretical model for the sampling distribution of the observations. Given a criteria for optimality, an optimal estimator is an estimator that performs better than any other estimator with respect to that criteria. A robust estimator, on the other hand, is an estimator that is not sensitive to misspecification of the theoretical model. Hence, a robust estimator may be somewhat inferior to an optimal estimator in the context of an assumed model. However, if in actuality the assumed model is not a good description of reality then robust estimator will tend to perform better than the estimator denoted optimal.

Some say that optimal estimators should be preferred while other advocate the use of more robust estimators. What is your opinion?

When you formulate your answer to this question it may be useful to come up with an example from you own field of interest. Think of an estimation problem and possible estimators that can be used in the context of this problem. Try to identify a model that is natural to this problem an ask yourself in what ways may this model err in its attempt to describe the real situation in the estimation problem.

As an example consider estimation of the expectation of a Uniform measurement. We demonstrated that the mid-range estimator is better than the sample average if indeed the measurements emerge from the Uniform distribution. However, if the modeling assumption is wrong then this may no longer be the case. If the distribution of the measurement in actuality is not symmetric or if the distribution is more concentrated in the center than in the tails then the performance of the mid-range estimator may deteriorate. The sample average, on the other hand is not sensitive to the distribution not being symmetric.

Formulas:

  • Bias: \(\mathrm{Bias} = \Expec(\hat \theta) - \theta\).

  • Variance: \(\Var(\hat \theta) = \Expec\big[(\hat \theta - \Expec(\hat\theta))^2\big]\).

  • Mean Square Error: \(\mathrm{MSE} = \Expec\big[(\hat \theta - \theta)^2\big]\).


  1. The name of the argument stands for “NA remove”. If the value of the argument is set to “TRUE” then the missing values are removed in the computation of the average. Consequently, the average is computed for the sub-sequence of non-missing values. The default specification of the argument in the definition of the function is “na.rm=FALSE”, which implies a missing value for the mean when computed on a sequence that contains missing values.

  2. As a matter of fact, the difference is the probability of falling in the half-open interval \((12000,14000]\). However, for continuous distributions the probability of the end-points is zero and they do not contribute to the probability of the interval.

  3. As a matter of fact, the variance of the sample average is exactly \(\Var(X)/100 = 0.02\). Due to the inaccuracy of the simulation we got a slightly different variance.

  4. Observe that the middle range of the \(\mathrm{Uniform}(a,b)\) distribution, the middle point between the maximum value of the distribution \(b\) and the minimal value \(a\), is \((a+b)/2\), which is equal to the expectation of the distribution

  5. Actually, the exact value of the variance of the sample average is \(\Var(X)/100 = 0.02083333\). The results of the simulation are consistent with this theoretical computation.

  6. For the estimator \(S^2\) we get that \(\Expec(S^2) = \Var(X)\). On the other hand, for the alternative estimator we get that \(\Expec([(n-1)/n]\cdot S^2) = [(n-1)/n]\Var(X) \not = \Var(X)\). This statement holds true also in the cases where the distribution of the measurement is not Normal.

  7. As part of your homework assignment you are required to investigate the properties of \(S\), the square root of \(S^2\), as an estimator of the standard deviation of the measurement. A conclusion of this investigation is that \(S\) is a biased estimator of the standard deviation.

  8. The letter \(\theta\) is frequently used in the statistical literature to denote a parameter of the distribution. In the previous section we considered \(\theta = \Expec(X)\) and in this section we consider \(\theta=\Var(X)\).

  9. Observe that we diverge here slightly from our promise to use capital letters to denote random variables. However, denoting the parameter by \(\theta\) and denoting the estimator of the parameter by \(\hat \theta\) is standard in the statistical literature. As a matter of fact, we will use the “hat” notation, where a hat is placed over a Greek letter that represents the parameter, in other places in this book. The letter with the hat on top will represent the estimator and will always be considered as a random variable. For Latin letters we will still use capital letters, with or without a hat, to represent a random variable and small letter to represent evaluation of the random variable for given data.

  10. It can be shown mathematically that \(\Expec([(n-1)/n] S^2) = [(n-1)/n] \Expec(S^2)\). Consequently, the actual value of the expectation of the alternative estimator in the current setting is \([19/20]\cdot 3 = 2.85\) and the bias is \(-0.15\). The results of the simulation are consistent with this fact.

  11. The expectation of \(X\sim\mathrm{Binomial}(n,p)\) is \(\Expec(X)=np\). In the Bernoulli case \(n=1\). Therefore, \(\Expec(X) = 1\cdot p = p\).

  12. The variance of \(X\sim\mathrm{Binomial}(n,p)\) is \(\Var(X)=np(1-p)\). In the Bernoulli case \(n=1\). Therefore, \(\Var(X) = 1\cdot p(1-p) = p(1-p)\).

  13. Hint: You may use the function summary or you may note that the expression “variable==level” produces a sequence with logical “TRUE” or “FALSE” entries that identify entries in the sequence “variable” that obtain the value “level”.