### Overview

This topic relates to what we do in the analysis of discrete data in the following ways...

### Review Topics

The following are the topics that are covered in this lesson:

- Review of Important Concepts
- Loglikelihood Functions
- Binomial Loglikelihood Functions
- Poisson Loglikelihood Functions

- Asymptotic Confidence Intervals
- Example: Online-Class Exercise

- Another Example
- Observed and Expected Information
- Bernoulli Asymptotic Confidence Intervals
- Binomial Asymptotic Confidence Intervals
- Poisson Asymptotic Confidence Intervals
- Alternative Parameterizations
- Intervals Based on the Likelihood Ratio

**Review of Important Concepts **

Let *X*_{1}, *X*_{2}, ..., *X _{n}* be a simple random sample from a
probability distribution

*f*(

*x*; θ).

A parameterθ off(x; θ) is a variable that is a characteristic off(x; θ).

A statisticT is any quantity that can be calculated from a sample; it's a function ofX_{1}, ...,X._{n}

An estimatefor θ is a single number that is a reasonable value for θ

An estimatorfor θ is a statistic that gives the formula for computing the estimate .The

likelihoodof the sample is the joint PDF (or PMF)The

maximum likelihood estimate(MLE) maximizes L(θ):

If we use *X _{i}*'s instead of

*x*'s then is the maximum likelihood estimator. Usually, MLE:

_{i}- is unbiased, E() = θ
- is consistent, → θ, as
*n*→ ∞ - is efficient, has small SE() as
*n*→ ∞ - is asymptotically normal, ≈
*N*(0, 1)

The **loglikelihood function **is defined to be the natural
logarithm of the likelihood function,

*l*(θ ; *x*) = log*L*(θ ; *x*).

For a variety of reasons, statisticians often work with
the loglikelihood rather than with the likelihood. One
reason is that *l*(θ ; *x*) tends to be a simpler function
than *L*(θ ; *x*). When we take logs, products are
changed to sums. If *X* = (*X*_{1},* X*_{2}, . . . ,* X _{n}*) is an

*iid*sample from a probability distribution

*f*(

*x*; θ), the overall likelihood is the product of the likelihoods for the individual

*X*'s:

_{i}The loglikelihood, however, is the sum of the individual loglikelihoods:

Next, we will go through some examples of loglikelihood functions associated with different distributions.

### Binomial Loglikelihood Functions

Suppose X ∼ Bin(*n*, *p*) where *n* is known.
The likelihood function is

and so the loglikelihood function is

*l*(*p* ; *x*) = *k* + *x *log* p* + (*n* − *x*) log(1 − *p*),

where *k* is a constant that does not involve the
parameter *p*. In the future we will omit the constant,
because it's statistically irrelevant.

#### Poisson Loglikelihood Functions

Suppose *X* = (*X*_{1},* X*_{2}, . . . ,* X _{n}*) is an

*iid*sample from a Poisson distribution with parameter λ. The likelihood is

and the loglikelihood is

ignoring the constant terms that do not depend on λ.
As the above examples show, *l*(θ ; *x*) often looks nicer
than *L*(θ ; *x*) because the products become sums and
the exponents become multipliers.

### Asymptotic Confidence Intervals

Loglikelihood forms the basis for many approximate
confidence intervals and hypothesis tests because it
behaves in a predictable manner as the sample size
grows. The following example illustrates what
happens to *l*(θ ; *x*) as *n* becomes large.

#### Example: Online-Class Exercise

Suppose that we observe *X* = 1
from a binomial distribution with *n* = 4 and *p* unknown. Calculate the loglikelihood. What does the
graph of loglikelihood look like? Find the MLE (do you
understand the difference between the estimator and
the estimate?) Locate the MLE on the graph of the
likelihood.

The MLE is = 1/4 = .25. Ignoring constants, the loglikelihood is

*l*(*p* ; *x*) = log* p* + 3 log(1 − *p*),

which looks like this:

Here is a sample code for plotting this function in R:

p=seq(from=.01,to=.80,by=.01)

loglik=log(p) + 3*log(1-p)

plot(p,loglik,xlab="",ylab="",type="l",xlim=c(0,1))

For clarity I omitted from the plot all values of *p* beyond .8, because for *p* > .8 the loglikelihood drops
down so low that including these values of *p* would
distort the plot's appearance. When plotting
loglikelihoods, we don't need to include all θ values in
the parameter space; in fact, it's a good idea to limit
the domain to those θ's for which the loglikelihood is
no more than 2 or 3 units below the maximum value
l(; x) because, in a single-parameter problem, any θ whose loglikelihood is more than 2 or 3 units below
the maximum is highly implausible.

Now suppose that we observe *X* = 10 from a binomial
distribution with *n* = 40. The MLE is again = 10/40 = .25, but the loglikelihood is

*l*(*p* ; *x*) = 10 log *p* + 30 log(1 − *p*),

Finally, suppose that we observe *X* = 100 from a
binomial with *n* = 400. The MLE is still = 100/400 = .25, but the loglikelihood is now

*l*(*p* ; *x*) = 100 log *p* + 300 log(1 − *p*),

As *n* gets larger, two things are happening to the
loglikelihood. First, *l*(*p* ; *x*) is becoming more sharply
peaked around . Second, *l*(*p* ; *x*) is becoming more
symmetric about .

__The first point shows that as the sample size grows,
we are becoming more confident that the true
parameter lies close to .__ If the loglikelihood is highly
peaked—that is, if it drops sharply as we move away
from the MLE—then the evidence is strong that

*p*is near . A flatter loglikelihood, on the other hand, means that

*p*is not well estimated and the range of plausible values is wide. In fact, the curvature of the loglikelihood (i.e. the second derivative of

*l*(θ ;

*x*) with respect to θ) is an important measure of statistical information about θ.

__The second point, that the loglikelihood function
becomes more symmetric about the MLE as the
sample size grows, forms the basis for constructing
asymptotic (large-sample) confidence intervals for the
unknown parameter.__ In a wide variety of problems, as
the sample size grows the loglikelihood approaches a
quadratic function (i.e. a parabola) centered at the
MLE.

The parabola is significant because that is the shape of the loglikelihood from the normal distribution.

If we had a random sample of any size from a normal
distribution with known variance σ^{2} and unknown
mean μ, the loglikelihood would be a perfect parabola
centered at the

From elementary statistics, we know that if we have a
sample from a normal distribution with known
variance σ^{2}, a 95% confidence interval for the mean μ
is

(1)

The quantity σ/√n is called the standard error; it measures the variability of the sample mean about the true mean μ. The number 1.96 comes from a table of the standard normal distribution; the area under the standard normal density curve between −1.96 and 1.96 is .95 or 95%.

The confidence interval (1) is valid because over repeated samples the estimate is normally distributed about the true value μ with a standard deviation of σ/√n.

There is much confusion about how to interpret a confidence interval (CI).

A CI is NOT a probability statement about θ since θ is a fixed value, not a random variable; unless you are doing a Bayesian analaysis and interperting a credible interval!

** One interpretation:** if we took many samples, most of
our intervals would capture true parameter (e.g. 95%
of out intervals will contain the true parameter).

### Another Example

The nationwide telephone poll was conducted by NY Times/CBS News between Jan. 14-18. with 1118 adults. About 58% of respondents feel optimistic about next four years. The results are reported with a margin of error of 3%.

In STAT 504, the parameter of interest will not be the mean of a normal population, but some other parameter θ pertaining to a discrete probability distribution.

We will often estimate the parameter by its MLE .

But because in large samples the loglikelihood
function *l*(θ ; *x*) approaches a parabola centered at ,
we will be able to use a method similar to (1) to form
approximate confidence intervals for θ.

Just as is normally distributed about μ, is approximately normally distributed about θ in large samples.

This property is called the “asymptotic normality of the MLE,” and the technique of forming confidence intervals is called the “asymptotic normal approximation.” This method works for a wide variety of statistical models, including all the models that we will use in this course.

The asymptotic normal 95% confidence interval for a parameter θ has the form

(2)

where *l*"(; *x*) is the second derivative of the
loglikelihood function with respect to θ, evaluated at
θ = .

Of course, we can also form intervals with confidence coefficients other than 95%.

All we need to do is to replace 1.96 in (2) by *z*, a value
from a table of the standard normal distribution,
where ± *z* encloses the desired level of confidence.

If we wanted a 90% confidence interval, for example, we would use 1.645.

### Observed and Expected Information

The quantity −*l"*(; *x*) is called the “observed
information,” and 1√−*l"*(; *x*) is an approximate
standard error for . As the loglikelihood becomes
more sharply peaked about the MLE, the second
derivative drops and the standard error goes down.

When calculating asymptotic confidence intervals,
statisticians often replace the second derivative of the
loglikelihood by its expectation; that is, replace −*l"*(; *x*) by the function

*I*(θ) = −*E* [*l"*(; *x*)],

which is called the expected information or the Fisher information. In that case, the 95% confidence interval would become

(3)

When the sample size is large, the two confidence intervals (2) and (3) tend to be very close. In some problems, the two are identical.

Next, what follows are a few examples of asymptotic confidence intervals.

### Bernoulli Asymptotic Confidence Intervals

If *X* is Bernoulli with success probability *p*, the loglikelihood is

*l*(*p* ; *x*) = *x* log *p* + (1 − *x*) log (1 − *p*),

the first derivative is

and the second derivative is

(to derive this, use the fact that *x*^{2} = *x*). Because

*E* [(*x* − *p*)^{2}] = *V* (*x*) = *p*(1 − *p*),

the Fisher information is

Of course, a single Bernoulli trial does not provide
enough information to get a reasonable confidence
interval for *p*. Let's see what happens when we have
multiple trials.

### Binomial Asymptotic Confidence Intervals

If *X* ∼ Bin(*n*, *p*), then the loglikelihood is

*l*(*p* ; *x*) = *x* log *p* + (*n* − *x*) log (1 − *p*),

the first derivative is

the second derivative is

and the Fisher information is

Thus an approximate 95% confidence interval for *p* based on the Fisher information is

(4)

where = x/n is the MLE.

Notice that the Fisher information for the Bin(*n*,* p*)
model is *n* times the Fisher information from a single
Bernoulli trial. This is a general principle; if we
observe a sample size *n*,

*X* = (*X*_{1},* X*_{2}, . . . ,* X _{n}*),

where *X*_{1}, *X*_{2}, . . . , *X _{n}* are independent random
variables, then the Fisher information from X is the
sum of the Fisher information functions from the
individual

*X*'s. If

_{i}*X*

_{1},

*X*

_{2}, . . . ,

*X*are iid, then the Fisher information from

_{n}*X*is

*n*times the Fisher information from a single observation

*X*.

_{i}What happens if we use the observed information
rather than the expected information? Evaluating the
second derivative *l"*(*p* ; *x*) at the *MLE* =* x*/*n* gives

so the 95% interval based on the observed information is identical to (4).

Unfortunately, Agresti (2002, p. 15) points out that
the interval (4) performs poorly unless *n* is very large;
the actual coverage can be considerably less than the
nominal rate of 95%.

The confidence interval (4) has two unusual features:

- The endpoints can stray outside the parameter space; that is, one can get a lower limit less than 0 or an upper limit greater than 1.
- If we happen to observe no successes (
*x*= 0) or no failures (*x*=*n*) the interval becomes degenerate (has zero width) and misses the true parameter*p*. This unfortunate event becomes quite likely when the actual*p*is close to zero or one.

A variety of fixes are available. One ad hoc fix, which can work surprisingly well, is to replace by

which is equivalent to adding half a success and half a failure; that keeps the interval from becoming degenerate. To keep the endpoints within the parameter space, we can express the parameter on a different scale, such as the log-odds

which we will discuss later.

### Poisson Asymptotic Confidence Intervals

If *X* = (*X*_{1}, *X*_{2}, . . . , *X _{n}*) is an iid sample
from a Poisson distribution with parameter ", the
loglikelihood is

the first derivative is

the second derivative is

and the Fisher information is

An approximate 95% interval based on the observed or expected information is

where is the MLE.

Once again, this interval may not perform well in some circumstances; we can often get better results by changing the scale of the parameter.

### Alternative Parameterizations

Statistical theory tells us that if *n* is large enough, the
true coverage of the approximate intervals (2) or (3)
will be very close to 95%. How large *n* must be in
practice depends on the particulars of the problem.
Sometimes an approximate interval performs poorly
because the loglikelihood function doesn't closely
resemble a parabola. If so, we may be able to improve
the quality of the approximation by applying a
suitable *reparameterization*, a transformation of the
parameter to a new scale. Here is an example.

Suppose we observe *X* = 2 from a binomial
distribution Bin(20, *p*). The MLE is = 2/20 = .10
and the loglikelihood is not very symmetric:

This asymmetry arises because is close to the
boundary of the parameter space. We know that *p* must lie between zero and one. When is close to
zero or one, the loglikelihood tends to be more skewed
than it would be if were near .5. The usual 95%
confidence interval is

or (−.031, .231), which strays outside the parameter space.

The “logistic” or “logit” transformation is defined as

(6)

The logit is also called the “log odds,” because *p*/(1 −* p*) is the odds associated with *p*. Whereas *p* is
a proportion and must lie between 0 and 1, φ may
take any value from − ∞ to + ∞, so the logit
transformation solves the problem of a sharp
boundary in the parameter space.

Solving (6) for *p* produces the back-transformation

(7)

Let's rewrite the binomial loglikelihood in terms of φ:

Now let's graph the loglikelihood *l*(φ; *x*) versus φ:

It's still skewed, but not quite as sharply as before.
This plot strongly suggests that an asymptotic
confidence interval constructed on the φ scale will be
more accurate in coverage than an interval
constructed on the *p* scale. An approximate 95%
confidence interval for φ is

where is a the MLE of φ, and *I*(φ) is the Fisher
information for φ. To find the MLE for φ, all we need
to do is apply the logit transformation to :

Assuming for a moment that we know the Fisher
information for φ, we can calculate this 95%
confidence interval for φ. Then, because our interest
is not really in φ but in *p*, we can transform the
endpoints of the confidence interval back to the *p* scale. This new confidence interval for *p* will not be
exactly symmetric—i.e. will not lie exactly in the
center of it—but the coverage of this procedure
should be closer to 95% than for intervals computed
directly on the *p*-scale.

The general method for reparameterization is as follows.

First, we choose a transformation φ = φ(θ) for which we think the loglikelihood will be symmetric.

Then we calculate , the MLE for θ, and transform it to the φ scale,

= φ(ˆθ).

Next we need to calculate I(), the Fisher information for φ. It turns out that this is given by

(8)

where φ() is the first derivative of φ with respect to θ. Then the endpoints of a 95% confidence interval for φ are:

The approximate 95% confidence interval for φ is [φ low, φ high]. The corresponding confidence interval for θ is obtained by transforming φ low and φ high back to the original θ scale. A few common transformations are shown in Table 1, along with their back-transformations and derivatives.

**Table 1**: Some common transformations, their back
transformations, and derivatives.

Going back to the binomial example with *n* = 20 and *X* = 2, let's form a 95% confidence interval for
φ = log *p*/(1 − *p*). The MLE for *p* is = 2/20 = .10,
so the MLE for φ is = log(.1/.9) = −2.197. Using
the derivative of the logit transformation from Table
1, the Fisher information for φ is

Evaluating it at the MLE gives

I( ) = 20 × 0.1 × 0.9 = 1.8

The endpoints of the 95% confidence interval for φ are

and the corresponding endpoints of the confidence interval for *p* are

The *MLE* = .10 is not exactly in the middle of this
interval, but who says that a confidence interval must
be symmetric about the point estimate?

### Intervals Based on the Likelihood Ratio

Another way to form a confidence interval for a single
parameter is to find all values of θ for which the
loglikelihood *l*(θ ; *x*) is within a given tolerance of the
maximum value *l*( ; *x*). Statistical theory tells us
that, if θ_{0} is the true value of the parameter, then the
likelihood-ratio statistic

(9)

is approximately distributed as %21
when the sample
size *n* is large. This gives rise to the well known
likelihood-ratio (LR) test.

In the LR test of the null hypothesis

*H*_{0} : θ = θ_{0}

versus the two-sided alternative

*H*_{1} : θ ≠ θ_{0},

we would reject H0 at the α-level if the LR statistic
(9) exceeds the 100(1 − α)th percentile of the %21
distribution. That is, for an α = .05-level test, we
would reject *H*_{0} if the LR statistic is greater than
3.84.

The LR testing principle can also be used to construct
confidence intervals. An approximate 100(1 − α)%
confidence interval for θ consists of all the possible
θ_{0}'s for which the null hypothesis *H*_{0} : θ = θ_{0} would
not be rejected at the α level. For a 95% interval, the
interval would consist of all the values of θ for which

2 [ *l* ( ; *x*) − *l* (θ ; *x*)] ≤ 3.84

or

*l*(θ ; *x*) ≥ *l*( ; *x*) − 1.92.

In other words, the 95% interval includes all values of θ for which the loglikelihood function drops off by no more than 1.92 units.

Returning to our binomial example, suppose that we
observe *X* = 2 from a binomial distribution with *n* = 20 and *p* unknown. The graph of the
loglikelihood function

*l*(*p* ; *x*) = 2 log *p* + 18 log(1 − *p*)

looks like this,

the MLE is = *x*/*n* = .10, and the maximized
loglikelihood is

*l*( ; *x*) = 2 log .1 + 18 log .9 = −6.50.

Let's add a horizontal line to the plot at the loglikelihood value −6.50 − 1.92 = −8.42:

The horizontal line intersects the loglikelihood curve
at *p* = .018 and *p* = .278. Therefore, the LR
confidence interval for *p* is (.018, .278).

When *n* is large, the LR method will tend to produce
intervals very similar to those based on the observed
or expected information. Unlike the
information-based intervals, however, the LR intervals
are scale-invariant. That is, if we find the LR interval
for a transformed version of the parameter such as
φ = log *p*/(1 − *p*) and then transform the endpoints
back to the *p*-scale, we get exactly the same answer as
if we apply the LR method directly on the *p*-scale.
For that reason, statisticians tend to like the LR
method better.

If the loglikelihood function expressed on a particular scale is nearly quadratic, then a information-based interval calculated on that scale will agree closely with the LR interval. Therefore, if the information-based interval agrees with the LR interval, that provides some evidence that the normal approximation is working well on that particular scale. If the information-based interval is quite different from the LR interval, the appropriateness of the normal approximation is doubtful, and the LR approximation is probably better.