What distribution is used when the sample size N is small and or when the population standard deviation is unknown?

  1. Hogg RV, Craig AT: Introduction to mathematical statistics. 1995, New YorkToronto , Macmillan College Pub. Co. ;Maxwell Macmillan Canada ;Maxwell Macmillan International, xi, 564-5th,

    Google Scholar 

  2. Mood AMF, Graybill FA, Boes DC: Introduction to the theory of statistics. 1974, New York, , McGraw-Hill, xvi, 564-3d

    Google Scholar 

  3. Petiti DB: Meta-analysis, decision analysis and cost-effectiveness analysis. Methods for quantitative synthesis in medicine. 2nd ed. 2000, New York , Oxford press

    Google Scholar 

  4. Rizzo JD, Lichtin AE, Woolf SH, Seidenfeld J, Bennett CL, Cella D, Djulbegovic B, Goode MJ, Jakubowski AA, Lee SJ, Miller CB, Rarick MU, Regan DH, Browman GP, Gordon MS: Use of epoetin in patients with cancer: evidence-based clinical practice guidelines of the American Society of Clinical Oncology and the American Society of Hematology. J Clin Oncol. 2002, 20 (19): 4083-4107. 10.1200/JCO.2002.07.177.

    CAS  Article  PubMed  Google Scholar 

  5. Bohlius J, Langensiepen S, Schwarzer G, Seidenfeld J, Piper M, Bennet C, Engert A: Erythropoietin for patients with malignant disease. Cochrane Database Syst Rev. 2004, CD003407.-

    Google Scholar 

  6. Chan AW, Hrobjartsson A, Haahr MT, Gotzsche PC, Altman DG: Empirical Evidence for Selective Reporting of Outcomes in Randomized Trials: Comparison of Protocols to Published Articles. JAMA. 2004, 291 (20): 2457-2465. 10.1001/jama.291.20.2457.

    CAS  Article  PubMed  Google Scholar 

  7. Del Mastro L, Venturini M, Lionetto R, Garrone O, Melioli G, Pasquetti W, Sertoli MR, Bertelli G, Canavese G, Costantini M, Rosso R: Randomized phase III trial evaluating the role of erythropoietin in the prevention of chemotherapy-induced anemia. J Clin Oncol. 1997, 15 (7): 2715-2721.

    CAS  PubMed  Google Scholar 

  8. Kunikane H, Watanabe K, Fukuoka M, Saijo N, Furuse K, Ikegami H, Ariyoshi Y, Kishimoto S: Double-blind randomized control trial of the effect of recombinant human erythropoietin on chemotherapy-induced anemia in patients with non-small cell lung cancer. Int J Clin Oncol. 2001, 6 (6): 296-301. 10.1007/s10147-001-8031-y.

    CAS  Article  PubMed  Google Scholar 

  9. Welch RS, James RD, Wilkinson PM, Fb: Recombinant Human Erythropoietin and Platinum-Based Chemotherapy In Advanced Ovarian Cancer. Cancer J Sci Am. 1995, 1 (4): 261-

    CAS  PubMed  Google Scholar 

  10. Thatcher N, De Campos ES, Bell DR, Steward WP, Varghese G, Morant R, Vansteenkiste JF, Rosso R, Ewers SB, Sundal E, Schatzmann E, H. S: Epoetin alpha prevents anaemia and reduces transfusion requirements in patients undergoing primarily platinum-based chemotherapy for small cell lung cancer. Br J Cancer. 1999, 80 (3-4): 396-402. 10.1038/sj.bjc.6990369.

    Article  PubMed  PubMed Central  Google Scholar 

  • The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/5/13/prepub


Page 2

  • Policies
  • Accessibility
  • Press center
  • Support and Contact
  • Leave feedback
  • Careers

Follow BMC

  • BMC Twitter page
  • BMC Facebook page
  • BMC Weibo page

Just to clarify on relation to the title, we aren't using the t-distribution to estimate the mean (in the sense of a point estimate at least), but to construct an interval for it.

But why use an estimate when you can get your confidence interval exactly?

It's a good question (as long as we don't get too insistent on 'exactly', since the assumptions for it to be exactly t-distributed won't actually hold).

"You must use the t-distribution table when working problems when the population standard deviation (σ) is not known and the sample size is small (n<30)"

Why don't people use the T-distribution all the time when the population standard deviation is not known (even when n>30)?

I regard the advice as - at best - potentially misleading. In some situations, the t-distribution should still be used when degrees of freedom are a good deal larger than that.

Where the normal is a reasonable approximation depends on a variety of things (and so depends on the situation). However, since (with computers) it's not at all difficult to just use the $t$, even if the d.f. are very large, you'd have to wonder why the need to worry about doing something different at n=30.

If the sample sizes are really large, it won't make a noticeable difference to a confidence interval, but I don't think n=30 is always sufficiently close to 'really large'.

There is one circumstance in which it might make sense to use the normal rather than the $t$ - that's when your data clearly don't satisfy the conditions to get a t-distribution, but you can still argue for approximate normality of the mean (if $n$ is quite large). However, in those circumstances, often the t is a good approximation in practice, and may be somewhat 'safer'. [In a situation like that, I might be inclined to investigate via simulation.]

Learning Objectives

  1. To become familiar with Student’s \(t\)-distribution.
  2. To understand how to apply additional formulas for a confidence interval for a population mean.

The confidence interval formulas in the previous section are based on the Central Limit Theorem, the statement that for large samples \(\overline{X}\) is normally distributed with mean \(\mu\) and standard deviation \(\sigma /\sqrt{n}\). When the population mean \(\mu\) is estimated with a small sample (\(n<30\)), the Central Limit Theorem does not apply. In order to proceed we assume that the numerical population from which the sample is taken has a normal distribution to begin with. If this condition is satisfied then when the population standard deviation \(\sigma\) is known the old formula \(\bar{x}\pm z_{\alpha /2}(\sigma /\sqrt{n})\) can still be used to construct a \(100(1-\alpha )\%\) confidence interval for \(\mu\).

If the population standard deviation is unknown and the sample size \(n\) is small then when we substitute the sample standard deviation \(s\) for \(\sigma\) the normal approximation is no longer valid. The solution is to use a different distribution, called Student’s \(t\)-distribution with \(n-1\) degrees of freedom. Student’s \(t\)-distribution is very much like the standard normal distribution in that it is centered at \(0\) and has the same qualitative bell shape, but it has heavier tails than the standard normal distribution does, as indicated by Figure \(\PageIndex{1}\), in which the curve (in brown) that meets the dashed vertical line at the lowest point is the \(t\)-distribution with two degrees of freedom, the next curve (in blue) is the \(t\)-distribution with five degrees of freedom, and the thin curve (in red) is the standard normal distribution. As also indicated by the figure, as the sample size \(n\) increases, Student’s \(t\)-distribution ever more closely resembles the standard normal distribution. Although there is a different \(t\)-distribution for every value of \(n\), once the sample size is \(30\) or more it is typically acceptable to use the standard normal distribution instead, as we will always do in this text.

What distribution is used when the sample size N is small and or when the population standard deviation is unknown?
Figure \(\PageIndex{1}\): Student’s \(t\)-Distribution

Just as the symbol \(z_c\) stands for the value that cuts off a right tail of area \(c\) in the standard normal distribution, so the symbol \(t_c\) stands for the value that cuts off a right tail of area \(c\) in the standard normal distribution. This gives us the following confidence interval formulas.

Small Sample \( 100(1−α)\%\) Confidence Interval for a Population Mean

If \(σ\) is known:

\[\overline{x} = ±z_{α/2} \left( \dfrac{σ}{\sqrt{n}}\right) \]

If \(σ\) is unknown:

\[\overline{x} = ±t_{α/2} \left( \dfrac{s}{\sqrt{n}}\right) \label{tdist}\]

with the degrees of freedom \( df=n−1\).

The population must be normally distributed and a sample is considered small when \(n < 30\).

To use the new formula we use the line in Figure 7.1.6 that corresponds to the relevant sample size.

Example \(\PageIndex{1}\)

A sample of size \(15\) drawn from a normally distributed population has sample mean \(35\) and sample standard deviation \(14\). Construct a \(95\%\) confidence interval for the population mean, and interpret its meaning.

Solution:

Since the population is normally distributed, the sample is small, and the population standard deviation is unknown, the formula that applies is Equation \ref{tdist}.

Confidence level \(95\%\) means that

\[α=1−0.95=0.05\]

so \(α/2=0.025\). Since the sample size is \(n = 15\), there are \(n−1=14\) degrees of freedom. By Figure 7.1.6 \(t_{0.025}=2.145\). Thus

\[\begin{align} \overline{x} &= ±t_{α/2} \left( \dfrac{s}{\sqrt{n}}\right) \\ &=35 ± 2.145 \left( \dfrac{14}{\sqrt{15}} \right) \\ &=35 ±7.8 \end{align} \]

One may be \(95\%\) confident that the true value of \(μ\) is contained in the interval

\[(35−7.8, 35+7.8) = (27.2,42.8).\]

Example \(\PageIndex{2}\)

A random sample of \(12\) students from a large university yields mean GPA \(2.71\) with sample standard deviation \(0.51\). Construct a \(90\%\) confidence interval for the mean GPA of all students at the university. Assume that the numerical population of GPAs from which the sample is taken has a normal distribution.

Solution:

Since the population is normally distributed, the sample is small, and the population standard deviation is unknown, the formula that applies is Equation \ref{tdist}

Confidence level \(90\%\) means that

\[α=1−0.90=0.10\]

so \(α/2=0.05\). Since the sample size is \(n = 12\), there are \(n−1=11\) degrees of freedom. By Figure 7.1.6 \(t_{0.05}=1.796\). Thus

\[\begin{align} \overline{x} &= ±t_{α/2} \left( \dfrac{s}{\sqrt{n}}\right) \\ &=2.71 ± 1.796 \left( \dfrac{0.51}{\sqrt{12}} \right) \\ &=2.71 ±0.26 \end{align} \]

One may be \(90\%\) confident that the true average GPA of all students at the university is contained in the interval

\[(2.71−0.26,2.71+0.26)=(2.45,2.97).\]

Compare "Example 4" in Section 7.1 and "Example 6" in Section 7.1. The summary statistics in the two samples are the same, but the \(90\%\) confidence interval for the average GPA of all students at the university in "Example 4" in Section 7.1, \((2.63,2.79)\), is shorter than the \(90\%\) confidence interval \((2.45,2.97)\), in "Example 6" in Section 7.1. This is partly because in "Example 4" in Section 7.1 the sample size is larger; there is more information pertaining to the true value of \(\mu\) in the large data set than in the small one.

Key Takeaway

  • In selecting the correct formula for construction of a confidence interval for a population mean ask two questions: is the population standard deviation \(\sigma\) known or unknown, and is the sample large or small?
  • We can construct confidence intervals with small samples only if the population is normal.