Probability Theory and Its Applications

An important tool in the study of the relationships that exist between two jointly distributed random variables, \(X\) and \(Y\) , is provided by the notion of conditional expectation. In section 3 of Chapter 7 the notion of the conditional distribution function \(F_{Y \mid X}(\cdot \mid x)\) of the random variable \(Y\) , given the random variable \(X\) , is defined. We now define the conditional mean of \(Y\) , given \(X\) , by

\[E[Y \mid X=x] = \left\{ \begin{array}{l} \displaystyle\int_{-\infty}^{\infty} y \, dF_{Y \mid X}(y \mid x) \tag{7.1} \\[5mm] \displaystyle\int_{-\infty}^{\infty} y f_{Y \mid X}(y \mid x) \, dy \\[5mm] \displaystyle\sum_{\substack{\text{over all } y \text{ such that} \\ p_{Y \mid X}(y \mid x) > 0}} y p_{Y \mid X}(y \mid x); \end{array} \right.\]

the last two equations hold, respectively, in the cases in which \(F_{Y \mid X}(\cdot \mid x)\) is continuous or discrete. From a knowledge of the conditional mean of \(Y\) , given \(X\) , the value of the mean \(E[Y]\) may be obtained:

\[E[Y] = \begin{cases} \displaystyle\int_{-\infty}^{\infty} E[Y \mid X=x] \, dF_{X}(x) \tag{7.2} \\[5mm] \displaystyle\int_{-\infty}^{\infty} E[Y \mid X=x] f_{X}(x) \, dx \\[5mm] \displaystyle\sum_{\substack{\text{over all } x \text{ such that} \\ p_{X}(x) > 0}} E[Y \mid X=x] p_{X}(x) \end{cases}\]

Example 7A. Sampling from an urn of random composition. Let a random sample of size \(n\) be drawn without replacement from an urn containing \(N\) balls. Suppose that the number \(X\) of white balls in the urn is a random variable. Let \(Y\) be the number of white balls contained in the sample. The conditional distribution of \(Y\) , given \(X\) , is discrete, with probability mass function for \(x=0,1, \ldots, N\) and \(y=0,1, \ldots, x\) given by

\[\left.p_{Y}\right|_{X}(y \mid x)=P[Y=y \mid X=x]=\frac{\left(\begin{array}{l} x \tag{7.3} \\ y \end{array}\right)\left(\begin{array}{l} N-x \\ n-y \end{array}\right)}{\left(\begin{array}{l} N \\ n \end{array}\right)},\]

since the conditional probability law of \(Y\) , given \(X\) , is hypergeometric. The conditional mean of \(Y\) , given \(X\) , can be readily obtained from a knowledge of the mean of a hypergeometric random variable;

\[E[Y \mid X=x]=n \frac{x}{N}. \tag{7.4}\]

The mean number of white balls in the sample drawn is then equal to

\[E[Y]=\sum_{x=0}^{N} E[Y \mid X=x] p_{X}(x)=\frac{n}{N} \sum_{x=0}^{N} x p_{X}(x)=\frac{n}{N} E[X]. \tag{7.5}\]

Now \(E[X] / N\) is the mean proportion of white balls in the urn. Consequently (7.5) is analogous to the formulas for the mean of a binomial or hypergeometric random variable. Note that the probability law of \(Y\) is hypergeometric if \(X\) is hypergeometric and \(Y\) is binomial if \(X\) is binomial. (See theoretical exercise 4.1 of Chapter 4.)

Example 7B. The conditional mean of jointly normal random variables. Two random variables, \(X_{1}\) and \(X_{2}\) , are jointly normally distributed if they possess a joint probability density function given by (2.18) . Then

\[f_{X_{2} \mid X_{1}}\left(x_{2} \mid x_{1}\right)=\frac{1}{\sigma_{2} \sqrt{1-\rho^{2}}} \phi\left(\frac{x_{2}-m_{2}-\left(\sigma_{2} / \sigma_{1}\right) \rho\left(x_{1}-m_{1}\right)}{\sigma_{2} \sqrt{1-\rho^{2}}}\right). \tag{7.6}\]

Consequently, the conditional mean of \(X_{2}\) , given \(X_{1}\) , is given by

\[E\left[X_{2} \mid X_{1}=x_{1}\right]=m_{2}+\frac{\sigma_{2}}{\sigma_{1}} \rho\left(x_{1}-m_{1}\right)=\alpha_{1}+\beta_{1} x_{1} \tag{7.7}\]

in which we define the constants \(\alpha_{1}\) and \(\beta_{1}\) by

\[\alpha_{1}=m_{2}-\frac{\sigma_{2}}{\sigma_{1}} \rho m_{1}, \quad \beta_{1}=\frac{\sigma_{2}}{\sigma_{1}} \rho. \tag{7.8}\]

Similarly,

\[E\left[X_{1} \mid X_{2}=x_{2}\right]=\alpha_{2}+\beta_{2} x_{2}; \quad \alpha_{2}=m_{1}-\frac{\sigma_{1}}{\sigma_{2}} \rho m_{2}, \quad \beta_{2}=\frac{\sigma_{1}}{\sigma_{2}} \rho. \tag{7.9}\]

From (7.7) it is seen that the conditional mean of a random variable \(X_{2}\) , given the value \(x_{1}\) of a random variable \(X_{1}\) with which \(X_{2}\) is jointly normally distributed, is a linear function of \(x_{1}\) . Except in the case in which the two random variables, \(X_{1}\) and \(X_{2}\) , are jointly normally distributed, it is generally to be expected that \(E\left[X_{2} \mid X_{1}=x_{1}\right]\) is a nonlinear function of \(x_{1}\) .

The conditional mean of one random variable, given another random variable, represents one possible answer to the problem of prediction . Suppose that a prospective father of height \(x_{1}\) wishes to predict the height of his unborn son. If the height of the son is regarded as a random variable \(X_{2}\) and the height \(x_{1}\) of the father is regarded as an observed value of a random variable \(X_{1}\) , then as the prediction of the son’s height we take the conditional mean \(E\left[X_{2} \mid X_{1}=x_{1}\right]\) . The justification of this procedure is that the conditional mean \(E\left[X_{2} \mid X_{1}=x_{1}\right]\) may be shown to have the property that

\begin{align} & E\left[\left(X_{2}-E\left[X_{2} \mid X_{1}=x_{1}\right]\right)^{2}\right] \tag{7.10} \\ & =\int_{-\infty}^{\infty} \int_{-\infty}^{\infty}\left(x_{2}-E\left[X_{2} \mid X_{1}=x_{1}\right]\right)^{2} f_{X_{1}, X_{2}}\left(x_{1}, x_{2}\right) d x_{1} d x_{2} \\ & \leq E\left[\left(X_{2}-g\left(X_{1}\right)\right)^{2}\right]=\int_{-\infty}^{\infty} \int_{-\infty}^{\infty}\left[x_{2}-g\left(x_{1}\right)\right]^{2} f_{X_{1}, X_{2}}\left(x_{1}, x_{2}\right) d x_{1} d x_{2} \end{align}

for any function \(g\left(x_{1}\right)\) for which the last written integral exists. In words, (7.10) is interpreted to mean that if \(X_{2}\) is to be predicted by a function \(g\left(X_{1}\right)\) of the random variable \(X_{1}\) then the conditional mean \(E\left[X_{2} \mid X_{1}=x_{1}\right]\) has the smallest mean square error among all possible predictors \(g\left(X_{1}\right)\) .

From (7.7) it is seen that in the case in which the random variables are jointly normally distributed the problem of computing the conditional mean \(E\left[X_{2} \mid X_{1}=x_{1}\right]\) may be reduced to that of computing the constants \(\alpha_{1}\) and \(\beta_{1}\) , for which one requires a knowledge only of the means, variances, and correlation coefficient of \(X_{1}\) and \(X_{2}\) . If these moments are not known, they must be estimated from observed data. The part of statistics concerned with the estimation of the parameters \(\alpha_{1}\) and \(\beta_{1}\) is called regression analysis .

It may happen that the joint probability law of the random variables \(X_{1}\) and \(X_{2}\) is unknown or is known but is such that the calculation of the conditional mean \(E\left[X_{2} \mid X_{1}=x_{1}\right]\) is intractable. Suppose, however, that one knows the means, variances (assumed to be positive), and correlation coefficient of \(X_{1}\) and \(X_{2}\) . Then the prediction problem may be solved by forming the best linear predictor of \(X_{2}\) , given \(X_{1}\) , denoted by \(E^{*}\left[X_{2} \mid X_{1}=x_{1}\right]\) . The best linear predictor of \(X_{2}\) , given \(X_{1}\) , is defined as that linear function \(a+b X_{1}\) of the random variable \(X_{1}\) , that minimizes the mean square error of prediction \(E\left[\left(X_{2}-\left(a+b X_{1}\right)\right)^{2}\right]\) involved in the use of \(a+b X_{1}\) as a predictor of \(X_{2}\) . Now

\begin{align} -\frac{\partial}{\partial a} E\left[\left(X_{2}-\left(a+b X_{1}\right)\right)^{2}\right] & =2 E\left[X_{2}-\left(a+b X_{1}\right)\right] \tag{7.11} \\ -\frac{\partial}{\partial b} E\left[\left(X_{2}-\left(a+b X_{1}\right)\right)^{2}\right] & =2 E\left[\left(X_{2}-\left(a+b X_{1}\right)\right) X_{1}\right]. \end{align}

Solving for the values of \(a\) and \(b\) , denoted by \(\alpha\) and \(\beta\) , at which these derivatives are equal to 0, one sees that \(\alpha\) and \(\beta\) satisfy the equations

\begin{align} \alpha+\beta E\left[X_{1}\right] & =E\left[X_{2}\right] \\ \alpha. E\left[X_{1}\right]+\beta E\left[X_{1}^{2}\right] & =E\left[X_{1} X_{2}\right] \tag{7.12} \end{align}

Therefore, \(E^{*}\left[X_{2} \mid X_{1}=x_{1}\right]=\alpha+\beta x_{1}\) , in which

\[\alpha=E\left[X_{2}\right]-\beta E\left[X_{1}\right], \quad \beta=\frac{\operatorname{Cov}\left[X_{1}, X_{2}\right]}{\operatorname{Var}\left[X_{1}\right]}=\frac{\sigma\left[X_{2}\right]}{\sigma\left[X_{1}\right]} \rho\left(X_{1}, X_{2}\right). \tag{7.13}\]

Comparing (7.7) and (7.13), one sees that the best linear predictor \(E^{*}\left[X_{2} \mid X_{1}=x_{1}\right]\) coincides with the best predictor, or conditional mean, \(E\left[X_{2} \mid X_{1}=x_{1}\right]\) , in the case in which the random variables \(X_{1}\) and \(X_{2}\) are jointly normally distributed.

We can readily compute the mean square error of prediction achieved with the use of the best linear predictor. We have

\begin{align} E\left[\left(X_{2}-E^{*}\left[X_{2} \mid X_{1}=x_{1}\right]\right)^{2}\right] & =E\left[\left\{\left(X_{2}-E\left[X_{2}\right]\right)-\beta\left(X_{1}-E\left[X_{1}\right]\right)\right\}^{2}\right] \tag{7.14} \\ & =\operatorname{Var}\left[X_{2}\right]+\beta^{2} \operatorname{Var}\left[X_{1}\right]-2 \beta \operatorname{Cov}\left[X_{2}, X_{1}\right] \\ & =\operatorname{Var}\left[X_{2}\right]-\frac{\operatorname{Cov}^{2}\left[X_{1}, X_{2}\right]}{\operatorname{Var}\left[X_{1}\right]} \\ & =\operatorname{Var}\left[X_{2}\right]\left\{1-\rho^{2}\left(X_{1}, X_{2}\right)\right\}. \end{align}

From (7.14) one obtains the important conclusion that the closer the correlation between two random variables is to 1, the smaller the mean square error of prediction involved in predicting the value of one of the random variables from the value of the other.

The Phenomenon of“Spurious” Correlation . Given three random variables \(U, V\) , and \(W\) , let \(X\) and \(Y\) be defined by

\[X=U+W, \quad Y=V+W \quad \text { or } \quad X=\frac{U}{W}, \quad Y=\frac{V}{W}, \tag{7.15}\]

(or in some similar way) as functions of \(U, V\) , and \(W\) . The reader should be careful not to infer the existence of a correlation between \(U\) and \(V\) from the existence of a correlation between \(X\) and \(Y\) .

Example 7C. Do storks bring babies? Let \(W\) be the number of women of child-bearing age in a certain geographical area, \(U\) , the number of storks in the area, and \(V\) , the number of babies born in the area during a specified period of time. The random variables \(X\) and \(Y\) , defined by

\[X=\frac{U}{W}, \quad Y=\frac{V}{W}, \tag{7.16}\]

then represent, respectively, the number of storks per woman and the number of babies born per woman in the area. If the correlation coefficient \(\rho(X, Y)\) between \(X\) and \(Y\) is close to 1, does that not prove that storks bring babies? Indeed, even if it is proved only that the correlation coefficient \(\rho(X, Y)\) is positive, would that not prove that the presence of storks in an area has a beneficial influence on the birth rate there? The reader interested in a discussion of these delightful questions would be well advised to consult J. Neyman, Lectures and Conferences on Mathematical Statistics and Probability , Washington, D.C., 1952, pp. 143–154.

Theoretical Exercises

In the following exercises let \(X_{1}, X_{2}\) , and \(Y\) be jointly distributed random variables whose first and second moments are assumed known and whose variances are positive.

7.1. The best linear predictor , denoted by \(E^{*}\left[Y \mid X_{1}, X_{2}\right]\) , of \(Y\) , given \(X_{1}\) and \(X_{2}\) , is defined as the linear function \(a+b_{1} X_{1}+b_{2} X_{2}\) , which minimizes \(E\left[\left(Y-\left(a+b_{1} X_{1}+b_{2} X_{2}\right)\right)^{2}\right]\) . Show that

\[E^{*}\left[Y \mid X_{1}, X_{2}\right]=E[Y]+\beta_{1}\left(X_{1}-E\left[X_{1}\right]\right)+\beta_{2}\left(X_{2}-E\left[X_{2}\right]\right)\]

where

\begin{align} & \beta_{1}=\operatorname{Cov}\left[Y, X_{1}\right] \Sigma_{11}+\operatorname{Cov}\left[Y, X_{2}\right] \Sigma_{12} \\ & \beta_{2}=\operatorname{Cov}\left[Y, X_{1}\right] \Sigma_{21}+\operatorname{Cov}\left[Y, X_{2}\right] \Sigma_{22}, \end{align}

in which we define

\[\begin{gather} \Sigma_{11}=\operatorname{Var}\left[X_{2}\right] / \Delta, \quad \Sigma_{22}=\operatorname{Var}\left[X_{1}\right] / \Delta, \quad \Sigma_{12}=\Sigma_{21}=-\operatorname{Cov}\left[X_{1}, X_{2}\right] / \Delta . \\ \Delta=\operatorname{Var}\left[X_{1}\right] \operatorname{Var}\left[X_{2}\right]\left[1-\rho^{2}\left(X_{1}, X_{2}\right)\right] . \end{gather}\]

7.2. The residual of \(Y\) with respect to \(X_{1}\) and \(X_{2}\) , denoted by \(\eta\left[Y \mid X_{1}, X_{2}\right]\) , is defined by

\[\eta\left[Y \mid X_{1}, X_{2}\right]=Y-E^{*}\left[Y \mid X_{1}, X_{2}\right].\]

Show that \(\eta\left[Y \mid X_{1}, X_{2}\right]\) is uncorrelated with \(X_{1}\) and \(X_{2}\) . Consequently, conclude that the mean square error of prediction, called the residual variance of \(Y\) , given \(X_{1}\) and \(X_{2}\) , is given by

\[E\left[\eta^{2}\left[Y \mid X_{1}, X_{2}\right]\right]=\operatorname{Var}[Y]-\operatorname{Var}\left[E^{*}\left[Y \mid X_{1}, X_{2}\right]\right].\]

Next show that the variance of the predictor is given by

\begin{align} \operatorname{Var}\left[E^{*}\left[Y \mid X_{1}, X_{2}\right]\right]= & \beta_{1}^{2} \operatorname{Var}\left[X_{1}\right]+\beta_{2}^{2} \operatorname{Var}\left[X_{2}\right]+2 \beta_{1} \beta_{2} \operatorname{Cov}\left[X_{1}, X_{2}\right] \\ = & \Sigma_{11} \operatorname{Cov}^{2}\left[Y, X_{1}\right]+\Sigma_{22} \operatorname{Cov}^{2}\left[Y, X_{2}\right] \\ & +2 \Sigma_{12} \operatorname{Cov}\left[Y, X_{1}\right] \operatorname{Cov}\left[Y, X_{2}\right]. \end{align}

The positive quantity \(R\left[Y \mid X_{1}, X_{2}\right]\) , defined by

\[R^{2}\left[Y \mid X_{1}, X_{2}\right]=\frac{\operatorname{Var}\left[E^{*}\left[Y \mid X_{1}, X_{2}\right]\right]}{\operatorname{Var}[Y]}=\rho^{2}\left(Y, E^{*}\left[Y \mid X_{1}, X_{2}\right]\right),\]

is called the multiple correlation coefficient between \(Y\) and the random vector \(\left(X_{1}, X_{2}\right)\) . To understand the meaning of the multiple correlation coefficient, express in terms of it the residual variance of \(Y\) , given \(X_{1}\) and \(X_{2}\) .

7.3. The partial correlation coefficient of \(X_{1}\) and \(X_{2}\) with respect to \(Y\) is defined by

\[\rho\left[X_{1}, X_{2} \mid Y\right]=\rho\left(\eta\left[X_{1} \mid Y\right], \eta\left[X_{2} \mid Y\right]\right),\]

in which \(\eta\left[X_{i} \mid Y\right]=X_{i}-E^{*}\left[X_{i} \mid Y\right]\) for \(i=1,2\) . Show that

\[\rho\left[X_{1}, X_{2} \mid Y\right]=\frac{\rho\left(X_{1}, X_{2}\right)-\rho\left(X_{1}, Y\right) \rho\left(X_{2}, Y\right)}{\sqrt{\left(1-\rho^{2}\left(\bar{X}_{1}, Y\right)\right)\left(1-\rho^{2}\left(X_{2}, Y\right).\right)}}\]

7.4. (Continuation of example 7A). Show that \[\operatorname{Var}[Y]=n \frac{E[X]}{N}\left(1-\frac{E[X]}{N}\right) \frac{N-n}{N-1}+\frac{n-1}{N-1} \frac{n}{N} \operatorname{Var}[X]. \tag{7.17}\]

Exercises

7.1. Let \(X_{1}, X_{2}, X_{3}\) be jointly distributed random variables with zero means, unit variances, and covariances \(\operatorname{Cov}\left[X_{1}, X_{2}\right]=0.80, \operatorname{Cov}\left[X_{1}, X_{3}\right]=-0.40\) , \(\operatorname{Cov}\left[X_{2}, X_{3}\right]=-0.60\) . Find (i) the best linear predictor of \(X_{1}\) , given \(X_{2}\) , (ii) the best linear predictor of \(X_{3}\) , given \(X_{2}\) , (iii) the partial correlation between \(X_{1}\) and \(X_{3}\) , given \(X_{2}\) , (iv) the best linear predictor of \(X_{1}\) , given \(X_{2}\) and \(X_{3},(\mathrm{v})\) the residual variance of \(X_{1}\) , given \(X_{2}\) and \(X_{3}\) , (vi) the residual variance of \(X_{1}\) , given \(X_{2}\) .

Answer

(i) \(0.8 x_{2}\) ; (ii) \(-0.6 x_{2}\) ; (iii) \(\frac{1}{6}\) ; (iv) \(\frac{7}{8} x_{2}+\frac{1}{8} x_{3}\) ; (v) 0.35; (vi) 0.36.

7.2. Find the conditional mean of \(Y\) , given \(X\) , if \(X\) and \(Y\) are jointly continuous random variables with a joint probability density function \(f_{X, Y}(x, y)\) vanishing except for \(x>0, y>0\) , and in the case in which \(x>0, y>0\) given by

\[\frac{4}{5} (x + 3y) e^{-x - 2y},\]
\[\frac{y}{(1 + x)^4} e^{-y/(1 + x)},\]
\[\frac{9}{2} \frac{1 + x + y}{(1 + x)^4 (1 + y)^4}.\]

7.3. Let \(X=\cos 2 \pi U, Y=\sin 2 \pi U\) , in which \(U\) is uniformly distributed on 0 to 1. Show that for \(|x| \leq 1\)

\[E^{*}[Y \mid X=x]=0, \quad E[Y \mid X=x]=\sqrt{1-x^{2}}.\]

Find the mean square error of prediction achieved by the use of (i) the best linear predictor, (ii) the best predictor.

7.4. Let \(U, V\) , and \(W\) be uncorrelated random variables with equal variances. Let \(X=U \pm W, Y=V \pm W\) . Show that

\[\rho(X, W)=\rho(Y, W)=1 / \sqrt{2}, \quad \rho(X, Y)=0.5.\]

Answer

(i) \(\operatorname{Var}[Y]=0.5\) ; (ii) 0.