Regression analysis

Regression Analysis

The regression model is used to answer the question whether tow variables are related. An example is the relation between high blood pressure of the model and birthweight of the child. Is high blood pressure a prognostic factor of low birth weight? If y=birthweight and x=bloodpressure the regression line is

y=a + bx

where a is the intercept, and b is the slope of the regression line. This relation does not hold for every women. The outcome y is the average birthweight of the women with the same bloodpressure. To complete the model we need an error term e which represents deviations from the regression line. This error term is normally distributed with mean zero and variance sigma^2. The variance is a measure of quality of the regression line. Large values of sigma^2 indicate more scatter around the regression line.

An extension is multiple correlation analysis where three or more variables are considered. In the study of birth weight in addition to high blood pressure, the cholesterol level of the mother would also be considered.

A regression model is usefull for the following purposes.

to describe how the response changes when one of the explanatory variables is changed;

to describe the relation between the response and one or more explanatory variables;

or to predict the value of the response for a new observation with known variables.

The regression model is the most direct approach offering quantities such as the multiple correlation coeffcient. The standard error of the estimate is the square root of the residual mean square given in the ANOVA table. Another name for the standard error of the estimate is the standard deviation of the residuals. This quantity is usefull in the calculations of confidence intervals of new values. The multiple correlation coeffcient describes how well the model fits. The square of this measure is the proportion explained variance by the model.

Anova Table

The ANOVA table summarizes results from a regression analyis or ANOVA. The bases is the equation

Observation = Model + Residual

This decomposition of the data also applies to the sum of squares of the deviations from the mean. If we subtract the average of the observations we have

SS_Total = SS_regression + SS_residual

where SS is the abreviation of sum of squares. A similar equations holds for the degrees of freedom

df_Total = df_regression + df_residual

The other quantities in the ANOVA table follow from the equations given above. The mean square is the sum of squares divided by the degrees of freedom. The F statistic is the mean square due to regression divided by the mean square due to residual, and tests the hypothesis that all coefficients (except the intercept) are zero.

Anova Table Sum of Squares df Mean Square F Statistic p-value
Regression 20.89698 1 20.89698 1.67514 0.21809
Residual 162.17235 13 12.4748
Total 183.06933 14