diff --git a/education/statistics/5.html b/education/statistics/5.html new file mode 100644 index 0000000..976b1cf --- /dev/null +++ b/education/statistics/5.html @@ -0,0 +1,129 @@ + + +

General

+

Simple Linear Regression

+

Normal equations

+
+
+

 

+

Least-squares calculations

+
+
+
+
+
+

 

+

Solution for normal equations

+
+
+

 

+

Standard error of estimate

+
+

 

+

Test Statistic for inferences concerning β

+
+

Or it can be written as

+
+

 

+

Confidence interval for regression coefficient β

+
+

 

+

Confidence interval for the mean of y when x = x0

+
+

 

+

Coefficient of correlation

+
+

 

+

Multiple Regression

+
+
+

 

+

Estimated Equation

+
+

 

+
+
+
+

Where

+

SST is the total sum of squares,

+

SSR is the sum of squares due to regression,

+

SSE is the sum of squares due to error.

+

 

+

Multiple Coefficient of Determination

+
+

Which is the % of variation of y can be explained by the sample regression line.

+

 

+

Adjusted Multiple Coefficient of Determination

+
+

Ra2 will always be smaller than R2.

+

 

+

Assumptions

+

Linearity

+

The relationship between the explanatory X and the response variable Y should be linear.

+

Methods for fitting a model to non-linear relationships exist but are beyond the scope of this course.

+

Check using a scatterplot of the data, or a residuals plot.

+

 

+

Nearly Normal Residuals

+

The residuals should be nearly normal.

+

This condition may not be satisfied when there are unusual observations that do not follow the trend of the rest of the data.

+

Check using histogram or normal probability plot of residuals.

+

 

+

Constant variability

+

The variability of points around the least squares line should be roughly constant.

+

This implies that the variability of residuals around the 0 line should be roughly constant as well.

+

It is also called homoscedasticity.

+

Check using histogram or normal probability plot of residuals.

+

 

+

Testing for Significance

+

Whole Model

+

To determine whether a significant relationship exists between the dependent variable y and the set of all the independent variables x.

+

 

+

Setting H0 and H1:

+
+
+

 

+
+
+

 

+

Coefficient of Individual X

+

Setting H0 and H1:

+
+
+
+
+

 

+

Multicollinearity

+

The correlation among the independent variables.

+

When the independent variables are highly correlated, say |r| > 0.7, it is not possible to determine the separate effect on the dependent variable.

+

Every attempt should be made to avoid including independent variables that are highly correlated.

+

Two predictor variables are said to be collinear when they are correlated, and this collinearity complicates model estimation.

+

Predictors that are associated with each other are not preferred to be added into the model, as often the addition of such variables brings nothing to the table. Instead, the simplest model is preferred or say the parsimonious model.

+

While it is not possible to avoid collinearity from arising in observational data, experiments are usually designed to prevent correlations among predictors.

+

 

+

Qualitative Independent Variables

+

Such as genders (male, female), method of payment (cash, check, credit card).

+

 

+

For example, X1 might represent gender where X1 = 0 indicates male and X1 = 1 indicates female.

+

In this case, X1 is called a dummy or indicator variable.

+

 

+

More Complex Qualitative Variables

+

If a qualitative variable has k levels, k - 1 dummy variables are required, with each dummy variable being coded as 0 or 1.

+

For example, a variable with levels A, B, and C could be represented by X1 and X2 values of (0, 0) for A, (1, 0) for B, and (0, 1) for C.

+

Care must be taken in defining and interpreting the dummy variables.

+

 

+

Residual Analysis

+

In multiple regression analysis it is preferable to use the residual plot against ŷ to determine if the model assumptions are satisfied.

+

 

+

Standardized residuals are frequently used in residual plots.

+

Identifying outliers (typically, standardized residuals < -2 or > 2)

+

Providing insight into the assumption that the error term e has a normal distribution.

+

 

+

The computation of the standardized residuals in multiple regression analysis is too complex to be done by hand, Excel regression tool can be used.