Chapter 14
Regression Analysis and Correlation
Modified: 2007-05-03
- I. Analyzing two interval variables--prediction is the key
- A. Regression equation--a way of predicting from OBSERVED data
- 1. Obtaining data on two interval variables
- Collect two data points from each participant
- 2. Plotting the data on a graph
- 3. Determining if plotted data can be summarized by a straight line
- This is the assumption of linearity, or is a straight line best
- 4. Calculating the regression equation
- When two or more variables for a set of data are known, one
variable can be used to predict the other. Such prediction is
usually accomplished via a regression line, and the process is
known as linear regression. Other functions (quadratic,
hyperbolic, curvilinear, or exponential) can be used to predict
data, but they are beyond the scope of this course. Here is an example of a regression line:
In the example above, the regression line shows the best linear fit for the bivariate data (mileage and weight)
- Linear Regression
- Calculating the b coefficient
- Calculate b as follows:
b = r(Sy/Sx)
- where:
- r = correlation coefficient for X and Y
- Sy = standard deviation of Y
- Sx = standard deviation of X
- can you see that for positive correlations, b will be
positive too?
- can you see that for negative correlations, b will be
negative too?
- Calculating the a coefficient
- Finding Y'
- use:
Y' = a + bX
- or use:
Y' = r(Sy/Sx)(X - X) + Y
- Drawing Regression Lines
- Use the mean of X and the mean of Y for your first
point
- Pick a value of X for second point (any value will work)
and compute Y using the regression equation.
- Draw a line between the two points (and extend the line in
either direction)
Correlated data are often graphed on a scatterplot. One of the variables is plotted on the X axis and the other on the Y axis. Each dot of a scatterplot represents a pair of scores, one for the X variable and one for the Y variable. Here are some examples:
- Don't forget to interpret the sign of the correlation!
- What Does + and - Sign Mean?
- Remember to interpret the sign of the correlation also. The
sign tells you the direction of the relationship.
- Positive sign
- Both variables are in the same direction (i.e.,
height and weight)
- Negative sign
- Both variables are in opposite direction (i.e.,
smoking and distance run)
- Other examples?
- Effect Size for r
- What does the size of a correlation mean?
- Small relationship: r = .10
- Medium relationship: r = .30
- Large relationship: r = .50
- Uses of r
- Reliability
- one of the best and most common uses of r is to assess
the reliability of sets of measurements.
- typical examples include correlating the observations of
two or more observers, estimating the variability within a
population, and assessing whether or not developmental
processes are at work
- Sign of possible causation
- Some correlations will turn out to have a causal
relationship, you just cannot use r to prove it. So, another
common use of r is as a quick way to run a pilot study to be
followed later by experimentation.
- Coefficient of determination (r2)
- This is a very useful feature of r. The coefficient of
determination, r2, tells you what proportion of
the variance is shared by two variables. This is useful as
an estimate of how much of the variance in particular
behavior is accounted for by other variables.
- Other Issues Related to r
- Nonlinearity
- r is not the appropriate statistic to use for data that
are not linear. More complex relationships (i.e.,
curvilinear, exponential) may exist, but r will not indicate
those relationships. Other nonlinear correlations techniques
must be used for such data.
- Truncated Range
- A truncated (or shortened) range occurs when the
sample's range is less than the population's range. In such
cases, r will not indicate the existence of actual
correlations either.
- b. Regression equation
- (1) Y=a + bX where a=constant, b=slope, Y=predicted value of Y,
- X=dependent variable
- c. Y intercept (constant)
- d. Regression coefficient (slope)
- e. Predicted value of Y
- f. Residual
- g. Goodness of fit
- B. Correlation coefficient
The Pearson correlation coefficient (symbolized r) is a widely used descriptive statistic that shows the degree of relationship between two variables. Correlation coefficients range from -1.00 to 1.00, with .00 in the middle.
The strongest degree of relationship is indicated by r = 1.00 and r = -1.00. Both coefficients indicate that the relationship is perfect, which means that changes in the scores of one variable are accompanied by perfectly predictable changes in the scores of the other variable. The middle value, r = .00, means that there is no relationship between the two variables. When r = .00, the changes in one variable give no clue as to changes in the other variable.
Positive correlation coefficients (from .01 to 1.00) indicate that the two variables vary in the same direction. That is, as scores on one variable increase, scores on the other variable increase as well. Negative correlation coefficients (from - .01 to -1.00) mean that the variables change in opposite directions. As scores on one variable increase, scores on the other variable decrease. The closer r is to 1.00 or -1.00, the more predictable the increase or decrease is.
- 2. r2--the coeffiecient of determination
- This is a very useful feature of r. The coefficient of
determination, r2, tells you what proportion of
the variance is shared by two variables. This is useful as
an estimate of how much of the variance in particular
behavior is accounted for by other variables.
- C. Other factors in interpreting linear regression
- 1. Size of the regression coefficient
- a. Size does not imply strength
- b. measurement scale
- 2. Standard error of the regression coefficient
- a. Estimate true value of the regression coefficient
- b. Determine if the relationship is statistically significant
- c. Determine significance of the regression coefficient
- 3. Range of X values used to calculate the regression equation
- a. Do not estimate outside range of data
- b. Extreme points – outliers
Outlier: A score separated from others and 1.5(IQR) beyond the 25th or 75th percentile.
Here's an example of a dataset with an outlier:
The data above are the number of patents awarded by state in 1940 vs. 1950. Look for the outlier, it's not that easy to see. If the outlier were not there, where would the regression line be? What should we do with these data?
- 4. Time interval between measuring X and Y
- a. Lagging a variable--use when there is a time interval between measurement and prediction.
- II. Regression analysis: the multivariate case--here the issue is can prediction be improved by adding variables.
- A. Multiple regression equation
- 1. Defined and explained--does adding information improve prediction?
- How tall is a particular person?
- Most likely they will not be 10 feet tall or 2 inches tall
- Add variables: adults, gender, weight, play college basketball
- Does your prediction improve with these additional variables?
- 2. Partial regression coefficient--inverse of of multiple regression, in a way
- What is the correlation between two variables when each is also under the influence of a third variable
- Third variable is called a covariate
- Partial correlation evens out the effect of the covariate and allows measurement of relatedness of the other two variables.
- Consider: V1 = reading speed, V2 = reading comprehension, Covariate is IQ
- 3. Dummy variables--dichotomous variables (personal characteristics) that are used in regression equation
- B. Measures of association for multiple regression
- 1. Coefficient of multiple determination--R2 accounts for the variances from each of the variables
- 2. Adjusted R2 "penalizes" R2 when too many variables are used, adjusted R2 cannot be larger than R2
- C. Importance of each independent variable: beta weights
- 1. Defined and explained--are standardized regression coefficients (normalized). They make it possible to compare the relative contributions of each variable.
- 2. t-ratio
- III. Statistical significance and linear regression
- A. t-ratio
- B. F-ratio
- both of the above test null hypotheses
- the t-ratio tests the hypothesis that the regression coefficients is not 0
- the F-ratio tests the hypothesis that beta weights are the same
- IV. Multicollinearity
- A. Defined--the variables are too correlated, thus regression equation cannot estimate the independent effects of each variable
- B. Symptoms
- 1. Equation is significant, however, no t-ratios are significant
- 2. Addition of an independent variable radically changes beta weights or regression
- coefficients
- V. Regression and non-interval variables
- A. Logistical regression--related model to multiple regression allows the use of dichotomous variables
- VI. Regression models to analyze time-series data
- A. Defined--predicting future trends
- linearity is required
- short term, seasonal, or cyclical data may invalidate results
- (e.g., forecasting accountants business from sample drawn from Jan. to Apr.)
- B. Forecasting uses
- C. Autocorrelation (see: NIST definition)
- Use autocorrelation to check for non-randomness and appropriateness of time series design
- VII. Regression and causality--experiments guarantee causality. Can multiple regression substitute for experimentation in the search for causality? Only when the following issues are handled. Even then, caution is advised.
- A. Control for possible causal variables--have all possible variables been identified?
- B. Quality of model--is the model sufficiently powerful?
- C. Linearity--is the model linear? If not, what is the relationship?
- D. Correlation does not equal causality!
- VIII. Chapter summary
- IX. Appendix 15.1: Calculating Regression Statistics for Bivariate Relationships
Online Resourses
Correlation coefficient
Interactive page allows user to see regression line and scatterplot of bivariate distribution whose values range from -1.00 to +1.00.
http://noppa5.pc.helsinki.fi/koe/corr/cor7.html
Wikipedia on correlation coefficient
Long page discusses correlation coefficient in detail including mathematical properties, non-parametric coefficients, and common misconceptions about correlation.
http://en.wikipedia.org/wiki/Correlation
Scatterplot
Shows sample scatterplots and explains positive and negative associations between bivariate data.
http://www.stat.yale.edu/Courses/1997-98/101/scatter.htm
Multiple Regression
Online primer covers basic topics in multiple regression
http://www.statsoft.com/textbook/stmulreg.html
Back to Main RMPA Page