The linear, or Pearson, correlation coefficient is used to describe the strength of the linear relationship between two numeric variables. The sample correlation r is a number between -1 and 1. If r=-1, then the variables X and Y perfectly follow a straight line, with Y decreasing as X increases. If r=0, then X and Y do not have a linear relationship. If r=1, then X and Y perfectly follow a straight line, with Y increasing as X increases. If all values of X and/or Y are equal, then the correlation is undefined mathematically. PROC CORR is the most common method of obtaining correlation coefficients in SAS.
The following data were obtained from StatLib at Carnegie Mellon University. Thirty samples of cheddar cheese were tasted by a panel, and their average scores across panelists were calculated. The amounts of acetic acid, hydrogen sulfide (H2S), and lactic acid in each sample were also calculated.
data cheddar; input case taste acetic h2s lactic @@; datalines; 1 12.3 4.543 3.135 0.86 2 20.9 5.159 5.043 1.53 3 39 5.366 5.438 1.57 4 47.9 5.759 7.496 1.81 5 5.6 4.663 3.807 0.99 6 25.9 5.697 7.601 1.09 7 37.3 5.892 8.726 1.29 8 21.9 6.078 7.966 1.78 9 18.1 4.898 3.85 1.29 10 21 5.242 4.174 1.58 11 34.9 5.74 6.142 1.68 12 57.2 6.446 7.908 1.9 13 0.7 4.477 2.996 1.06 14 25.9 5.236 4.942 1.3 15 54.9 6.151 6.752 1.52 16 40.9 6.365 9.588 1.74 17 15.9 4.787 3.912 1.16 18 6.4 5.412 4.7 1.49 19 18 5.247 6.174 1.63 20 38.9 5.438 9.064 1.99 21 14 4.564 4.949 1.15 22 15.2 5.298 5.22 1.33 23 32 5.455 9.242 1.44 24 56.7 5.855 10.199 2.01 25 16.8 5.366 3.664 1.31 26 11.6 6.043 3.219 1.46 27 26.5 6.458 6.962 1.72 28 0.7 5.328 3.912 1.25 29 13.4 5.802 6.685 1.08 30 5.5 6.176 4.787 1.25 ; run;
A cheesemaker may be interested in the correlations of the chemical constituents with taste and with each other. A correlation coefficient by itself does not indicate whether two variables have a strong relationship! It is possible, for example, for X and Y to have a curvilinear relationship with a small correlation. It is also possible for a correlation coefficient near 1 or -1 to be solely due to an outlying point. To see if linear relationships of the chemicals with taste are reasonable, plots of the variables should be constructed before pursuing regression or correlation analyses. One way to produce the plots of taste versus the chemicals is shown below.
proc plot data=cheddar; plot taste*(acetic h2s lactic); run;
One of the plots is shown below. All three chemicals have a positive relationship with TASTE.
Plot of TASTE*H2S. Legend: A = 1 obs, B = 2 obs, etc.
60 +
| A A A
| A
|
40 + A A A
| A A
TASTE | A
| A A A
20 + A A A A A
| A A A A A
| AA
| A B
0 + A A
++----------+----------+----------+----------+----------+-
2 4 6 8 10 12
H2S
The correlations of TASTE with the other three variables can be obtained as follows:
proc corr data=cheddar; var taste; with acetic h2s lactic; run;
This produces the same results, except that the correlations are listed horizontally across the page rather than vertically.
proc corr data=cheddar; var acetic h2s lactic; with taste; run;
The last set of statements produces the following output.
Correlation Analysis
1 'WITH' Variables: TASTE
3 'VAR' Variables: ACETIC H2S LACTIC
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
TASTE 30 24.5333 16.2554 736.0000 0.7000 57.2000
ACETIC 30 5.4980 0.5709 164.9410 4.4770 6.4580
H2S 30 5.9418 2.1269 178.2530 2.9960 10.1990
LACTIC 30 1.4420 0.3035 43.2600 0.8600 2.0100
Pearson Correlation Coefficients/Prob>|R| under Ho:Rho=0/N = 30
ACETIC H2S LACTIC
TASTE 0.54954 0.75575 0.70424
0.0017 0.0001 0.0001
Notice that PROC CORR automatically computes some simple statistics for each variable. The output shows that the correlation between taste score and acetic acid is r=.54954. When testing the null hypothesis that the population correlation between these two variables is zero against the alternative hypothesis that it is different from zero, a low p-value of .0017 is obtained, based on the sample size of N=30 cheeses. Thus, taste and acetic acid have a significant positive relationship.
The following statements produce all pairwise correlations, including each variable with itself.
proc corr data=cheddar; var acetic h2s lactic; run;
You may need to transfer the correlations to another dataset for subsequent work. If so, you can use the OUTP= option to create an output dataset (the P in OUTP is for Pearson correlation coefficients). You can also use NOPRINT to suppress printing those correlations; this is useful if many variables are involved. An example is shown below.
proc corr data=cheddar outp=chedcorr noprint; var acetic h2s lactic; run; proc print data=chedcorr; run;
This produces the following output. Notice that means, standard deviations, and sample sizes are also included. If you only needed to use the correlations, you could remove the other statistics by keeping only the observations with _TYPE_='CORR' in a DATA step.
OBS _TYPE_ _NAME_ ACETIC H2S LACTIC 1 MEAN 5.4980 5.9418 1.4420 2 STD 0.5709 2.1269 0.3035 3 N 30.0000 30.0000 30.0000 4 CORR ACETIC 1.0000 0.6180 0.6038 5 CORR H2S 0.6180 1.0000 0.6448 6 CORR LACTIC 0.6038 0.6448 1.0000
Regression involves the calculation of terms in an equation to predict a response Y from one or more predictor variables X1, X2, etc. In linear regression, the equation has the form Y = bo + b1X1 + b2X2 + ... + bkXk + e, where e represents a random error term which is assumed to be normally distributed with mean 0 and constant variance which does not depend on the value of any other observation.
PROC REG is the main SAS procedure for linear regression, but it can also be done with PROC GLM. PROC NLIN is used for nonlinear regression. SAS has other regression routines for special types of data.
In the cheese study, we may want to develop an equation to predict taste scores from the various chemical constituents. For example, we may want to know how hydrogen sulfide affects taste. The plot above showed that these two variables are positively associated with each other, and the linear correlation between the two variables was large. The following SAS code fits the regression line.
proc reg data=cheddar; model taste = h2s; run;
In the MODEL statement, the response (or Y, or dependent) variable is listed before an equals sign, followed by the predictor (or X, or independent) variables. This produces the following output.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Prob>F
Model 1 4376.74585 4376.74585 37.293 0.0001
Error 28 3286.14082 117.36217
C Total 29 7662.88667
Root MSE 10.83338 R-square 0.5712
Dep Mean 24.53333 Adj R-sq 0.5558
C.V. 44.15781
The table above shows various statistics computed from the means and variances of both TASTE (Dep Mean) and H2S, as well as their correlation. All the parameters are defined in standard statistics textbook, with the possible exceptions, CV = 100(Root MSE)/(Dep Mean) and Adj R-sq = 1-(1-R2)(n-1)/dfe, where dfe=degrees of freedom for error.
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate Error Parameter=0 Prob > |T|
INTERCEP 1 -9.786837 5.95791028 -1.643 0.1116
H2S 1 5.776089 0.94584996 6.107 0.0001
From this section of the output, we obtain the equation (Estimated TASTE) = -9.79 + 5.78 (H2S). The coefficients' standard errors are printed next, followed by results of testing whether the parameters differ significantly from zero. In this case, we can conclude that hydrogen sulfide is a significant predictor of taste.
The assumptions of the regression model should be checked before drawing conclusions. For example, you should visually verify that a straight line is an appropriate model when using one predictor. Also, the residuals, or the differences between the observed and predicted values, should have equal variability regardless of the size of the predicted taste value, and they should have no apparent trends. You can check these with the following statements.
proc reg data=cheddar; model taste=h2s; plot predicted.*h2s='+' taste*h2s='O'/overlay; plot residual.*predicted.; run;
SAS produces the following plots.
P PRED |
r 60 +
e | O O O
d | O +
i | ++ +
c 40 + O + O O
t | O ++++ O
e | +++ O
d | O + O O
20 + O O +?+ + O O
V | O? + + OO O
a | +? ++
l | O OO
u 0 + O O
e +-+-----+-----+-----+-----+-----+-----+-----+-----+-
3 4 5 6 7 8 9 10 11
H2S
The regression line (+'s) seems to "go through" the cloud of data points (O's) quite well.
RESIDUAL |
40 +
|
R |
e | 1
s 20 + 1
i | 1 1
d | 1 1
u | 11 1111 1
a 0 + 1
l | 1 1 11 1 1 1 1
| 1 11 1 1 1
| 1 1
-20 +
+--+----+----+----+----+----+----+----+----+----+----
5 10 15 20 25 30 35 40 45 50
Predicted Value of TASTE PRED
Notice how the errors (residuals) are more widespread for middle values of taste than for low or high values. However, there is no clear indication that variability of the residuals is not constant.
Suppose that you wanted to see if the residuals are normally distributed. Then, you would need to make a new dataset which contains the residuals. This can be done with the following:
proc reg data=cheddar; model taste=h2s; output out=results residual=_resid_; run;
The new dataset RESULTS looks like this:
OBS CASE TASTE ACETIC H2S LACTIC _RESID_ 1 1 12.3 4.543 3.135 0.86 3.9788 2 2 20.9 5.159 5.043 1.53 1.5580 3 3 39.0 5.366 5.438 1.57 17.3765 (more lines follow)
The variable _RESID_ can then be analyzed for normality using PROC UNIVARIATE.
SAS provides regression equations, but it can also use those equations to provide new estimates for you. For example, suppose that you wanted to use the regression equation above to predict TASTE for H2S=3, 4, 5, ..., 8. You could add new observations to the CHEDDAR dataset as follows:
data new; do h2s=3 to 8; output; end; run; data cheesy; set cheddar new; run;
Now, use SAS to fit the regression line. Only the observations with nonmissing TASTE and H2S will be used to calculate the coefficients, but SAS will predict TASTE for every observation with a nonmissing H2S value.
proc reg data=cheesy; model taste=h2s; output out=pset predicted=_pred_; run; proc print data=pset; run;
SAS produces the following output.
OBS CASE TASTE ACETIC H2S LACTIC _PRED_ 1 1 12.3 4.543 3.135 0.86 8.3212 2 2 20.9 5.159 5.043 1.53 19.3420 3 3 39.0 5.366 5.438 1.57 21.6235 (lines deleted) 30 30 5.5 6.176 4.787 1.25 17.8633 31 . . . 3.000 . 7.5414 32 . . . 4.000 . 13.3175 33 . . . 5.000 . 19.0936 34 . . . 6.000 . 24.8697 35 . . . 7.000 . 30.6458 36 . . . 8.000 . 36.4219
To use two or more predictors in a regression model, simply list all of them after the equals sign in the MODEL statement. See the following example.
proc reg data=cheddar; model taste=acetic h2s lactic; run;
Suppose that you wanted to fit a polynomial regression model to predict taste score from (lactic acid) and (lactic acid)2. You would first need to define a new variable for the squared term, as follows:
data cheddar; set cheddar; lactic_2=lactic**2; run; proc reg data=cheddar; model taste=lactic lactic_2; run;
Adding more variables to a regression model is not always beneficial. SAS provides several ways to help you decide whether new variables should be added to the model. Consult a regression textbook and SAS documentation for details.
K.M. Portier and the University of Florida, 2004