curiouseyes: what is wrong with computing correlations on averages? In reference to statistics?
Explain when it is appropriate to use averages when computing correlations. Explain what statisticians should be aware of when doing this.
Answers and Views:
Answer by David F
Generally, correlations are used to test for relationships among individual-level variables. For example, one might want to test for a correlation between SAT scores and freshman GPA in college.
However, there are some circumstances where it is appropriate to correlate averages. Suppose we had a hypotheses that job satisfaction was related to sales performance in a fast-food restaurant chain. Since the dependent variable (sales) exists only at the store level, a practical way to test this hypotheses is to compute the average job satisfaction level for each restaurant and correlate it with restaurant sales. In other words, the restaurant becomes the unit of analysis instead of the individual employees. Obviously, you need many restaurants to do this kind of analysis, but this kind of research is pretty common.
Answer by MerlynThe correlation coefficient, r, is a measure of the linear relationship between two variables. If the data is non-linear then the correlation coefficient is meaningless.
r takes on values between -1 and 1. negative values indicate the relationship between the variables is indirect, i.e., on a scatter plot the data tends to have a negative slope. Positive values for r indicate the data tends to have a positive slope. if r = 0 we say the variables are uncorrelated.
the closer the absolute value of r is to 1, the stronger the linear association between the two variables.
there are many different formulas for calculating the value of r. if we let xbar and ybar be the means of two data sets. sx and sy are the standard deviations in the data sets and n = total sample size then:
r = 1/(n – 1) * Σ( ((xi – xbar)/sx) * ((yi – ybar)/sy)) with the sum going from i = 1 to n
r = Covariance(X,Y) / [(√(Var(X))√(Var(Y))]
the second equation shows that the correlation coefficient the ratio between the measure of spread between the variables and the product of the spread within each variable.
r is unit less.
r is not affected by multiplying each data set by a constant, and a constant to each data set or interchanging x and y.
r is subject to outliers.
Also note that correlation is not causation. Here is an example: the shoe size of grade school students and the student’s vocabulary are highly correlated. In other words, the larger the shoe size, the larger the vocabulary the student has. Now it is easy to see that shoe size and vocabulary have nothing to do with each other, but they are highly correlated. The reason is that there is a confounding factor, age. the older the grade school student the larger the shoe size and the larger the vocabulary.
you cannot compare models by comparing the r values. This is a long discussion, a full day lecture in the prob/stat courses I’ve instructed. Model comparison is a topic usually saved for high level under grad courses or graduate level courses.
good sites with info about correlation are:
https://mathworld.wolfram.com/CorrelationCoefficient.html
https://mathworld.wolfram.com/LeastSquaresFitting.html
Leave a Reply