If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Calculating correlation coefficient r

The most common way to calculate the correlation coefficient (r) is by using technology, but using the formula can help us understand how r measures the direction and strength of the linear association between two quantitative variables.

Want to join the conversation?

  • female robot ada style avatar for user Alison
    Why would you not divide by 4 when getting the SD for x? I don't understand how we got three.
    (37 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user ju lee
    Why is r always between -1 and 1?

    I know that this question has been asked before but the answers are either too technical or too naive. Could someone please provide an answer that is mathematical in nature but can be understood by someone who have ok but not strong mathematical foundation.

    Thanks for your help.
    (14 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user Mendel Yaffe
      Hey there,

      r is very different from slope. Slope can have any positive or negative value since the question being addressed is as follows: How much does x change when y changes? r is not telling us how great that change is. r is just trying to tell us whether the relationship between x and y is positive negative or neither. In math terms, is r equal to 1, -1 or 0? If, on average, the relationship between changes in x and changes in y are positive then we say r=1. If the relationship is positive but not perfectly so it might have a score of 0.85 (or any other number between 0 and 1). If there is no relationship then r=0. If the relationship is perfectly negative then r=-1.

      I think this is easiest to understand if you try to visualize what a change in r means in terms of the angle of the least squares line (which Sal draws at ). The least squares line will always go through the mean of X and the mean of Y. So imagine the minute hand on a clock which can rotate 360 degrees but is pinned down to the centre of the clock.

      When it comes to telling the time we refer to the angle of the minute hand by splitting the clock into 60. Here, when we say that r has a value of 1 we are basically saying that on average an increase in X will result in an increase in Y. This line (r=1) is an upwards sloping line. Now let's rotate our line clockwise....until the line is a straight horizontal line. This line has an r value of 0. (Which btw means that a change in X results in no change in Y).

      Let's continue rotating our imaginary line clockwise....now we are moving 'beneath' the line which has an r of 0 so we are moving into negative territory. Do you see what I am getting at? Now r has a negative value. As this line moves further and further from the line which has an r of 0 we are getting closer to the 'opposite' of the line which had an r of positive 1. A line which is a 'perfect opposite' of r=1 will be r=-1 i.e a downwards sloping line.

      But this cannot go on forever. As we rotate our line further and further clockwise we once again pass the perfectly horizontal line (r=0), but this time we are moving into positive territory i.e. we are moving away from r=0 and closer and closer to the line which has an r of 1.

      Hope that helps :)
      (14 votes)
  • blobby green style avatar for user Mihaita Gheorghiu
    Why is r always between -1 and 1?
    (6 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user In_Math_I_Trust
    Is the correlation coefficient also called the Pearson correlation coefficient?
    (6 votes)
    Default Khan Academy avatar avatar for user
    • male robot hal style avatar for user Robin Yadav
      The Pearson correlation coefficient(also known as the Pearson Product Moment correlation coefficient) is calculated differently then the sample correlation coefficient. In this video, Sal showed the calculation for the sample correlation coefficient.
      (4 votes)
  • blobby green style avatar for user poojapatel.3010
    How was the formula for correlation derived?
    (6 votes)
    Default Khan Academy avatar avatar for user
  • duskpin ultimate style avatar for user Grace
    bro like im i 6th and i can't understand a single word of this
    (6 votes)
    Default Khan Academy avatar avatar for user
  • old spice man green style avatar for user circlePulse
    What calculator is Sal using? ()
    (4 votes)
    Default Khan Academy avatar avatar for user
  • aqualine ultimate style avatar for user Joshua Kim
    What does the little i stand for? Like in xi or yi in the equation. Also, the sideways m means sum right?
    (2 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user DiannaFaulk
      This is a bit of math lingo related to doing the sum function, "Σ". The "i" tells us which x or y value we want. Imagine we're going through the data points in order: (1,1) then (2,2) then (2,3) then (3,6). Remembering that these stand for (x,y), if we went through the all the "x"s, we would get "1" then "2" then "2" again then "3". The "i" indicates which index of that list we're on. So if "i" is 1, then "Xi" is "1", if "i" is 2 then "Xi" is "2", if "i" is 3 then "Xi" is "2" again, and then when "i" is 4 then "Xi" is "3".
      (7 votes)
  • blobby green style avatar for user kjs2214
    I understand that the equation for r is the average of the x-zscores multiplied by their corresponding y-zscores. But what's the reasoning behind multiplying the zscores?
    (4 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user daniella
      Multiplying the z-scores of X and Y in the correlation coefficient formula helps capture the relationship between the deviations of each variable from their respective means.
      When both z-scores have the same sign (both positive or both negative), their product is positive, indicating a positive correlation. When the signs differ (one positive, one negative), their product is negative, indicating a negative correlation.
      Essentially, multiplying the z-scores accounts for the direction of deviation from the mean in both variables, which is crucial for assessing the relationship between them.
      (1 vote)
  • blobby green style avatar for user Teresa Chan
    Why is the denominator n-1 instead of n?
    Thanks.
    (3 votes)
    Default Khan Academy avatar avatar for user
    • duskpin ultimate style avatar for user Vyacheslav Shults
      When instructor calculated standard deviation (std) he used formula for unbiased std containing n-1 in denominator. If you have the whole data (or almost the whole) there are also another way how to calculate correlation. In this case you must use biased std which has n in denominator. And in overall formula you must divide by n but not by n-1. Does not matter in which way you decide to calculate. The result will be the same.
      (3 votes)

Video transcript

- [Instructor] What we're going to do in this video is calculate by hand the correlation coefficient for a set of bi-variated data. Now, when I say bi-variate it's just a fancy way of saying for each X data point, there's a corresponding Y data point. Now, before I calculate the correlation coefficient, let's just make sure we understand some of these other statistics that they've given us. So, we assume that these are samples of the X and the corresponding Y from our broader population. And so, we have the sample mean for X and the sample standard deviation for X. The sample mean for X is quite straightforward to calculate, it would just be one plus two plus two plus three over four and this is eight over four which is indeed equal to two. The sample standard deviation for X, we've also seen this before, this should be a little bit review, it's gonna be the square root of the distance from each of these points to the sample mean squared. So, one minus two squared plus two minus two squared plus two minus two squared plus three minus two squared, all of that over, since we're talking about sample standard deviation, we have four data points, so one less than four is all of that over three. Now, this actually simplifies quite nicely because this is zero, this is zero, this is one, this is one and so you essentially get the square root of 2/3 which is if you approximate 0.816. So, that's that. And the same thing is true for Y. The sample mean for Y, if you just add up one plus two plus three plus six over four, four data points, this is 12 over four which is indeed equal to three and then the sample standard deviation for Y you would calculate the exact same way we did it for X and you would get 2.160. Now, with all of that out of the way, let's think about how we calculate the correlation coefficient. Now, right over here is a representation for the formula for the correlation coefficient and at first it might seem a little intimating until you realize a few things. All this is saying is for each corresponding X and Y, find the Z score for X, so we could call this Z sub X for that particular X, so Z sub X sub I and we could say this is the Z score for that particular Y. Z sub Y sub I is one way that you could think about it. Look, this is just saying for each data point, find the difference between it and its mean and then divide by the sample standard deviation. And so, that's how many sample standard deviations is it away from its mean, and so that's the Z score for that X data point and this is the Z score for the corresponding Y data point. How many sample standard deviations is it away from the sample mean? In the real world you won't have only four pairs and it'll be very hard to do it by hand and we typically use software computer tools to do it but it's really valuable to do it by hand to get an intuitive understanding of what's going on here. So, in this particular situation, R is going to be equal to one over N minus one. We have four pairs, so it's gonna be 1/3 and it's gonna be times a sum of the products of the Z scores. So, this first pair right over here, so the Z score for this one is going to be one minus how far it is away from the X sample mean, divided by the X sample standard deviation, 0.816, that times one, now we're looking at the Y variable, the Y Z score, so it's one minus three, one minus three over the Y sample standard deviation, 2.160 and we're just going keep doing that. I'll do it like this. So, the next one it's going to be two minus two over 0.816, this is where I got the two from and I'm subtracting from that the sample mean right over here, times, now we're looking at this two, two minus three over 2.160 plus I'm happy there's only four pairs here, two minus two again, two minus two over 0.816 times now we're gonna have three minus three, three minus three over 2.160 and then the last pair you're going to have three minus two, three minus two over 0.816 times six minus three, six minus three over 2.160. So, before I get a calculator out, let's see if there's some simplifications I can do. Two minus two, that's gonna be zero, zero times anything is zero, so this whole thing is zero, two minus two is zero, three minus three is zero, this is actually gonna be zero times zero, so that whole thing is zero. Let's see this is going to be one minus two which is negative one, one minus three is negative two, so this is going to be R is equal to 1/3 times negative times negative is positive and so this is going to be two over 0.816 times 2.160 and then plus three minus two is one, six minus three is three, so plus three over 0.816 times 2.160. Well, these are the same denominator, so actually I could rewrite if I have two over this thing plus three over this thing, that's gonna be five over this thing, so I could rewrite this whole thing, five over 0.816 times 2.160 and now I can just get a calculator out to actually calculate this, so we have one divided by three times five divided by 0.816 times 2.16, the zero won't make a difference but I'll just write it down, and then I will close that parentheses and let's see what we get. We get an R of, and since everything else goes to the thousandth place, I'll just round to the thousandths place, an R of 0.946. So, R is approximately 0.946. So, what does this tell us? The correlation coefficient is a measure of how well a line can describe the relationship between X and Y. R is always going to be greater than or equal to negative one and less than or equal to one. If R is positive one, it means that an upwards sloping line can completely describe the relationship. If R is negative one, it means a downwards sloping line can completely describe the relationship. R anywhere in between says well, it won't be as good. If R is zero that means that a line isn't describing the relationships well at all. Now in our situation here, not to use a pun, in our situation here, our R is pretty close to one which means that a line can get pretty close to describing the relationship between our Xs and our Ys. So, for example, I'm just going to try to hand draw a line here and it does turn out that our least squares line will always go through the mean of the X and the Y, so the mean of the X is two, mean of the Y is three, we'll study that in more depth in future videos but let's see, this actually does look like a pretty good line. So, let me just draw it right over there. You see that I actually can draw a line that gets pretty close to describing it. It isn't perfect. If it went through every point then I would have an R of one but it gets pretty close to describing what is going on. Now, the next thing I wanna do is focus on the intuition. What was actually going on here with these Z scores and how does taking products of corresponding Z scores get us this property that I just talked about where an R of one will be strong, positive correlation, R of negative one would be strong, negative correlation? Well, let's draw the sample means here. So, the X sample mean is two, this is our X axis here, this is X equals two and our Y sample mean is three. This is the line Y is equal to three. Now, we can also draw the standard deviations. This is, let's see, the standard deviation for X is 0.816 so I'll be approximating it, so if I go .816 less than our mean it'll get us at some place around there, so that's one standard deviation below the mean, one standard deviation above the mean would put us some place right over here, and if I do the same thing in Y, one standard deviation above the mean, 2.160 so that'll be 5.160 so it would put us some place around there and one standard deviation below the mean, so let's see we're gonna go, if we took away two, we would go to one and then we're gonna go take another .160, so it's gonna be some place right around here. So, for example, for this first pair, one comma one. What were we doing? Well, we said alright, how many standard deviations is this below the mean? And that turned out to be negative one over 0.816, that's what we have right over here, that's what this would have calculated, and then how many standard deviations for in the Y direction, and that is our negative two over 2.160 but notice, since both of them were negative it contributed to the R, this would become a positive value and so, one way to think about it, it might be helping us get closer to the one. If both of them have a negative Z score that means that there's a positive correlation between the variables. When one is below the mean, the other is you could say, similarly below the mean. Now, if we go to the next data point, two comma two right over here, what happened? Well, the X variable was right on the mean and because of that that entire term became zero. The X Z score was zero. And so, that would have taken away a little bit from our correlation coefficient. The reason why it would take away even though it's not negative, you're not contributing to the sum but you're going to be dividing by a slightly higher value by including that extra pair. If you had a data point where let's say X was below the mean and Y was above the mean, something like this, if this was one of the points, this term would have been negative because the Y Z score would have been positive and the X Z score would have been negative and so, when you put it in the sum it would have actually taken away from the sum and so, it would have made the R score even lower. Similarly something like this would have made the R score even lower because you would have a positive Z score for X and a negative Z score for Y and so a product of a positive and a negative would be a negative.