Main content

### Course: High school statistics > Unit 2

Lesson 1: Standard deviation# Mean and standard deviation versus median and IQR

Learn to choose the "preferred" measures of center and spread when outliers are present in a set of data.

## Want to join the conversation?

- 1,2,3 ,1000,2000,10000,20000

median is 1000.

It just tries to stay in between.

Mean is like finding a point that is closest to all. But it gets skewed.

If for a distribution,if mean is bad then so is SD, obvio.

Standard deviation is how many points deviate from the mean.

For two datasets, the one with a bigger range is more likely to be the more dispersed one.

IQR is like focusing on the middle portion of sorted data. So it doesn’t get skewed.

Why not use IQR Range only.

Use standard deviation using the median instead of mean.

Create levels expanding from the IQR range, level 1, level 2.

Is it a good idea?(7 votes)- When you perform an exploratory data analysis you may be interested the range.

There is no such thing as IQR range. IQR is a form of range (interquartile range).

There is no such thing as levels in IQR. But perhaps you can create a new feature if you feel it is necessary.(2 votes)

- How about mode? Wouldn't that often be more reliable? Like when calculating the average salary in a large population - would the amount most people make not seem the most representative?(4 votes)
- Not necessary
**Powel**. The example Carlos explained above is accurate(3 votes)

- If median and IQR are preferred when there are outliers, doesn't that imply that they are more accurate when there is
**any variance at all**?

The only case where mean and standard deviation are going to be as accurate as median and IQR is if there is no variance at all in the data.

With that being said, is there**any**situation where mean and standard deviation would be preferable?(4 votes)- While median and IQR are more robust in the presence of outliers, mean and standard deviation are still useful in certain situations:

- If the data is symmetrically distributed around the mean without significant outliers, mean and standard deviation can provide a good representation of the data's central tendency and spread.

- In datasets that follow a normal distribution, mean and standard deviation are commonly used because they accurately summarize the distribution's properties.

- Mean and standard deviation are often preferred for mathematical calculations and comparisons between different datasets due to their mathematical properties and ease of interpretation.

Ultimately, the choice between mean/standard deviation and median/IQR depends on the nature of the data and the specific objectives of the analysis. If the data is heavily skewed or contains outliers, using median and IQR can provide a more accurate representation of the central tendency and spread.(1 vote)

- what does the Standard deviation have to do with the IQR(1 vote)
- They are both measures of how far the typical data point is from the center--either the mean or the median, depending on which you use.(7 votes)

- why cant we mix and match

? as we figure out that median captures central tendency better. why cant we still use median in standard deviation formula?. That would be better capturing total variance/spread in the data set(3 votes)- interesting idea

and it would remedy the misleading by biased mean a bit

but the skew and thus bias by an outlier remain even with median for calculating standard deviation.

i think that's why we better rely on IQR in that type of situations as it can simply ignore too extreme cases.(1 vote)

- is there any simpler way to explain this?(3 votes)
- Sure! In simpler terms, when we talk about "sample biased variance," we're talking about a correction needed when calculating the variance from a sample to make it more accurate. And when we say the data is "skewed," we mean it's not evenly spread out around the average. It's like if most people earn around $50,000, but one person earns $1 million, that would skew the average salary higher.(1 vote)

- i have 2 questions.. the first one is on variance... why was the previous video refer to it as sample biased variance.. what does it mean? the second question is the term skew.. what does it mean here? thank you(2 votes)
- The term "sample biased variance" likely refers to the fact that when calculating the variance from a sample (as opposed to the entire population), dividing by n−1 instead of n (Bessel's correction) is necessary to correct for bias in the estimation of the population variance. This correction accounts for the fact that using n instead of n−1 tends to underestimate the true population variance, making the sample variance biased.

In statistics, "skew" refers to the asymmetry or lack of symmetry in a distribution of data. If a distribution is skewed, it means that the data points are not symmetrically distributed around the mean. In the context of the video, the term "skewed data set" implies that the distribution of salaries is not symmetric, likely due to the presence of extreme values (outliers) like the $250,000 salary mentioned. These extreme values can disproportionately influence measures of central tendency like the mean, causing it to be skewed or distorted.(1 vote)

- Would the mean be robust if there are outliers on both sides of the main group of data points?(1 vote)
- Still no because it is unknown how drastically the outliers differ from each other. For example, if most of the data were from 50-60 one of the outliers could be 30 while another outlier is 200. Thus if any outliers as a general reasons use the median.(3 votes)

- if mean is 80 how far away is 60 and in what direction(2 votes)
- 20 to the left. So, you would subtract 20 from 80 to get 60.(1 vote)

- What is the minimum number of points for using each of the choices - standard deviation or IQR(1 vote)
- Look at the spread and your own intutive reasoning. If you feel data is roughly symetrical around mean then use standard deviation else go with IQR(2 votes)

## Video transcript

- [Narrator] So we have
nine students who recently graduated from a small school
that has a class size of nine, and they wanna figure out
what is the central tendency for salaries one year after graduation? And they also wanna have a
sense of the spread around that central tendency one
year after graduation. So they all agree to put in
their salaries into a computer, and so these are their salaries. They're measured in thousands. So one makes 35,000, 50,000,
50,000, 50,000, 56,000, two make 60,000, one makes
75,000, and one makes 250,000. So she's doing very well for herself, and the computer it spits
out a bunch of parameters based on this data here. So it spits out two typical
measures of central tendency. The mean is roughly 76.2. The computer would calculate
it by adding up all of these numbers, these nine numbers,
and then dividing by nine, and the median is 56, and median
is quite easy to calculate. You just order the numbers and you take the middle number here which is 56. Now what I want you to
do is pause this video and think about for this data set, for this population of
salaries, which measure, which measure of central
tendency is a better measure? All right, so let's think
about this a little bit. I'm gonna plot it on a line here. I'm gonna plot my data
so we get a better sense and we just don't see them,
so we just don't see things as numbers, but we see
where those numbers sit relative to each other. So let's say this is zero. Let's say this is, let's see,
one, two, three, four, five. So this would be 250, this
is 50, 100, 150, 200, 200, and let's see. Let's say if this is 50
than this would be roughly 40 right here, and I just wanna get rough. So this would be about 60,
70, 80, 90, close enough. I'm, I could draw this
a little bit neater, but, 60, 70, 80, 90. Actually, let me just clean
this up a little bit more too. This one right over here would be a little bit closer to this one. Let me just put it right around here. So that's 40, and then
this would be 30, 20, 10. Okay, that's pretty good. So let's plot this data. So, one student makes 35,000,
so that is right over there. Two make 50,000, or three make 50,000, so one, two, and three. I'll put it like that. One makes 56,000 which would
put them right over here. One makes 60,000, or
actually, two make 60,000, so it's like that. One makes 75,000, so
that's 60, 70, 75,000. So it's gonna be right around there, and then one makes 250,000. So one's salary is all
the way around there, and then when we
calculate the mean as 76.2 as our measure of central tendency, 76.2 is right over there. So is this a good measure
of central tendency? Well to me it doesn't feel that good, because our measure of central
tendency is higher than all of the data points except for
one, and the reason is is that you have this one that the,
that our, our data is skewed significantly by this
data point at $250,000. It is so far from the
rest of the distribution from the rest of the data
that it has skewed the mean, and this is something
that you see in general. If you have data that is skewed,
and especially things like salary data where someone might
make, most people are making 50, 60, $70,000, but someone
might make two million dollars, and so that will skew the
average or skew the mean I should say, when you add them all
up and divide by the number of data points you have. In this case, especially when
you have data points that would skew the mean,
median is much more robust. The median at 56 sits right
over here, which seems to be much more indicative for central tendency. And think about it. Even if you made this instead of 250,000 if you made this 250,000
thousand, which would be 250 million dollars, which is
a ginormous amount of money to make, it wouldn't, it would
skew the mean incredibly, but it actually would not
even change the median, because the median, it doesn't matter how high this number gets. This could be a trillion dollars. This could be a quadrillion dollars. The median is going to stay the same. So the median is much more robust if you have a skewed data set. Mean makes a little bit more
sense if you have a symmetric data set or if you have things
that are, you know, where, where things are roughly
above and below the mean, or things aren't skewed
incredibly in one direction, especially by a handful of data points like we have right over here. So in this example, the median is a much better measure of central tendency. And so what about spread? Well you might say, well,
Sal you already told us that the mean is not so good and the standard deviation
is based on the mean. You take each of these data
points, find their distance from the mean, square that
number, add up those squared distances, divide by the
number of data points if we're taking the population standard
deviation, and then you, and then you, you take the
square root of the whole thing. And so since this is based on
the mean, which isn't a good measure of central tendency
in this situation, and this, this is also going to skew
that standard deviation. This is going to be, this is a lot larger than if you look at the, the actual, if you wanted an indication of the spread. Yes, you have this one data
point that's way far away from either the mean or
the median depending on how you wanna think about it, but
most of the data points seem much closer, and so for that situation, not only are we using the median, but the interquartile range
is once again more robust. How do we calculate the
interquartile range? Well, you take the median
and then you take the bottom group of numbers and
calculate the median of those. So that's 50 right over here
and then you take the top group of numbers, the
upper group of numbers, and the median there is
60 and 75, it's 67.5. If this looks unfamiliar
we have many videos on interquartile range and calculating standard deviation and median and mean. This is just a little bit of a review, and then the difference
between these two is 17.5, and notice, this distance
between these two, this 17.5, this isn't going to change, even if this is 250 billion dollars. So once again, it is both of
these measures are more robust when you have a skewed data set. So the big take away here is
mean and standard deviation, they're not bad if you have
a roughly symmetric data set, if you don't have any
significant outliers, things that really skew the data set, mean and standard deviation
can be quite solid. But if you're looking at
something that could get really skewed by a handful of data
points median might be, median and interquartile range,
median for central tendency, interquartile range for spread
around that central tendency, and that's why you'll see when
people talk about salaries they'll often talk about
median, because you can have some skewed salaries,
especially on the up side. When we talk about things
like home prices you'll see median often measured
more typically than mean, because home prices in a
neighborhood, a lot of, or in a city, a lot of the
houses might be in the 200,000, $300,000 range, but maybe
there's one ginormous mansion that is 100 million dollars,
and if you calculated mean that would skew and give a
false impression of the average or the central tendency
of prices in that city.