Judging outliers in a dataset
Using the inter-quartile range (IQR) to judge outliers in a dataset.
Want to join the conversation?
- at3:47how did you get 1.5?(19 votes)
- Good question!
1.5 is simply a given number that statisticians have decided on using when finding outliers, so it is just a part of the equation.
Hope this helps!(34 votes)
- In the bottom wiskers box plot I noticed the instructor only moved the lower wisker to 6. So he only changed the lower value to 6 since the outliers were removed. If the outliers are removed, wouldn't that also change the median and other quartiles?(21 votes)
- When Sal says he's going to "not include the outliers" he is only talking about not considering them for the min or max. You are correct that if you took them out of the data set completely it could affect all the quartiles.(20 votes)
- In this example, there were two 1s in the data. Sal puts a single dot to represent the two of them in the box plot as outliers. Can I put two dots instead? Or is it just a matter of convention to represent any number of outliers having the same numerical value with a single dot?(8 votes)
- You can just put 1 to indicate that there is data at that point.(5 votes)
- When you are creating box-and-whisker plots, how do you know when you should include outlier's or not?(9 votes)
- Outliers are by definition elements that exist outside of a pattern (i.e. it’s an extreme case or exception). While they might be due to anomalies (e.g. defects in measuring machines), they can also show uncertainty in our capability to measure. Just as there is no perfect mathematical model to characterize the universe, there isn’t a perfect machine to measure it. Hence, when plotting data sets one should never exclude outliers from plots. Someone might come up one day with a better model to characterize your data, and show those outliers are part of something magnificent.(4 votes)
- so q1 is median of first half and q2 is the median of the set and q3 is the median of second half so what is q4 ?!(4 votes)
- There is no Q4. To quarter a data set only needs three points.(9 votes)
- I'm working with count data for my thesis and apparently 94% of my data are outliers (a lot of zeros as well as extreme values when we find large clusters of the animals). Would I exclude 94% of the data in this instance just because they are outliers?(3 votes)
- What is your data set? Logically at least 50% of the data can't be considered as outliers because they would fall between Q1 and Q3. To calculate the outliers you see if they are < Q1 - 1.5 * IRQ or > Q3 + 1.5 * IRQ. So it is not possible to have 94% of your data as outliers.(8 votes)
- When we exclude outliers, doesn't it make sense to adjust Q1, Q2, and Q3 accordingly?(3 votes)
- Since the median is literally the middle number of the data set, it is not necessarily affected by the value of the outliers (unlike the mean). The same goes for Q1 and Q3, since they are technically medians as well.
Here's a more detailed explanation of solving for Q1 and Q3 https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/interquartile-range-iqr/a/interquartile-range-review
Hope this helps!😄(3 votes)
- And I gotta do this for school(3 votes)
- This is so freakin frustating(3 votes)
- Where is the logic behind the use of the number 1.5?(3 votes)
- [Instructor] We have a list of 15 numbers here, and what I want to do is think about the outliers. And to help us with that, let's actually visualize this, the distribution of actual numbers. So let us do that. So here, on a number line, I have all the numbers from one to 19. And let's see, we have two ones. So I could say that's one one and then two ones. We have one six. So let's put that six there. We have got a 13, or we have two 13s. So we're gonna go up here, one 13 and two 13s. Let's see, we have three 14s. So 14, 14, and 14. We have a couple of 15s, 15, 15. So 15, 15. We have one 16. So that's our 16 there. We have three 18s. One, two, three. So one, two, and then three. And then we have a 19. Then we have a 19. So when you look, when you look visually at the distribution of numbers, it looks like the meat of the distribution, so to speak, is in this area, right over here. And so some people might say, "Okay, we have three outliers. "There are these two ones and the six." Some people might say, "Well, the six is kinda close enough. "Maybe only these two ones are outliers." And those would actually be both reasonable things to say. Now to get on the same page, statisticians will use a rule sometimes. We say, well, anything that is more than one and a half times the interquartile range from below Q-one or above Q-three, well, those are going to be outliers. Well, what am I talking about? Well, let's actually, let's figure out the median, Q-one and Q-three here. Then we can figure out the interquartile range. And then we can figure out by that definition, what is going to be an outlier? And if that all made sense to you so far, I encourage you to pause this video and try to work through it on your own, or I'll do it for you right now. All right, so what's the median here? Well, the median is the middle number. We have 15 numbers, so the middle number is going to be whatever number has seven on either side. So it's gonna be the eighth number. One, two, three, four, five, six, seven. Is that right? Yep, six, seven, so that's the median. And then you have one, two, three, four, five, six, seven numbers on the right side too. So that is the median, sometimes called Q-two. That is our median. Now what is Q-one? Well, Q-one is going to be the middle of this first group. This first group has seven numbers in it. And so the middle is going to be the fourth number. It has three and three, three to the left, three to the right. So that is Q-one. And then Q-three is going to be the middle of this upper group. Well, that also has seven numbers in it. So the middle is going to be right over there. It has three on either side. So that is Q-three. Now what is the interquartile range going to be? Interquartile range is going to be equal to Q-three minus Q-one, the difference between 18 and 13. Between 18 and 13, well, that is going to be 18 minus 13, which is equal to five. Now to figure out outliers, well, outliers are gonna be anything that is below. So outliers, outliers, are going to be less than our Q-one minus 1.5, times our interquartile range. And this, once again, this isn't some rule of the universe. This is something that statisticians have kind of said, well, if we want to have a better definition for outliers, let's just agree that it's something that's more than one and half times the interquartile range below Q-one. Or, or an outlier could be greater than Q-three plus one and half times the interquartile range, interquartile range. And once again, this is somewhat, you know, people just decided it felt right. One could argue it should be 1.6. Or one could argue it should be one, or two, or whatever. But this is what people have tended to agree on. So let's think about what these numbers are. Q-one we already know. So this is going to be 13 minus 1.5 times our interquartile range. Our interquartile range here is five. So it's 1.5 times five, which is 7.5. So this is 7.5. 13 minus 7.5 is what? 13 minus seven is six, and then you subtract another .5, is 5.5. So we have outliers, outliers. Outliers would be less than 5.5. Or the Q-three is 18, this is, once again, 7.5. 18 plus 7.5 is 25.5, or outliers, outliers greater than 25, 25.5. So based on this, we have a, kind of a numerical definition for what's an outlier. We're not just subjectively saying, well, this feels right or that feels right. And based on this, we only have two outliers, that only these two ones are less than 5.5. Only these two ones are less than 5.5. This is the cutoff, right over here. So this dot just happened to make it. And we don't have any outliers on the high side. Now another thing to think about is drawing box-and-whiskers plots based on Q-one, our median, our range, all the range of numbers. And you could do it either taking in consideration your outliers or not taking into consideration your outliers. So there's a couple of ways that we can do it. So let me actually clear, let me clear all of this. We've figured out all of this stuff. So let me clear all of that out. And let's actually draw a box-and-whiskers plot. So I'll put another, another, actually let me do two here. That's one, and then let me put another one down there. And then this is another. Now if we were to just draw a classic box-and-whiskers plot here, we would say, all right, our median's at 14. And actually, I'll do it both ways. Our median's at 14. Median's at 14. Q-one's at 13. Q-one's at 13, and Q-one's at 13. Q-three is at 18. Q-three is at 18, Q-three is 18. So that's the box part. Now let me draw that as an actual, let me actually draw that as a box. So my best attempt, there you go. That's the box. And this is also a box. So far, I'm doing the exact same thing. Now if we don't want to consider outliers, we would say, well, what's the entire range here? Well, we have things that go from one all the way to 19. So one way to do it is to, hey, we start at one. And so our entire range, we go, actually let me draw it a little bit better than that. We're going all the way, all the way from one to 19. Now in this one, we're including everything. We're including even these two outliers. But if we don't want to include those outliers, we want to make it clear that they're outliers, well, let's not include them. And what we can do instead is say, all right, including (chuckles) our non-outliers, we would start at six 'cause six we're saying is in our data set, but it is not an outlier. Let me make this look better. So we're gonna, we are going to start at six and go all the way to 19. And then to say that we have these outliers, we would put this, we have outliers over there. So once again, this is a box-and-whiskers plot of the same data set without outliers. And this is one where we make specific, we make it clear where the outliers actually are.