Statistics and probability
An introduction to density curves for visualizing distributions. A brief review of frequency histograms and relative frequency histograms as well.
Want to join the conversation?
- what's the unit on y-axis once you convert the frequency histogram to a density curve? you marked the peak as 0.2 in the density curve. what does 0.2 stand for?(23 votes)
- I was struggling with this as well. The unit on the y-axis of the density curve is not percentage, it is density. Density values can be greater than 1. In the frequency histogram the y-axis was percentage, but in the density curve the y-axis is density and the area gives the percentage. When creating the density curve the values on the y-axis are calculated (scaled) so that the total area under the curve is 1. This allows us to then define an arbitrary interval on the x-axis and calculate the percentage from the area for that continuous interval, as opposed to calculating the percentage for a discrete value from the frequency histogram. It can be confusing because the visual shapes of the histogram and density curve are the same, but the units on the y-axis are different things. I assume this is explained better than I can in a video down the line.(37 votes)
- 3:28Why if we become more granular, and therefore there are less data per rectangle, the percentage doesn't decrease?(22 votes)
- Because there are more rectangles.
When you have a relative frequency histogram (discrete) you plot the probability that the data falls in one of those rectangles, but each value of your data can only be exactly in one of them. If you add them all you get 100%.
Imagine that you have a continuous data set (like the quantity of water that people drink every day), and you try to fit the values in your discrete rectangles. Then each rectangle should be a range, and you plot the probability that someone would drink water inside that range.
What you are thinking has a lot of sense. If I simply plot the probability of someone drinking water inside a smaller range, then the probability should be smaller, and the shape of the rectangles should change, being squashed.
BUT, then the area under the curve wouldn't be equal to 100%, because the total range is the same but the probabilities are smaller.
What that numbers represent is the probability that some value falls into those ranges, DIVIDED by the range. Those values are the probability PER UNIT. Now it makes sense to have infinitely many small rectangles.
If you put the probability divided by the range in a graph, then you can make an integral summing up infinitely many infinitely small rectangles. Rectangles of an area equal to the probability of some value falling into it. And that area would be smaller as we make smaller rectangles, because the width is smaller, but not the height.
In summary, what you thought that was the height of the rectangle in reality is the area of that rectangle. The height is the probability per unit of range, and the width is the range in which we could include values.(10 votes)
- When we switch from histogram to a density curve, we are told to conceptualize the buckets as getting infinitesimally narrow until they are no longer represented by bars but by a line connecting their tops.
However, the likelihood of any data point matching a given value becomes less and less likely as our buckets get narrower and narrower. So if you could zoom in it would look like a spike followed by zero for awhile then a spike.
Is the best way to conceptualize this as incredibly small buckets connected by a sort of trend line that is "stiff" (ie not dropping to zero in the space between buckets)? Or should the x-axis of density curves be conceptualized as legitimately continuous values (ie not buckets)? And if so, why isn't the line at 0 most of the time?(5 votes)
- That was exactly my question, although you asked it one year before me :P
As of now I think the analogy for the density curve as a relative frequency histogram with incredibly thin rectangles is incorrect.
From other sites, I've gathered that the density curve can rather be thought of as the derivative of the culmulative distribution function/curve.
I am not the best person to explain exactly what this means, but check out this link:
The prob of getting an X in an interval for relatve frequency histograms is represented by the tops of the rectangles, but for a density curve, it is represented by the area under the curve. Therefore, I think that the density curve is not a rel freq histogram with really thin rectangles.(2 votes)
- 7:59not to be picky but it would be molecule right?(4 votes)
- Well technically yes(2 votes)
- So if there are no values that are exactly 3, This part confuses me a little bit. What if you asked how many data points drank 2.9 glasses of water, which was an original data point, is there any way to tell what % drank 2.9 glasses of water, or do you just have to assume that it isn't exactly 2.9 glasses?(3 votes)
- At2:47he mentioned that you have "more data points" and want "more granular categories", which suggests that the data set used is not the same one in the first histogram. (3:20- 16 million data points)
It seems like the lesson assumed that the first data set is discrete, 'cause it's countable (16 students), and the precision is only 1 decimal digit. It takes a huge number of data points to construct a distribution function of a continuous variable, and the variable itself, when described as a curve, is not countable.
In reality, the number of glasses of water drank per day of, say, 1 billion people, with each data point written with 10 decimal digits, is a continuous variable. The probability of someone who drinks 2.9000000000 glasses of water/day is close to 0.(3 votes)
- With a discrete (think that's the right word) dataset, making the graph more and more granular wouldn't result in a curve like Sal shows at3:43right?** So he's talking abt a continuous dataset.
I understand that the area under the curve is the probability of getting an X in that interval. But for a rel frequency histogram, the tops of the rects represent that probability. So the analogy of a rel freq histogram with infinitely thin rectangles being the density curve is incorrect right?
Even if our dataset wasn't specific values but instead a continuous variable X, and we imagine a rel freq histogram with very thin rects, then all the tops of the rects would be very close to zero right?
So a relative frequency histogram with infinitely thin rectangles is not the density curve?
If we take Sal's data set at2:38and the rel frequency histogram, and we make the histogram's rectangles very thin, we won't get a curve; we'll get a bunch of very thin rectangles with their tops at zero, and then thin rects at all the data points with a 1/16 probability (except for 3.2, which has a 2/16 probability since there's 2 in the dataset).(4 votes)
- Is it relative frequency on y-axis?(2 votes)
- please explain how you put 0.2 on the y-axis. M so confused about it. Please help. From-India(2 votes)
- I think the example of #glasses of water doesn't suit the density curve, since it's a discrete variable (integer/30). If you drink 90 glasses of water per 30 days, you are drinking 3.00000 glasses of water per day on average, and one atom less can't make a glass of water not a glass of water. So there should be some probability that somebody drinks exactly 3.00000 glasses of water per day.(2 votes)
I agree with your statement. But, there is no interval here just % of data drinking exactly 3 glasses. why do i want to think about the area here. If it is an interval with multiple infinitesimally thin intervals, then i would calculate area to sum up corresponding frequencies. As, here we are interested in infinitely thin interval for 3, it should be just height.
Also, It depends on data from which we constructed the curve. if our data does not have observations in certain intervals. why do we still draw a curve insted of leaving holes in curve(2 votes)
- [Instructor] What we're going to do in this video is think about how to visualize distributions of data, and then to analyze those visualizations, and we will eventually get to something known as a density curve. But let's start with a simple example, just to review some concepts. Let's say I go to 16 students and I ask them to measure how many glasses of water they drink per day for the last 30 days, and then to average it. And so this data point right over here tells us one student drank an average of 0.5 glasses of water per day. That person is probably very dehydrated. This person drank 8.1 glasses of water per day, on average, for the last 30 days, they are better hydrated. If we want to visualize that we can set up a frequency histogram, where we can create some categories. So this first category would be for data points that are greater than or equal to zero, and less than one, and we can see that two data points fall into that category, and that's why the bar right over here for that category is up to two. This category right over here is greater than or equal to three, and less than four. Notice, there are four data points in that category and on this frequency histogram the height of the bar is indeed four. So this is a nice way of looking at a distribution. But you might be more concerned with what percentage of my data falls into each of these categories, and that becomes especially interesting if we have many, many, many data points, and if we had, you know, 1,600,432,507 data points, well just knowing the absolute number that fit into each category isn't so useful, the percent that fits into each category is a lot more useful. And so for that, we could set up a relative frequency histogram. So notice, this is representing the same data. But in that first category, instead of the bar height being two, the bar height is now 12.5%. Why is that? Because two of the 16 data points fall into this category. 2/16 is 1/8, which is 12.5%. And this one right over here, notice, instead of the height being four for four data points, it's now 25%. But these are saying the same thing. Four out of the 16 data points fall into this category. 4/16 is 1/4, which is 25%. So both of these types of histograms are really useful and you will see them used all of the time. But there are also cases where you have many, many, many more data points, and you want more granular categories. So what you could do, is, well, let's just make our categories a little more granular. So for example, instead of them being one glass of water wide, maybe you make them half a glass of water wide. So this first category could be greater than or equal to zero, and less than 0.5, and that will give you a clearer picture, and I'm now assuming in a world where we have more than 16 data points, maybe we have 16 million data points, this would be percentages on the left hand side. But maybe that isn't good enough for you, maybe you wanna get even more granular. So you make everything, each category, a quarter of a glass. But maybe that doesn't satisfy, you wanna get more and more and more granular. Well, you could imagine where this is going. You could get to a point where you're approaching an infinite number of categories, and each category is infinitely thin, is super, super thin, to a point that if you just connect the tops of the bars that you will actually get a curve. And this type of curve is something that we actually use in the statistics, and, as promised at the beginning of the video, this is the density curve we talk about. And what's valuable about a density curve, it is a visualization of a distribution where the data points can take on any value in a continuum. They're not just thrown into these coarse buckets. So how would you interpret something like this? If you look over the entire interval from zero, let's say, to nine, assuming no one drank more than an average of nine glasses per day, even in our 16 million data points, well then the area under the curve over that interval is going to be 100%, or 1.0. This is going to be true for any density curve, that the entire area of the curve is 100%, it represents all of the data points. A density curve will also never take on a negative value, you won't see the curve dip down and do something strange like that. Now, with that out of the way, let's think about how we would make use of it. If I wanted to know what percentage of my data falls between two and four glasses, well I would look at that interval. I'd go from two to four, I would look at this interval right over here, and I would try to figure out the area under the curve here. And this area is going to be greater than or equal to zero, and less than or equal to 100%. When I eyeball it right over here, it looks like it's about 40% of the entire area under the curve, so just eyeballing it, I would say roughly 40% of my data falls into this interval. If I were to ask you what percentage of the data is greater than three, well then you would be looking at this area, and it looks like it is about 50%, but once again, I am estimating it. But you can start to see how, even with estimation, a density curve could be useful. In the real world, statisticians will often have tables that might represent the information for the density curve, they might have computer programs or some type of automated tool, and there are also well-known density curves. The famous Bell Curve that we will study later on, where there's a lot of precise data and a lot of tools to exactly figure out the areas. The last thing I'd like to cover is a key misconception for density curves. If I were to ask you, approximately what percentage of my data is exactly three glasses of water per day? And when I say exactly, I mean exactly the number 3.000 with zeroes just going on and on forever, the exact number three. Well, you might be tempted to just say okay, this is three. Let me see the corresponding point on the curve. It looks like it is about 0.2, or a little higher than that, so maybe you would say a little bit more than 20%, or approximately 20%. And what I would say to you, is this is wrong. Remember, the percentage of the data in an interval is not the height of the curve, it is the area under the curve in that interval. And if we're just talking about one precise value, like exactly the number three, there is no area under the curve. This vertical line that I just drew over the number three has no width, and this actually makes sense in the real world. Even if you were to look at 16 million people, it is very unlikely that even anyone would drink exactly three glasses of water per day. I'm talking about not one atom more or one atom less than three glasses. There might be many people between 2.9 and 3.1, but no one is exactly three glasses a day. When someone says I'm drinking three glasses of water per day, that would be a rough estimate. They're probably 3.001, or 2.99999, or 3.15, or whatever else. And so instead, you could say what percentage falls in the interval, maybe, that is greater than or equal to 2.9 and less than or equal to 3.1. And so once you have an interval, then you actually can look at the area, so we're gonna go from 2.9 to 3.1, so now we have an interval that actually has width, and so it'd be roughly the size of this yellow area that I'm shading in right over here, and we can approximate it with a rectangle even though the top of this curve isn't flat, but we can say, look, it's approximately like a rectangle that is 0.2 high, and what's the width? The width here, if we're going from 2.9 to 3.1, the width is going to be 0.2 wide, and so we could approximate this area by approximating this rectangle, the area of the rectangle. 0.2 times 0.2, that would give us an area of 0.04. Or we could say approximately 4% of the data falls in this interval.