# Analyzing trends in categorical data

Sal solves an example where he is asked to calculate relative frequencies and analyze trends in categorical data.  Created by Sal Khan.

• I need more explanation and examples on this topic and some more methods to crack such questions? i start good but in between i stumbles and lead to wrong calculation. This row way / column way and Total calculations gets confusing. Help pls
• It basically condenses the conditional distributions while including total percentages. If you just want to look at one row and see the conditional distribution for that row, you look at the row %. If you want to see the conditional distribution for a column, you compare the column % within that column. It also includes the percent of each option out of the total which makes it easy to find the total if you are given a number in a particular category.
• "because 55.0% of people who get 7 or more hours are minimal computer users" claim makes sense for me. But why does "only 35.8% of all people are minimal computer users" claim support the conclusion?
• It is saying that 55% of minimal computer users get 7+ hours of sleep. To counter this some people might say that this may be because of the fact that most people taking the survey are minimal computer users. However, to prove this wrong and back up the original claim, it says the majority of people are not minimal computer users, only 35.8% are. So in conclusion the point of this was to back up the point that 55% of people who get 7+ hours of sleep are minimal computer users by showing that this is not because of majority.
• I can understand the concept but not the row and columns.
• This is how I've understood it. It's more like how each group of categories (computer-time and hours-p-night) relate with each other.
Let's call computer-time group the row-group (the categories of this group are on the rows of the table) and the hours-p-night group the column-group.

The row-group has Minimal, Moderate, Extreme categories.
The column-group has 5 or few, 5 - 7, 7 or more categories.

Because the categories are grouped, we look at the data from the perspective of each group.
From the perspective of computer-time (for example Minimal) we can say that:
16.3% of the Minimal Computer users have 5 or few hours of sleep
32.6% of the Minimal Computer users have 5 to 7 hours of sleep
51.1% of the Minimal Computer users have 7 or more hours of sleep

And these values goes to the 'Row %' of each category in the hours-p-night (the column-group). More like, for this row (Minimal is on the row of the table) what are the values for each column (the values of hours-p-hight). Note that 'Row total' doesn't have value for 'Column %'. This because the values on 'Column %' are of another group of categories (hours-p-night) and the total for these are on the 'Column total'.

Now, looking from the perspective of hours-p-night (for example 5 or few) we can say that:
17.5% of people that have 5 or few hours of sleep, are Minimal Computer users.
32.5% of people that have 5 or few hours of sleep, are Moderate Computer users
50.0% of people that have 5 or few hours of sleep, are Extreme Computer users.

And these values goes to the 'Column %' of each category in the row-group. for this column (5 or few) what are the values for each category
This for all the categories on the sleeping-time group.

Having two groups of categories, without the 'Row %' and 'Column %' to guide us, would've been hard for us to understand everything.
• This is seems like a confusing way to present data. I hope a module on "Selecting Appropriate Charts/Tables" will be added in the future. That said, I do understand that data won't always be presented in a way that is the most easily legible to the reader.
• @ , when the instructor checks the last answer, I did not think it was supporting his claim which was that there is indeed an associated between minimal computer usage and getting 7+ hours of sleep. Saying that 55% of minimal computer users get 7+ hours of sleep supports his point, but linking it to the data saying x percentage of computer users are minimal users does not suggest a positive association. Am I missing something?

A better data link to reference would be 18.3% of total users being minimal users that get 7+ hours of sleep, while 15% of total users that are moderate to extreme users get 7+ hours of sleep. 18.3% > 15%, which suggests that since more minimal usage computer users get 7+ hours of sleep by 3.3%, there is a positive associated.
• This has to do how was the data gathered. In this problem we grabbed a bunch of people, estimated their computer usage patterns and sleep patterns and made a table.

If on the other hand we grabbed some people in Town A and Town B and estimated their sleep patterns, the last claim (replacing "minimal computer user" with "being from Town A") would not be valid.

In the Computer usage example we cannot influence how much people will be in each of the usage groups, it simply samples the population. In the Towns example, we decide how many people from each town we sample. (And more importantly, the ratio.)
• Khan academy should have included "filling out frequency table for independent events" before teaching "analyzing trends in categorical data.
• What do Row & Column labels mean?
• The Row % is the conditional percent in that particular row or how often that particular outcome appeared in that one row. Column % is the same thing, but downward rather than across.
• i am unable to read the table that specifies the values of row% column % and total %
• Just leave out everything and notice the topmost row and the leftmost column. That row depicts the time a computer user sleeps. It may be 5 or fewer, 5-7 or 7 or more. So you analyze any kind of row, just remember it is related with the amount of sleep. The column (leftmost) shows type of computer users. It can be minimal,moderate or extreme. So any kind of column you see, the first thought that should cross your mind is that column is related with percentage of people who are using computer.
So, let's analyze the first row which says minimal computer time. So all of the data that is included in this row will have a prefix of "minimal computer users". Then we see hours per night as the second variable and just below that we see this row%, column% total%. Don't worry about this. It's just formatting and doesn't show any data. The next entry is 5 or fewer. And just below that (in the 1st row), we see 3 entries; 16.3% (row%), 17.5%(column%), 5.8% (total). Now let's see what they mean.
16.3% (row%): We are dealing with minimal computer users. And this data is for the row% as we can see from the name. Now recall what we discussed earlier, a row is related with sleep. So 16.3% of the minimal computer users sleep 5 or fewer hours.
17.5%(column%): We are dealing with minimal computer users. And this data is for the column%. Column is related with the users. So 17.5% of the people who sleep 5 or fewer hours are minimal computer users.
Do you get hunch that : there are 3 categories - minimal , moderate and extreme. If 17.5% OF THE PEOPLE WHO SLEEP 5 OR FEWER are minimal computer users then The column % of minimal+column%of moderate+column%of extreme (ALL of the column of 5 or fewer) will add up to 100%? Because all the people who sleep 5 or fewer means 100% people of this category. Check if this is the case or not.
5.8%(total%): We are dealing with minimal computer users. So 5.8% of the minimal computer users get 5 or fewer hours of sleep.
And what % of total people are minimal computer users? its 35.8 , see the rightmost column.
Try to analyze the rest by yourself.
• I just can't seem to answer word questions for associations but numbers are fine. I don't understand what I'm doing wrong and it's frustrating because I keep practicing over and over and over and the only thing I keep getting wrong is the " is there an association...". I don't know what else to do about it, I'm in an infinite loop.