Understanding and describing groups

Context note: this is a sub-part of the fundamental concepts of statistics section of the computational literacy for humanities and social sciences course. You can use this to teach yourself some fundamental concepts of statistics. However, if you want to understand more broadly when you might want to use them, you're better off going through the whole course.

Consider this dataset of the age at death of the two million Finns (2 027 385 to be exact) who died between 1980 to 2020. While aggregated in the original, the data essentially contains the following information:

SexYear of deathAge at death

Male

1980
86

Male

1980
76

Male

1980
76

Female

1980
75

Male

1981
96

... (for a longer sample of 1000 people, see here)

null
null

As a table, this tells us information about each individual. Further, we might order the two million rows by for example Age at death to find out the longest-living people for this time period:

SexYear of deathAge at death

Female

2000
112

Female

2005
111

Male

2009
111

Female

2015
111

...

null
null

However, what if we want to look at the data as a whole, to see what it tells us about the lifespans of Finns as a whole? For this, we need to turn to statistics. Let us start with a simple visualization that just takes all two million people and plots their ages at death in increasing order:

While this is not the way these types of data are usually presented, what does this visualization tell us? First, one can read proportions out of it. Because the people are ordered by age at death, looking at the midpoint of the graph (around the 1st millionth person) and looking at the age recorded there (around 77), we can say that 50% of Finns live to be older than 77. This works for any percentage: looking at around 200 000 (10% of 2 000 000) and finding the number 52, we can say that only 10% of Finns die before reaching that age, while looking at 1 800 000, we can conclude that only 10% of Finns live longer than 90 years.

This works the other way around as well. For example, looking at age 40 and finding the number 100 000 (5% of 2 000 000), we can say that only 5% of Finns die before reaching 40. If we want to know the proportion of Finns who die between 40 and 80, we look up 80 (at about 1 200 000 or 60% of 2 000 000) and can calculate that 60%-5%(the proportion from 0 to 40)=55 per cent of Finns die in that time period.

To make these calculations easier, we can replace the person number with their position in percentage of the dataset. Further, let's switch the X and Y axes with each other (in this format, the ensuing graph has a term associated with it: the empirical cumulative distribution function).

So, only 5% of Finns die within the first 40 years of their lives, while 55% die in their next 40. Comparing these two 40-year spans in the graphs, we can see that in the first graph arranged by people, in the horizontal bands where the graph moves up quickly, there are few people. In the bands where it moves up slowly, there are many more people. In the second graph arranged by age at death, this is reversed. Where the graph moves up slowly, there are few people, and where it moves quick, there are many. This, in general, is how one examines statistical visualizations: knowing what the shapes represent in each different graph and how to read them, one can start exploring patterns in the data.

While these graphs are good for dividing the data into percentages, from them it is still difficult to get a good idea of when exactly people are likely to die. To get a better overview of this, we need to move from cumulative graphs to ones showing local density. For this, we need to calculate the distribution of the data over the Age at death. What this means is that we take each Age at death, and count how many people die at that age. The resulting table looks like this:

Age at deathNumber of peoplle

112

1

111

3

110

6

109

10

...

null

Plotted visually, this table looks as follows:

Here, the height of each bar corresponds directly to the number of people dying at that age. This allows immediate lookup and comparison of exact ages (e.g. that about 10 000 people died aged 50, while about double that many people died aged 60). Comparing ranges on the other hand now requires comparing the geometric area of the different regions in the graph. This is not something that humans are good at doing accurately, but general notions are still available.

For example, from this visualization we can immediately see the following things:

  1. Many people die in the first year of their lives

  2. After that, there is only a small chance of dying before reaching 30, which seems to be very low between 1-15 and then increase somewhat in the late teen years.

  3. After 30, the probability of death slowly increases. The average age at death by natural causes seems to be somewhere around 80 years, but there is a large variation of 10-20 years around that as well.

This last observation takes us to a side path on summary statistics.

Last updated