This content is not yet complete. In the meantime, see this presentation: Data analysis: fundamental concepts of statistics (pdf, gd)

Consider this dataset describing the gender, lifespans, social ranks, education and number of letters written by a bunch of people (taken from metadata of the Corpus of Early English Correspondence), the first five rows of which are reproduced below:

Name | Sex | Lifespan | Rank | Education | NumberOfLetters |

Mary Wortley Montagu née Pierrepont | Female | 73 | Nobility | High | 189 |

JOHN HOLLES | Male | 72 | Nobility | High | 136 |

Samuel Pepys | Male | 70 | Professional | High | 136 |

Daniel Fleming | Male | 68 | Gentry | High | 122 |

Walter Ralegh | Male | 64 | Gentry | High | 119 |

What can one do with such data? First of all, one can look at the individuals. To see how they compare against each other with regard to all the axes, it may make sense to order or sort the data according to an axis of interest, such as lifespan or number of letters. One can also visualize the data graphically, similarly ordered by lifespan or number of letters.

However, what if one is interested in not just individuals, but groups in the data, or alternatively the data as a whole? What can one say about the lifespans, or the number of letters written as a whole in this set of data? Looking at the lifespan graph for example, one can say that the highest lifespan in the data is 95, while the lowest is 16. Also by finding the centrepoint of the graph and looking up the lifespan there, it seems that about half of the people live at least 65 years, while about half live less than that. However, from this view it is quite hard to say for example what the most common lifespans are. For this, it helps to *aggregate *the data by lifespan. In practice, this is done by taking all the lifespans appearing in the data, and counting how many times they appear. The result of this calculation is a new table and visualization describing the *distribution *of the lifespans.

categorical

ordinal

numerical / interval (/ ratio)

continuous, discrete

| | | | |

| | | | |

“The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English, both spoken and written, from the late twentieth century.”

90% written, 10% speech

Written:

70-80% informative, 20-30% imaginative

60% books, 30% periodicals, 10% miscellaneous

Informative: 5% natural and pure science, 5% applied science, 15% social and community, 15% world and current affairs, 10% commerce and finance, 10% arts, 5% belief and thought, 10% leisure

High, low and middle-level language

Spoken: demographic sample of discussions, event-based sample of educational, business, public/institutional and leisure speech (60% dialogue, 40% monologue)

“The average life expectancy at birth is 63 years for males and 64 years for females”

What does this mean?

https://en.wikipedia.org/wiki/Anscombe%27s_quartet

https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median.php

Reading assignment: The civilizing process in London’s Old Bailey

Try to answer the questions given under the "Reading material" heading

Check out the Explained Visually site, and especially PCA explained visually