Data analysis: fundamental concepts of statistics

This content is not yet complete. In the meantime, see this presentation: Data analysis: fundamental concepts of statistics (pdf, gd)

Consider this dataset describing the gender, lifespans, social ranks, education and number of letters written by a bunch of people (taken from metadata of the Corpus of Early English Correspondence), the first five rows of which are reproduced below:

Name

Sex

Lifespan

Rank

Education

NumberOfLetters

Mary Wortley Montagu née Pierrepont

Female

73

Nobility

High

189

JOHN HOLLES

Male

72

Nobility

High

136

Samuel Pepys

Male

70

Professional

High

136

Daniel Fleming

Male

68

Gentry

High

122

Walter Ralegh

Male

64

Gentry

High

119

What can one do with such data? First of all, one can look at the individuals. To see how they compare against each other with regard to all the axes, it may make sense to order or sort the data according to an axis of interest, such as lifespan or number of letters. One can also visualize the data graphically, similarly ordered by lifespan or number of letters.

However, what if one is interested in not just individuals, but groups in the data, or alternatively the data as a whole? What can one say about the lifespans, or the number of letters written as a whole in this set of data? Looking at the lifespan graph for example, one can say that the highest lifespan in the data is 95, while the lowest is 16. Also by finding the centrepoint of the graph and looking up the lifespan there, it seems that about half of the people live at least 65 years, while about half live less than that. However, from this view it is quite hard to say for example what the most common lifespans are. For this, it helps to aggregate the data by lifespan. In practice, this is done by taking all the lifespans appearing in the data, and counting how many times they appear. The result of this calculation is a new table and visualization describing the distribution of the lifespans.

Archetypes of data from the viewpoint of statistics

  • categorical

  • ordinal

  • numerical / interval (/ ratio)

    • continuous, discrete

Terminology: representativeness

“The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English, both spoken and written, from the late twentieth century.”

  • 90% written, 10% speech

  • Written:

    • 70-80% informative, 20-30% imaginative

    • 60% books, 30% periodicals, 10% miscellaneous

    • Informative: 5% natural and pure science, 5% applied science, 15% social and community, 15% world and current affairs, 10% commerce and finance, 10% arts, 5% belief and thought, 10% leisure

    • High, low and middle-level language

  • Spoken: demographic sample of discussions, event-based sample of educational, business, public/institutional and leisure speech (60% dialogue, 40% monologue)

Terminology: average

“The average life expectancy at birth is 63 years for males and 64 years for females”

What does this mean?

Anscombe's quartet, a set of datasets that have identical descriptive statistics (means, variances, correlation)

https://en.wikipedia.org/wiki/Anscombe%27s_quartet

https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median.php

  1. Reading assignment: The civilizing process in London’s Old Bailey

    • Try to answer the questions given under the "Reading material" heading

  2. Check out the Explained Visually site, and especially PCA explained visually