Computational literacy
  • Computational literacy for the humanities and social sciences
  • Three approaches to computational methods
  • History of humanities computing
  • Data processing: fundamental concepts of programming for humanists and social scientists
  • Data processing: regular expressions
  • Data analysis: fundamental concepts of statistics
    • Understanding and describing groups
    • What is average?
    • Uncertainty in describing groups
    • What is a sensible group to describe?
    • Comparing groups
    • Understanding relationships
  • Digging into a method: topic modeling
  • Final project
  • Where to continue?
  • Course instances
    • Helsinki fall 2021
    • Helsinki fall 2020
    • Helsinki fall 2019
    • Helsinki fall 2018
  • Holding area for unfinished content
    • Data
    • Easy tools for acquiring, processing and exploring data
    • Computational data analysis method literacy
    • Open, reproducible research and publishing
Powered by GitBook
On this page

Was this helpful?

  1. Data analysis: fundamental concepts of statistics

Understanding relationships

PreviousComparing groupsNextDigging into a method: topic modeling

Last updated 3 years ago

Was this helpful?

Context note: this is a sub-part of the section of the . You can use this to teach yourself some fundamental concepts of statistics. However, if you want to understand more broadly when you might want to use them, you're better off going through the whole course.

Sometimes, we are not interested in either describing groups or validating their differences. Instead, we may be interested in relationships between variables. For example, we might want to how much income level affects life expectancy, and even compare that to the effect of sex and healthcare expenditure. This is an area of statistics that quickly grows in complexity (see e.g. ). For this short introduction, we will thus only limit ourselves to the most simple of methods, intended to just convey the general gist of what these are about.

When wanting to evaluate the relationship between two numerical variables (e.g. life expectancy and healthcare spending), the simplest approach is to look at their correlation. Correlation measures the extent to which the variables are linearly related, i.e. have a relationship where if one variable grows a certain amount, the second variable either grows or diminishes a proportional amount (e.g. that for every 100 million spent into healthcare, life expectancy would increase by a year - note also that as stated, correlation only accounts for linear relationships, so, for example, would not be able to model diminishing returns in healthcare spending, etc).

Assignment

  • To get a better idea of how correlation works, play around with this . For example, think of X as the size of ones' hand, and Y as their height. If the two are correlated, and you measure both the height as well as hand size of people, a large hand size (large X value) should also lead to a large Y value (= tall height).

  • Once you understand how correlation works, ponder what it means by looking at found in real-world data. The take-home message here is that . Instead, there are multiple ways through which a correlation can appear, including for example a common external cause for both, or again merely due to random chance.

Stepping on from correlation, one may want to start building a formal model of the data, through which one could posit and verify laws about the world that gave rise to it. While such models can grow increasingly complex (again see e.g. ), it is good again to start with the simplest of such models: a linear regression model. Here, the idea is to formally code the linear relationships sought for through correlation. For example, to come up with a formula that the height of a person is 50cm+8times their hand size in cm (height=50+8*hand size).

Assignment

  • To get a better idea about linear regression, go look at and play around with it.

  • Finally, as a bridge toward computational data analysis approaches, look at . In this approach, the viewpoint switches from describing, comparing or modelling data on given axes into instead using statistical computation to 1) figure out what axes in the data are important in the first place and 2) to automatically discover the most important patterns (or more precisely, most important by a certain precise definition concerning axes of maximal variance) in the data overall.

fundamental concepts of statistics
computational literacy for humanities and social sciences course
know
Bayesian probabilistic modeling
visualization
spurious correlations
correlation does not imply causation
Bayesian probabilistic modeling
regression explained visually
principal component analysis explained visually