Data

This content is not yet complete (portions not complete marked separately below).

On this course, computational human sciences has been defined as the application of computational and/or statistical techniques to data in order to yield results of interest to the human sciences. In this part of the course, we will delve into what we mean by data and discuss the particular problems inherent in the types of data most often used in computational human sciences research.

First, to get an intuitive understanding of what data is and how it can be used, we present multiple short glances into different viewpoints around archetypes, modes of engagement and ways to access data.

Archetypes of data

For the purposes of this course, data is defined as information in digital format. For practical purposes, it is useful to partition data into two archetypes: structured and unstructured. Examples of structured data include Excel sheets such as this one containing information on ancient Greek books, and databases, such as this one containing information on books, places and people related to French book trade in Enlightenment Europe. Examples of unstructured data include raw texts (like this Complete Works of William Shakespeare), sounds (like this recording from 1906) and images (like this image of a newspaper page from 1884).

In the end, computers are really only good with structured data and numbers. They want to count explicit things, be they occurrences of certain gods in ancient Greek texts, or particular types of relationships between people involved in book trade. Thus, if starting with unstructured data, your first task will often be to extract structure out of that data (for example, frequencies of certain words from text, or neural network -deduced keyword descriptions of images)

So how do e.g. neural networks manage to extract structure from the "unstructured image", you might ask? Well, it is actually already structured for the computer. It just has structure on an extremely low level. What an image is to a computer is a a grid of numbers, each number describing the colour and brightness of one square in the grid. What a neural network does, is take this extremely low level information, and starts to build higher level features from it by combining information from multiple squares in the grid. For example, if there are multiple bright squares in a row while their other neighbours in the grid are dark, that indicates there may be a line there. Then, when the neural network knows where lines are, maybe it can combine certain patterns of lines into circles, certain patterns of circles into eyes, and finally certain patterns of eyes, mouths and noses into a guess that the image depicts a face.

This is also the way in which many of your analysis processes will proceed. You start with some data, and then apply various means by which you'll derive more involved and sophisticated features from that data, until you arrive at something which corresponds (near enough) to your object of interest. For example, given a set of digitised newspaper images, you may first apply one tool to detect illustrations, a second one to filter those to advertisement images, a third one to extract keyword descriptions out of those, add a manual step to filter and categorise those keywords into categories corresponding to your object of interest (e.g. if you're interested in gender images in advertising, you map "woman" and "girl" to "female" and "man" and "boy" to "male", while grouping a whole slew of keywords either into categories such as "technical", "household", "outdoors" etc). Only after this will you take this data and use your analytical and statistical tools to count the number of times each gender is associated with each category across source, place and time in order to answer your original question.

Modes of engagement with data

Next, again for the purposes of this course, it is useful to delineate and contrast some archetypal modes of engaging with data (all examples being different ways of interacting with the Finnish national library collection of digitised newspapers):

  • Non-digital / digital without any functionality - both finding relevant sources as well as their analysis happens through manual work and reading

  • Search interfaces - aid in finding, but analysis and understanding happens through reading

  • Analytical interfaces - transform and aggregate the data in new ways to yield insight. For example:

    • Keyword in context (KWIC) interfaces show search results, but ordered around the word and its particular instances. Data is returned not in its original structure, but transformed to a view centred around the word and its appearances.

    • Frequency graphs transform the data through aggregation, here counting how many times a keyword is found in the data, and projecting these counts through time. Through this presentation, large-scale trends can be seen.

Ways of accessing data

This section not yet complete. In the meantime, please see this presentation

  • User interfaces provide access to pre-defined functionalities in a user-friendly manner

  • APIs provide access to pre-defined functionalities for programmes

  • Having access to the dataset as data allows you yourself control over which tools to use, and how to tie the tools together into a workflow.

  • Data dumps provide access to raw data. However, they may often be very large, and their format may also be complex in order to include all facets of the data. Thus, raw dumps may be difficult to process and use.

  • Both user interfaces as well as APIs may allow subsets of the data to be selected and downloaded as data for further analysis. Often these allow limiting both the size of the data, as well as its features, thus making the resulting data much easier to process and handle. However, very often APIs do not contain all facets of the original data, instead making a trade-off between richness and ease of use to serve the data in a simpler, easier format useful for a particular subset of use cases.

Problems with non-standard data

This section not yet written. In the meantime, please see this presentation

Assignment

Assignment

  1. Find a dataset that could be of interest to you in your final project. Post a message on #datasets on Slack giving:

    1. A link to the dataset

    2. A note on why you selected it

    3. A short description of what types of information the dataset contains, and

    4. The structure, technical format and way of downloading the data

  2. Find a potential source of bias in a dataset someone else picked. Reply to their message on Slack to let them know about it.

  3. If you get a note of bias, respond back by thinking of ways of overcoming or surmounting it.

Potential datasets/APIs are for example:

Further resources

Last updated