Different types of data, data quality, available open datasets

This content is not yet complete. In the meantime, please see this presentation: Different types of data, data quality, available open datasets (pdf, gd)

Archetypes of data

In the end, computers are really only good with numbers. They want to count stuff. Thus, if starting with unstructured data, your first task will be to extract structure out of that data (for example, frequencies of certain words from text, or neural network -deduced keyword descriptions of images)

So how do neural networks manage to extract structure from the "unstructured image", you might ask? Well, it is actually already structured for the computer. It just has structure on an extremely low level. What an image is to a computer is a a grid of numbers, each number describing the colour and brightness of one square in the grid. What a neural network does, is take this extremely low level information, and starts to build higher level features from it by combining information from multiple squares in the grid. For example, if there are multiple bright squares in a row while their other neighbours in the grid are dark, that indicates there may be a line there. Then, when the neural network knows where lines are, maybe it can combine certain patterns of lines into circles, certain patterns of circles into eyes, and finally certain patterns of eyes, mouths and noses into a guess that the image depicts a face.

This is also the way in which many of your analysis processes will proceed. You start with some data, and then apply various means by which you'll derive more involved and sophisticated features from that data, until you arrive at something which corresponds (near enough) to your object of interest. For example, given a set of digitised newspaper images, you may first apply one tool to detect illustrations, a second one to filter those to advertisement images, a third one to extract keyword descriptions out of those, add a manual step to filter and categorize those keywords into categories corresponding to your object of interest (e.g. if you're interested in gender images in advertising, you map "woman" and "girl" to "female" and "man" and "boy" to "male", while grouping a whole slew of keywords either into categories such as "technical", "household", "outdoors" etc). Only after this will you take this data and use your analytical and statistical tools to count the number of times each gender is associated with each category across source, place and time in order to answer your original question.

Modes of engagement with data

  • Non-digital / digital without any functionality - both finding relevant sources as well as their analysis happens through manual work and reading

  • Search interfaces - aid in finding, but analysis and understanding happens through reading

  • Analytical interfaces - transform and aggregate the data in new ways to yield insight. For example:

    • Keyword in context (KWIC) interfaces show search results, but ordered around the word and its particular instances. Data is returned not in its original structure, but transformed to a view centred around the word and its appearances.

    • Frequency graphs transform the data through aggregation, counting how many times a keyword is found in the data, and projecting these counts through time. Through this presentation, large-scale trends can be seen.

Data access

Ways of accessing data

  • User interfaces provide access to pre-defined functionalities in a user-friendly manner

  • APIs provide access to pre-defined functionalities for programmes

  • Having access to the dataset as data allows you yourself control over which tools to use, and how to tie the tools together into a workflow.

  • Data dumps provide access to raw data. However, they may often be very large, and their format may also be complex in order to include all facets of the data. Thus, raw dumps may be difficult to process and use.

  • Both user interfaces as well as APIs may allow subsets of the data to be selected and downloaded as data for further analysis. Often these allow limiting both the size of the data, as well as its features, thus making the resulting data much easier to process and handle. However, very often APIs do not contain all facets of the original data, instead making a trade-off between richness and ease of use to serve the data in a simpler, easier format useful for a particular subset of use cases.

Open data in the digital humanities - the good

Open data in the digital humanities - the bad

Using data for research - Bias, the great bogeyman

Potential datasets/APIs are for example:

Assignment

  1. Find a dataset that could be of interest to you in your final project. Post a message on #datasets on Slack giving:

    1. A link to the dataset

    2. A note on why you selected it

    3. A short description of what types of information the dataset contains, and

    4. The structure, technical format and way of downloading the data

  2. Find a potential source of bias in a dataset someone else picked. Reply to their message on Slack to let them know about it.

  3. If you get a note of bias, respond back by thinking of ways of overcoming or surmounting it.

Further resources