Final project
To pass the course, you are required to demonstrate that you're able to take what you've learned, and use it to design and enact computational work in practice. Therefore, you are tasked with taking some dataset, and processing it in some way to yield an analysis that tackles a question of interest in the humanities or social sciences.
To do this, you will need to navigate between the limits of the data, methods and research questions, trying to figure out which line of research is possible. Often, this is an iterative process, starting from something, running up against limits of either data or methodology, and then trying to sidestep those. The most important learning goal of this assignment is to gain experience in this process in practice by going through it.
Potential datasets/APIs are for example (but instead of these please choose a dataset that is relevant to yourself):
Tools for processing and analysis are for example:
Preprocessing: R, Python, pandas, tm, OpenRefine, OpenCV, TensorFlow™ Image Recognition, tuneR, pyAudio Analysis, ...
Topic modeling: Mallet, topicmodels, LDAvis, gensim, ...
Simulation: NetLogo, ...
Neural networks: som, TensorFlow™, ...
Anomaly detection: AnomalyDetection, ...
In your work, you are expected to follow as best as possible the guidelines for open, reproducible research. Thus, include with your project a document (e.g. a README.md) describing what you've done. Make sure the document answers the following questions:
What are your humanities/social science research questions? How do they relate to possible prior and related work?
Which data did you use?
What did you do to the data, and how can I reproduce it?
What does the analysis show, how does it answer the humanities/social science research question? How do the results relate to possible prior and related work?
Critically analyse your data and pipeline for potential bias and problems. What would still need to be done for the analysis to be trustable?
Further info: as said, the most important learning goal for this assignment is to learn how to navigate between the shoals of data, methods and questions in designing a computational human science research process. Thus, for submissions, I prefer full pipelines that go from raw data to results. To get there, it is okay to cut massive corners as long as you know which those corners are (and that is what question 5 is for). However, sometimes this just isn't possible. Therefore, submissions can also be just some steps towards a complete pipeline (e.g. the data cleaning part). However, if you don't have end results, you need to very explicitly describe what your next steps would be to get those (i.e. a plan for future research).
To return the assignment following open science best practices, you will need to upload your data, code and results into a GitHub repository, link that repository with Zenodo and give us the Zenodo DOI for your work. To return your assignment, send the Zenodo DOI to Eetu on Slack, along with your student ID number. You probably won't want to include the ID number in the project files themselves, as all of those are public in perpetuity. Remember to also fill the course feedback form! (University of Helsinki students should use the official version)
Evaluation criteria
Minimum requirement (grade 1/5): Your project must include a humanities/social science research question, and a description of a complete pipeline that moves from a dataset toward that question. In addition, at least some step of the pipeline needs to be fully implemented.
You need to document your pipeline in a way that it can be rerun and its results reproduced. (+1 grade)
You need to include an analysis of the results of your pipeline. If you do not end up with a full pipeline from data to analytical results, then you need to evaluate the reliability of the part of the pipeline that you did develop. (+1 grade)
To get a 4 or a 5, both your analysis as well as documentation need to be robust, logical and understandable. This includes:
A clear, logical description of your whole research process that will enable it to be critiqued and reproduced in full - what did you do at each point to the data, and why? Also be sure to include an analysis of points of possible biases and problems in your data and pipeline (+1 grade)
Importantly, a reasoned and thorough discussion of the results from your analysis from the viewpoint of the humanities/social science research questions. If possible, contextualise your analysis with regard to other disciplinary knowledge (+1 grade)
Here it should be noted that checking all the marks will be much easier with a pipeline that yields an analytical result at the end. It will be possible to attain these also with partial pipelines, but without an analytical result, you need to employ indirection and projection to relate your reliability analysis to how its results would affect substantive analysis. Alternatively or in addition, you might need to do a manual substantive analysis of a subset to be able to discuss implications from the viewpoint of humanities/social science scholarship.
As a special consideration, while naturally hoping that your pipeline succeeds, sometimes in the end you find out that the approach you picked just can't be made to work. In this case, what I'm looking for is a "robust report of failure", meaning that you can document that you've spent a reasonable amount of effort in trying different options and ways to get the pipeline to work. In addition, you must be able to explain in detail exactly how the results are insufficient and/or problematic in being useful for answering your initial research questions.
Importantly, you are free to submit your final assignment as many times as you want, until you obtain the grade that you desire.
Submissions from previous years
To further aid you in your work, here are some previous submissions for inspiration (for most of them, you should actually click the GitHub link on the right to start to make sense of them):
Errors in machine translating Finnish surnames - DOI: 10.5281/zenodo.7469559
Themes in Hungarian folk love songs - DOI: 10.5281/zenodo.44570
Differences and similarities in the depiction of ghosts in two Chinese novels from different eras - DOI: 10.5281/zenodo.7467017
Extracting and visualizing biographical information from an old bank matricle - DOI: 10.5281/zenodo.225890
Analysis of a survey on user involvement in software development - DOI: 10.5281/zenodo.237727
Comparing Language Complexity in Fact-Checked Fake and Real News - DOI: 10.5281/zenodo.4327219
Polite vs casual address form use by Finnish language learners in different situations - DOI: 10.5281/zenodo.218844
Discovering patterns in chalcolithic and early bronze age burials in northeast England- DOI: 10.5281/zenodo.215932
Finnish politicians in pictures - biases in the contents of the Finna portal - DOI: 10.5281/zenodo.4313215
Analysing the poets and themes selected for the book "Three Hundred Tang Poems" - DOI: 10.5281/zenodo.5796611
Analysing the composition of the collection of the Metropolitan Art museum - DOI: 10.5281/zenodo.8076250
Themes discussed in Helsingin Sanomat in 1905 - DOI: 10.5281/zenodo.44572
Topics covered in Finnish proverbs from the 1930s - DOI: 10.5281/zenodo.6365445
Differences in use between the words maahanmuuttaja and pakolainen in Finnish newspapers 1970- to present - DOI: 10.5281/zenodo.44544
Differences in how frequently Finnish and Swedish newspapers talk about the Romani people - DOI: 10.5281/zenodo.44590
Contrasting Beck's lyrics to blues lyrics - DOI: 10.5281/zenodo.215292
Extracting and analysing recipe information in an old cookbook - DOI: 10.5281/zenodo.216232
Theories of consequence in early English books (1473-1700) - DOI: 10.5281/zenodo.5800084
Comparing the use of polite plural "you" in Mandarin Chinese and Lithuanian - DOI: 10.5281/zenodo.1134294
A thematic analysis of the discussion around Guggenheim on the Suomi24 forum - DOI: 10.5281/zenodo.217719
Sentiment analysis of Twitter discussion related to the Indian biometric identifier system Aadhaar - DOI: 10.5281/zenodo.1134623
Exploring themes in Helsinki tourist brochures 1967-2008 - DOI: 10.5281/zenodo.6045173
Differences in language between texts dealing with altered states of mind and normal fiction - DOI: 10.5281/zenodo.230676
Social relations as expressed in District Court Sessions of Iisalmi parish 1639-1651 - DOI: 10.5281/zenodo.4327155
Analysing the composition of print and audio book versions of the New York Times bestseller lists - DOI: 10.5281/zenodo.5795399
Preliminary analysis of Free Direct Speech in Se tapahtui täällä by Raija Siekkinen - DOI: 10.5281/zenodo.4338190
Exploring ways to compare adaptations of a literary work - DOI: 10.5281/zenodo.1127754
Preliminary analysis comparing different Finnish cabinet strategies against each other - DOI: 10.5281/zenodo.216604
Preliminary analysis of patterns in the holdings of the Finnish National Gallery - DOI: 10.5281/zenodo.218735
Last updated