In the end, computers are really only good with numbers. They want to count stuff. Thus, if starting with unstructured data, your first task will be to extract structure out of that data (for example, frequencies of certain words from text, or neural network -deduced keyword descriptions of images)
So how do neural networks manage to extract structure from the "unstructured image", you might ask? Well, it is actually already structured for the computer. It just has structure on an extremely low level. What an image is to a computer is a a grid of numbers, each number describing the colour and brightness of one square in the grid. What a neural network does, is take this extremely low level information, and starts to build higher level features from it by combining information from multiple squares in the grid. For example, if there are multiple bright squares in a row while their other neighbours in the grid are dark, that indicates there may be a line there. Then, when the neural network knows where lines are, maybe it can combine certain patterns of lines into circles, certain patterns of circles into eyes, and finally certain patterns of eyes, mouths and noses into a guess that the image depicts a face.
This is also the way in which many of your analysis processes will proceed. You start with some data, and then apply various means by which you'll derive more involved and sophisticated features from that data, until you arrive at something which corresponds (near enough) to your object of interest. For example, given a set of digitised newspaper images, you may first apply one tool to detect illustrations, a second one to filter those to advertisement images, a third one to extract keyword descriptions out of those, add a manual step to filter and categorize those keywords into categories corresponding to your object of interest (e.g. if you're interested in gender images in advertising, you map "woman" and "girl" to "female" and "man" and "boy" to "male", while grouping a whole slew of keywords either into categories such as "technical", "household", "outdoors" etc). Only after this will you take this data and use your analytical and statistical tools to count the number of times each gender is associated with each category across source, place and time in order to answer your original question.
Non-digital / digital without any functionality - both finding relevant sources as well as their analysis happens through manual work and reading
Search interfaces - aid in finding, but analysis and understanding happens through reading
Analytical interfaces - transform and aggregate the data in new ways to yield insight. For example:
Keyword in context (KWIC) interfaces show search results, but ordered around the word and its particular instances. Data is returned not in its original structure, but transformed to a view centred around the word and its appearances.
Frequency graphs transform the data through aggregation, counting how many times a keyword is found in the data, and projecting these counts through time. Through this presentation, large-scale trends can be seen.
Ways of accessing data
User interfaces provide access to pre-defined functionalities in a user-friendly manner
APIs provide access to pre-defined functionalities for programmes
Having access to the dataset as data allows you yourself control over which tools to use, and how to tie the tools together into a workflow.
Data dumps provide access to raw data. However, they may often be very large, and their format may also be complex in order to include all facets of the data. Thus, raw dumps may be difficult to process and use.
Both user interfaces as well as APIs may allow subsets of the data to be selected and downloaded as data for further analysis. Often these allow limiting both the size of the data, as well as its features, thus making the resulting data much easier to process and handle. However, very often APIs do not contain all facets of the original data, instead making a trade-off between richness and ease of use to serve the data in a simpler, easier format useful for a particular subset of use cases.
Academic libraries have a long tradition of collaborating with library service companies (primarily EBSCO Information Services, ProQuest LLC and Gale Cengage Learning) to produce services
But, this is also a wider culture inside humanities, e.g. Electronic Enlightenment
Potential datasets/APIs are for example:
Data Organization in Spreadsheets for Social Scientists (not really anything social science specific) / Tidy data for librarians (nothing library specific either)
Big? Smart? Clean? Messy? Data in the Humanities (simple introduction to different kinds of data in the digital humanities)
Biases caused by using publicly available Twitter APIs (search/streaming) for sampling: