Data processing: regular expressions
Last updated
Was this helpful?
Last updated
Was this helpful?
Regular expressions are not programming per se, but instead a versatile grammar for specifying text matching patterns. As such, they are useful for finding complex things in unstructured text, be that for close reading or for extracting those things into structured datasets.
For example, suppose you have the following text, which comes from an automatic transcription of a Swedish language bank employee matricle:
Aaltio (född Hinderson), Ellen, född i Björneborg 25/B 1887, genomgått 6 klasser i Björneborgs, svenska samskola och en bokföringskurs 1899. -- Kassör vid Wasa-Aktie-Banks filialkontor i Björneborg från % 1904, i 4 års tid under vintermånaderna tjänstgjort vid Björneborgs Sparbank.
Adlercreutz, Herman, född i Kyrkslätt 28/io 1864, stu-. dent 2% 1884, allmän rätts-ex. 81/6 1889, auskultant i Åbo h ofrätt 6/6 s. å. - Kanslist vid Finlands Banks hufvud-kontor sedan 1893. - Vicehäradshöfding 2%2 1892, förste stadsfogde i Helsingfors sedan 1899. -- Deltagit i ridder-skåpet och adelns förhandlingar vid landtdagarne 1897, .1899, 1900, 1904--05, (suppleant i lagutskottet) och-1905--06 (suppleant i fusteringsutskottet).
Aejmelaeus, Otto, född i Paldamo 9/x 1850, genomgått 6 klasser i elementarläroverket i Uleåborg. - Direktör för Nordiska Aktiebankens 'för Handel och Industri filialkontor i Jyväskylä från 1896. - Konditionerat hos C. E. Carlström i Kristinestad, Aug. Eklöf i Borgå och i Paul Wahl & C:os trävaruaffär i Jyväskylä. -- Varit ordförande i fattigvårdsnämden och i drätselkammaren i Jyväskylä.
Now, suppose that you'd like to analyse the general profile of these bank employees. For that, you'd like to extract from this text a structured dataset of people, birth places and all attached dates. How would you go about it?
Looking at the text, there is considerable regularity in how the information is presented. To allow you to better see it, here is the text with all last names italicised, first names rendered in a different font
, birthplaces bolded and years both italicised as well as bolded:
Aaltio (född Hinderson),
Ellen
, född i Björneborg 25/B 1887, genomgått 6 klasser i Björneborgs, svenska samskola och en bokföringskurs 1899. -- Kassör vid Wasa-Aktie-Banks filialkontor i Björneborg från % 1904, i 4 års tid under vintermånaderna tjänstgjort vid Björneborgs Sparbank.Adlercreutz,
Herman
, född i Kyrkslätt 28/io 1864, stu-. dent 2% 1884, allmän rätts-ex. 81/6 1889, auskultant i Åbo h ofrätt 6/6 s. å. - Kanslist vid Finlands Banks hufvud-kontor sedan 1893. - Vicehäradshöfding 2%2 1892, förste stadsfogde i Helsingfors sedan 1899. -- Deltagit i ridder-skåpet och adelns förhandlingar vid landtdagarne 1897, .1899, 1900, 1904--05, (suppleant i lagutskottet) och-1905--06 (suppleant i fusteringsutskottet).Aejmelaeus,
Otto
, född i Paldamo 9/x 1850, genomgått 6 klasser i elementarläroverket i Uleåborg. - Direktör för Nordiska Aktiebankens 'för Handel och Industri filialkontor i Jyväskylä från 1896. - Konditionerat hos C. E. Carlström i Kristinestad, Aug. Eklöf i Borgå och i Paul Wahl & C:os trävaruaffär i Jyväskylä. -- Varit ordförande i fattigvårdsnämden och i drätselkammaren i Jyväskylä.
Looking at the patterns, you could derive for example the following rules to extract the elements:
Year: four numbers anywhere
Last name: all letters before a space or a comma at the start of the line
First name: consecutive letters following the first comma on a line
Birthplace: consecutive letters following “född i”
Removing all punctuation, multiple spaces etc, and replacing them with a single space: s/\W+/ /g
Finding full names (some basic named-entity recognition or NER): \p{Lu}\p{L}* \p{Lu}\p{L}*
Matching different ways of spelling the word cannot in a varied historical corpus: [kc]an.?no.?t.
Regular expressions allow you to transform such rules into one the computer is able to automatically process. For example, the regular expression for matching four numbers anywhere can be written (click through for an interactive visualisation and explanation of the pattern). However, because the regular expression grammar contains functionality for describing repeats, it can also be written . Further, because numbers are such a common class of things one might want to match, they too have their own shorthand symbol. Thus, the shortest equivalent regular expression for matching four consecutive numbers is .
In practice, crafting regular expressions is an iterative process, where you experiment with different formulations of a pattern to end up with one that matches what you want, but doesn't match anything else. For example, the current formulation does not discover the year 1905 from the string 1904--05
. If that year was interesting information, a second pattern might need to be added. On the other hand, if you discovered that there are other four digit numbers in the text that aren't years, maybe you'd think of countering that by restricting matches only to numbers that start with either 18 or 19 (realised e.g. as ).
Go through
(everything before a space at the start of the line)
(consecutive word characters at the start of the line [but see below])
(One uppercase character followed by consecutive upper or lower case characters. Here, the \p{Lu}
and \p{L}
come from . This is important, because historically, regular expressions were very Anglocentric. Therefore \w
only matches the letters a-z, and not for example ümläüts, áccênts, or symbols from completely different sets such as Hangul or Kanji. Using the Unicode character classes fixes all this.)
match more than you actually want and then extract from that using , instead of trying to craft a pattern capturing just the element (which often anyway requires fooling around with non capturing ).
One of the problems with regular expressions is that when you develop them iteratively, you invariably end up with (even to yourself later on). Here's for example my best effort in crafting a single regular expression to capture both last as well as first names from the matricle: (which again speaks to the fact that you really shouldn't try to capture too much in a single expression, as the same information could be extracted using multiple much smaller and thus more understandable patterns).
, an environment for experimenting with regular expressions
reference