DHC Weekly: Part of Speech Tagger

May 7, 2019
 
Continuing with the theme of text analysis, this week I want to go hyper-granular and a little old school to talk about a part-of-speech tagging tool, CLAWS. A part-of-speech tagger is exactly what it sounds like -- a tool that will take a piece of text, and auto-tag each word in it with its part of speech. POS tagging predates more robust and fancy tools like Voyant as one of the earliest and most common forms of corpus annotations and, indeed, of computational humanities, as the field was then called. Computational corpus analysis has been being undertaken since at least the 1960's, but the CLAWS POS-tagging software dates its earliest iterations to the 1980's. You can read all about how, grammatically and computationally, CLAWS functions here.

 

As an example text, I turned to our old friend The Adventures of Sherlock Holmes and plugged in the first story, "A Scandal in Bohemia." A (very very small) section of my output looked like this: 

    
0000002 080 To                                          00 PRP     
0000002 090 Sherlock                                    00 NP0     
0000002 100 Holmes                                      00 NP0     
0000002 110 she                                         00 PNP     
0000002 120 is                                          00 VBZ     
0000002 130 always                                      00 AV0     
0000002 140 THE                                         00 AT0     
0000002 150 woman                                       00 NN1        

Here is the first sentence of "Scandal," word by word, tagged by part of speech according to this tagset. So: preposition, proper noun, proper noun, personal pronoun, -s form of the verb "be", adverb, article, singular noun. This, of course, extends all the way through the end of the story.

So, what use might one find from this sort of POS-tagging? One example is that once you've tagged all your words it becomes much easier to pull them apart from one another and look at them as groups -- one could imagine, for example, a line of inquiry along the lines of how active vs. passive the verbs in a text are, or how many proper nouns are in a text as a measure of its specifity of place and person.