The Digital Humanities Center has open hours on Mondays, Tuesdays, and Thursdays from 1-8 PM. Please visit our COVID-19 page for more information about how to work with us this year.


Hello DH fans!

This week we're leaving mapping behind us and turning to a category of DH tools oft-utilized in the classroom: text analysis! I'm going to be taking a look at one of the most oft-used text analysis tools, Voyant! Voyant is so popular because it's quite out-of-the-box easy to use, with no coding necessary. In practice, I have found this to mean that Voyant is a little idiosyncratic and difficult  -- but I'm going to try to break down its basics for you all this week!

Voyant allows you to upload a corpus of text (it comes pre-loaded with the complete works of William Shakespeare and those of Jane Austen) and analyze various things about it, mostly related to word frequency and order. To play with, I downloaded a .txt file of The Adventures of Sherlock Holmes from Project Gutenberg and uploaded it as my corpus. Right away, with no customization added, Voyant gives me a couple views of my corpus text. First, it plainly gives me some facts and figures about my corpus, namely that it has "108,612 total words and 8,307 unique word forms," a Vocabulary Density of 0.076 and and Average Words Per Sentence of 16.2 The most frequent words in the corpus are said (486); holmes (467); man (291); mr (275); little (269).

On the subject of word frequency, there's also my cirrus:

This is a word cloud, with the most frequently repeated terms from your corpus, arranged with a greater size implying a higher frequency. My Sherlock Holmes corpus yielded unsurprising key terms, like "Holmes," "Mr.," "man," "said," and "think," as well as words  that are fairly common but that, even as a devoted, life-long Holmes fan (my college common app essay was about Sherlock Holmes - please do not make fun of me, I was 16 and really ticked off about BBC Sherlock) I don't know if I would have guessed were terms that were so central to Sherlock Holmes. "Door," for example, "room," "window," and "house," all appear in the cirrus, perhaps in an example of the sort of spatialization of plot in detective fiction that Peter Brooks (in his book Reading for the Plot) lays out in a reading of "The Musgrave Ritual". Body parts are another category that crop up here - "hand," "hands," "face," "head" - in a way that is not unlike the stories themselves paying the same sort of attention to the same markers of trade and personality that Holmes himself tracks in order to perform his sleuthing. 

Even just from the basic, un-altered word cloud, I have two inferences about The Adventures of Sherlock Holmes that are perhaps worth pursuing! But let's take a look at some other Voyant features. There's the trends feature, which in this case is not very useful, since The Adventures of Sherlock Holmes is a set of short stories, whose bounds don't line up Voyant's automatically applied segmentations. Now, if my corpus had consisted of individual files of each story, I could use the trends feature to track keywords across stories, didn't, so, that's that on trends.

line chart with instances of "holmes" "little" "man" "mr" and "said" (the biggest words in the cirrus) plotted against "document segments"

There's also a feature called Contexts, whereby you can select a term and see the sentences in which it apepars in each instance - something that is far more convenient than having to track down all those "faces" to see, for example, whether it's Holmes or Watson's narration who is marking them.

There are many more features you can enable on Voyant - you can see them all listed here - but it is perhaps worth noting that most of them are just different ways to graphically represent word frequency and word collocation (aka, which words appear together). 

There is also - theoretically - much customization that can go into which words Voyant counts and which it ignores. Voyant automatically filters for words it sees as too ubiquitous to be worth counting, things like "a" or "the" or "I", and, again, theoretically, it's possible to edit that filter-list and tell Voyant that you do want "me" to be counted, or that you don't want "Mr." to show up in your cirrus because it's...not even a word, so how is it literally the 4th highest occuring word in the whole corpus?! Theoretically, one would do such a thing by clicking on the "options" toggle over the cirrus cloud, and then click "Edit List" next to the "Stopwords" category. I say "theoretically" (I have said it a couple times, perhaps you have noticed), because in many many trial runs and experiments with Voyant I have never been able to get a word I added to the stopwords list to actually stop showing up in the cirrus cloud. I have seen others online succesfully use this feature, so maybe it's just inconsistent, or I'm just doing something wrong, but - something to keep in mind and proceed with caution around if you are interested in using Voyant!