If you haven’t heard of the Google Books Ngram Viewer, I highly suggest you check it out. As expressed by the caption towards the bottom of that page, here you are encouraged to run your own experiment using the text library of thousands of electronic books available for your access, thanks to Google.
The process of digitizing text and then searching through the entire collection to find patterns is a new type of science known as Culturomics. Scholars attempt to make strong inferences about human thought, themes in culture, and trends in history by analyzing entire corpuses, or bodies of text. This process is further discussed in Jean-Baptiste Michel’s article entitled “Quantitative Analysis of Culture Using Millions of Digitized Books.” Here he explains, “Culturomics is the application of high-throughput data collection and analysis to the study of human culture.”
For our 5th mini-project, we were instructed to interact with texts at scale by using tools of text mining and visualization such as Google Ngrams, Voyant text analysis tools, and/or Wordle. The big question here was, how do we formulate queries to properly answer our questions about culture? To begin with, I defined my corpus as the complete texts of Alice in Wonderland and the sequel to this book, Alice Through the Looking Glass (both written by Lewis Carroll). The focus of my research was to try and see if by using these text analysis tools, I would be able to point out differences in theme and setting between the two novels. I started out by creating a Wordle of both the texts separately and compared the differences. For both books, the name Alice was predominantly larger than any other word, which tells me that she is the main component of the stories. However, in Looking Glass, the word “queen” was nicely sized as well, which leaves me to believe the queen has a more important role in the sequel.
I tried looking for words that could hint at setting, but these were hard to find. Instead, there are many abstract concepts involving time and sequence (i.e. minute, beginning, last, first, two, and three) and thinking (i.e. thought, know, and think). There are also many animals and characters mentioned in both books (i.e. Humpty, turtle, Tweedledee, Tweedledum, unicorn, Hatter, Knight, Cat, Mouse, Rabbit, etc.) which makes me think that Alice interacts with many different characters throughout the books, but this is an interesting contrast to the fact that the word “one” is pretty large on both Wordles. However, “one” has different possible meanings, such as the single being or the number, so it’s difficult to analyze what exactly the importance of this word is in these novels.
Additionally, the word “little” stood out to me which because it was displayed in medium font on both Wordles. But just like the word “one,” this word can have many different connotations as well. This was one of the downfalls of analyzing text through Wordle and Voyant Tools (which I used to see word counts for both stories). By transforming stories (which require readers to be objective and interpretive) into plain text to be analyzed by a computer (which cannot pick up on deeper meaning the way a human brain can) we lose an important component to the story.
This problem I ran into exhibits that this science has flaws. To believe this type of text analysis is foolproof is to believe in parallelisms, or that the meaning of a word, in all cases, is the same. Not only this, but you must ignore other errors such as misrecognition of characters/letters/words in bodies of text.
In a blog post by Michael Witmore entitled “Text, A Massively Addressable Object,” Witmore explains that texts have always been massively addressable at different levels of scale, but being able to digitally record and query enormous amounts of texts is what allows us to perform analyses at a more complicated level. “When we create a digitized population of texts, our modes of address become more and more abstract.”
But regardless of all this, I am still left with a burning question. If the full scope of a text is being misrepresented due to computational errors, how can we draw accurate conclusions about our culture from it?