| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Tristan Denton - Final Essay

Page history last edited by Tristan Denton 9 years, 4 months ago

Some Possibilities for Reading Dystopian Fiction Discursively with Topic Modeling Tools 

 

By Tristan Denton, Dystopian Novel Project

 

           The Dystopian Novel Project undertaken for English 149 was an attempt to understand and analyze the recent rise and shift of dystopian fiction through time, from its popular origins in the early twentieth century, to its early twenty-first century iterations as young adult fiction series. The initial assumption made by this group in order to guide its research is that current dystopian fiction has undergone formal and narrative transformations that differentiate it from earlier forms of the genre, which correlate to a noticeable increase in the genre’s popularity and visibility. The project, then, was undertaken with the aim of attempting to understand this shift by way of analyzing a corpus of dystopian fiction that spanned from its origins and foundational texts to contemporary works. To analyze this scope of dystopian fiction, each group member focused their research mainly on one approach and a related set of digital humanities tools: infographic, topic modeling, textual analysis, and digital timeline. Being that each research approach was undertaken essentially independently, this essay will focus mainly on the methods and results of the research approach with which this researcher was the most involved: topic modeling.

 

Methods

            The process of topic modeling research began with the compilation of a corpus of dystopian literature from which to generate topics. On the basis of the goodreads’ infographic “Dystopian Books Again Seize Power,” the decision was made to generate separate sets of topics from three individual corpuses of dystopian fiction, each of which would represent and be comprised of texts from one of the eras in this infographic: “1930’s-1960’s,” “Second Wave,” and “Young Adult.” While each era identified on this infographic contains at least four works, the decision was made to include only three of each, due to the fact that V for Vendetta, included in the Second Wave era, is a graphical novel, and thus a problematic, if not nonviable, text for topic modeling. Included in the corpuses for topic modeling are Brave New World by Aldous Huxley, Fahrenheit 451 by Ray Bradbury, and 1984 by George Orwell in the 1930’s-1960’s corpus; Never Let Me Go by Kazuo Ishiguro, The Handmaid’s Tale by Margaret Atwood, and Children of Men by P.D. James in the Second Wave corpus; and Uglies by Scott Westerfeld, The Hunger Games by Suzanne Collins, and Crossed by Ally Condie in the Young Adult corpus. Many of these texts were obtained from webpages that hosted the novels in plain text format. The rest had to be obtained as epub documents for e-readers, which were then converted into plain text files with the Zamzar online file converter. The texts were then compiled into three, large plain text files, one for each corpus, and prepared for topic modeling by removing extraneous words inserted by the websites from which they were obtained.

The next step in the research process was the identification of the ideal topic modeling tool. Favoring simplicity, in-browser topic modeling tools that could be used without installing or using programs other than a web browser were initially preferred. To this end, David Mimno’s In-Browser Topic Modeling tool and the Overview tool, both available in the Digital Humanities Toychest, were both considered and tested. Mimno’s tool was deemed insufficient for the research being undertaken in this project because it had no export functionality, meaning that any data it produced could not be imported into any other programs, an essential part of this topic modeling research. Overview was decided against both because it required that the text to be analyzed be uploaded to its server, a remarkably slow process that took over an hour when tested, and because the data it produced did not seem useful to the research at hand. David Newman’s java-based GUI for the MALLET topic modeling toolkit was ultimately chosen on the basis of its intuitive interface, simple operation, and its export functionality, which outputs data in both csv and html files.

            Once the topic modeling tool was selected, the three plain text files that comprised the three corpuses studied for this section of the project’s research were then imported into the topic modeling tool in order to, in the vocabulary of Newman’s MALLET GUI, “Learn Topics” within them. An immediate problem arose in the resulting data gathered from the first, exploratory round of topic modeling via MALLET: meaningless non-words or parts of words, as well as extraneous words inserted by online sources, inadvertently passed over during the process of preparing the texts for topic modeling appeared frequently in the topics generated by the program. On closer inspection and with more research into topic modeling, many of the seemingly meaningless groups of characters became explainable with the fact that MALLET does not take into account punctuation. That is, many common, yet seemingly nonsensical topic words like “ve” could be explained by the fact that, because MALLET does not take into consideration punctuation, including apostrophes, it had recognized contractions as two separate words, separated by an apostrophe. In order to solve the problem of meaningless words included in topics, the default list of stopwords (words to be excluded from the program’s consideration) was expanded. Words were added like “ve” and “t” to eliminate parts of contractions, while keeping the beginnings of contractions in the results, and words like “xml” and “unzipped” to eliminate text added by the online hosting services from which they were obtained.

            Topic generation through MALLET had some external criteria applied to it, in order to produce results that were, in the opinion of those in charge of topic modeling, the most useful for their research. The first criterion was number of topics. Deciding that the default number of topics generated by Newman’s GUI, ten, was too few, the decision was eventually made to generate twenty topics for each corpus. This number, it was reasoned, would provide researchers with enough data to eliminate topics from consideration as necessary, without generating too many topics to consider. The topics were then numbered 1-60, with 1-20 comprising the 1930’s-1960’s corpus, 21-40 comprising the Second Wave corpus, and 41-60 comprising the Young Adult corpus. The second set of criteria applied to the set of topics were the criteria for eliminating topics from analysis: incoherence, over-specificity, and lack of connection. Incoherent topics, such as topic 35, which contains the words “back don long make side car wouldn face front feel,” were eliminated because no semantic connection could be determined between the words in the topic. Over-specific topics, like topic 20, which contains the words “page brien blair chapter war thought telescreen big brother human,” were eliminated because they seemed to pertain mostly or entirely to only one work, and thus were not useful for researching an entire corpus of texts. Topics without connection, of which there were several, were eliminated because they did not share a topic word with another topic. These topics were eliminated retroactively, after the process of visualization in Gephi.

            After the three sets of topics had been generated and the unsuitable topics eliminated, each set of topic words was then assigned a theme or set of themes. Each theme is one word, with up to three assigned to each numbered topic. These themes were assigned based purely on subjective interpretations of the semantic associations between the words in any given topic.

            Once the topics were generated and assigned thematic interpretations, the next step was to find a method for visualizing the data generated by topic modeling. As with the initial stages of topic modeling, researchers for this project looked to tools in the Digital Humanities Toychest and initially favored in-browser tools, with no installation or expertise required. To that end, both ManyEyes and RAW visualization tools were considered and tested by importing data generated by MALLET. These visualization tools ultimately proved to be too simplistic, and could not represent the topic modeling data with the necessary specificity, or in a way that added anything substantial to the project. What these tools did add to the project was that, in their use, the researchers learned the importance of their topic modeling data in the form of csv spreadsheets. These files proved far more useful, to the process of visualization at least, than the html files also generated by MALLET in that tables are able to translate to graphs naturally. Unable to decide on a visualization tool purely from trying those in the Digital Humanities Toychest, but still aiming to visualize the data compiled from topic modeling, research was undertaken about the subject of visualizing topic modeling data. Veronica Poplawski’s “Topic Modeling and Gephi: A Work in Progress” provided the impetus to pursue Gephi as the visualization tool. While Gephi required not only more expertise than other in-browser tools, but also required a certain level of customization, Poplawski’s article mitigated these obstacles because it laid out the steps she undertook in order to represent topic modelling data generated with MALLET in Gephi.
            While Poplawski’s article did inspire the use of Gephi in order to visualize data from MALLET, the method it laid out was not followed in entirety. Poplawski’s method, while complete enough to suggest the use of Gephi, actually left out several crucial steps in the process of visualizing topic modelling data, which needed to be filled in by a “Quick Start” tutorial published by the Gephi Consortium. A major divergence from Poplawski’s method was formats in which topic modeling data was imported into Gephi. Whereas Poplawski converted her data into xml-based gexf files, ours were imported simply as csv spreadsheets.

            The three csv tables generated from the three corpuses analyzed by MALLET were imported into Gephi and rendered as three separate graphs, while a fourth was made consisting of the data from all three. Once the csv files were imported, several parameters were instituted on the visualizations that were created, all of which contributed to usefulness of the graphs as measurement of the connectedness of topics. That is, all of the graphs were set so that words or topics, represented in these graphs as nodes, which had more connections to other nodes would appear larger and follow a green-blue-red color gradient, with green being the least-connected, blue being more connected, and red being the most-connected. Further, the graphs that represented the topics from only one era were set so that only nodes with at least two connections to another node would appear. The graph that represented the topics from all three corpuses was set so that only nodes with at least three connections would appear, owing to the incomprehensibility that arose when the same connection parameter as the other three graphs was applied to this graph. All four graphs are reproduced in the appendix, and are linked on the project page on the course website.

 

Results

            Even though topic modeling, as it was utilized in the scope of this project, but especially in work beyond its scope, requires a prolonged engagement with the texts that are the focus of our study, the question remains: What use is topic modeling? The short answer, informed by Ted Underwood’s assertion that, “Topic modeling is a way of extrapolating backward from a collection of documents to infer the discourses (“topics”) that could have generated them,” is that topic modeling is useful because it allows texts to be read discursively. This premise, which seems to encapsulate topic modeling’s promised contribution to digital humanities research, can be taken alongside the central premise of Joseph Campbell’s critical essay “The Treatment for Stirrings: Dystopian Literature for Adolescents,” that dystopian literature plays out and interrogates society’s discourse networks from an imagined future for its audience. Together, the two sources suggest a hybrid paradigm that can be used to analyze and interpret the results of this project’s work with visualized topic modeling. Namely, that the topic modeling research as undertaken here points to the discourse networks at play in different eras of dystopian fiction and how these discourses shift over time.

            By examining a visualization of the topics generated for one era of dystopian fiction, it is possible to at least suggest some of the discourse networks that may be operating in that era of dystopian fiction. For example, by looking at Figure 1, it is possible to get a fairly broad overview of the discourses operating in the era of dystopian fiction represented in the 1930’s-1960’s era corpus. Provided one accepts the interpretation assigned to the topics generated from this corpus, which are available on the project page on the course website, it is possible to infer from this visualization that the discourses that shaped this era of dystopian fiction show an insistence towards issues of power, time, perception, and authority. Interestingly, none of the discourses that represent the same themes, all nearly equally distributed, connect within this discourse network except for topics 15 and 16, both of which are discourses of perception, which intersect at the word “man.” While similar topics here don’t connect with each other, a larger network of discourse links them to other, externally dissimilar topics. What this seems to suggest, then, is that these texts work to play out thematic discourses in a multivalent way, that discourses that are externally similar may be internally dissimilar: the same issues are tackled from a multitude of angles. After all, the words of topics 1 and 13, both dealing with time are respectively “young back repeated pale girls opened end whispered boy isn” and “moment make words side times book great began truth hand.” The two topics are undeniably similar, semantically, but the discourse that each plays out is along different lines and represents different concerns.

            By comparing the prevalent discourses in each era’s corpus it is possible to conceptualize, in a broad way, how the prevalent discourses of dystopian fiction change over time. By looking only at the points of connection between topics, rather than the topics themselves, in Figure 4, which shows the topics of all three corpuses, one can generally establish the prevalent concerns of dystopian fiction. Some of the largest connective nodes in this graph are “back,” “eyes,” “look,” “thought,” “remember,” and “time.” From these large points of connection, one can generally establish that dystopian fiction, across eras, is concerned with issues of time, perception, memory, and thinking. All of these issues are represented as thematic discourses in the topics of nearly every era of dystopian fiction as represented in Figures 1-3. For example, discourses surrounding the issues of time in the 1930’s-1960’s era are linked to issues authority, while in the Second Wave and Young Adult eras they are linked to issues of domesticity and the body. This phenomenon seems to point to a shift through time over how these discourses are played out and which other issues influence or participate in fostering these discourses.  

 

Conclusion

            The methods utilized by the Dystopian Novel Project to analyze a corpus of dystopian fiction through topic modeling and the results that they subsequently produced point to the possibility of using digital humanities tools to read dystopian literature discursively. That is, these methods can be used to illuminate and the discourse methods at play within a large historical scope of dystopian fiction. What these methods fail to do, however, is point towards causality. They can reveal underlying discourses, but they are insufficient to explain why these discourses arise and shift as they do over a span of time. To examine causality requires other approaches, which, fortunately, the Dystopian Novel Project includes. Taken as a whole, topic modeling provides just one component of this project’s analysis of dystopian fiction, one which provides insight into discourse networks that other methods cannot, but in and of itself is fundamentally limited.

 

Works Cited

Campbell, Joseph. “The Treatment for Stirrings: Dystopian Literature for Adolescents.” Blast, Corrupt, Dismantle, Erase: Contemporary North American Dystopian Literature. Ed. Brett Josef Grubisic, Gisèle M. Baxter, and Tara Lee. Waterloo: Wilfrid Laurier University Press, 2014. pp. 165-180. Print. 

 

“Dystopian Books Again Seize Power.” goodreads. Goodreads Inc., 21 March 2012. Web. 11 Dec 2014.

 

The Gephi Consortium. Gephi. 2008-2014. Web. 13 Dec 2014.

 

“Gephi Quick Start.” Slideshare. Mar 2010. Web. 13 Dec 2014.

 

Newman, David. Topic Modeling Tool. 2011. Web. 12 Dec 2014.

 

Poplawski, Veronica. “Topic Modeling and Gephi: A Work in Progress.” Digital Environmental Humanities. Social Sciences and Humanities research Council of Canada, July

2014. Web. 13 Dec 2014. 

 

Underwood, Ted. “Topic Modeling Made Just Simple Enough.” The Stone and the Shell. Web. 20 Nov 2014

 

Zamzar. 2006-2014. Web. 11 Dec 2014. 

 

Appendix

 


Figure 1: Visualization of 1930’s-1960’s Corpus Topics

 

Figure 2: Visualization of Second Wave Corpus Topics

 

Figure 3: Visualization of Young Adult Corpus Topics

 


 

Figure 4: Visualization of Topics from All Three Corpuses

 

 

Comments (0)

You don't have permission to comment on this page.