Project Management R+ Text Analysis


As the incubator portion of this digital project comes to end, I’d like to provide a bit of transparency about how and why I am using the text files.

As a historian and digital humanist, I am always plagued by the lack sources. For those of us who work in Native American history, it is especially troubling—and even worse for those who examine the lives of 19th century American Indian women.

Race and gender are very important to me as a historian.  One way to get at how both of these topics operated in the data sets was analyzing them separately and then together.  First, I split the data sets into the following groups:  American Indian males; Jewish male settlers; and Jewish female settlers.  I am still actively seeking American Indian female transcripts or journals. By analyzing them in this way, it allows me to understand how each how race and gender shaped their historical experiences in the Dakotas.

From there, I analyzed all of the male text files and the female text files separately.  When I am able to locate more female American Indian files, I will be able to complete the American Indian data sets.  Once those are located, I will then process those with the same code as the others to look for similarities and differences in the topics by gender.

Project Management R+ Text Analysis

Data Management


Being a digital humanist, and before that, a bit of a web nerd, I had file naming and data management conventions drilled into my head.  At least that is what I thought.  As part of our incubator, we had the opportunity to meet with Assistant Professor and Data Curation Librarian, Jennifer Thoegersen.


Here are just a few of the things that I learned:

  • You want a local and backup copy; three copies on two different mediums and one remote copy.
  • When dealing with human subjects, it’s important to think about private information—especially if they are alive.
  • For my project, which uses R, it is important to ensure that the names of the R output files match the text input and that there are no spaces or special characters in the file names (For example, airp730.pdf. airp730.txt, airp730.R)  I
  • Always keep a README File.
  • It helps others who might not be familiar with my project to understand my choices.  For example, I removed non-essential pages from the .pdf’s before I used OCR software.



Project Management R+ Text Analysis

Oral History + R

Oral history presents certain challenges that you never think about when you start a digital project.  Over the past few weeks I have been struggling to make certain choices about how to handle and then process the text files.  For example, do I change the diction to standard English?; Do I include both the questions and the answers for text analysis and topic modeling?; and the dreaded speech to text issues that seem to plague the process.

I decided that I needed to convert various words (cuz to because; differnt to different, and so on) so that I could standardize the texts.  Because I am also topic modeling, I did not want to add any more words to my stop list file when training the modeler.  I also was afraid that I would change some of the KWIC (key words in context) if I kept the words in their original form for the text analysis portion of my project.

In terms of keeping both the questions and answers for data mining, I decided to create two data sets: the original and one with just the answers of the interviewee. I will then compare and contrast to see how it shifts the results.

There is really now easy work around converting speech to text.  It’s a slow process of training, and even more training to ensure that the text is matching the spoken file.  In the past I have had to just transcribe it by hand because the software could not handle a thick New York accent. Wish me luck as I start to train Mac2Speech this Friday!

Project Management R+ Text Analysis

First project frights


I am worried that this entire academic enterprise is a blunder. In fact, before I posted this blog I shared my fears with Jason Heppler. He also understands the fears of writing a blog, making mistakes, and the pressure of perfection that all academics face—or at least self-imposed pressure. Luckily for me, Jason shared some great pointers for academic bloggers. Over the summer, I will write on a weekly basis about my first digital humanities project.

I was first introduced to text analysis and topic modeling through Cameron Blevins’ work on Martha Ballard’s diary that he completed when he was Matthew Jockers’s student at Stanford. Nearly a year later I had the good fortune of working with Amanda Gailey on a digital project about children’s literature and race as part of my internship for my certificate in Digital Humanities. As I encoded various newspapers from the Carlisle Indian Boarding School, it occurred to me that Native American history is ready for computational analysis. But I did not have the right archival sources at the time. The next fall, I took a Microanalysis class with Matthew Jockers at UNL in hopes that the training would allow me to move forward. It did—tenfold. At the same time, I discovered a series of American Indian oral history transcriptions and journals written by Jewish settlers. Since then I’ve been following the work of Lincoln Mullen and other historians like Kellen Funk. Friends have directed me to Ben Schmidt’s Bookworm project.

Other than a group project for Matthew Jockers course, this is my first digital humanities project, which means I am learning through trial and error. The University Libraries, Jockers, and other faculty affiliated with the Center for Digital Research in the Humanities have provided strong mentorship and support for my project. This summer I am part of the second iteration of the Graduate Student Incubator project that works with Liz Lorang, the Digital Projects Librarian for the CDRH. One of the first things we had to do was go through an annotated checklist. It included such things as: the scopes of the project, what is our research question, develop a communication plan, and develop a data management plan to name a few. Geared to make us think more clearly about our projects, I spent nearly eight hours crafting the annotated checklist in hopes that a lot of upfront work would save me from bigger headaches as I moved through the project. I was wrong because I had still been thinking about my project like a manuscript. That said, I learned a lot from Liz’s instructive comments. I thought one way that the blog might serve the larger graduate digital humanities community is to actively show the struggles and successes of this project in hopes that those who come across the space will avoid my own “lessons.”

Here are some very helpful hints when thinking about conceiving your digital project:

  • It is not your dissertation. What I mean is that the way you must think about a digital project is entirely different than a dissertation or any other manuscript. For example, dates are important for scope in your written work, but may not be for your digital project.
  • Think about copyright immediately. Not only the archival sources you will pull from, but what license you will use if you provide source code. Standard archival agreements do not always meet the conditions that digital projects create. For example, as Liz rightly pointed out in my first draft: does clearance mean I have the right to use the materials in computational analysis and to publish about them and quote from them in limited form? Or, do you also have permission to make your input data available (the transcriptions you create)?
  • When creating a project work plan, allow for some wiggle room in your schedule. Some things will take much longer than you think. Do you really think that you can knock out that code in a week? Don’t we all make mistakes? Pad your schedule to account for human error.
  • Various aspects of your project may require different licenses. MIT might work for one part of your project, but might not be frequently used for R or MALLET source code.