Project Management R+ Text Analysis

Oral History + R

Oral history presents certain challenges that you never think about when you start a digital project.  Over the past few weeks I have been struggling to make certain choices about how to handle and then process the text files.  For example, do I change the diction to standard English?; Do I include both the questions and the answers for text analysis and topic modeling?; and the dreaded speech to text issues that seem to plague the process.

I decided that I needed to convert various words (cuz to because; differnt to different, and so on) so that I could standardize the texts.  Because I am also topic modeling, I did not want to add any more words to my stop list file when training the modeler.  I also was afraid that I would change some of the KWIC (key words in context) if I kept the words in their original form for the text analysis portion of my project.

In terms of keeping both the questions and answers for data mining, I decided to create two data sets: the original and one with just the answers of the interviewee. I will then compare and contrast to see how it shifts the results.

There is really now easy work around converting speech to text.  It’s a slow process of training, and even more training to ensure that the text is matching the spoken file.  In the past I have had to just transcribe it by hand because the software could not handle a thick New York accent. Wish me luck as I start to train Mac2Speech this Friday!

Leave a Reply

Your email address will not be published. Required fields are marked *