Oral history presents certain challenges that you never think about when you start a digital project. Over the past few weeks I have been struggling to make certain choices about how to handle and then process the text files. For example, do I change the diction to standard English?; Do I include both the questions and the answers for text analysis and topic modeling?; and the dreaded speech to text issues that seem to plague the process.
I decided that I needed to convert various words (cuz to because; differnt to different, and so on) so that I could standardize the texts. Because I am also topic modeling, I did not want to add any more words to my stop list file when training the modeler. I also was afraid that I would change some of the KWIC (key words in context) if I kept the words in their original form for the text analysis portion of my project.
In terms of keeping both the questions and answers for data mining, I decided to create two data sets: the original and one with just the answers of the interviewee. I will then compare and contrast to see how it shifts the results.
There is really now easy work around converting speech to text. It’s a slow process of training, and even more training to ensure that the text is matching the spoken file. In the past I have had to just transcribe it by hand because the software could not handle a thick New York accent. Wish me luck as I start to train Mac2Speech this Friday!