The method behind Botnick, and many other predictive text generators, is Markov-Chains. Ben Shraver wrote an excellent Medium post about Markov Chains and how they can be used in these predictive text analyses. Basically, Markov Chains use probabilities to predict future events in sequences. Like any machine learning method, Markov Chains need to be fed training data, which is this case is a text corpus. A Markov Chain model, then, will attempt to predict future events in a sequence based of the known sequence patterns in the training data. In the case of text, an algorithm learns common word pairings and sentence structures to predictively create new text. Botnick calls this a “voice,” which is essentially an algorithm that “writes” in the style of a character, musician, etc.
So I got to work implementing MC methods on my corpus of Bachelor biographies, but it didn’t turn out exactly how I expected.
Like I mentioned earlier, these predictive text methods rely heavily on training data. So, if you’re creating an episode of Seinfeld, a blink-182 album, or a new Harry Potter chapter, your going to have better results because of the amount of training text available to create that voice. Unfortunately for me, around 90 biographies of training data wasn’t quite enough to satisfy my clients. Here’s the best of what my Bot came up with:
"Bunny Shes looking for her heart and hand Tattoos No dogs She has it the Mormon culture but neither one Bachelor with drug addiction Victoria could break any kind of her to talk interact with her biological mother and are you Dinner and cut her lease was the west"
Taking creative co-writing liberties, I came up with something like this.
Bunny is looking for someone to take her heart and hand. She has tattoos, but none of dogs. Her two tattoos ground her in her love of Mormon culture, but neither one depict the Bachelor with his drug addiction. She will break anyone who stands in the way of her talking to and interacting with biological mother, Victoria, at dinner. She is so committed to finding love that she ended her lease to move out West.
Like I said, according to my clients, this was not satisfactory. If we were going to produce a full episode of the Bachelor, we were going to need more character depth than these methods could give us.
I was crushed that my very own Bot was unemployed. Still, I didn’t lose sight of our goal and determination to see this through. My clients wanted more creative control and I was going to give it to them.
I ended up going back to the basics, a good ol’ spreadsheet, to see this project through. I asked my friends to contribute possible ideas to personality categories: names, ages, occupations, hometowns, traits, interests, dating history, and sob story. Once we had a sufficient number of ideas, I used the RANDBETWEEN() function to randomly select one of the submissions from each category to create a character for each of us.
This method ended up working really well. We all had so much fun with our characters and developing a story. Though it was really fun to learn more about predictive text methods, I think this project taught me that most of the time human creativity just can’t be beat.
[1] Since I originally completed this project, ABC has actually archived the Bachelor bios I was working with. This is a really good learning lesson for anyone getting into Webscraping. The Web is constantly changing. If sustainably updating data is important to your project, Webscraping is probably not the tool to best accomplish your data collection.
Here is the code I used to create the Bachelor bot!
## Create bot descriptions ---------
bios <- unlist(bios)
## text <- text[nchar(bios)]
text <- str_replace_all(bios, "[[:punct:]]", "")
terms <- unlist(strsplit(text, ' '))
fit <- markovchainFit(data = terms)
new1 <- NULL
for(i in 1:100){
new1 <- c(new1,
c(paste(markovchainSequence(n=50, markovchain=fit$estimate), collapse=' ')))
### Create bot input -------------
descriptions <- unlist(descriptions)
## text <- text[nchar(descriptions)]
text <- str_replace_all(descriptions, "[[:punct:]]", "")
terms <- unlist(strsplit(text, ' '))
fit <- markovchainFit(data = terms)
new1 <- NULL
for(i in 1:50){
new1 <- c(new1,
c(paste(markovchainSequence(n=5, markovchain=fit$estimate), collapse=' ')))