FUN WITH RECURRENT NEURAL NETWORKS:

Fake cannabis strain names (and silly words, and Doctor Who and Star Trek titles)

(If you aren't already familiar with recurrent neural networks, why not see Andrej Karpathy's excellent blog?)

These days, thanks to The Wonders of Science[TM], we can train neural networks to imitate different styles of text by showing them some examples. Often the results are gibberish, but occasionally in this gibberish there is a nugget of... less gibberish. There are many fine Python libraries out there to let one run RNN experiments: I am using textgenrnn, and fine-tuning its stock model on data of my own whimsical fancy. Here is a selection of the most interesting, perplexing, or otherwise notable outputs.

In honour of no occasion in particular and for no special reason I scraped a list of cannabis strains to feed to my neural network. Amateur genetics enthusiasts give their hybrids some very whimsical names, and I thought they might pair well with my list of silly words to train a neural network to generate some really primo, high-quality nonsense. There were a number of confounding factors to consider. Firstly, the list of silly words was only ~500 items long, while the cannabis strains numbered over 10,000. If I tried combining the datasets to train a network, the cannabis strains would so outnumber silly words that the network would only learn features of the cannabis strains, and nothing about the silly words. So from the outset I knew I had to employ transfer learning: train a network up on silly words first to prime it, and then give it a little fine-tuning--just a single epoch, given the size of this dataset--to see if it could learn to generate plausible cannabis strain names infused with the spirit of codswallop and poppycock. But secondly, there were some orthographic differences between the lists. The cannabis strains were often capitalized, and included a lot of eccentric punctuation. The silly words however were all lowercased and included many fewer digits and special characters. Remember, these neural networks work on a character-by-character level, based on the characters generated so far: so once they generate a character that only occurs in one of the training sets, they're liable to finish off with more characters based just on that training set, since they're the only things they know can follow that. That means the network won't learn how to combine features from both sets, which is what I really want.

This happens a lot in machine learning research: whenever you want to combine data from multiple sources, you will always find that there are some eccentricities of the datasets that make them look very different, even if morally speaking they shouldn't be. Presence or absence of capitalization and punctuation in text, for example. Or average brightness or dimensions of a set of images. Spurious features that distract your machines, for machines are notoriously naive and flighty. To get data into a useful form for machine learning, one must clean and process it using your living wits. In this case, to get the two datasets onto a common character set, I converted them all to lowercase and stripped out all punctuation and special characters except whitespace before training the network first on the list of silly words, and then on the list of cannabis strains.

Let's see what we get!

Along the way, since I was training the network on them already, I generated some bonus silly words:

For kicks, I took the strains network, and fine-tuned it on the list of Doctor Who episode titles too:

Then I infused the Star Trek titles as well: