Transfer Learning in Recurrent Neural Networks

These days, thanks to The Wonders of Science[TM], we can train neural networks to imitate different styles of text by showing them some examples. Often the results are gibberish, but occasionally in this gibberish there is a nugget of... less gibberish. There are many fine Python libraries out there to let one run RNN experiments: I am using textgenrnn, and fine-tuning its stock model on data of my own whimsical fancy. Here is a selection of the most interesting, perplexing, or otherwise notable outputs.

Whenever one has two good things, the natural human impulse is to smoosh them together: peanut butter and jelly, cheese and cake, jam and wasp. So too with neural networks! If a neural network can generate text based on television episode titles, and separately based on grandolinquent poppycock and balderdashery, why not train it on both together and hope it smooshes them together into an unholy gestalt of awesomeness.

Would that it were so simple.

When a char-rnn generates text, it chooses each character based on the characters that have come before, from the characters that were likely to follow those characters in the training data you showed it. If some characters usually follow others in the training set, that same pattern will often repeat in the trained network. If you train it on a bunch of examples of pattern A and a bunch of examples of pattern B, and ask the network to generate some text, some of the time it'll generate something like pattern A, and some of the time it'll generate something like pattern B. If the two patterns are different enough, it may never learn to generate examples that blend the two.

But instead of just blindly smooshing our datasets together like rocky and road, we can use the technique of transfer learning. This is a very popular strategy in machine learning, where training data is at a premium. To train a neural network up from scratch can take an enormous amount of training data, which needs to be cleaned, curated, annotated, brushed, and washed. Who has that kind of time? Instead, practitioners perform transfer learning:

Gather just a small amount of data of your own. Then go out and find a neural network that has already been trained on similar data, the more similar the better. Now retrain slash fine-tune this network on your own small dataset. The idea is that the network will already have learned enough useful correspondences and features from the original training data that that learning will transfer over and allow it to adapt to your own data comparatively quickly. This is tremendously popular in computer vision, for example. It's also already being used secretly in these experiments: since I only have 100-500 examples in each of my datasets, I've been starting from the stock neural network in textgenrnn, which has already been trained up on a diverse corpus and thus "knows about" many different linguistic patterns already, and then fine-tuning it on my own data. Let's try transfer learning here! Let's train the textgenrnn neural network on my list of silly words, and then fine-tune it on various other lists, but only for a few epochs, in the hopes of combining the features of the new and original datasets. This is hard to get right: if you don't train the network on the new data enough, it'll just spit out gibberish, but if you train it too much, it'll forget all about the older dataset altogether and only generate text like the new dataset. In my experiments with textgenrnn and training sets of 100-500 words, somewhere around 5 epochs generally seems to be about right. (To perform an epoch of training is to train a network on each example from your training set exactly once.)

First, some bonus examples of machine-generated silly words inspired by words like "apparatus" and "poltroon":

bitchey fun
bumbernampumpycock
bumphobery
bumphone
buttappel
diddly-duddy
flumpery
fullygas
godwang
gumbler-fuddle
lickey-pigosis
mornication
noods
pezzler
shantrophy
shellycock
sniphous
weamous

When I took my silly word network and tuned it on Star Trek episode titles, the poor network mostly just seemed confused. There were a few gems, though they don't necessarily seem very Star Trekky:

Breasts
Creamers
More of the Author
Sex of the Graphy
The Behind the Counter
The Brain Shutter
The Card of the Donator
The Care of Sleep and State
The Care of the Catsing
The Children Doesn't Wash
The Counterpopper
The Death of the Part II
The Factor of the Battle of the Part of the Manders
The First Broken Star
The Magical Story
The Man Waster
The Man Who Have the Behind
The Moral
Tying Sand's Hycent
What Are Part II
You Have Gay

(That last clearly a Bashir/O'Brien story.) Doctor Who titles seemed to work somewhat better. After taking the silly word network, and tuning it up on Doctor Who episode titles, many examples are almost plausible:

A Bad Thanks
A Bargain of the Daleks
A Desperate Monster
Arm Doctors
Evil of Fear
Mysterious You
Pots of Monsters
School of the Daleks
Shakespeare in Space
Smith of the Daleks
The Bride of the Daleks
The Cost of Death
The Dalek of the Daleks
The Doctors of the Doctor
The End of Doctor Rock
The Family of the Person
The God Hive
The Horse Hour Invasion
The Husbands of Death
The Invasion of Doctor Rock
The Marine of the Star of the Space of the Time
The Office of the Offense
The Rest of the Daleks
The Satan Patrol
The Skulls of the Doctor
The Space Time
The Steel Terror of the Daleks
The Time War

However, others are gratuitously preposterous in their circumlocution:

Air Chica
Arch of Small
Classic Daleks
Crazing of the Boot
Effle of the Dead
Flight Then
Rap of the Doctor
School on the Cock
The Cats of the Daleks
The Cock of the Death
The Colons of the Daleks
The Colons of the Dead
The Consent of the Dead
The Dankers in Shade
The Death of the State
The Death to Ordering
The Dolls of the Daleks
The Marina Death
The Mars of the Fens
The Sunly Poon
The Tub Planet
The Wash Time
Time Kats

FUN WITH RECURRENT NEURAL NETWORKS:

Training a Recurrent Neural Network on Combined Datasets by Transfer Learning