Using an RNN to generate Bill Wurtz notes

Textgenrnn is fun

2019-10-05 project

Bill Wurtz is an American musician who became reasonably famous through short musical videos posted to Vine and YouTube. I was searching through his website the other day, and stumbled upon a page labeled notebook, and thought I should check it out.

Bill’s notebook is a large (about 580 posts) collection of random thoughts, ideas, and sometimes just collections of words. A prime source of entertainment, and neural network inputs..

“If you are looking to burn something, fire may be just the ticket” - Bill Wurtz

Choosing the right tool for the job

If you haven’t noticed yet, Im building a neural net to generate notes based on his writing style and content. Anyone who has read my first post will know that I have already done a similar project in the past. This means time to reuse come code!

For this project, I decided to use an amazing library by @minimaxir called textgenrnn. This Python library will handle all of the heavy (and light) work of training an RNN on a text dataset, then generating new text.

Building a dataset

This project was a joke, so I didn’t bother with properly grabbing each post, categorizing them, and parsing them. Instead, I build a little script to pull every HTML file from Bill’s website, and regex out the body. This ended up leaving some artifacts in the output, but I don’t really mind.

import re
import requests


def loadAllUrls():
    page = requests.get("https://billwurtz.com/notebook.html").text

    links = re.findall(r"HREF=\"(.*)\"style", page)

    return links


def dumpEach(urls):
    for url in urls:
        page = requests.get(f"https://billwurtz.com/{url}").text.strip().replace(
            "</br>", "").replace("<br>", "").replace("\n", " ")

        data = re.findall(r"</head>(.*)", page, re.MULTILINE)

        # ensure data
        if len(data) == 0:
            continue

        print(data[0])


urls = loadAllUrls()
print(f"Loaded {len(urls)} pages")
dumpEach(urls)

This script will print each of Bill’s notes to the console (on it’s own line). I used a simple redirect to write this to a file.

python3 scrape.py > posts.txt

Training

To train the RNN, I just used some of textgenrnn’s example code to read the posts file, and build an HDF5 file to store the RNN’s neurons.

from textgenrnn import textgenrnn

generator = textgenrnn()
generator.train_from_file("/path/to/posts.txt", num_epochs=100)

This takes quite a while to run, so I offloaded it to a Droplet, and left it running overnight.

The results

Here are some of my favorite generated notes:

“note: do not feel better”

“hi I am a car.”

“i am stuff and think about this before . this is it, the pond. how do they make me feel better?”

“i am still about the floor”

Not perfect, but it is readable english, so i call it a win!

Play with the code

I have uploaded the basic code, the scraped posts, and a partial hdf5 file to GitHub for anyone to play with. Maybe make a twitter bot out of this?



Thank you for reading this post. If you enjoyed the content, and want to let me know, or want to ask any questions, please contact me via one of the methods listed here. If you would like to be notified about future posts, feel free to load my rss feed into your favorite feed reader, or follow me on Twitter for notifications about my work and future posts.
If you have the time to read some more, I recommend checking out one of the following posts:
Using Bazel to create Minecraft modpacks
I decided to modernize my system for producing builds of my personal Minecraft modpack using the Bazel buildsystem.
Mounting Google Drive accounts as network drives
I can never get the Google Drive webapp to load quickly when I need it to. My solution: use some command-line magic to mount my drives directly to my laptop's filesystem.
Reading metadata from a bitmap file
Inspired from one of my friend's projects, I built a small tool for displaying bitmap file info from the command line.


Made with ♥ by Evan Pratten | RSS | API Status