Working with texts part 3 - word chains

# COMMENTS

At this point, we've done a fair amount of playing with text so it's time for a fun little project. We're going to generate some text "in the style" of a source text. The technique we're going to use is usually called a Markov Chain text generator. Basically a model where the next state or word is based entirely on the current state. I don't dwell on the math under the hood but in case you're interested, here are a few links: Wikipedia, Explained Visually, UC Davis Math.

You can motivate this lesson in a number of ways. One is to start by showing some generated text - for this, a flowery, poetic text like the Book of Psalms works well. You can generate text a bunch of times until you get a good result.

Here's the start of the Book of Psalms:

Blessed is the man that walketh not in the counsel of the
ungodly, nor standeth in the way of sinners, nor sitteth in
the seat of the scornful.

and here's some generated text:

Blessed is the people that delight in war. Princes shall come out of
Zion,  the perfection of beauty god hath shined our god shall come and
shall declare thy mighty acts. I will speak of the glory of the lord
surely i will not lie unto David.

I added the punctuation as our generator strips it all and converts the text to lower case.

The generated text looks reasonable in some ways but is clearly wrong in others.

This leads to the conversation.

If we look at the original text, it starts with "blessed" but if we look at our bag of words we also can see that "blessed" appears 49 times. What's interesting for generating text are the words the follow "blessed" and the number of times they follow it. Here they are (sorted for convenience):

Word How many times it follows the word "blessed"
are 5
art 1
be 16
blessed 1
depart 1
for 1
his 1
in 1
is 14
of 2
that 1
thee 1
thy 1
upon 1
wealth 1
you 1

So we'll have a lot of things like "blessed be the" "blessed be he" etc. but only one occurrence of "blessed upon...." To generate text, we can build a dictionary where the keys are the words and the values are all the words that follow the key. We can build this with a variation on our bag of words builder:

  
  words_raw = "a cleaned string with our text"
  words = words.raw.split()
  wordlists = {}
  for i in range(len(words)-1):
      current_word = words[i]
      next_word = words[i+1]
      wordlists.setdefault(current_word,[])
      wordlists.[current_word].append(next_word)
  

Note that we need a counting loop rather than just iterating over the list since we need both the current item and the one after it in each iteration.

The (partial) value for the wordlists for the Book of Psalms would be:

  
  wordlists = {...
               'blessed' : ['are', 'are', 'are', 'are', 'are', 'art', 'be', ...],
               'wealth' : ['and','and','by','to'],
               ...
  }
  
    At this point, generating text is easy.
  1. Start with a word
  2. Add the word to yoru story.
  3. Look it up in your wordlists dictionary.
  4. select a random word from that words dictionary entry (it's list).
  5. That becomes the word for the next iteration
  6. Repeat until you have enough words.

This won't add any punctuation and it will crash when it gets to the last word in the original text (as it can't find any "next words" but basically it will work. Here's the code:

  
  # assume wordlists is the dictionary built as specified above
  # and that we're starting with the word "blessed"
  current_word = "blessed"
  storylist = []
  for i in range(storylength):
      story.append(current_word)
      current_word = random.choice(wordlists[current_word])
  story = " ".join(storylist)
  

The random.choice() randomly selects a word from a list. This is perfect for us since words that follow our key more frequently will have more entries in the list and thus appear more often. I decided to build a list of words and then use the " ".join() to turn it all into a space separated string.

This is pretty fun but the students will note that it really doesn't work that well. This can lead to increasing our sample. Instead of using a single word as our key and having chain links based on two words (key, random choice from value), we can use three word links. Use a two word tuple for the key and the same list of words as the value.

For example, a partial dictionary based on this idea from Macbeth Act 1 Scene 1 could be:

  
  # source -> "when shall we three meet again"
  tuple_dict = { ("when","shall") : ["we"],
                 ("shall","we") : ["three"],
                 ("we","three") : ["meet"],
                 ("tree","meet") : ["again"]}

  

This takes a bit more work to build and also a bit more work to generate text but it's eminently doable and the result is better. What about three word keys? Four? These aren't hard to write and ambitious students can write a generic dicitonary builder and story generator routine so as to be able to use any key lenght.

Students will find that the longer the key, the better the resultant story. They'll also notice that at some point, all you'll ever generate is the original back again. This is a great time to talk about over-training.

This can be a fun unit and / or assignment. I used source materials that I like but any text will work.