Matthew Denner

Something something witty comment something

Rick Rolling Markov Chains

I had this whacky idea: could I build a Clojure program that could generate arbitrary Rick Rolling text? Turns out it wasn't as hard as I expected! Ok, so the text it generates isn't brilliant, but the code is incredibly simple.

The idea

First, understand that a simple way to generate sensible yet random text is to use a Markov Chain: a graph where each node is a word and each directed edge between two nodes determines the probability of the words flowing in that order. To generate text simply pick a node and start randomly walking, based on the probability from the edges, until you have all the text you want. It doesn't generate great text, and is the plague of Twitter, but it does get you something that may make sense!

Second, if we're going to generate Rick Rolling text we know that our path through the Markov Chain will have to pass through particular words at particular points. If we pretend that our text is broken into lines of a particular length, let's say L, then we know that every L words we will have to pass through one of the words in the lyrics of Never Gonna Give You Up.

The Markov Chain

So let's start simple by writing the code that will generate our Markov Chain:

(defn build-markov-chain
  [words]
  (-> words
      words->edges
      edges->counted-edges
      counted-edges->markov-map))

Here the code takes in a sequence of words, words, and then passes them through several functions:

(defn- words->edges
  [words]
  (map vector words (rest words)))

The words->edges function takes a sequence of words and returns a sequence of edges: each edge is a pair from one word to the word that follows it in the text.

(defn- edges->counted-edges
  [edges]
  (let [accumulate (fn [memo edge] (merge-with + memo {edge 1}))]
    (map flatten (reduce accumulate {} edges))))

The edge->counted-edge simply counts the number of occurrences of each edge, returning a sequence of triples. Effectively this is generating the properly weighted edges of our Markov Chain.

(defn- counted-edges->markov-map
  [counted-edges]
  (let [merge-edge (fn [memo [word following occurrences]] (merge-with merge memo {word {following occurrences}}))]
    (reduce merge-edge {} counted-edges)))

The final function, counted-edge->markov-map, takes our weighted edges and builds the final Markov Chain. Here the returned map goes from a word to another map which is the occurrences of each possible following word.

It's easier to see it in practice (formatted for better readability):

user=> (build-markov-chain (split "this is some simple text is it" #" "))
{"this"   {"is" 1},
 "some"   {"simple" 1},
 "simple" {"text" 1},
 "text"   {"is" 1},
 "is"     {"some" 1, "it" 1}}

If you read the text you can see that is appears twice and is followed by some and it.

This isn't quite a Markov Chain as it has counts on the edges, rather than probabilities, but the code that comes will determine the probabilities at each visit to a node in the graph. The reason for this will be apparent later.

Rick-Rolling

Right, so we know we need to have the lyrics of Never Gonna Give You Up in order in our text, and we also know that gonna is going to be a PITA to actually find in any other text, so we'll have a synonym of going to:

(def ^{:private true}
  rick-roll-lyrics
  "never gonna give you up
   never gonna let you down
   never gonna run around and desert you
   never gonna make you cry
   never gonna say goodbye
   never gonna tell a lie and hurt you")

(def ^{:private true}
  synonyms
  {"gonna" (split "going to" #"\s+")})

(defn- replace-synonyms
  [words]
  (flatten (map #(get synonyms %1 %1) words)))

(def ^{:private true}
  rick-roll-sequence
  (flatten (replace-synonyms (split rick-roll-lyrics #"\s+"))))

We also know that any Markov Chain must contain all of the lyrics otherwise there is no point in even trying to generate the text:

(defn- contains-all-lyrics?
  [markov-map]
  (let [words-in-markov-map (set (keys markov-map))
        words-in-lyrics     (set rick-roll-sequence)]
    (subset? words-in-lyrics words-in-markov-map)))

Now we have to generate a path through the Markov Chain that passes through particular words. We can make a sequence that represents this path but with blanks that need filling in, and we already know that we'll have a number of these blanks on each line.

(defn- generate-path
  [length-between-lyrics]
  (let [blanks (repeat length-between-lyrics :pick-a-word)]
    (flatten (interleave rick-roll-sequence (repeat blanks)))))

Here the code marks the blanks with :pick-a-word. To fill in these blanks we're going to have to pick a word based on the previous word, which means that we'll be using the occurrence map from the Markov Chain for that previous word. But if we pick a word, and it later doesn't work out, we'll need to take out the word out of the occurrence map, at least for this point in the walk. So two functions:

(defn- pick-word
  [occurrences]
  (when-not (empty? occurrences)
    (let [weighted-words (map (fn [[k v]] (repeat v k)) occurrences)
          pick-list      (flatten weighted-words)
          index-to-pick  (rand-int (count pick-list))]
      (nth pick-list index-to-pick))))

(def ^{:private true}
  remove-word
  dissoc)

pick-word takes an occurrence map and generates a sequence where each word is repeated. It then uses rand-int to pick an index in that sequence: effectively this combination gives us the probability of picking a word.

The penultimate piece of the puzzle is taking some text and Rick Rolling from it:

(defn rick-roll
  [words length]
  (let [markov-chain                (markov/build-markov-chain words)
        [first-word & path-to-walk] (generate-path length)
        where-to-begin              (seq [first-word])]
    (when-let [rick-roll-text (walk-the-path markov-chain path-to-walk where-to-begin)]
      (reverse rick-roll-text))))

This should be fairly obvious: we build a Markov Chain, generate a path to walk, and start at the first word of that path. The only bit to fill in is walk-the-path, which will walk through the Markov Chain to build the Rick Rolling text:

(defn- try-various-steps-in-path
  [markov-map occurrences-from-previous-word path-to-walk rick-roll-text-so-far]
  (loop [occurrences occurrences-from-previous-word]
    (when-let [suggested-word (pick-word occurrences)]
      (let [rick-roll-text (walk-the-path markov-map path-to-walk (cons suggested-word rick-roll-text-so-far))]
        (if-not (nil? rick-roll-text)
          rick-roll-text
          (recur (remove-word occurrences suggested-word)))))))

(defn- walk-the-path
  [markov-map [current-word & path-to-walk] [previous-word & rest-of-rick-roll :as rick-roll-text-so-far]]
  (when (contains-all-lyrics? markov-map)
    (if (nil? current-word)
      (cons previous-word rest-of-rick-roll)
      (let [occurrences-from-previous-word (get markov-map previous-word)]
        (if (= :pick-a-word current-word)
          (try-various-steps-in-path markov-map occurrences-from-previous-word path-to-walk rick-roll-text-so-far)
          (when-let [occurrence-of-current-word (get occurrences-from-previous-word current-word)]
            (walk-the-path markov-map path-to-walk (cons current-word rick-roll-text-so-far))))))))

Notice that try-various-steps-in-path loops until it either walk-the-path finds a final sequence of words, or it runs out of words to follow the current one. So this code will always terminate, it just may never find a valid path through the Markov Chain.

Generating text

Turns out that writing this code isn't too hard: it took me a couple of hours and could be better. The difficulty comes in finding sources of text that can form a Markov Chain that will work for Rick Rolling. Even with the gonna/going to synonym switch it's actually desert that is the hard word to find! I tried Project Gutenberg's top 100 and ended up using a couple.

Once you do though:

(require '[rick-roll-my-markov-chain.core :as core] :reload)
(require '[rick-roll-my-markov-chain.markov :as markov] :reload)
(require '[clojure.java.io :as io])
(require '[clojure.string :as str])

(defn article->text [filename]
  (remove #(= % "")
    (map (fn [w] (-> w str/lower-case (str/replace #"[^a-z'!.]" "") str/trim))
      (str/split (str/join " " (line-seq (io/reader filename))) #"\s+"))))

(defn sequence->paragraphs [rick-roll-text sentence-length]
  (loop [text-sequence rick-roll-text
         p             []]
    (if (empty? text-sequence)
      p
      (let [head (take sentence-length text-sequence)
            tail (drop sentence-length text-sequence)]
        (recur tail (conj p head))))))

(let [sentence-length 15
      filenames       ["the-adventures-of-sherlock-holmes" "frankenstein"]
      texts           (map (comp article->text #(str "resources/" %1 ".txt")) filenames)
      rick-roll-text  (core/rick-roll (apply concat texts) (dec sentence-length))
      paragraphs      (sequence->paragraphs rick-roll-text sentence-length)]
  (doseq [p paragraphs]
    (println (str/join " " p))))

And, if you give it a moment as this code isn't efficient:

never did this man who they stood fixed for the cottage it out upon you
going to hear me to you a restrained by them overlook the papers. if you
to for thy lover and carpeted it was. then that we retired to a just
give new preceptors i left oxford with mr. windibank wished to the sorrows of which
you tonight. you made a light. it's a more illiterate than her whole of taking
up my pulses paused i try to me neglect the town in regent street i
never ventured to geneva but you see me. i sincerely sympathised in justice and about
going out. i looked upon me to walk out of science. with my revolver in
to reflect that he loved. this suspense. presently i have five and i would have
let me are. and the wreck and prescribed medicines and then i should run for
you i experienced them and coming in height of treachery and they gained on him
down to claim his steps because we shall have devoted affection. my former ways and
never for my mind was able to a heavy footfall in my assistant mr. mccarthy
going and it be aware that said mr. ryder. pray continue your confession i asked.
to light. it's town for the advertised description tallied in favour i was possible to
run back and the light and rugged mountains of yours mr. turner's lodgekeeper his destruction
around the sledge to her mother could it was alive. i felt helpless. i saw
and that hurts my own good and bloomed in the hall this sort of the
desert heaths and a marriage with the inquest. the greatest fear leaves a sudden and
you know more flurried than with chains and i met him made many objections which
never beheld a victim. she beheld a woman deposed that purpose unfulfilled. yet when you
going the door and source of england and five in some surprising since i have
to conceal some small lake and perilous situation marks of my person it clear and
make the ambition of the wind quickly over for a broadbrimmed hat actually freezing in
you at my eye was not return. they are the owner of the storm that
cry 'cooee!' then at the use and pay such joy so blind vacancy open and
never to him for you will find their walks with the market. all to me
going in possession of your left him but there came upon the trustees are well
to communicate. should die but a trumpet of receipt of my travels for he must
say that there is worthy of heaven whom the appearances of charity descends to him.
goodbye and earth and made of flowers and when that i don't think that he
never had penetrated into the door and knock you appear to state visit to me
going in this is a heading upon young openshaw shall see your own. indeed so
to do you reasoned with more general look for he does his hand. your case
tell you must form a more tangible cause. and animate us. that he pressed upon
a violent start full and i saw the danger would each new scene to a
lie behind him. and were already dusk and wretchedness and glad of the gloomy idea
and this ebook complying with them. which i very truly and the most only to
hurt you wish to the horizon as pressing my father. my father of copet. another
you can afford. i have been in the utmost limit itself while my word that

Obviously the larger the text used to create the Markov Chain, the more possible paths through it for Rick Rolling, but more garbage text will be produced.

Not happy with some of the code, but the result was pleasing.