Clustering Characters

In this section I’ll describe a modest NLP success. I’ve been wanting to try unsupervised learning for some time, and thought that I could take all the character dialogue and cluster it, using a method called k-means. If it worked, this would show non-obvious character groupings. I imagined that perhaps Sandor Clegane and the other profane characters would be in one cluster, with the holy men in another, and the courtiers in still another. That did end up happening, but it took a lot of tinkering to get there. Here’s how it went:

First, I winnowed the ~1100 characters down to 200. Almost every natural language dataset obeys Zipf’s law, in which most of the occurrences come from a relatively small number of contributors. So the top 200 most talkative characters cover ~90% of the dialogue in ASOIAF. I wanted to narrow the field down because initial tests were being skewed by the huge chunk of one-liner parts.

I then had to figure out the parameters of the clustering. The tricky thing here was determining how rare the words would be. When you’re clustering data, you want to throw out ubiquitous terms, because they won’t help you discriminate. That means you toss prepositions and pronouns, but also corpus-specific terms, like “lord” and “ser”.

On the other end, you don’t want to admit extremely rare words, because then you’ll have overly specific clusters that don’t capture enough characters. I decided that if ten separate characters (out of 200, remember) used a word, it was common enough to go into the analysis.

There’s also the question of how many clusters you want. This was a hard problem as well, because there’s no good way to check it. That’s because the data suffers from “the curse of dimensionality.” The characters have a large vocabulary of about 17,000 words. Every one of those can be thought of as a spatial dimension. If I was clustering on two or three dimensions, it would be simple to check the data. I could throw it on a scatterplot and eyeball it. But with 17,000 dimensions, there’s no way for me to do that. (I think, I’m still new at this data science thing.) All I could do was run a bunch of tests and see if the clusters seemed too large or too small, and refine it from there.

After a lot of trial-and-error, I arrived at the clustering below. Take a look and see how it did.

I think it performed decently at identifying geographic clusters. It did that by homing in on the topics which dominate the conversation in the various locales: dragons, ships, trees, walls.

But since I know which characters talk to each other, I already had this data. I was hoping for more like Cluster 5, the backcountry cluster. These are the characters with rural poor dialect, like the ranger Dywen, who has wooden teeth. These characters don’t necessarily converse, but they are similar in their speech patterns. Cluster 7 captures those sitting above the salt, and you can see their genteel speech patterns with “fear” and “hope”. I’m honestly not sure what “doe” is doing there. (I’m writing this a few months after the fact.) I believe that’s the result of the stemmer trimming does down. Stemmers are used to normalize the data by reducing singular and plurals to one root form.

Cluster 0 – The Meereenese Knot

Keywords: dragon, meereen, yunkai, astapor, worship, khaleesi

Cluster 1 – Religiosity

Keywords: sin, trial, seven, faith, pray, land

Cluster 2 – Northern Magic

Keywords: dream, tree, watch, summer, wall, giant

Cluster 3 – Naval Matters

Keywords: dragon, sea, ship, sail, qarth, fleet

Cluster 4 – Stannis & The Night’s Watch

Keywords: night, command, wall, watch, light, dead

Cluster 5 – Countryfolk

Keywords: o’, t’, got, aye, har, fool

Cluster 6 – The Dog and the Brotherhood

Keywords: aye, dog, dead, song, fight, got

Cluster 7 – The Nobles and Courtiers

Keywords: fear, daughter, wed, hope, doe, land

Return to table of contents