Clustering Characters

In this section I’ll describe a modest NLP success. I’ve been wanting to try unsupervised learning for some time, and thought that I could take all the character dialogue and cluster it, using a method called k-means. If it worked, this would show non-obvious character groupings. I imagined that perhaps Sandor Clegane and the other profane characters would be in one cluster, with the holy men in another, and the courtiers in still another. That did end up happening, but it took a lot of tinkering to get there. Here’s how it went:

First, I winnowed the ~1100 characters down to 200. Almost every natural language dataset obeys Zipf’s law, in which most of the occurrences come from a relatively small number of contributors. So the top 200 most talkative characters cover ~90% of the dialogue in ASOIAF. I wanted to narrow the field down because initial tests were being skewed by the huge chunk of one-liner parts.

I then had to figure out the parameters of the clustering. The tricky thing here was determining how rare the words would be. When you’re clustering data, you want to throw out ubiquitous terms, because they won’t help you discriminate. That means you toss prepositions and pronouns, but also corpus-specific terms, like “lord” and “ser”.

On the other end, you don’t want to admit extremely rare words, because then you’ll have overly specific clusters that don’t capture enough characters. I decided that if ten separate characters (out of 200, remember) used a word, it was common enough to go into the analysis.

There’s also the question of how many clusters you want. This was a hard problem as well, because there’s no good way to check it. That’s because the data suffers from “the curse of dimensionality.” The characters have a large vocabulary of about 17,000 words. Every one of those can be thought of as a spatial dimension. If I was clustering on two or three dimensions, it would be simple to check the data. I could throw it on a scatterplot and eyeball it. But with 17,000 dimensions, there’s no way for me to do that. (I think, I’m still new at this data science thing.) All I could do was run a bunch of tests and see if the clusters seemed too large or too small, and refine it from there.

After a lot of trial-and-error, I arrived at the clustering below. Take a look and see how it did.

I think it performed decently at identifying geographic clusters. It did that by homing in on the topics which dominate the conversation in the various locales: dragons, ships, trees, walls.

But since I know which characters talk to each other, I already had this data. I was hoping for more like Cluster 5, the backcountry cluster. These are the characters with rural poor dialect, like the ranger Dywen, who has wooden teeth. These characters don’t necessarily converse, but they are similar in their speech patterns. Cluster 7 captures those sitting above the salt, and you can see their genteel speech patterns with “fear” and “hope”. I’m honestly not sure what “doe” is doing there. (I’m writing this a few months after the fact.) I believe that’s the result of the stemmer trimming does down. Stemmers are used to normalize the data by reducing singular and plurals to one root form.

Cluster 0 – The Meereenese Knot

Keywords: dragon, meereen, yunkai, astapor, worship, khaleesi

Dany Targaryen
Barristan Selmy
Skahaz mo Kandaq
Daario Naharis
Hizdahr zo Loraq
Quentyn Martell
Gerris Drinkwater
Kraznys
The Tattered Prince
Missandei
Arys Oakheart
Reznak mo Reznak
Galazza Galare
Grey Worm
Qavo Nogarys
Nurse
Irri
Ghael

Cluster 1 – Religiosity

Keywords: sin, trial, seven, faith, pray, land

Kevan Lannister
Elder Brother
High Sparrow
Lancel Lannister
Jacelyn Bywater

Cluster 2 – Northern Magic

Keywords: dream, tree, watch, summer, wall, giant

Bran Stark
Maester Luwin
Meera Reed
Mance Rayder
Jojen Reed
Shae
Qhorin Halfhand
Godric Borrell
Osha
Bloodraven
Osmund Kettleblack
Hot Pie
Balon Greyjoy
Mya Stone
Old Nan
Benjen Stark
Robett Glover
Tommen
Big Walder
A Messenger
Leaf
Jeyne Poole

Cluster 3 – Naval Matters

Keywords: dragon, sea, ship, sail, qarth, fleet

Jorah Mormont
Asha Greyjoy
Illyrio
Xaro Xhoan Daxos
Victarion Greyjoy
Aeron Damphair
A Captain
Euron Greyjoy
Rodrik Harlaw
Viserys
Tristifer Botley
Vogarro’s Whore
Aegon Targaryen
Harry Strickland
Tycho Nestoris
Moqorro
Leo Tyrell
Aurane Waters
Pyat Pree

Cluster 4 – Stannis & The Night’s Watch

Keywords: night, command, wall, watch, light, dead

Jon Snow
Sam
Jeor Mormont
Sansa Stark
Brienne
Davos Seaworth
Melisandre
A Kindly Man
Aemon Targaryen
Salladhor Saan
Haldon Halfmaester
Bowen Marsh
Barbrey Dustin
Rodrik Cassel
Jon Connington
Janos Slynt
Loras Tyrell
Selyse Baratheon
Gilly
Alliser Thorne
Donal Noye
Randyll Tarly
Axell Florent
Grenn
Pyp
Maester Cressen
Hallyne
Val
Jaqen H’ghar
Syrio Forel
Alys Karstark
Maester Pylos
Robert Arryn
Denys Mallister
Alleras
Justin Massey
Tyene Sand
Othell Yarwyck
Armen
Tytos Blackwood
Ravella Swann
Falyse Stokeworth
Hoster Tully
Rennifer Longwaters
Archmaester Marwyn
Edric Storm
The Waif
Godry Farring

Cluster 5 – Countryfolk

Keywords: o’, t’, got, aye, har, fool

Tormund Giantsbane
Ygritte
Yoren
Dick Crabb
Brown Ben Plumm
Lem Lemoncloak
A Sallow Man
Craster
A Man
An Old Man
Rattleshirt
Dywen

Cluster 6 – The Dog and the Brotherhood

Keywords: aye, dog, dead, song, fight, got

Arya Stark
Sandor Clegane
Robert Baratheon
Septon Meribald
Joffrey Lannister
Bronn
Gendry
Walder Frey
Dolorous Edd
Ramsay Bolton
Thoros
Tom Sevenstrings
Dontos
Hyle Hunt
Genna Lannister
Harwin
Daven Lannister
Penny
Beric Dondarrion
Mirri Maz Duur
Podrick Payne
Archibald Yronwood
A Guard
Cleos Frey
Jonos Bracken
Obara Sand
A Maester
Osney Kettleblack
Chett
Marillion
Dareon
Chiswyck
Tobho Mott
Cotter Pyke
Nymeria Sand
Weese
Ghost of High Heart
Edwyn Frey
Khal Drogo
Creighton Longbough
A Dwarf
Rolly Duckfield
Rowan

Cluster 7 – The Nobles and Courtiers

Keywords: fear, daughter, wed, hope, doe, land

Tyrion Lannister
Cersei Lannister
Jaime Lannister
Catelyn Stark
Littlefinger
Stannis Baratheon
Ned Stark
Varys
Robb
Theon Greyioy
Tywin Lannister
Roose Bolton
Grand Maester Pycelle
Doran Martell
Arianne Martell
Oberyn Martell
Lysa Arryn
Brynden Tully
Qyburn
Wyman Manderly
Renly Baratheon
Edmure Tully
Olenna Tyrell
Taena Merryweather
Margaery Tyrell
Myranda Royce
Mace Tyrell
Lothar Frey
Alester Florent
Nestor Royce
Harys Swyft
Orton Merryweather