As Conan O'Brien has been winding down his TBS show this summer, he's been bringing back some of his favorite guests for a last hurrah. It got me wondering which late night hosts had the deepest bench, and, once I realized that episode names were listed in billing order, if we could establish a celebrity pecking order using Elo rankings.
All data was scraped from TVMaze.com, and comes from the 2015-2021 seasons -- when all of the following six hosts were on the air:
Here's a JavaScript snippet for scraping episode info from TVMaze.com, using the xpath functionality provided by Chrome's devtools.
copy({metadata: window.location.href,
data: $x("//section[contains(@class, 'season')]/article[contains(@class, 'episode-row')]").map(row => {
cells = [...row.children].map((x, i) => i < 3 ? x.innerText : '')
return {
'episodeNumber': cells[0],
'airDate': cells[1],
'guests': cells[2]
}
})})
import pandas as pd
import json
import pickle
import copy
import numpy as np
from scipy.spatial import distance
import matplotlib.pyplot as plt
Implementation of the Elo system below is courtesy of ddm7018 on Github.
class Elo:
def __init__(self,k,g=1,homefield = 100):
self.ratingDict = {}
self.k = k
self.g = g
self.homefield = homefield
def addPlayer(self,name,rating = 1500):
self.ratingDict[name] = rating
def gameOver(self, winner, loser, winnerHome):
if winnerHome:
result = self.expectResult(self.ratingDict[winner] + self.homefield, self.ratingDict[loser])
else:
result = self.expectResult(self.ratingDict[winner], self.ratingDict[loser]+self.homefield)
self.ratingDict[winner] = self.ratingDict[winner] + (self.k*self.g)*(1 - result)
self.ratingDict[loser] = self.ratingDict[loser] + (self.k*self.g)*(0 - (1 -result))
def expectResult(self, p1, p2):
exp = (p2-p1)/400.0
return 1/((10.0**(exp))+1)
%matplotlib notebook
We'll do some initial setup -- each season of TV is a big blob of JSON with some metadata letting me know the URL that generated it. I hacked together a dictionary providing the relevant data for each of those URLs.
Then we loop through all the seasons and all their episodes, breaking the episode titles -- which contains all of the guests names -- into individual rows.
While we're here, we'll also run the Elo calculations. I decided to treat a three guest episode as a series of one-on-one matchups, in which the lead guest defeats the second guest, the second guest defeats the third guest, and the first guest then defeats the third guest as well.
# Initialize Elo league -- starting rating is 1500
pecking_order = Elo(k=20, g = 1, homefield = 0)
href_lookup = eval(open('host-season-lookup.pkl', 'r', encoding='utf-8').read())
print("href_lookup looks like:")
print(list(href_lookup.items())[:3])
json_array = json.load(open('data.json', 'r', encoding='utf-8'))
dfs = []
for season in json_array:
records = []
index = href_lookup[season['metadata']]
episodes = season['data']
for ep in episodes:
if ep['episodeNumber'] == 's':
continue # it's a special of some sort
splitter = ';' if ';' in ep['guests'] else ','
guests = [x.strip() for x in ep['guests'].split(splitter)]
for guest in guests:
# Ensure all guests are present in the Elo rankings
if guest not in pecking_order.ratingDict:
pecking_order.addPlayer(guest)
for billing, current_guest in enumerate(guests):
record = copy.deepcopy(ep)
del record['guests']
record['guest'] = current_guest
record['billing'] = billing + 1
records.append(record)
# Once they are, have the current guest lose to all higher-ranked guests on that episode
# The billing - 1 is to make the range exclusive on the right side.
for guest_slot in range(0, billing-1):
earlier_guest = guests[guest_slot]
pecking_order.gameOver(earlier_guest, current_guest, True)
df = pd.DataFrame(records)
df['host'] = index[0]
df['season'] = index[1]
df.set_index(['season', 'host', 'episodeNumber'], inplace=True)
dfs.append(df)
df = pd.concat(dfs)
print("\nSample episode datum:")
print(ep)
node_ids = pd.DataFrame(list(set(df.guest.values)))
node_ids.index.name = 'Id'
node_ids = node_ids.reset_index().set_index(0)
node_ids.index.name = 'Label'
inverse_node_ids = node_ids.reset_index().set_index('Id')
guest_data = df.reset_index().groupby(['guest', 'host']).agg([np.size, np.mean])
appearances_per_show = guest_data.billing.sort_values('size', ascending=False).reset_index()
aps = appearances_per_show.pivot_table(values='size', index='guest', columns='host', margins=True, fill_value=0, margins_name='total', aggfunc=sum)
aps.sort_values('total', ascending=False).head(20)
Kind of bizarre that Bernie Sanders is the most frequent late night guest for the last 7 years.
Anyway, the six hosts can be bucketed along a few dimensions, which are visible even in a cursory glance at this leaderboard of guest visits.
Then there are their various networks (TBS, CBS, ABC, NBC) and the conglomerates that own them -- Kimmel gets more of the Avengers because Disney owns ABC. Corden's English, so he's bound to get more Brits, and his producer Ben Winston has an in with the NBA somehow, so you'll see people like JJ Redick and Steph Curry show up alongside the more bread-and-butter guests, which are Broadway types.
It's also interesting to notice which guests like doing late night shows but shun certain hosts. There's probably two reasons for this:
1.) The guest doesn't want to do that person's show, because it's too small, or they don't like the host. 2.) The booker doesn't think the guest will work on their show.
Corden's show is an apolitical couch-hang in the style of Graham Norton -- Bernie Sanders isn't going to work in that format.
And I can't imagine Tig Notaro's sense of humor meshing with Jimmy Kimmel's.
Somebody like Bryan Cranston or Nick Kroll, on the other hand, will talk to anybody.
Lets try and figure out which celebrities fall where.
We can think of each row in our appearances table as a 6-dimensional vector, representing that celebrity's affinity for the various late night hosts. Using cosine similarity, we can figure out which guests have similar appearance patterns.
vectors = aps.join(node_ids).set_index('Id').drop(columns='total').dropna().sort_index().drop(index=[np.nan])
vectors
Now we'll tack on 13 synthetic guests.
First we'll put their ids into our lookup, then append them to the end of a numpy array. From there, we use scipy's distance functions to compute a squareform distance matrix -- that means that every guest will be compared against every other guest, using cosine similarity.
vectors.columns.values
mapping = inverse_node_ids.to_dict()['Label']
max_key = max(mapping.keys())
for i, hated_host in enumerate(vectors.columns.values):
key = max_key + 1 + i
value = f"Snubs {hated_host}"
mapping[key] = value
max_key = max(mapping.keys())
for i, loved_host in enumerate(vectors.columns.values):
key = max_key + 1 + i
value = f"Loves {loved_host}"
mapping[key] = value
mapping[max(mapping.keys()) + 1] = "Social Butterfly"
snub_vectors = np.ones((6,6))
np.fill_diagonal(snub_vectors, 0)
love_vectors = np.zeros((6,6))
np.fill_diagonal(love_vectors, 1)
butterfly = np.ones((1,6))
data = vectors.to_numpy()
data = np.concatenate((data, snub_vectors, love_vectors, butterfly))
pdist = distance.pdist(data, 'cosine')
sims = pd.DataFrame(distance.squareform(pdist))
sims.index = sims.index.map(mapping)
sims.columns = sims.columns.map(mapping)
sims = sims.join(total_appearances)
freq_guests = sims[sims.total >= 10]
freq_guests = freq_guests.transpose()
Spoiler -- there aren't many legit snubs in here. Either the vector I chose for cosine similarity isn't a great measure, or this type of pettiness is simply rare.
It is interesting to note some of the loyalists. For instance, Kimmel & Jimmy Fallon each have their own animal handler.
And there's a curious West Wing connection with the Corden show: he's had Aaron Sorkin, Allison Janney, and Bradley Whitford on quite a bit.
print("Maximum similarity == 0, Maximum dissimilarity == 1\n")
synth_df = freq_guests.reindex(list(mapping.values())[-13:])
for index, row in synth_df.iterrows():
print(index, '\n')
print(row.sort_values().head(20))
print('\n***********\n')
elo = pd.DataFrame(pecking_order.ratingDict.items(), columns=['guest', 'rating']).set_index('guest').sort_values('rating', ascending=False)
elo.rating.hist(bins=50)
At the top of the rankings we've got a few different types of guests:
The pros:
elo.head(50)
Spader's & Maddow's presences definitely feels mandated by NBC execs -- look where they show up:
aps.loc['James Spader']
aps.loc['Rachel Maddow']
The bottom is musical guests, who always close out the show. The very bottom are the drummers-in-residence that Seth Meyers would have on while Fred Armisen wasn't around -- these people would get last billing and stick around for a few weeks, which explains the Elo pummeling they took. Ditto for the Game of Thrones kids -- they would show up in big groups, which meant they'd "lose" to 8 or 9 people in a single appearance.
elo.tail(30)
elo_aps = elo.join(aps)
elo_aps = elo_aps[elo_aps.total > 15]
elo_aps.to_csv('elo.csv')
It's kind of interesting to check out frequent, non-musical guests who are hovering around the average of 1500 -- tells you that they're typically batting second. A lot of the people on here are just good at being guests -- Andrew Rannells, Bill Burr, Marc Maron, Jenny Slate, Pete Holmes -- but aren't famous enough to lead off a show.
elo_aps.tail(30)
Gephi can help explore social networks. To import data, you need to create CSVs with a particular format: first an edge list containing Source,Target
, and then a nodes table with some extra info: in this case I'm tracking total appearances across shows, and the average billing order.
total_appearances = aps['total']
average_billing = df.groupby('guest').mean()
node_info = pd.concat([total_appearances, average_billing], axis=1)
edges = df.reset_index()[['guest', 'host']]
edges = edges.merge(node_ids, left_on='guest', right_index=True).merge(node_ids, left_on='host', right_index=True)
edges.rename(columns={'Id_x': 'Source', 'Id_y': 'Target'})[['Source', 'Target']].to_csv('edges.csv', index=False)
node_ids.join(node_info).reset_index()[['Id', 'Label', 'total', 'billing']].to_csv('nodes.csv', index=False)
avg_billing = appearances_per_show.pivot_table(values='mean', index='guest', columns='host', margins=True, fill_value=None, margins_name='total')
avg_billing['slot'] = avg_billing.total.round()
avg_billing['total_appearances'] = aps.total
avg_billing.sort_values(['slot', 'total_appearances'], ascending=[True, False]).head(25)
pcts = aps.div(aps.total, axis=0)
pcts.sort_values("Conan O'Brien", ascending=False).iloc[400:430]
avg_billing[(avg_billing.total > 2.0) & (avg_billing.total < 3.0)].tail(30)
aps.sort_values('total', ascending=False).to_csv("appearances_per_show_per_guest.csv")