Teammates of Champions

Starting in the 1983-84 season, every NBA champion has featured a player who was on the same team as Shaquille O'Neal at some point in his career. This summer the Warriors won again -- except they did it without Matt Barnes, their last link to the Big Kevin Bacon -- ending the streak at 33 years. Last week, /u/packmanwiscy heroically sifted through fifty candidates to fill Shaq's outsize shoes. He limited the pool to accomplished players drafted more recently than 1990, and settled on Chris Paul (and soon LeBron James) as the heirs to O'Neal.

That felt like an appropriate finding, since Paul and James are both well-travelled vets and perennial contenders, which means they cross paths with a lot of champion role players looking for their next parade. But with something like 2,600 players getting game checks from the NBA since the beginning of Shaq's streak, I wondered if there was another player out there, perhaps less well-known, with a streak that could rival Shaq's.

To answer that question, I scraped roster data stretching back to the 1983-84 season from stats.nba.com, and fed it into a Neo4j graph database. Graph databases are a little different from typical relational databases. Unlike MySQL, which keeps its data in a series of tables not much different than an Excel spreadsheet, Neo4j puts everything into a giant network, full of nodes connected to each other by links. It's built to answer questions like this one, and here's what it came up with. In each row, after the player's name, you'll find their longest streak in parentheses. Darker shades indicate that the player has logged more time with the players on that year's championship team, based on seasons shared. (One caveat: I don't think stats.nba.com is as reliable as basketball-reference.)

Highlight a season to see which of that year's champions a player played with, and where.
Credit to the Stats & Info desk at ESPN, who -- I think -- first discovered this stat. See the original story here.

How I Created This Table

Loading the database

If you're curious about graph databases, let me give you an idea of how they can be used, and some of the quirks involved.

First we need to load our database, and that means grabbing data. The endpoints at stats.nba.com are open to the public, so long as your request has a User-Agent header that isn't obviously from a headless browser. A query like this will get you a team's roster for a given season:

season = "1999-00"
team = 1610612737
endpoint = "https://stats.nba.com/stats/commonteamroster?LeagueID=00&Season=
{season}&TeamID={team_id}".format(season=season, team_id=team_id)
response = r.get(endpoint, headers={
	"User-Agent": '''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
	(KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'''
})
body = json.loads(response.text)["resultSets"]
roster = body[0]
columns = roster["headers"]
columns = [
	"teamId",
	"seasonPlayed",
	"leagueId",
	"playerName",
	"number",
	"position",
	"height",
	"weight",
	"birthDate",
	"age",
	"experience",
	"school",
	"playerId"
]
player_data = roster["rowSet"]
		

I created a Pandas DataFrame for each of these results, like so:

df = pd.DataFrame.from_records(data=player_data, columns=columns)

... and after churning through all 30 teams and all 34 seasons between my start and end points (with a respectful sleep between each request, of course), I glued them all together in a concat operation.

Once I had my database, I did a little post-processing to give myself some more fields that I knew I'd need to reference later in my queries. All this data wrangling took probably the most time, and was not particularly exciting, so I'll gloss over the rest of it.

One thing that would have helped was setting up a pipeline for getting data from Pandas into Neo4j. I kept making the classic mistake of thing that "well, this is just a fun exploration, I don't need a whole pipeline for this". But nothing makes a fun exploration tedious like copy-pasting a sequence of IMPORT commands into Neo4j's browser, which doesn't allow for multiple queries to execute at once.

(Note: I was just told that newer versions of Neo4j's browser does allow for multiple queries, so you may have better luck than me in this regard!)

Had I known how many times I'd have to modify my schema and reimport CSVs, I would have taken the time to write up a Python script capable of firing off the few requests necessary to the REST API, which is not at all challenging.

Here's a look at some of the cypher queries I used to load my database.

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///teams.csv" AS row
CREATE (:Team {name: row.teamName,
abbreviation: row.teamAbbreviation,
city: row.teamCity});

Most of these queries are taking a CSV and remapping its columns into some property names I like better. Here I'm stripping off "team" from a number of properties.

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///team-seasons.csv" AS row
CREATE (:TeamSeason {teamId: row.primaryKey,
code: row.teamId,
fullName: row.fullName,
name: row.teamName,
displayName: row.displayName,
abbreviation: row.teamAbbreviation,
city: row.teamCity,
season: row.season,
champ: toBoolean(row.champ),
runnerUp: toBoolean(row.runnerUp)});

You can load CSVs from regular URLs as well as files on your file system. In order to access an arbitrary directory in your load statement, you need to disable a security feature in your Neo4j configuration file, which is a text file to be found in the same folder as the binaries you execute to start your database. In these queries I'm not messing with that setting, so all these paths are being resolved relative to PATH_TO_MY_NEO4J_DIRECTORY/import

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///players.csv" AS row
CREATE (:Player {name: row.playerName,
number: row.number,
position: row.position,
height: row.height,
weight: row.weight,
birthDate: row.birthDate,
school: row.school,
playerId: row.playerId})

One more to get our players. By the way, these "period commits" are a way to keep your database from dropping dead of a memory error. Wasn't a problem for me at any point, it's just mentioned in most of the examples.

CREATE CONSTRAINT ON (p:Player)
 ASSERT p.playerId IS UNIQUE;

Putting a constraint on a node type will also index it, which is important for performance in the later queries. Without indexes, Neo4j can take quite awhile to find and merge your relationships.

CREATE INDEX ON :TeamSeason(code);

You can also put a plain old index like this.

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///player-seasons.csv" AS row
MATCH (player:Player {playerId: row.playerId})
MATCH (team:TeamSeason {code: row.teamId, season: row.season})
MERGE (player)-[:PLAYED_FOR]->(team);

Here's the query that lets us really get to work. At this point, Neo4j knows about our players and the seasons, but not how they're related. So I feed it a CSV that represents the join table between the two, where each row indicates a player and where they played that season. I then match both those components, and MERGE a relationship between them. Merging in Neo4j is like saying "find or create this relationship".

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///championship-belt.csv" AS row
MATCH (priorChamp:TeamSeason {teamId: row.succeeded})
MATCH (newChamp:TeamSeason {teamId: row.primaryKey})
MERGE (priorChamp)<-[:SUCCEEDED]-(newChamp);

This one was important to know the lineage of championship teams, essential for determining a streak.

Querying the database

First, I wanted to explore my data, get a sense for who has the most champion teammates.

// Most champion teammates
MATCH (bacon:Player)-->(baconTeam:TeamSeason)<--
(champPlayer:Player)-->(t:TeamSeason {champ: true})
RETURN bacon.name,
	collect(DISTINCT champPlayer.name),
	collect(DISTINCT t.displayName) as `champTeams`,
	count(DISTINCT t.teamId) as `champCount`
ORDER BY champCount DESC
LIMIT 100;

I saw Shaq and some other Lakers at the top, which gave me confidence that my data had been imported correctly. I made use of some aggregating functions like collect and count to give myself a more compact output. Keep in mind that a "row" to Neo4j is a single pattern match. We tend to think in terms of the object of interest, here a player, but Neo4j has been instructed to find a particular configuration of players and teams; that's what it'll return unless you tell it to bundle up some of that information.

When I was poking around my database, I was a little concerned to see certain players matched up. They never played together, right? To spot-check these relationships, I asked Neo4j for the shortest path between them.

// Shortest path between players
MATCH path=shortestPath(
	(p1:Player {name:'Grant Hill'})-[:PLAYED_FOR*1..20]-(p2:Player {name:'Zaza Pachulia'})
)
RETURN path

And sure enough, Grant Hill did play with Zaza Pachulia, on the '03-04 Magic.

Hill's name was popping up quite a bit in my initial exploration, so I wanted a query that would let me quickly see on a year-by-year basis all his connections with a championship squad:

//BOILED DOWN CONNECTIONS
MATCH (p:Player)-->(champTeam:TeamSeason {champ: true})
WITH p as champion, champTeam
MATCH (bacon:Player {name: "Grant Hill"})-->(baconTeam:TeamSeason)<--(champion)
RETURN bacon.name,
	collect(DISTINCT champion.name) as teammates,
	champTeam.abbreviation as abbr,
	champTeam.season as season
ORDER BY champTeam.season ASC
LIMIT 3

This gave me results like this:

╒════════════╤════════════════════════════╤══════╤═════════╕
│"bacon.name"│"teammates"                 │"abbr"│"season" │
╞════════════╪════════════════════════════╪══════╪═════════╡
│"Grant Hill"│["Rick Mahorn","Joe Dumars"]│"DET" │"1988-89"│
├────────────┼────────────────────────────┼──────┼─────────┤
│"Grant Hill"│["Joe Dumars"]              │"DET" │"1989-90"│
├────────────┼────────────────────────────┼──────┼─────────┤
│"Grant Hill"│["Horace Grant"]            │"CHI" │"1990-91"│
└────────────┴────────────────────────────┴──────┴─────────┘

I should mention how important these WITH statements are. Without them, I wasn't getting all of my results, and from what I could glean in my Googling, this had to do with cardinality errors, explained here.

After a lot more exploration, I was ready to find some streaks. I thought it'd be useful down the road to refer to the length of a player's best streak, so this adds a new property to matching players.

This is a monster query, so I'll annotate it inline.

MATCH (n1:Player)-->(t1:TeamSeason)<--(n2:Player)
WITH n1, t1, collect(n2) as teammates
UNWIND teammates as teammate
MATCH (teammate)-->(champTeam:TeamSeason {champ: true})
WITH n1, collect(distinct champTeam) as champTeams

// We need some way of finding the longest -- not the shortest -- path through
// all the teams a player has connections to. While Neo4j is optimized for
// finding short paths, we need to do something like this to find the longest.

UNWIND champTeams as t
UNWIND champTeams as u
WITH * 
MATCH p=(t)<-[:SUCCEEDED*1..35]-(u)
WHERE id(t) < id(u)
AND u.teamId = "1610612744_2017-2018"

// The chunk above finds all combinations of teams and uses them as the start
// and endpoint of a streak. It would be all permutations of teams, but by
// insisting that the id (an internal Neo4j designation, not anything I
// assigned) of one is before the other, you're guaranteed to have the older
// team first -- so long as your teams got fed into the database oldest first.

// After getting all those teams, the database tries to find a path between
// them. I added an additional constraint of having the endpoint be the most
// recent champion, so I would only be seeing streaks that were still active

// But there's a problem -- Neo4j was forgetting about the set of teams we
// were working on originally. In order to find paths between two champions,
// it might make use of a whole bunch of teams the player in question had no
// relation to. So we want to constrain it by saying "make sure each node
// along the path belongs to the championship teams our player is associated
// with." Here's how that translates to Cypher:

AND all(x in nodes(p) WHERE x IN champTeams)

// Now we need  to get the longest path. I don't know a better way for
// plucking the max value of a result set in Cypher, so this may be
// inefficient:

WITH n1, p, length(p) as chainLength
ORDER BY chainLength DESC
WITH n1, collect(p)[..1] as paths
UNWIND paths as path

// We finally set a property with the path we found
SET n1.longestActiveChain = length(path)
return n1.name,
	extract(x in nodes(path) | x.displayName) as teamsPath,
	length(path) as chainLength
ORDER BY chainLength DESC;

And here's the query that gave me the data for the D3 script responsible for the table above:

MATCH (n:Player)

// I only want to see players with long streaks, so we ignore those who don't
// have that property set, and grab the top 25 according to streak length.

WHERE exists(n.longestChain)
WITH collect(n) as players, n
ORDER BY n.longestChain DESC
LIMIT 25

// We've got the players, now let's try and find their connections to the
// championship teams:

MATCH (t:TeamSeason {champ: true})
WITH players, collect(t) as champTeams
UNWIND players as n
UNWIND champTeams as champTeam
MATCH (n)-[:PLAYED_FOR]->(baconTeam:TeamSeason)<-[:PLAYED_FOR]-(p2:Player)
-[:PLAYED_FOR]->(champTeam)
RETURN DISTINCT n.name as player,
	n.longestChain as longestChain,
	p2.name as championTeammate,
	champTeam.displayName as champTeam,
	baconTeam.displayName as baconTeam
ORDER BY n.name ASC

I recommend aliasing your output columns, so they've got tidy names when you send them into the next step of you process.

I'd love to tell you about how I continued into D3 for rendering this information... but it's lunch time! I hope you enjoyed seeing some of these queries. Graph databases are fascinating, and can really flex their muscles on problems like this one. How would I have even begun to approach this in something like MySQL? I would speculate, but again... lunch.