The streak started twice, in some sense. It first began on June 12, 1984, when the Garden's faithful stormed the parquet. The Celtics had just beaten the Lakers in seven games, and Greg Kite rose from his spot at the end of the bench to join the celebration. Twelve inches of height was about all that separated Kite from the crush of fans: he'd outscored the ticketholders by a whole seven points during the playoff run. (Larry Bird, praising him after a key defensive performance against Kareem, once gushed, "I've seen Greg play real well at times.") But he was in the locker room when the champagne popped, and he was in the Orlando Magic's locker room in 1993, when Shaquille O'Neal entered the league.
This was the streak's second beginning. Kite was the first NBA champion Shaq played with. Shaq would go on to play with another 34 over the course of his 19-year career. Not all were champions when they crossed paths, of course. Danny Green was just a rookie when he and Shaq played together in Cleveland in 2010.
(Editor's note: That number still could rise if the Lakers, Milwaukee, Utah, or Miami win it all this year.
35 is a lot of decorated teammates, but not completely unusual: there are 100+ players in my dataset that have 20+ champs in their phonebook. 15 guys win rings every June, and with all the roster turnover in the NBA, they have a quite a few ex-teammates; ~175 players have played with the members of the 2018-19 Raptors.
What's intriguing about Shaq's champion teammates is how evenly distributed they were -- there was at least one from every single season between 1983-84 and 2017-18. It's a weird stat, a streak that existed only in retrospect. It wasn't until 2013 that someone at ESPN's Stats & Info desk noticed it, and it ended five years later when the Warriors won without Matt Barnes, who had been their previous link to the Big Kevin Bacon. Some people noted the ending of the streak, and speculated on who would take the crown. I compiled a database of players and teams to answer it conclusively.
Hover over the cells in the table to see which of that year's champions a player played with, and where.
I scraped roster data stretching back to the 1983-84 season from stats.nba.com, and fed it into a Neo4j graph database. Graph databases are a little different from typical relational databases. Unlike MySQL, which keeps its data in a series of tables not much different than an Excel spreadsheet, Neo4j puts everything into a giant network, full of nodes connected to each other by links. It's built to answer questions like this one, and here's what it came up with. In each row, after the player's name, you'll find their longest streak in parentheses. Darker shades indicate that the player has logged more time with the players on that year's championship team, based on seasons shared. (One caveat: I don't think stats.nba.com is as reliable as basketball-reference.)
If you're curious about graph databases, let me give you an idea of how they can be used, and some of the quirks involved.
First we need to load our database, and that means grabbing data. The endpoints at stats.nba.com are open to the public, so long as your request has a User-Agent header that isn't obviously from a headless browser. A query like this will get you a team's roster for a given season:
season = "1999-00" team = 1610612737 endpoint = "https://stats.nba.com/stats/commonteamroster?LeagueID=00&Season= {season}&TeamID={team_id}".format(season=season, team_id=team_id) response = r.get(endpoint, headers={ "User-Agent": '''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36''' }) body = json.loads(response.text)["resultSets"] roster = body[0] columns = roster["headers"] columns = [ "teamId", "seasonPlayed", "leagueId", "playerName", "number", "position", "height", "weight", "birthDate", "age", "experience", "school", "playerId" ] player_data = roster["rowSet"]
I created a Pandas DataFrame for each of these results, like so:
df = pd.DataFrame.from_records(data=player_data, columns=columns)
... and after churning through all 30 teams and all 34 seasons between my start and end points (with a respectful sleep between each request, of course), I glued them all together in a concat
operation.
Once I had my database, I did a little post-processing to give myself some more fields that I knew I'd need to reference later in my queries. All this data wrangling took probably the most time, and was not particularly exciting, so I'll gloss over the rest of it.
One thing that would have helped was setting up a pipeline for getting data from Pandas into Neo4j. I kept making the classic mistake of thing that "well, this is just a fun exploration, I don't need a whole pipeline for this". But nothing makes a fun exploration tedious like copy-pasting a sequence of IMPORT
commands into Neo4j's browser, which doesn't allow for multiple queries to execute at once.
Had I known how many times I'd have to modify my schema and reimport CSVs, I would have taken the time to write up a Python script capable of firing off the few requests necessary to the REST API, which is not at all challenging.
Here's a look at some of the cypher queries I used to load my database.
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///teams.csv" AS row CREATE (:Team {name: row.teamName, abbreviation: row.teamAbbreviation, city: row.teamCity});
Most of these queries are taking a CSV and remapping its columns into some property names I like better. Here I'm stripping off "team" from a number of properties.
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///team-seasons.csv" AS row CREATE (:TeamSeason {teamId: row.primaryKey, code: row.teamId, fullName: row.fullName, name: row.teamName, displayName: row.displayName, abbreviation: row.teamAbbreviation, city: row.teamCity, season: row.season, champ: toBoolean(row.champ), runnerUp: toBoolean(row.runnerUp)});
You can load CSVs from regular URLs as well as files on your file system. In order to access an arbitrary directory in your load statement, you need to disable a security feature in your Neo4j configuration file, which is a text file to be found in the same folder as the binaries you execute to start your database. In these queries I'm not messing with that setting, so all these paths are being resolved relative to PATH_TO_MY_NEO4J_DIRECTORY/import
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///players.csv" AS row CREATE (:Player {name: row.playerName, number: row.number, position: row.position, height: row.height, weight: row.weight, birthDate: row.birthDate, school: row.school, playerId: row.playerId})
One more to get our players. By the way, these "period commits" are a way to keep your database from dropping dead of a memory error. Wasn't a problem for me at any point, it's just mentioned in most of the examples.
CREATE CONSTRAINT ON (p:Player) ASSERT p.playerId IS UNIQUE;
Putting a constraint on a node type will also index it, which is important for performance in the later queries. Without indexes, Neo4j can take quite awhile to find and merge your relationships.
CREATE INDEX ON :TeamSeason(code);
You can also put a plain old index like this.
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///player-seasons.csv" AS row MATCH (player:Player {playerId: row.playerId}) MATCH (team:TeamSeason {code: row.teamId, season: row.season}) MERGE (player)-[:PLAYED_FOR]->(team);
Here's the query that lets us really get to work. At this point, Neo4j knows about our players and the seasons, but not how they're related. So I feed it a CSV that represents the join table between the two, where each row indicates a player and where they played that season. I then match both those components, and MERGE
a relationship between them. Merging in Neo4j is like saying "find or create this relationship".
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///championship-belt.csv" AS row MATCH (priorChamp:TeamSeason {teamId: row.succeeded}) MATCH (newChamp:TeamSeason {teamId: row.primaryKey}) MERGE (priorChamp)<-[:SUCCEEDED]-(newChamp);
This one was important to know the lineage of championship teams, essential for determining a streak.
First, I wanted to explore my data, get a sense for who has the most champion teammates.
// Most champion teammates MATCH (bacon:Player)-->(baconTeam:TeamSeason)<-- (champPlayer:Player)-->(t:TeamSeason {champ: true}) RETURN bacon.name, collect(DISTINCT champPlayer.name), collect(DISTINCT t.displayName) as `champTeams`, count(DISTINCT t.teamId) as `champCount` ORDER BY champCount DESC LIMIT 100;
I saw Shaq and some other Lakers at the top, which gave me confidence that my data had been imported correctly. I made use of some aggregating functions like collect
and count
to give myself a more compact output. Keep in mind that a "row" to Neo4j is a single pattern match. We tend to think in terms of the object of interest, here a player, but Neo4j has been instructed to find a particular configuration of players and teams; that's what it'll return unless you tell it to bundle up some of that information.
When I was poking around my database, I was a little concerned to see certain players matched up. They never played together, right? To spot-check these relationships, I asked Neo4j for the shortest path between them.
// Shortest path between players MATCH path=shortestPath( (p1:Player {name:'Grant Hill'})-[:PLAYED_FOR*1..20]-(p2:Player {name:'Zaza Pachulia'}) ) RETURN path
And sure enough, Grant Hill did play with Zaza Pachulia, on the '03-04 Magic.
Hill's name was popping up quite a bit in my initial exploration, so I wanted a query that would let me quickly see on a year-by-year basis all his connections with a championship squad:
//BOILED DOWN CONNECTIONS MATCH (p:Player)-->(champTeam:TeamSeason {champ: true}) WITH p as champion, champTeam MATCH (bacon:Player {name: "Grant Hill"})-->(baconTeam:TeamSeason)<--(champion) RETURN bacon.name, collect(DISTINCT champion.name) as teammates, champTeam.abbreviation as abbr, champTeam.season as season ORDER BY champTeam.season ASC LIMIT 3
This gave me results like this:
╒════════════╤════════════════════════════╤══════╤═════════╕ │"bacon.name"│"teammates" │"abbr"│"season" │ ╞════════════╪════════════════════════════╪══════╪═════════╡ │"Grant Hill"│["Rick Mahorn","Joe Dumars"]│"DET" │"1988-89"│ ├────────────┼────────────────────────────┼──────┼─────────┤ │"Grant Hill"│["Joe Dumars"] │"DET" │"1989-90"│ ├────────────┼────────────────────────────┼──────┼─────────┤ │"Grant Hill"│["Horace Grant"] │"CHI" │"1990-91"│ └────────────┴────────────────────────────┴──────┴─────────┘
I should mention how important these WITH
statements are. Without them, I wasn't getting all of my results, and from what I could glean in my Googling, this had to do with cardinality errors, explained here.
After a lot more exploration, I was ready to find some streaks. I thought it'd be useful down the road to refer to the length of a player's best streak, so this adds a new property to matching players.
This is a monster query, so I'll annotate it inline.
MATCH (n1:Player)-->(t1:TeamSeason)<--(n2:Player) WITH n1, t1, collect(n2) as teammates UNWIND teammates as teammate MATCH (teammate)-->(champTeam:TeamSeason {champ: true}) WITH n1, collect(distinct champTeam) as champTeams // We need some way of finding the longest -- not the shortest -- path through // all the teams a player has connections to. While Neo4j is optimized for // finding short paths, we need to do something like this to find the longest. UNWIND champTeams as t UNWIND champTeams as u WITH * MATCH p=(t)<-[:SUCCEEDED*1..35]-(u) WHERE id(t) < id(u) AND u.teamId = "1610612744_2017-2018" // The chunk above finds all combinations of teams and uses them as the start // and endpoint of a streak. It would be all permutations of teams, but by // insisting that the id (an internal Neo4j designation, not anything I // assigned) of one is before the other, you're guaranteed to have the older // team first -- so long as your teams got fed into the database oldest first. // After getting all those teams, the database tries to find a path between // them. I added an additional constraint of having the endpoint be the most // recent champion, so I would only be seeing streaks that were still active // But there's a problem -- Neo4j was forgetting about the set of teams we // were working on originally. In order to find paths between two champions, // it might make use of a whole bunch of teams the player in question had no // relation to. So we want to constrain it by saying "make sure each node // along the path belongs to the championship teams our player is associated // with." Here's how that translates to Cypher: AND all(x in nodes(p) WHERE x IN champTeams) // Now we need to get the longest path. I don't know a better way for // plucking the max value of a result set in Cypher, so this may be // inefficient: WITH n1, p, length(p) as chainLength ORDER BY chainLength DESC WITH n1, collect(p)[..1] as paths UNWIND paths as path // We finally set a property with the path we found SET n1.longestActiveChain = length(path) return n1.name, extract(x in nodes(path) | x.displayName) as teamsPath, length(path) as chainLength ORDER BY chainLength DESC;
And here's the query that gave me the data for the D3 script responsible for the table above:
MATCH (n:Player) // I only want to see players with long streaks, so we ignore those who don't // have that property set, and grab the top 25 according to streak length. WHERE exists(n.longestChain) WITH collect(n) as players, n ORDER BY n.longestChain DESC LIMIT 25 // We've got the players, now let's try and find their connections to the // championship teams: MATCH (t:TeamSeason {champ: true}) WITH players, collect(t) as champTeams UNWIND players as n UNWIND champTeams as champTeam MATCH (n)-[:PLAYED_FOR]->(baconTeam:TeamSeason)<-[:PLAYED_FOR]-(p2:Player) -[:PLAYED_FOR]->(champTeam) RETURN DISTINCT n.name as player, n.longestChain as longestChain, p2.name as championTeammate, champTeam.displayName as champTeam, baconTeam.displayName as baconTeam ORDER BY n.name ASC
I recommend aliasing your output columns, so they've got tidy names when you send them into the next step of you process.
I'd love to tell you about how I continued into D3 for rendering this information... but it's lunch time! I hope you enjoyed seeing some of these queries. Graph databases are fascinating, and can really flex their muscles on problems like this one. How would I have even begun to approach this in something like MySQL? I would speculate, but again... lunch.