DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.

2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.

PostgreSQL: Learn about the open-source RDBMS' advanced capabilities, core components, common commands and functions, and general DBA tasks.

AI Automation Essentials. Check out the latest Refcard on all things AI automation, including model training, data security, and more.

Related

  • Unveiling the Clever Way: Converting XML to Relational Data
  • Enhancing Performance: Optimizing Complex MySQL Queries for Large Datasets
  • Recovering an MS SQL Database From Suspect Mode: Step-By-Step Guide
  • Keep Calm and Column Wise

Trending

  • C4 PlantUML: Effortless Software Documentation
  • AWS Fargate: Deploy and Run Web API (.NET Core)
  • Code Complexity in Practice
  • The Impact of Biometric Authentication on User Privacy and the Role of Blockchain in Preserving Secure Data
  1. DZone
  2. Data Engineering
  3. Databases
  4. Graph Databases and Baseball

Graph Databases and Baseball

Baseball is a statistics-driven sport and has been since its inception. Can a graph database be used to further analyze the data? You'd be surprised...

By 
Scott Sosna user avatar
Scott Sosna
DZone Core CORE ·
Jul. 06, 23 · Tutorial
Like (7)
Save
Tweet
Share
3.4K Views

Join the DZone community and get the full member experience.

Join For Free

In my opinion, stating that baseball remains America's Pastime in 2023 is a tough argument to make, as it often is the third-most popular sport in the United States. However, it is very obvious that baseball has been and always will be statistics-driven: baseball-reference.com has data all the way back to the 1876 National League season. Two franchises from that season still exist — the Chicago Cubs and Atlanta Braves — but interestingly, both used a name subsequently used by their later-arriving American League brethren.

And the data available continues to grow: Retrosheet researchers continue to search for box scores and play-by-plays for older seasons, their latest release including box scores and play-by-play for the 1919 and 1920 seasons. Occasionally this research changes individuals' stats. The data is available for download if interested.

Transactions

As a fan of the Oakland (soon-to-be Las Vegas) Athletic, I harass my wife constantly by identifying the former Oakland players on other teams — a bone-crushingly simple exercise in recent decades as stars are traded for prospects instead of retaining them at higher salaries. Very frustrating.

Similarly, there's the ex-Cubs Factor theory, tying teams' playoff success (or lack thereof) to the number of ex-Cubs who played for each team in a playoff series or World Series; fortunately, it did not prevent Oakland in 1973 from winning the World Series (though mitigating factors are claimed).

In recent years the Minnesota Twins seem hesitant to trade promising-but-underperforming young players because of  David Ortiz's success in Boston, keeping Miguel Sano until his struggles were too obvious — and painful to ignore, who is currently out of baseball after being released following the 2022 season.

With this background, I decided to explore player movements between teams over time. Through the 1975 season, baseball's Reserve Clause made it basically impossible for players to have any say in where they played or what they were paid; that said, players could change teams if owners decided to trade or sell players.  Free agency was introduced before the 1976 season, dramatically changing the game.

Fortunately — and perhaps unsurprisingly — data about all player transactions are available for analysis, and I thought that I could use Neo4J to interrogate the data.

Transactions

As a fan of the Oakland (soon-to-be Las Vegas) Athletic, I harass my wife constantly by identifying the former Oakland players on other teams — a bone-crushingly simple exercise in recent decades as stars are traded for prospects instead of retaining them at higher salaries. Very frustrating.

Similarly, there's the ex-Cubs Factor theory, tying teams' playoff success (or lack thereof) to the number of ex-Cubs who played for each team in a playoff series or World Series; fortunately, it did not prevent Oakland in 1973 from winning the World Series (though mitigating factors are claimed).

In recent years the Minnesota Twins seem hesitant to trade promising-but-underperforming young players because of  David Ortiz's success in Boston, keeping Miguel Sano until his struggles were too obvious — and painful to ignore, who is currently out of baseball after being released following the 2022 season.

With this background, I decided to explore player movements between teams over time. Through the 1975 season, baseball's Reserve Clause made it basically impossible for players to have any say in where they played or what they were paid; that said, players could change teams if owners decided to trade or sell players. Free agency was introduced before the 1976 season, dramatically changing the game.

Fortunately — and perhaps unsurprisingly — data about all player transactions are available for analysis, and I thought that I could use Neo4J to interrogate the data.

Raw Data

As mentioned above, Retrosheet has an abundance of raw data available, of which the player and transaction data appeared relevant. Future enhancements may incorporate additional data, such as batting average, earned run average, wins above replacement, etc.

Players

Retrosheet's biographical database contains information about players, managers, coaches, and umpires, including name, birth and death information, first and last appearance date, bat, throws, height, weight, death, and cemetery information. The values are comma-separated and double-quoted only when required.

The biographic zip file is available here.

Transactions

Retrosheet's transaction database contains the details of all transactions: e.g., transaction date, the player involved, the team(s) involved, and the transaction type. The values are comma-separated, and every string is double-quoted, whether required or not.

Baseball has approximately sixty transaction types that describe how players join or leave teams, from the mundane — player drafted, traded, sold — to the interesting — jumping teams, refusing to report, purchase voided.

When the team or teams involved are from the three baseball major leagues — e.g., American, Federal, National — the well-known team abbreviation is used: LAD for Los Angeles Dodgers, BRO for Brooklyn Dodgers, etc. For all other leagues, the team and league names are explicitly spelled out. e.g., Providence of the International League.

The transactions zip file is available here.

Note #1: the home page provides links for year-by-year transactions; however, the page has not been updated since 2020. The zip file contains all transactions since 2022.

Note #2: some Negro League data is available but is separated from the other major leagues. I have not yet looked into what is available and how it may be used.

Note #3: I only loaded post-1900 transaction data as pre-1900 transaction data is less complete and consistent and required additional exception handling. And let's be honest: how many of you even know that the 1899 Cleveland Spiders lost 134 games, much less who was on their roster?

Teams

Retrosheet does not provide a definitive list of teams, so I created my own.

The file teams.txt is a list of all current major league teams or, for the Federal Leagues, the teams that existed when the league folded. No American or National League team has folded since the American League was founded in 1901.

The file teammap.txt allows us to aggregate transactions for a franchise, e.g., BRO maps to LAD, combining Dodger transactions for both Brooklyn and Los Angeles into the current Los Angeles location.

Transactions Types

Though the many transaction types uniquely describe how players moved between teams, the transaction types can be grouped into three basic types. 

To Transaction

A player joins a team after being unsigned by any team, e.g., drafted out of high school or college or signed as an international player.

This shows that John Samuel Wilson was drafted by the Red Sox; the transaction date (October 1926) is a property on the relationship.

This shows that John Samuel Wilson was drafted by the Red Sox; the transaction date (October 1926) is a property on the relationship.

From Transaction

A player leaves or is dissociated with their team after their signed contract is invalidated, e.g., the contract expires, the player is released by the team, or the player retires.

Since a player must be signed to a team before leaving a team, this graph shows that John Cornelius Ray was traded to the Angels before the 1988 season before being released.

Since a player must be signed to a team before leaving a team, this graph shows that John Cornelius Ray was traded to the Angels before the 1988 season before being released.

FromTo Transaction

A player disassociates from one team and joins another, most often through a trade or sale of the player.

Roger Maris, best known for breaking Babe Ruth's season home run mark in 1961, was traded twice, though, from the graph, it's impossible to determine the specifics.

Roger Maris, best known for breaking Babe Ruth's season home run mark in 1961, was traded twice, though, from the graph, it's impossible to determine the specifics. [In fact, he was traded by Cleveland to the Athletics in 1958 and then to the Yankees in 1960.]

Extreme Player Movements

Octavio Dotel played for 13 teams, none more than once.

This is only the second-most, as Edwin Jackson played for 14 teams.

This is only the second-most, as Edwin Jackson played for 14 teams.

Examples

Team Frequency

For a pair of teams, how many players have been exchanged, either through trade or sale? Most are trades, but scrolling through the results you do see player sales as well.

Cypher Query

Cypher
 
MATCH (t1:Team)-[tx1]->(p:Player)-[tx2]->(t2:Team)
WHERE tx1.retrosheetId = tx2.retrosheetId AND t1.retrosheetId <> t2.retrosheetId 
RETURN t1.retrosheetId, t2.retrosheetId, tx1.transactionType, COUNT(p)
ORDER BY COUNT(p) DESC, tx1.transactionType, t1.retrosheetId, t2.retrosheetId


Results

Results

Additional Notes

This query would not be difficult in a relational database or even in Excel but demonstrates a Neo4J Cypher query using the data model. An easy place to start.

Teammates during a career

With whom did a player play, by the team, and how (transaction type) did they join the team?

This query uses retired pitcher Jon Lester as the base player.

Cypher Query

Cypher
 
MATCH (bp:Player {firstName: "Jonathan Tyler", lastName: "Lester"})-[tx1]-(bt:Team)
WITH bt.retrosheetId AS teamId,
     bp.playerDebut AS debut,
     MAX(tx1.transactionDate) AS lastTxn
MATCH (ot:Team)<-[tx2]-(op:Player)
WHERE ot.retrosheetId = teamId AND 
      tx2.transactionDate >= debut AND
      tx2.transactionDate <= lastTxn
RETURN teamId,
       op.name,
       tx2.transactionDate,
       tx2.transactionType
ORDER BY teamId,
         op.name,
         tx2.transactionDate


Results

Players Returning To the Same Team

Identify players who left a team and later returned, including details about the team and the transaction.

This query only considers players who debuted in the 1960s, which is a property on the Player node.

Cypher Query

Cypher
 
MATCH (t1:Team)-[tx1]->(p:Player)-[tx2]->(t2:Team)
WHERE DATETIME(p.playerDebut) > DATETIME('1960-01-01T00:00:00')
  AND DATETIME(p.playerDebut) > DATETIME('1970-01-01T00:00:00') 
  AND t1 = t2 
  AND DATETIME(tx1.transactionDate) < DATETIME(tx2.transactionDate)
RETURN p.name, 
       t1.name, tx1.transactionType, tx1.transactionDate,
       tx2.transactionType, tx2.transactionDate


Results

Results2

How To Read

  • Stan Bahnsen became a free agent in October 1980 and resigned from the team in December 1980.
  • Rich Baney was sold by the Orioles in June 1971 but returned to the Orioles in September 1971 for unknown reasons.

Players Traded Back To Original Team by Number of Days

Identify players who were traded between two teams, ordered by the number of days between the transactions.

Cypher Query

Cypher
 
MATCH (t1:Team)-[tx1:TRADED_FROM]->(p:Player)-[tx2:TRADED_TO]->(t2:Team)-[tx3:TRADED_FROM]->(p:Player)-[tx4:TRADED_TO]->(t1:Team)
WHERE tx1.retrosheetId = tx2.retrosheetId 
  AND tx3.retrosheetId = tx4.retrosheetId 
  AND t1.retrosheetId IS NOT NULL
  AND t2.retrosheetId IS NOT NULL 
  AND tx1.transactionDate < tx3.transactionDate 
RETURN p.name, 
       t1.name AS OriginalTeam, 
       t2.name AS TradedTeam, 
       tx1.transactionDate AS Trade1Date, 
       tx3.transactionDate AS Trade2Date, 
       Duration.inDays(DATETIME(tx1.transactionDate), DATETIME(tx3.transactionDate)).days as DaysBetween 
ORDER BY DaysBetween, 
         p.name, 
         t1.name, 
         t2.name


Results

Results3

How To Read

Colin David Rea was traded from the Padres to the Marlins and back to the Padres within 3 days.

Closing Thoughts

These examples scratch the surface of how the transactions can be analyzed.

With additional player information, e.g., their lifetime statistics, more detailed analysis could be achieved, such as using WAR to provide a value to a transaction.

The most awkward aspect of this data model is ordering transactions, as relationships have no concept of sequencing. Sometimes the transaction date is easy to use, but not in all situations.

Database Data (computing) Database transaction Graph Query Language

Opinions expressed by DZone contributors are their own.

Related

  • Unveiling the Clever Way: Converting XML to Relational Data
  • Enhancing Performance: Optimizing Complex MySQL Queries for Large Datasets
  • Recovering an MS SQL Database From Suspect Mode: Step-By-Step Guide
  • Keep Calm and Column Wise

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: