Introduction
This project aims to extract rating history data from the OlimpBase website. Because the official FIDE website only publishes data from January 2000 to now, there is a lot of missing data especially for older chess players. This is where OlimpBase comes in. Unlike, most chess websites, OlimpBase actually has rating information dating back to the late 70s.
Our procedure for this project will be web-scraping OlimpBase’s monthly rating records using individual player OlimpBase URLs. Although OlimpBase player URLs follow a fairly consistent pattern, there are inconsistencies that will make accessing these pages difficult. Additionally, because the website is outdated, extracting information will require a combination of new tools.
Data Preparation
Like previous web-scraping projects, we will be getting player names using FIDE’s downloadable monthly data. From there, we will take the names and combine them with the OlimpBase’s root URL so that we can access individual player pages.
Let’s load the necessary packages and libraries.
library(reticulate)
import pandas as pd
from bs4 import BeautifulSoup
import requests
from IPython.display import display
import pprint as pp
import lxml.html as lh
from googlesearch import search
from scrapy import Selector
Let’s load the September monthly data and convert the “B-day” variable to the numeric class.
file = 'C:/Users/laryl/Desktop/Data Sets/players_list_foa_sept.txt'
= pd.read_fwf(file)
chess 'B-day'] = pd.to_numeric(chess["B-day"])
chess[print(chess.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 1022038 entries, 0 to 1022037
## Data columns (total 19 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 ID Number 1022038 non-null int64
## 1 Name 1022037 non-null object
## 2 Fed 1022038 non-null object
## 3 Sex 1022038 non-null object
## 4 Tit 19664 non-null object
## 5 WTit 3955 non-null object
## 6 OTit 930 non-null object
## 7 FOA 2984 non-null object
## 8 SRtng 372890 non-null float64
## 9 SGm 372890 non-null float64
## 10 SK 370667 non-null float64
## 11 RRtng 204379 non-null float64
## 12 RGm 204379 non-null float64
## 13 Rk 204379 non-null float64
## 14 BRtng 146562 non-null float64
## 15 BGm 146562 non-null float64
## 16 BK 146562 non-null float64
## 17 B-day 1022038 non-null int64
## 18 Flag 339531 non-null object
## dtypes: float64(9), int64(2), object(8)
## memory usage: 148.2+ MB
## None
Because our focus is older grandmasters, we need to filter for not only players with grandmaster titles but also players born before a particular year. Our future projects will only really be concerned with ratings after players reach the age of 10. Ten years before the start of the FIDE website data is 1990. So that will be our cutoff. Additionally, our future projects will be focused on the strongest chess players, so we can use the 2600 cutoff to reduce the players we need to gather information for.
= chess[(chess["B-day"] < 1990) & (chess["Tit"] == "GM")]
players_before_1990 = players_before_1990[players_before_1990["SRtng"] >= 2600]
players_before_1990_clean = players_before_1990_clean[["ID Number", "Name", "B-day"]]
players_before_1990_cut
pp.pprint(players_before_1990_cut.info())
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 149 entries, 12393 to 1021560
## Data columns (total 3 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 ID Number 149 non-null int64
## 1 Name 149 non-null object
## 2 B-day 149 non-null int64
## dtypes: int64(2), object(1)
## memory usage: 4.7+ KB
## None
With only 149 players to work with. It’s time to get started.
Our first real problem comes from the inconsistencies in the URLs of the chess players. Because some players have 3 names or have initials in their names, this makes it difficult to easily generate player URLs. So we are going to separate the chess players into “problematic” players and “clean” players using the amount of spaces between the players names as filters.
library(tidyverse)
# Count spaces between names
<- py$players_before_1990_cut %>%
players_before_1990_counted mutate(space_count = str_count(Name, " "))
# Problematic players with more than 1 space between names
<- players_before_1990_counted %>%
problematic_players_df filter(space_count > 1) %>%
filter(Name != "Li, Chao b" & Name != "Iturrizaga Bonelli, Eduardo" ) %>%
arrange(Name)
# Clean players with only 1 space between names
<- players_before_1990_counted %>%
clean_players_df filter(space_count == 1)
str(problematic_players_df)
## 'data.frame': 14 obs. of 4 variables:
## $ ID Number : num 3503739 3503240 14102536 5002150 3800024 ...
## $ Name : chr "Bruzon Batista, Lazaro" "Dominguez Perez, Leinier" "Fedorchuk, Sergey A." "Ganguly, Surya Shekhar" ...
## $ B-day : num 1982 1983 1981 1983 1967 ...
## $ space_count: int 2 2 2 2 3 2 2 3 2 2 ...
## - attr(*, "pandas.index")=Int64Index([ 12393, 14511, 25622, 25696, 33897, 39272, 41660,
## 46262, 47693, 48838,
## ...
## 972365, 973795, 979970, 980452, 989280, 997979, 1013959,
## 1014101, 1015338, 1021560],
## dtype='int64', length=149)
str(clean_players_df)
## 'data.frame': 133 obs. of 4 variables:
## $ ID Number : num 400041 10601619 13300580 13300032 4138147 ...
## $ Name : chr "Adams, Michael" "Adly, Ahmed" "Akobian, Varuzhan" "Akopian, Vladimir" ...
## $ B-day : num 1971 1987 1983 1971 1985 ...
## $ space_count: int 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "pandas.index")=Int64Index([ 12393, 14511, 25622, 25696, 33897, 39272, 41660,
## 46262, 47693, 48838,
## ...
## 972365, 973795, 979970, 980452, 989280, 997979, 1013959,
## 1014101, 1015338, 1021560],
## dtype='int64', length=149)
We had to exclude two problematic chess players because they did not have any records on OlimpBase.
In these next few code blocks, we are going to take our clean chess players and append their first and last names to a base URL so that we can access their OlimpBase webpages.
<- clean_players_df %>%
clean_players_df_updatedseparate(Name, c("Last", "First"), sep = ", ", remove = TRUE, convert = FALSE)
<- clean_players_df_updated$Last
lastname <- clean_players_df_updated$First firstname
= list(r.lastname)
lastname_list = list(r.firstname) firstname_list
= list(range(0,len(lastname_list),1))
index_python
= []
prepped_olimpbase_pages for p in index_python:
= "https://www.olimpbase.org/Elo/player/" + lastname_list[p] + ",%20" + firstname_list[p] + ".html"
player_page
prepped_olimpbase_pages.append(player_page)
print("Our clean list contains", len(prepped_olimpbase_pages), "webpages!")
## Our clean list contains 133 webpages!
As tempting as it would be to find a method that would allow us to create URLs from these problematic players, it is probably best to manually insert these 14 URLs.
= [
problematic_olimpbase_pages "https://www.olimpbase.org/Elo/player/Bruzon,%20Lazaro.html",
"https://www.olimpbase.org/Elo/player/Dominguez,%20Lenier.html",
"https://www.olimpbase.org/Elo/player/Fedorchuk,%20Sergey%20A..html",
"https://www.olimpbase.org/Elo/player/Ganguly,%20Surya%20Shekhar.html",
"https://www.olimpbase.org/Elo/player/Granda%20Zuniga,%20Julio%20E..html",
"https://www.olimpbase.org/Elo/player/Hodgson,%20Julian%20M..html",
"http://www.olimpbase.org/Elo/player/Illescas%20Cordoba,%20Miguel.html",
"https://www.olimpbase.org/Elo/player/Jones,%20Gawain%20C..html",
"https://www.olimpbase.org/Elo/player/Mc%20Shane,%20Luke%20J..html",
"https://www.olimpbase.org/Elo/player/Nielsen,%20Peter%20Heine.html",
"https://www.olimpbase.org/Elo/player/Sadler,%20Matthew.html",
"https://www.olimpbase.org/Elo/player/Short,%20Nigel%20D..html",
"https://www.olimpbase.org/Elo/player/Vallejo%20Pons,%20Francisco.html",
"https://www.olimpbase.org/Elo/player/Van%20Wely,%20Loek.html"
]
print("Our problematic list contains", len(problematic_olimpbase_pages), "webpages!")
## Our problematic list contains 14 webpages!
Let’s put all of the pages together.
= [*prepped_olimpbase_pages, *problematic_olimpbase_pages]
all_olimpbase_pages print("In total, our master list contains", len(all_olimpbase_pages), "webpages!")
## In total, our master list contains 147 webpages!
Webscraping
As mentioned above, one of the biggest issues with extracting information from OlimpBase is the outdated structure of its webpages. Essentially, the key thing that this web-scraper does is extract specific text elements using a combination of string matching functions and the Scrapy library.
But before doing any web-scraping, there is one more issue to resolve. There were 2 chess players who’s missing 1994 data caused my web-scraper to err. The 2 chess players were Garry Kasparov and Nigel Short. Why did they cause problems? It’s because in 1993 Kasparov and Short, frustrated with the FIDE regulations, split from the federation and held their own World Championship. This lead to FIDE stripping them of their titles, which explains why they were missing their 1994 rating data and caused problems for my web-scraper. For more information, check out this Wikipedia page.
This means we need to extract Kasparov and Short’s data separately.
# Extract Kasparov and Short's URLs from main lis.
"https://www.olimpbase.org/Elo/player/Kasparov,%20Garry.html")
all_olimpbase_pages.remove("https://www.olimpbase.org/Elo/player/Short,%20Nigel%20D..html")
all_olimpbase_pages.remove(= all_olimpbase_pages
all_olimpbase_pages_updated
print("There are now", len(all_olimpbase_pages_updated), "in our main list!")
## There are now 145 in our main list!
Let’s load the functions that will permit us to extract specific elements on the webpages.
# Load library for access to regular expressions
import re
# Finds rating information
def find_info(x):
= re.findall(r'(?<![0-9])[0-9]{4}(?!\.|\=|[0-9])', x)
y return(y)
# Merges strings together
def combine_strings(x):
= ", ".join((str(elements) for elements in x))
y return(y)
# Extracts player names
def find_names(x):
= re.findall(r'([A-Z][a-z]+)', x)
y return(y)
Let’s append all of the web-scraped information to a list.
# Extract date, rating, id, and names from webpage
def extract_data_clean_players(pages):
= []
player_information_list for page in pages:
= requests.get(page).content
html = Selector( text = html )
sel = sel.xpath( '//pre[2]/a/text()').extract()
dates = sel.xpath( '//pre[2]/text()').extract()
infos = [str(date) for date in dates]
dates_flat = [combine_strings(find_info(str(info))) for info in infos]
infos_flat = [x for x in infos_flat if x]
infos_clean = [combine_strings(find_names(str(name))) for name in infos]
names = [x for x in names if x]
names_clean = list(range(0,len(dates_flat),1))
indices for number in indices:
str(names_clean[number]) + ": " + str(dates_flat[number]) + ": " + str(infos_clean[number]))
player_information_list.append( return(player_information_list)
= extract_data_clean_players(all_olimpbase_pages_updated)
player_information_listprint("Our clean player list has", len(player_information_list), "elements!")
## Our clean player list has 2375 elements!
We will now convert the list into a data frame with a single column.
= pd.DataFrame(player_information_list, columns= ["Bio"])
player_information_df 10)) pp.pprint(player_information_df.head(
## Bio
## 0 Adams, Michael: Oct 2001: 2731
## 1 Adams, Michael: Jul 2001: 2744
## 2 Adams, Michael: Apr 2001: 2750
## 3 Adams, Michael: Jan 2001: 2746
## 4 Adams, Michael: Oct 2000: 2754
## 5 Adams, Michael: Jul 2000: 2755
## 6 Adams, Michael: Jan 2000: 2715
## 7 Adams, Michael: Jul 1999: 2708
## 8 Adams, Michael: Jan 1999: 2716
## 9 Adams, Michael: Jul 1998: 2715
Let’s use a similar scraper to extract Kasaprov and Short’s information and convert that into data frames.
# Save Kasparov and Short's URLs
= "https://www.olimpbase.org/Elo/player/Kasparov,%20Garry.html"
kasparov = "https://www.olimpbase.org/Elo/player/Short,%20Nigel%20D..html"
short
# Function that extracts all information on the webpages and removes 1994 records.
def create_df_problem_players(page):
= requests.get(page).content
html = Selector( text = html )
sel = sel.xpath( '//pre[2]/a/text()').extract()
dates = sel.xpath( '//pre[2]/text()').extract()
infos = [combine_strings(find_names(str(name))) for name in infos]
names = [x for x in names if x]
names_clean = [str(date) for date in dates]
dates_flat "Jan 1994")
dates_flat.remove("Jul 1994")
dates_flat.remove(= [combine_strings(find_info(str(info))) for info in infos]
infos_flat = [x for x in infos_flat if x]
infos_clean = {"Name": names_clean, "Date": dates_flat, "RTNG": infos_clean}
player_dic = pd.DataFrame(player_dic)
player_df return(player_df)
# Run function and save Kasparov and Short data as data frames.
= create_df_problem_players(kasparov)
kasparov_df_raw = create_df_problem_players(short) short_df_raw
Data Cleaning and Validation
Now that we have all of the data we need, let’s put it all together.
# Modify main data frame
<- py$player_information_df %>%
player_information_df_clean separate(Bio, c("Name", "Date", "RTNG"), sep = ": ", remove = TRUE, convert = FALSE)
# Append data frames
<- rbind(player_information_df_clean, py$kasparov_df_raw, py$short_df_raw)
combined_player_info str(combined_player_info)
## 'data.frame': 2466 obs. of 3 variables:
## $ Name: chr "Adams, Michael" "Adams, Michael" "Adams, Michael" "Adams, Michael" ...
## $ Date: chr "Oct 2001" "Jul 2001" "Apr 2001" "Jan 2001" ...
## $ RTNG: chr "2731" "2744" "2750" "2746" ...
On the OlimpBase website, I discovered that some chess players had multiple IDs because of federation switches and FIDE ID procedure changes. For this reason, we will use the September monthly rating supplement from the beginning of this project to ensure that each chess player has a consistent current FIDE ID.
<- py$players_before_1990_cut %>%
supplemental_player_infofilter(Name != "Li, Chao b" & Name != "Iturrizaga Bonelli, Eduardo" )
# Merge main data with supplemental data using Name string distances.
library(fuzzyjoin)
<- combined_player_info %>%
combined_player_info_almost #Change the names that don't merge easily with supplemental data
mutate(Name = dplyr::recode(Name,
"Bruzon, Lazaro" = "Bruzon Batista, Lazaro",
"Dominguez, Lenier" = "Dominguez Perez, Leinier",
"Jones, Gawain" = "Jones, Gawain C B"
%>%
)) stringdist_left_join(supplemental_player_info, by = "Name", method= "qgram", max_dist = 3) %>%
select(Name.y, `ID Number`, Date, RTNG)%>%
rename(Name = Name.y)
The last step is to convert the columns to their proper data types and filter for data before January 2000 because we already have that information. Note that FIDE published only January and July monthly ratings until the 2000s.
library(lubridate)
<- combined_player_info_almost %>%
top_grandmaster_ratings_before_2000 mutate(RTNG = as.numeric(RTNG)) %>%
mutate(Date = parse_date_time(Date, "%b %Y")) %>%
filter(Date <="1999-07-01")
top_grandmaster_ratings_before_2000
Let’s export our data.
#write.csv(top_grandmaster_ratings_before_2000,"C:/Users/laryl/Desktop/Data Sets//top_grandmaster_ratings_before_2000.csv")
Conclusion
Although this project was short, it was extremely challenging. The majority of my time was used to find a method that could produce URLs and extract the deeply nested data within them. One of the biggest takeaways from this project was the amount of inconsistencies found in the FIDE data. Being able to see how FIDE’s data management has improved over the years proved to be more interesting than I originally thought. With the data acquired through this project and the previous chess web-scraping projects, we can now have fun creating visualizations that give us a better understanding of chess players growth over the last 50 years.
Sources
For the September chess player ratings, check out the FIDE Website.
For access to chess player records before 2000, see the OlimpBase’s monthly ratings.
For more information about the 1993 split, see this Wikipedia page.