Introduction
My goal for this project is to gather as many chess grandmaster birthdates and birthplaces as possible using web-scraping tools. Despite having so much chess player data available to the public, finding more personal data can be quite difficult, especially when dealing with non-famous chess players. To deal with this problem, I will be dividing this project into 2 parts.
In part 1, I will be using a combination of web-scraping tools to extract birthdates and birthplaces from the Wikipedia pages of chess players over a rating of 2600. This standard was set because once a player reaches the 2600 rating threshold, they usually become more visible in the chess world and are more likely to have a Wikipedia page.
In part 2, for grandmasters below this rating threshold, I will be extracting almost all of their birth information from a list of grandmasters Wikipedia page (See sources for why I am using an older version of this page). The reason this method is not being used for all of the grandmasters is because I found this page after I had already devised my own web-scraping methods for part 1. By using a more complex framework for the first part of this project, I get the opportunity to practice new python and R web-scraping tools.
I do make use of the list of grandmasters page in the first part of the project when my web-scraper does not extract the correct information. There are some conflicts that arise between the list of grandmasters page and the players’ Wikipedia pages, so as we progress through this project, I will detail my process for solving these inconsistencies.
Part 1: Getting 2600 Birth Information
Data Preparation
Before anything else, we need to load the package that permits us to use both R and Python.
library(reticulate)
Let’s load the libraries we need for python.
import pandas as pd
from bs4 import BeautifulSoup
import requests
from IPython.display import display
import pprint as pp
import lxml.html as lh
Next, we will be loading our chess rating data, which we obtained from the FIDE website. I chose the September ratings.
file = 'C:/Users/laryl/Desktop/Data Sets/players_list_foa_sept.txt'
= pd.read_fwf(file) chess
Let’s filter for players over 2600.
= chess[chess['SRtng'] >= 2600]
grandmaster_2600_raw display(grandmaster_2600_raw.info())
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 266 entries, 1963 to 1021560
## Data columns (total 19 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 ID Number 266 non-null int64
## 1 Name 266 non-null object
## 2 Fed 266 non-null object
## 3 Sex 266 non-null object
## 4 Tit 266 non-null object
## 5 WTit 2 non-null object
## 6 OTit 3 non-null object
## 7 FOA 0 non-null object
## 8 SRtng 266 non-null float64
## 9 SGm 266 non-null float64
## 10 SK 266 non-null float64
## 11 RRtng 255 non-null float64
## 12 RGm 255 non-null float64
## 13 Rk 255 non-null float64
## 14 BRtng 253 non-null float64
## 15 BGm 253 non-null float64
## 16 BK 253 non-null float64
## 17 B-day 266 non-null int64
## 18 Flag 29 non-null object
## dtypes: float64(9), int64(2), object(8)
## memory usage: 41.6+ KB
## None
From the output above, we have 266 players to find information for. Because the players’ last names are first, we need to flip them with their first names so that our search queries are more effective later on. Luckily, we can easily do this in R and clean up some of the names at the same time. Here is a preview of the data:
# Flip the players' names
library(tidyverse)
<- py$grandmaster_2600_raw %>%
grandmaster_2600_r separate(Name, c("Last", "First"), sep = ", ", remove = TRUE, convert = FALSE) %>%
unite(Name, First, Last, sep = " ") %>%
arrange(Name)
# Fix some problematic names for the next phase
<- grandmaster_2600_r %>%
grandmaster_2600_cleanish mutate(Name = dplyr::recode(Name,
"A.R. Saleh Salem" = "Salem Saleh (chess player)",
"B. Adhiban" = "Adhiban Baskaran",
"Chao b Li" = "Li Chao (chess player)",
"Chithambaram VR. Aravindh" = "Aravindh Chithambaram",
"David W L Howell" = "David Howell (chess player)",
"Fabiano Caruana" = "Fabiano Caruana wikipedia",
"Robert Hovhannisyan" = "Robert Hovhannisyan en wikipedia" ,
"Hao Wang" = "Wang Hao (chess player)",
"NA Narayanan.S.L" = "S.L. Narayanan",
"NA Nihal Sarin" = "Nihal Sarin",
"NA Praggnanandhaa R"= "Rameshbabu Praggnanandhaa",
"Johan-Sebastian Christiansen" = "Johan-Sebastian Christiansen wikipedia",
"Konstantin Landa" = "Konstantin Landa wikipedia"))
head(grandmaster_2600_cleanish)
Getting Data Using Google and Wikipedia
The Wikipedia URLs of chess players are fairly consistent. The formula for any person is (and we’ll use the world champion as an example):
the base URL(https://en.wikipedia.org/wiki/) + First Name(Magnus) + Last Name(_Carlsen) = https://en.wikipedia.org/wiki/Magnus_Carlsen
However, because the data set was made by FIDE and not Wikipedia, there are a number of inconsistencies that will arise if we just blindly follow the formula above. Here are some of the issues:
- Names from data set could be misspelled
- Data set first and last names could be flipped (even after we rearranged them)
- Names that are long are sometimes abbreviated
- Wikipedia pages are sometimes case sensitive (especially for dutch players)
- Player Wikipedia pages sometimes only exist in different languages (some of the Spanish and Latino players only have pages in Spanish)
- Some players have special characters in their names (like accents from other languages)
- Some players have common names( so we need to ensure that we get the chess player)
- Some players may not even have a Wikipedia page
With all of these issues in mind, I decided to use Google searches so that I could ensure that I get Wikipedia URLs that exist. Luckily, there is a python library that can return webpage links using search queries made in python. So let’s make our queries using the names in our data set.
= list(r.grandmaster_2600_cleanish["Name"])
grandmaster_list
= [player + ' wiki chess player' for player in grandmaster_list]
grandmaster_list_wiki_query pp.pprint(grandmaster_list_wiki_query)
## ['Salem Saleh (chess player) wiki chess player',
## 'Abhijeet Gupta wiki chess player',
## 'Ahmed Adly wiki chess player',
## 'Alan Pichot wiki chess player',
## 'Aleksandar Indjic wiki chess player',
## 'Aleksandr Lenderman wiki chess player',
## 'Aleksandr Rakhmanov wiki chess player',
## 'Aleksandra Goryachkina wiki chess player',
## 'Aleksey Dreev wiki chess player',
## 'Alexander Areshchenko wiki chess player',
## 'Alexander Chernin wiki chess player',
## 'Alexander Donchenko wiki chess player',
## 'Alexander Galkin wiki chess player',
## 'Alexander Grischuk wiki chess player',
## 'Alexander Ipatov wiki chess player',
## 'Alexander Khalifman wiki chess player',
## 'Alexander Moiseenko wiki chess player',
## 'Alexander Morozevich wiki chess player',
## 'Alexander Motylev wiki chess player',
## 'Alexander Onischuk wiki chess player',
## 'Alexander Riazantsev wiki chess player',
## 'Alexandr Predke wiki chess player',
## 'Alexei Shirov wiki chess player',
## 'Alexey Sarana wiki chess player',
## 'Alireza Firouzja wiki chess player',
## 'Anatoly Karpov wiki chess player',
## 'Andrei Volokitin wiki chess player',
## 'Andrey Esipenko wiki chess player',
## 'Anish Giri wiki chess player',
## 'Ante Brkic wiki chess player',
## 'Anton Demchenko wiki chess player',
## 'Anton Korobov wiki chess player',
## 'Anton Kovalyov wiki chess player',
## 'Anton Smirnov wiki chess player',
## 'Aram Hakobyan wiki chess player',
## 'Arkadij Naiditsch wiki chess player',
## 'Arman Pashikian wiki chess player',
## 'Aryan Tari wiki chess player',
## 'Adhiban Baskaran wiki chess player',
## 'Bartosz Socko wiki chess player',
## 'Bassem Amin wiki chess player',
## 'Benjamin Bok wiki chess player',
## 'Benjamin Gledura wiki chess player',
## 'Bogdan-Daniel Deac wiki chess player',
## 'Boris Alterman wiki chess player',
## 'Boris Gelfand wiki chess player',
## 'Boris Grachev wiki chess player',
## 'Li Chao (chess player) wiki chess player',
## 'Aravindh Chithambaram wiki chess player',
## 'Christian Bauer wiki chess player',
## 'Constantin Lupulescu wiki chess player',
## 'Cristobal Henriquez Villagra wiki chess player',
## 'Daniel Fridman wiki chess player',
## 'Daniel Naroditsky wiki chess player',
## 'Daniel Stellwagen wiki chess player',
## 'Daniele Vocaturo wiki chess player',
## 'Daniil Dubov wiki chess player',
## 'Dariusz Swiercz wiki chess player',
## 'Darmen Sadvakasov wiki chess player',
## 'David Anton Guijarro wiki chess player',
## 'David Baramidze wiki chess player',
## 'David Navara wiki chess player',
## 'David Paravyan wiki chess player',
## 'David Howell (chess player) wiki chess player',
## 'Denis Khismatullin wiki chess player',
## 'Dimitrios Mastrovasilis wiki chess player',
## 'Dmitrij Kollars wiki chess player',
## 'Dmitry Andreikin wiki chess player',
## 'Dmitry Jakovenko wiki chess player',
## 'Dmitry Kononenko wiki chess player',
## 'Eduardo Iturrizaga Bonelli wiki chess player',
## 'Emil Sutovsky wiki chess player',
## 'Eric Hansen wiki chess player',
## 'Ernesto Inarkiev wiki chess player',
## "Erwin L'Ami wiki chess player",
## 'Etienne Bacrot wiki chess player',
## 'Evgeniy Najer wiki chess player',
## 'Evgeny Alekseev wiki chess player',
## 'Evgeny Bareev wiki chess player',
## 'Evgeny Shtembuliak wiki chess player',
## 'Evgeny Tomashevsky wiki chess player',
## 'Fabiano Caruana wikipedia wiki chess player',
## 'Farrukh Amonatov wiki chess player',
## 'Ferenc Berkes wiki chess player',
## 'Francisco Vallejo Pons wiki chess player',
## 'Gabriel Sargissian wiki chess player',
## 'Gadir Guseinov wiki chess player',
## 'Garry Kasparov wiki chess player',
## 'Gata Kamsky wiki chess player',
## 'Gawain C B Jones wiki chess player',
## 'Georg Meier wiki chess player',
## 'Georgy Pilavov wiki chess player',
## 'Giovanni Vescovi wiki chess player',
## 'Grigoriy Oparin wiki chess player',
## 'Grzegorz Gajewski wiki chess player',
## 'Haik M. Martirosyan wiki chess player',
## 'Hans Moke Niemann wiki chess player',
## 'Wang Hao (chess player) wiki chess player',
## 'Hikaru Nakamura wiki chess player',
## 'Hrant Melkumyan wiki chess player',
## 'Hristos Banikas wiki chess player',
## 'Hua Ni wiki chess player',
## 'Ian Nepomniachtchi wiki chess player',
## 'Igor Kovalenko wiki chess player',
## 'Igor Lysyj wiki chess player',
## 'Igors Rausis wiki chess player',
## 'Ildar Khairullin wiki chess player',
## 'Ilia Smirin wiki chess player',
## 'Illya Nyzhnyk wiki chess player',
## 'Ioannis Papaioannou wiki chess player',
## 'Ivan Cheparinov wiki chess player',
## 'Ivan Popov wiki chess player',
## 'Ivan Salgado Lopez wiki chess player',
## 'Ivan Saric wiki chess player',
## 'Jaime Santos Latasa wiki chess player',
## 'Jan-Krzysztof Duda wiki chess player',
## 'Jan Gustafsson wiki chess player',
## 'Jeffery Xiong wiki chess player',
## 'Jeroen Piket wiki chess player',
## 'Jianchao Zhou wiki chess player',
## 'Jiangchuan Ye wiki chess player',
## 'Jinshi Bai wiki chess player',
## 'Joel Lautier wiki chess player',
## 'Johan-Sebastian Christiansen wikipedia wiki chess player',
## 'Jon Ludvig Hammer wiki chess player',
## 'Jorden Van Foreest wiki chess player',
## 'Jorge Cori wiki chess player',
## 'Jose Eduardo Martinez Alcantara wiki chess player',
## 'Judit Polgar wiki chess player',
## 'Jules Moussard wiki chess player',
## 'Julian M Hodgson wiki chess player',
## 'Julio E Granda Zuniga wiki chess player',
## 'Jun Zhao wiki chess player',
## 'Kacper Piorun wiki chess player',
## 'Karen H. Grigoryan wiki chess player',
## 'Kirill Alekseenko wiki chess player',
## 'Kirill Shevchenko wiki chess player',
## 'Konstantin Landa wikipedia wiki chess player',
## 'Krishnan Sasikiran wiki chess player',
## 'Laurent Fressinet wiki chess player',
## 'Lazaro Bruzon Batista wiki chess player',
## 'Leinier Dominguez Perez wiki chess player',
## 'Levon Aronian wiki chess player',
## 'Liren Ding wiki chess player',
## 'Liviu-Dieter Nisipeanu wiki chess player',
## 'Loek Van Wely wiki chess player',
## 'Luka Lenic wiki chess player',
## 'Luke J McShane wiki chess player',
## 'M. Amin Tabatabaei wiki chess player',
## 'Magnus Carlsen wiki chess player',
## 'Maksim Chigaev wiki chess player',
## 'Manuel Petrosyan wiki chess player',
## 'Marin Bosiocic wiki chess player',
## 'Markus Ragger wiki chess player',
## 'Martyn Kravtsiv wiki chess player',
## 'Mateusz Bartel wiki chess player',
## 'Matthew D Sadler wiki chess player',
## 'Matthias Bluebaum wiki chess player',
## 'Maxim Matlakov wiki chess player',
## 'Maxime Lagarde wiki chess player',
## 'Maxime Vachier-Lagrave wiki chess player',
## 'Michael Adams wiki chess player',
## 'Miguel Illescas Cordoba wiki chess player',
## 'Miguel Santos Ruiz wiki chess player',
## 'Mikhail Al. Antipov wiki chess player',
## 'Mikhail Kobalia wiki chess player',
## 'Murali Karthikeyan wiki chess player',
## 'Mustafa Yilmaz wiki chess player',
## 'NA Erigaisi Arjun wiki chess player',
## 'S.L. Narayanan wiki chess player',
## 'Nihal Sarin wiki chess player',
## 'Rameshbabu Praggnanandhaa wiki chess player',
## 'Ngoc Truong Son Nguyen wiki chess player',
## 'Nigel D Short wiki chess player',
## 'Nijat Abasov wiki chess player',
## 'Nikita Vitiugov wiki chess player',
## 'Nils Grandelius wiki chess player',
## 'Nodirbek Abdusattorov wiki chess player',
## 'Nodirbek Yakubboev wiki chess player',
## 'Olexandr Bortnyk wiki chess player',
## 'Parham Maghsoodloo wiki chess player',
## 'Parimarjan Negi wiki chess player',
## 'Pavel Eljanov wiki chess player',
## 'Pavel Ponkratov wiki chess player',
## 'Pentala Harikrishna wiki chess player',
## 'Peter Heine Nielsen wiki chess player',
## 'Peter Leko wiki chess player',
## 'Peter Svidler wiki chess player',
## 'Pouya Idani wiki chess player',
## 'Quang Liem Le wiki chess player',
## 'Qun Ma wiki chess player',
## 'Radoslaw Wojtaszek wiki chess player',
## 'Rasmus Svane wiki chess player',
## 'Rauf Mamedov wiki chess player',
## 'Raunak Sadhwani wiki chess player',
## 'Ray Robson wiki chess player',
## 'Richard Rapport wiki chess player',
## 'Rinat Jumabayev wiki chess player',
## 'Robert Hovhannisyan en wikipedia wiki chess player',
## 'Robin Van Kampen wiki chess player',
## 'Ruslan Ponomariov wiki chess player',
## 'Rustam Kasimdzhanov wiki chess player',
## 'S.P. Sethuraman wiki chess player',
## 'Sam Shankland wiki chess player',
## 'Samuel Sevian wiki chess player',
## 'Samvel Ter-Sahakyan wiki chess player',
## 'Sanan Sjugirov wiki chess player',
## 'Sandro Mareco wiki chess player',
## 'Santosh Gujrathi Vidit wiki chess player',
## 'Sergei Azarov wiki chess player',
## 'Sergei Movsesian wiki chess player',
## 'Sergei Rublevsky wiki chess player',
## 'Sergei Tiviakov wiki chess player',
## 'Sergey A. Fedorchuk wiki chess player',
## 'Sergey Karjakin wiki chess player',
## 'Shakhriyar Mamedyarov wiki chess player',
## 'Shanglei Lu wiki chess player',
## 'Shant Sargsyan wiki chess player',
## 'Suri Vaibhav wiki chess player',
## 'Surya Shekhar Ganguly wiki chess player',
## 'Tamas Banusz wiki chess player',
## 'Tamir Nabaty wiki chess player',
## 'Teimour Radjabov wiki chess player',
## 'Tigran Gharamian wiki chess player',
## 'Vadim Milov wiki chess player',
## 'Vadim Zvjaginsev wiki chess player',
## 'Valery Salov wiki chess player',
## 'Varuzhan Akobian wiki chess player',
## 'Vasif Durarbayli wiki chess player',
## 'Vasyl Ivanchuk wiki chess player',
## 'Velimir Ivic wiki chess player',
## 'Veselin Topalov wiki chess player',
## 'Viktor Erdos wiki chess player',
## 'Viktor Laznicka wiki chess player',
## 'Vincent Keymer wiki chess player',
## 'Viswanathan Anand wiki chess player',
## 'Vitaliy Bernadskiy wiki chess player',
## 'Vladimir Afromeev wiki chess player',
## 'Vladimir Akopian wiki chess player',
## 'Vladimir Fedoseev wiki chess player',
## 'Vladimir Kramnik wiki chess player',
## 'Vladimir Malakhov wiki chess player',
## 'Vladimir Onischuk wiki chess player',
## 'Vladislav Artemiev wiki chess player',
## 'Vladislav Kovalev wiki chess player',
## 'Vladislav Tkachiev wiki chess player',
## 'Wesley So wiki chess player',
## 'Wojciech Moranda wiki chess player',
## 'Xiangzhi Bu wiki chess player',
## 'Yangyi Yu wiki chess player',
## 'Yannick Gozzoli wiki chess player',
## 'Yaroslav Zherebukh wiki chess player',
## 'Yasser Seirawan wiki chess player',
## 'Yevgeniy Vladimirov wiki chess player',
## 'Yi Wei wiki chess player',
## 'Yifan Hou wiki chess player',
## 'Yue Wang wiki chess player',
## 'Yuri Drozdovskij wiki chess player',
## 'Yuriy Kryvoruchko wiki chess player',
## 'Yuriy Kuzubov wiki chess player',
## 'Zahar Efimenko wiki chess player',
## 'Zhong Zhang wiki chess player',
## 'Zoltan Almasi wiki chess player',
## 'Zoltan Gyimesi wiki chess player',
## 'Zurab Azmaiparashvili wiki chess player',
## 'Zviad Izoria wiki chess player']
We will now use our queries to acquire the URLs.
#Load library
from googlesearch import search
#List to store URLs
= []
player_wiki_pages
# Loop to so that we get one URL for every query
for p in grandmaster_list_wiki_query:
for i in search(p,
= 'co.in',
tld= 1,
num= 1,
stop= 2.0):
pause
player_wiki_pages.append(i) pp.pprint(player_wiki_pages)
## ['https://en.wikipedia.org/wiki/Salem_Saleh_(chess_player)',
## 'https://en.wikipedia.org/wiki/Abhijeet_Gupta',
## 'https://en.wikipedia.org/wiki/Ahmed_Adly',
## 'https://en.wikipedia.org/wiki/Alan_Pichot',
## 'https://en.wikipedia.org/wiki/Aleksandar_In%C4%91i%C4%87',
## 'https://en.wikipedia.org/wiki/Aleksandr_Lenderman',
## 'https://en.wikipedia.org/wiki/Aleksandr_Rakhmanov',
## 'https://en.wikipedia.org/wiki/Aleksandra_Goryachkina',
## 'https://en.wikipedia.org/wiki/Alexey_Dreev',
## 'https://en.wikipedia.org/wiki/Alexander_Areshchenko',
## 'https://en.wikipedia.org/wiki/Alexander_Chernin',
## 'https://en.wikipedia.org/wiki/Alexander_Donchenko',
## 'https://en.wikipedia.org/wiki/Aleksandr_Galkin_(chess_player)',
## 'https://en.wikipedia.org/wiki/Alexander_Grischuk',
## 'https://en.wikipedia.org/wiki/Alexander_Ipatov',
## 'https://en.wikipedia.org/wiki/Alexander_Khalifman',
## 'https://en.wikipedia.org/wiki/Alexander_Moiseenko',
## 'https://en.wikipedia.org/wiki/Alexander_Morozevich',
## 'https://en.wikipedia.org/wiki/Alexander_Motylev',
## 'https://en.wikipedia.org/wiki/Alexander_Onischuk',
## 'https://en.wikipedia.org/wiki/Alexander_Riazantsev_(chess_player)',
## 'https://en.wikipedia.org/wiki/Alexandr_Predke',
## 'https://en.wikipedia.org/wiki/Alexei_Shirov',
## 'https://en.wikipedia.org/wiki/Alexey_Sarana',
## 'https://en.wikipedia.org/wiki/Alireza_Firouzja',
## 'https://en.wikipedia.org/wiki/Anatoly_Karpov',
## 'https://en.wikipedia.org/wiki/Andrei_Volokitin',
## 'https://en.wikipedia.org/wiki/Andrey_Esipenko',
## 'https://en.wikipedia.org/wiki/Anish_Giri',
## 'https://en.wikipedia.org/wiki/Ante_Brki%C4%87',
## 'https://en.wikipedia.org/wiki/Anton_Demchenko',
## 'https://en.wikipedia.org/wiki/Anton_Korobov',
## 'https://en.wikipedia.org/wiki/Anton_Kovalyov',
## 'https://en.wikipedia.org/wiki/Anton_Smirnov_(chess_player)',
## 'https://www.wikidata.org/wiki/Q27525651',
## 'https://en.wikipedia.org/wiki/Arkadij_Naiditsch',
## 'https://en.wikipedia.org/wiki/Arman_Pashikian',
## 'https://en.wikipedia.org/wiki/Aryan_Tari',
## 'https://en.wikipedia.org/wiki/Adhiban_Baskaran',
## 'https://en.wikipedia.org/wiki/Bartosz_So%C4%87ko',
## 'https://en.wikipedia.org/wiki/Bassem_Amin',
## 'https://en.wikipedia.org/wiki/Benjamin_Bok',
## 'https://en.wikipedia.org/wiki/Benj%C3%A1min_Gledura',
## 'https://en.wikipedia.org/wiki/Bogdan-Daniel_Deac',
## 'https://en.wikipedia.org/wiki/Boris_Alterman',
## 'https://en.wikipedia.org/wiki/Boris_Gelfand',
## 'https://en.wikipedia.org/wiki/Boris_Grachev',
## 'https://en.wikipedia.org/wiki/Li_Chao_(chess_player)',
## 'https://en.wikipedia.org/wiki/Aravindh_Chithambaram',
## 'https://en.wikipedia.org/wiki/Christian_Bauer',
## 'https://en.wikipedia.org/wiki/Constantin_Lupulescu',
## 'https://en.wikipedia.org/wiki/Cristobal_Henriquez_Villagra',
## 'https://en.wikipedia.org/wiki/Daniel_Fridman',
## 'https://en.wikipedia.org/wiki/Daniel_Naroditsky',
## 'https://en.wikipedia.org/wiki/Dani%C3%ABl_Stellwagen',
## 'https://en.wikipedia.org/wiki/Daniele_Vocaturo',
## 'https://en.wikipedia.org/wiki/Daniil_Dubov',
## 'https://en.wikipedia.org/wiki/Dariusz_%C5%9Awiercz',
## 'https://en.wikipedia.org/wiki/Darmen_Sadvakasov',
## 'https://en.wikipedia.org/wiki/David_Ant%C3%B3n_Guijarro',
## 'https://en.wikipedia.org/wiki/David_Baramidze',
## 'https://en.wikipedia.org/wiki/David_Navara',
## 'https://en.wikipedia.org/wiki/David_Paravyan',
## 'https://en.wikipedia.org/wiki/David_Howell_(chess_player)',
## 'https://en.wikipedia.org/wiki/Denis_Khismatullin',
## 'https://en.wikipedia.org/wiki/Dimitrios_Mastrovasilis',
## 'https://en.wikipedia.org/wiki/Dmitrij_Kollars',
## 'https://en.wikipedia.org/wiki/Dmitry_Andreikin',
## 'https://en.wikipedia.org/wiki/Dmitry_Jakovenko',
## 'https://en.wikipedia.org/wiki/Dmitry_Kononenko',
## 'https://en.wikipedia.org/wiki/Eduardo_Iturrizaga',
## 'https://en.wikipedia.org/wiki/Emil_Sutovsky',
## 'https://en.wikipedia.org/wiki/Eric_Hansen_(chess_player)',
## 'https://en.wikipedia.org/wiki/Ernesto_Inarkiev',
## 'https://en.wikipedia.org/wiki/Erwin_l%27Ami',
## 'https://en.wikipedia.org/wiki/%C3%89tienne_Bacrot',
## 'https://en.wikipedia.org/wiki/Evgeniy_Najer',
## 'https://en.wikipedia.org/wiki/Evgeny_Alekseev_(chess_player)',
## 'https://en.wikipedia.org/wiki/Evgeny_Bareev',
## 'https://en.wikipedia.org/wiki/Evgeny_Shtembuliak',
## 'https://en.wikipedia.org/wiki/Evgeny_Tomashevsky',
## 'https://en.wikipedia.org/wiki/Fabiano_Caruana',
## 'https://en.wikipedia.org/wiki/Farrukh_Amonatov',
## 'https://en.wikipedia.org/wiki/Ferenc_Berkes',
## 'https://en.wikipedia.org/wiki/Francisco_Vallejo_Pons',
## 'https://en.wikipedia.org/wiki/Gabriel_Sargissian',
## 'https://en.wikipedia.org/wiki/Gadir_Guseinov',
## 'https://en.wikipedia.org/wiki/Garry_Kasparov',
## 'https://en.wikipedia.org/wiki/Gata_Kamsky',
## 'https://en.wikipedia.org/wiki/Gawain_Jones',
## 'https://en.wikipedia.org/wiki/Georg_Meier_(chess_player)',
## 'https://www.wikidata.org/wiki/Q4362660',
## 'https://en.wikipedia.org/wiki/Giovanni_Vescovi',
## 'https://en.wikipedia.org/wiki/Grigoriy_Oparin',
## 'https://en.wikipedia.org/wiki/Grzegorz_Gajewski',
## 'https://en.wikipedia.org/wiki/Haik_M._Martirosyan',
## 'https://en.wikipedia.org/wiki/Hans_Niemann',
## 'https://en.wikipedia.org/wiki/Wang_Hao_(chess_player)',
## 'https://en.wikipedia.org/wiki/Hikaru_Nakamura',
## 'https://en.wikipedia.org/wiki/Hrant_Melkumyan',
## 'https://en.wikipedia.org/wiki/Hristos_Banikas',
## 'https://en.wikipedia.org/wiki/Ni_Hua',
## 'https://en.wikipedia.org/wiki/Ian_Nepomniachtchi',
## 'https://en.wikipedia.org/wiki/Igor_Kovalenko',
## 'https://en.wikipedia.org/wiki/Igor_Lysyj',
## 'https://en.wikipedia.org/wiki/Igors_Rausis',
## 'https://en.wikipedia.org/wiki/Ildar_Khairullin',
## 'https://en.wikipedia.org/wiki/Ilya_Smirin',
## 'https://en.wikipedia.org/wiki/Illia_Nyzhnyk',
## 'https://en.wikipedia.org/wiki/Ioannis_Papaioannou',
## 'https://en.wikipedia.org/wiki/Ivan_Cheparinov',
## 'https://en.wikipedia.org/wiki/Ivan_Popov_(chess_player)',
## 'https://en.wikipedia.org/wiki/Iv%C3%A1n_Salgado_L%C3%B3pez',
## 'https://en.wikipedia.org/wiki/Ivan_%C5%A0ari%C4%87_(chess_player)',
## 'https://second.wiki/wiki/jaime_santos_latasa',
## 'https://en.wikipedia.org/wiki/Jan-Krzysztof_Duda',
## 'https://en.wikipedia.org/wiki/Jan_Gustafsson',
## 'https://en.wikipedia.org/wiki/Jeffery_Xiong',
## 'https://en.wikipedia.org/wiki/Jeroen_Piket',
## 'https://en.wikipedia.org/wiki/Zhou_Jianchao',
## 'https://en.wikipedia.org/wiki/Ye_Jiangchuan',
## 'https://en.wikipedia.org/wiki/Bai_Jinshi',
## 'https://en.wikipedia.org/wiki/Jo%C3%ABl_Lautier',
## 'https://en.wikipedia.org/wiki/Johan-Sebastian_Christiansen',
## 'https://en.wikipedia.org/wiki/Jon_Ludvig_Hammer',
## 'https://en.wikipedia.org/wiki/Jorden_van_Foreest',
## 'https://en.wikipedia.org/wiki/Jorge_Cori',
## 'https://en.wikipedia.org/wiki/Jose_Eduardo_Martinez_Alcantara',
## 'https://en.wikipedia.org/wiki/Judit_Polg%C3%A1r',
## 'https://en.wikipedia.org/wiki/Jules_Moussard',
## 'https://en.wikipedia.org/wiki/Julian_Hodgson',
## 'https://en.wikipedia.org/wiki/Julio_Granda',
## 'https://en.wikipedia.org/wiki/Zhao_Jun_(chess_player)',
## 'https://en.wikipedia.org/wiki/Kacper_Piorun',
## 'https://en.wikipedia.org/wiki/Karen_H._Grigoryan',
## 'https://en.wikipedia.org/wiki/Kirill_Alekseenko',
## 'https://en.wikipedia.org/wiki/Kirill_Shevchenko',
## 'https://en.wikipedia.org/wiki/Konstantin_Landa',
## 'https://en.wikipedia.org/wiki/Krishnan_Sasikiran',
## 'https://en.wikipedia.org/wiki/Laurent_Fressinet',
## 'https://en.wikipedia.org/wiki/L%C3%A1zaro_Bruz%C3%B3n',
## 'https://en.wikipedia.org/wiki/Leinier_Dom%C3%ADnguez',
## 'https://en.wikipedia.org/wiki/Levon_Aronian',
## 'https://en.wikipedia.org/wiki/Ding_Liren',
## 'https://en.wikipedia.org/wiki/Liviu-Dieter_Nisipeanu',
## 'https://en.wikipedia.org/wiki/Loek_van_Wely',
## 'https://en.wikipedia.org/wiki/Luka_Leni%C4%8D',
## 'https://en.wikipedia.org/wiki/Luke_McShane',
## 'https://en.wikipedia.org/wiki/Amin_Tabatabaei',
## 'https://en.wikipedia.org/wiki/Magnus_Carlsen',
## 'https://en.wikipedia.org/wiki/Maksim_Chigaev',
## 'https://en.wikipedia.org/wiki/Manuel_Petrosyan',
## 'https://en.wikipedia.org/wiki/Marin_Bosio%C4%8Di%C4%87',
## 'https://en.wikipedia.org/wiki/Markus_Ragger',
## 'https://en.wikipedia.org/wiki/Martyn_Kravtsiv',
## 'https://en.wikipedia.org/wiki/Mateusz_Bartel',
## 'https://en.wikipedia.org/wiki/Matthew_Sadler',
## 'https://en.wikipedia.org/wiki/Matthias_Bl%C3%BCbaum',
## 'https://en.wikipedia.org/wiki/Maxim_Matlakov',
## 'https://en.wikipedia.org/wiki/Maxime_Lagarde',
## 'https://en.wikipedia.org/wiki/Maxime_Vachier-Lagrave',
## 'https://en.wikipedia.org/wiki/Michael_Adams_(chess_player)',
## 'https://en.wikipedia.org/wiki/Miguel_Illescas',
## 'https://en.wikipedia.org/wiki/Miguel_Santos_Ruiz',
## 'https://en.wikipedia.org/wiki/Mikhail_Antipov',
## 'https://en.wikipedia.org/wiki/Mikhail_Kobalia',
## 'https://en.wikipedia.org/wiki/Karthikeyan_Murali',
## 'https://en.wikipedia.org/wiki/Mustafa_Y%C4%B1lmaz',
## 'https://en.wikipedia.org/wiki/Arjun_Erigaisi',
## 'https://en.wikipedia.org/wiki/S._L._Narayanan',
## 'https://en.wikipedia.org/wiki/Nihal_Sarin',
## 'https://en.wikipedia.org/wiki/Rameshbabu_Praggnanandhaa',
## 'https://en.m.wikipedia.org/wiki/Nguyen_Ngoc_Truong_Son',
## 'https://en.wikipedia.org/wiki/Nigel_Short',
## 'https://en.wikipedia.org/wiki/Nijat_Abasov',
## 'https://en.wikipedia.org/wiki/Nikita_Vitiugov',
## 'https://en.wikipedia.org/wiki/Nils_Grandelius',
## 'https://en.wikipedia.org/wiki/Nodirbek_Abdusattorov',
## 'https://en.wikipedia.org/wiki/Nodirbek_Yakubboev',
## 'https://en.wikipedia.org/wiki/Olexandr_Bortnyk',
## 'https://en.wikipedia.org/wiki/Parham_Maghsoodloo',
## 'https://en.wikipedia.org/wiki/Parimarjan_Negi',
## 'https://en.wikipedia.org/wiki/Pavel_Eljanov',
## 'https://en.wikipedia.org/wiki/Pavel_Ponkratov',
## 'https://en.wikipedia.org/wiki/Pentala_Harikrishna',
## 'https://en.wikipedia.org/wiki/Peter_Heine_Nielsen',
## 'https://en.wikipedia.org/wiki/Peter_Leko',
## 'https://en.wikipedia.org/wiki/Peter_Svidler',
## 'https://en.wikipedia.org/wiki/Pouya_Idani',
## 'https://en.wikipedia.org/wiki/L%C3%AA_Quang_Li%C3%AAm',
## 'https://en.wikipedia.org/wiki/Ma_Qun',
## 'https://en.wikipedia.org/wiki/Rados%C5%82aw_Wojtaszek',
## 'https://en.wikipedia.org/wiki/Rasmus_Svane',
## 'https://en.wikipedia.org/wiki/Rauf_Mamedov',
## 'https://en.wikipedia.org/wiki/Raunak_Sadhwani',
## 'https://en.wikipedia.org/wiki/Ray_Robson',
## 'https://en.wikipedia.org/wiki/Rich%C3%A1rd_Rapport',
## 'https://en.wikipedia.org/wiki/Rinat_Jumabayev',
## 'https://en.wikipedia.org/wiki/Robert_Hovhannisyan',
## 'https://en.wikipedia.org/wiki/Robin_van_Kampen',
## 'https://en.wikipedia.org/wiki/Ruslan_Ponomariov',
## 'https://en.wikipedia.org/wiki/Rustam_Kasimdzhanov',
## 'https://en.wikipedia.org/wiki/S._P._Sethuraman',
## 'https://en.wikipedia.org/wiki/Sam_Shankland',
## 'https://en.wikipedia.org/wiki/Samuel_Sevian',
## 'https://en.wikipedia.org/wiki/Samvel_Ter-Sahakyan',
## 'https://en.wikipedia.org/wiki/Sanan_Sjugirov',
## 'https://en.wikipedia.org/wiki/Sandro_Mareco',
## 'https://en.wikipedia.org/wiki/Vidit_Gujrathi',
## 'https://en.wikipedia.org/wiki/Sergei_Azarov',
## 'https://en.wikipedia.org/wiki/Sergei_Movsesian',
## 'https://en.wikipedia.org/wiki/Sergei_Rublevsky',
## 'https://en.wikipedia.org/wiki/Sergei_Tiviakov',
## 'https://en.wikipedia.org/wiki/Sergey_Fedorchuk',
## 'https://en.wikipedia.org/wiki/Sergey_Karjakin',
## 'https://en.wikipedia.org/wiki/Shakhriyar_Mamedyarov',
## 'https://en.wikipedia.org/wiki/Lu_Shanglei',
## 'https://en.wikipedia.org/wiki/Shant_Sargsyan',
## 'https://en.wikipedia.org/wiki/List_of_Indian_chess_players',
## 'https://en.wikipedia.org/wiki/Surya_Shekhar_Ganguly',
## 'https://en.wikipedia.org/wiki/Tam%C3%A1s_B%C3%A1nusz',
## 'https://en.wikipedia.org/wiki/Tamir_Nabaty',
## 'https://en.wikipedia.org/wiki/Teimour_Radjabov',
## 'https://en.wikipedia.org/wiki/Tigran_Gharamian',
## 'https://en.wikipedia.org/wiki/Vadim_Milov',
## 'https://en.wikipedia.org/wiki/Vadim_Zvjaginsev',
## 'https://en.wikipedia.org/wiki/Valery_Salov',
## 'https://en.wikipedia.org/wiki/Varuzhan_Akobian',
## 'https://en.wikipedia.org/wiki/Vasif_Durarbayli',
## 'https://en.wikipedia.org/wiki/Vasyl_Ivanchuk',
## 'https://en.wikipedia.org/wiki/Velimir_Ivi%C4%87',
## 'https://en.wikipedia.org/wiki/Veselin_Topalov',
## 'https://en.wikipedia.org/wiki/Viktor_Erd%C5%91s',
## 'https://en.wikipedia.org/wiki/Viktor_L%C3%A1zni%C4%8Dka',
## 'https://en.wikipedia.org/wiki/Vincent_Keymer',
## 'https://en.wikipedia.org/wiki/Viswanathan_Anand',
## 'https://en.wikipedia.org/wiki/Vitaliy_Bernadskiy',
## 'https://en.wikipedia.org/wiki/Vladimir_Afromeev',
## 'https://en.wikipedia.org/wiki/Vladimir_Akopian',
## 'https://en.wikipedia.org/wiki/Vladimir_Fedoseev',
## 'https://en.wikipedia.org/wiki/Vladimir_Kramnik',
## 'https://en.wikipedia.org/wiki/Vladimir_Malakhov_(chess_player)',
## 'https://en.wikipedia.org/wiki/Volodymyr_Onyshchuk',
## 'https://en.wikipedia.org/wiki/Vladislav_Artemiev',
## 'https://en.wikipedia.org/wiki/Vladislav_Kovalev',
## 'https://en.wikipedia.org/wiki/Vladislav_Tkachiev',
## 'https://en.wikipedia.org/wiki/Wesley_So',
## 'https://en.wikipedia.org/wiki/Wojciech_Moranda',
## 'https://en.wikipedia.org/wiki/Bu_Xiangzhi',
## 'http://t1.gstatic.com/licensed-image?q=tbn:ANd9GcR-sXmvQj_rsGFI0Z2h8Y8n62Hw1T7L8umRy3URmaukMqSdwXhB-6r8HfGF1run',
## 'https://en.wikipedia.org/wiki/Yannick_Gozzoli',
## 'https://en.wikipedia.org/wiki/Yaroslav_Zherebukh',
## 'https://en.wikipedia.org/wiki/Yasser_Seirawan',
## 'https://en.wikipedia.org/wiki/Yevgeniy_Vladimirov',
## 'https://en.wikipedia.org/wiki/Wei_Yi',
## 'https://en.wikipedia.org/wiki/Hou_Yifan',
## 'https://en.wikipedia.org/wiki/Wang_Yue',
## 'https://en.wikipedia.org/wiki/Yuri_Drozdovskij',
## 'https://en.wikipedia.org/wiki/Yuriy_Kryvoruchko',
## 'https://en.wikipedia.org/wiki/Yuriy_Kuzubov',
## 'https://en.wikipedia.org/wiki/Zahar_Efimenko',
## 'https://en.wikipedia.org/wiki/Zhang_Zhong',
## 'https://en.wikipedia.org/wiki/Zolt%C3%A1n_Alm%C3%A1si',
## 'https://en.wikipedia.org/wiki/Zoltan_Gyimesi',
## 'https://en.wikipedia.org/wiki/Zurab_Azmaiparashvili',
## 'https://en.wikipedia.org/wiki/Zviad_Izoria']
This worked really well because only a few pages are incorrect. We can deal with this in R by cutting out URLs with inconsistent patterns.
#Look for URLs that don't have a particular pattern
<- grepl(pattern = "^https://en.wikipedia.org/wiki/", x= py$player_wiki_pages)
player_wiki_pages_test
# Clean URLs
<- py$player_wiki_pages[player_wiki_pages_test]
player_wiki_pages_clean
# Problematic URLS
<- py$player_wiki_pages[!player_wiki_pages_test]
problematic_player_pages problematic_player_pages
## [1] "https://www.wikidata.org/wiki/Q27525651"
## [2] "https://www.wikidata.org/wiki/Q4362660"
## [3] "https://second.wiki/wiki/jaime_santos_latasa"
## [4] "https://en.m.wikipedia.org/wiki/Nguyen_Ngoc_Truong_Son"
## [5] "http://t1.gstatic.com/licensed-image?q=tbn:ANd9GcR-sXmvQj_rsGFI0Z2h8Y8n62Hw1T7L8umRy3URmaukMqSdwXhB-6r8HfGF1run"
Some problematic pages were not detected because the base URLs were present. Their queries actually generated Wikipedia lists of chess players from particular countries.
Because we don’t have a way to get these players’ URLs through web-scraping, we will manually insert their data into our final data frame.
<- c( "Jaime, Santos Latasa", "Yu Yangyi", "Georgy, Pilavov", "Suri, Vaibhav", "Aram, Hakobyan","Ngoc Truong Son Nguyen")
Name <- c('1996-07-03', '1994-06-08', "1974-12-13", "1997-02-08", "2001-04-01", "1990-02-23")
Birthdate <- c( 'San Sebastián', 'Hubei', "Luhansk", "New Delhi", "Yerevan", "Rach Gia" )
City_of_birth
<- data.frame(Name, Birthdate, City_of_birth)
problematic_grandmaster_bios problematic_grandmaster_bios
Before we do the actual Wikipedia web-scraping, we need a few helper functions that will make our scraping easier.
# Function to extract name from url so we can keep track of who's page we are scraping
def trim_url(x):
= x.replace("https://en.wikipedia.org/wiki/", "").replace("_(chess_player)", "").replace('_'," ")
y return(y)
# Function to combine web-scraping elements and convert them to strings
def combine_strings(x):
= ", ".join((str(elements) for elements in x))
y return(y)
# Function that eliminates unnecessary strings
def extra_string_remover(x):
= combine_strings(x).split(",", 1)
y = y[0]
z return(z)
Let’s now convert our clean pages vector above into a list in python and view the amount of players we are working with.
= list(r.player_wiki_pages_clean)
player_wiki_pages_final len(player_wiki_pages_final)
## 261
This number we are seeing is the combination of the “clean” chess players and the lists of chess players from particular countries.
Now, let’s use the “Scrapy” library to extract birthdate and birthplace information from each player’s Wikipedia page.
from scrapy import Selector
= []
player_bios for url in list(r.player_wiki_pages_clean):
= requests.get(url).content
html = Selector( text = html )
sel = sel.xpath( '//span[@class="bday"]/text()').extract()
bday_text = sel.xpath( '//td[@class="infobox-data"]//a/text()').extract()
birthplace = extra_string_remover(birthplace)
birthplace_clean = trim_url(url)
url_name + ": "+ combine_strings(bday_text) + ": "+ birthplace_clean)
player_bios.append( url_name pp.pprint(player_bios)
## ['Salem Saleh: 1993-01-04: Sharjah',
## 'Abhijeet Gupta: 1989-10-16: Bhilwara',
## 'Ahmed Adly: 1987-02-18: Cairo',
## 'Alan Pichot: 1998-08-13: Buenos Aires',
## 'Aleksandar In%C4%91i%C4%87: : Belgrade',
## 'Aleksandr Lenderman: 1989-09-23: Leningrad',
## 'Aleksandr Rakhmanov: : Cherepovets',
## 'Aleksandra Goryachkina: 1998-09-28: Orsk',
## 'Alexey Dreev: 1969-01-30: Stavropol',
## 'Alexander Areshchenko: 1986-06-15: Voroshilovgrad',
## 'Alexander Chernin: : ',
## 'Alexander Donchenko: : Moscow',
## 'Aleksandr Galkin: : ',
## 'Alexander Grischuk: 1983-10-31: Moscow',
## 'Alexander Ipatov: 1993-07-16: Lviv',
## 'Alexander Khalifman: 1966-01-18: Leningrad',
## 'Alexander Moiseenko: 1980-05-17: Severomorsk',
## 'Alexander Morozevich: 1977-07-18: Moscow',
## 'Alexander Motylev: 1979-06-17: Sverdlosk',
## 'Alexander Onischuk: 1975-09-03: Sevastopol',
## 'Alexander Riazantsev: 1985-09-12: Moscow',
## 'Alexandr Predke: 1994-01-05: Dimitrovgrad',
## 'Alexei Shirov: 1972-07-04: Riga',
## 'Alexey Sarana: : Moscow',
## 'Alireza Firouzja: : Babol',
## 'Anatoly Karpov: 1951-05-23: Zlatoust',
## 'Andrei Volokitin: 1986-06-18: Lviv',
## 'Andrey Esipenko: : Novocherkassk',
## 'Anish Giri: 1994-06-28: Sopiko Guramishvili',
## 'Ante Brki%C4%87: : 2607',
## 'Anton Demchenko: 1987-08-20: 2654',
## 'Anton Korobov: 1985-06-25: Mezhdurechensk',
## 'Anton Kovalyov: 1992-03-04: Kharkiv',
## 'Anton Smirnov: 2001-01-28: Canberra',
## 'Arkadij Naiditsch: 1985-10-25: Riga',
## 'Arman Pashikian: 1987-07-28: Irkutsk',
## 'Aryan Tari: 1999-06-04: Stavanger',
## 'Adhiban Baskaran: 1992-08-15: Mayiladuthurai',
## 'Bartosz So%C4%87ko: 1978-11-10: Piaseczno',
## 'Bassem Amin: 1988-09-09: 2682',
## 'Benjamin Bok: 1995-01-25: Lelystad',
## 'Benj%C3%A1min Gledura: 1999-07-04: Eger',
## 'Bogdan-Daniel Deac: : Râmnicu Vâlcea',
## 'Boris Alterman: : ',
## 'Boris Gelfand: 1968-06-24: Minsk',
## 'Boris Grachev: 1986-03-27: Moscow',
## 'Li Chao: 1989-04-21: Taiyuan',
## 'Aravindh Chithambaram: : Thirunagar',
## 'Christian Bauer: 1977-01-11: Forbach',
## 'Constantin Lupulescu: 1984-03-25: Buftea',
## 'Cristobal Henriquez Villagra: 1996-08-07: La Florida',
## 'Daniel Fridman: 1976-02-15: Riga',
## 'Daniel Naroditsky: 1995-11-09: San Mateo',
## 'Dani%C3%ABl Stellwagen: 1987-03-01: Soest',
## 'Daniele Vocaturo: 1989-12-16: Rome',
## 'Daniil Dubov: 1996-04-18: Moscow',
## 'Dariusz %C5%9Awiercz: 1994-05-31: Tarnowskie Góry',
## 'Darmen Sadvakasov: 1979-04-28: 2629',
## 'David Ant%C3%B3n Guijarro: 1995-06-23: Murcia',
## 'David Baramidze: 1988-09-27: Georgia',
## 'David Navara: 1985-03-27: Prague',
## 'David Paravyan: : Moscow',
## 'David Howell: 1990-11-14: Eastbourne',
## 'Denis Khismatullin: 1984-12-28: Neftekamsk',
## 'Dimitrios Mastrovasilis: 1983-06-12: 2618',
## 'Dmitrij Kollars: : Bremen',
## 'Dmitry Andreikin: 1990-02-05: Ryazan',
## 'Dmitry Jakovenko: 1983-06-29: Nizhnevartovsk',
## 'Dmitry Kononenko: : ',
## 'Eduardo Iturrizaga: 1989-11-01: Caracas',
## 'Emil Sutovsky: 1977-09-19: Baku',
## 'Eric Hansen: 1992-05-24: Irvine',
## 'Ernesto Inarkiev: 1985-12-09: Khaidarkan',
## 'Erwin l%27Ami: 1985-04-05: Woerden',
## '%C3%89tienne Bacrot: 1983-01-22: Lille',
## 'Evgeniy Najer: 1977-06-22: Moscow',
## 'Evgeny Alekseev: 1985-11-28: Pushkin',
## 'Evgeny Bareev: 1966-11-21: Yemanzhelinsk',
## 'Evgeny Shtembuliak: : ',
## 'Evgeny Tomashevsky: 1987-07-01: Saratov',
## 'Fabiano Caruana: 1992-07-30: Miami',
## 'Farrukh Amonatov: 1978-04-13: Dushanbe',
## 'Ferenc Berkes: 1985-08-08: Baja',
## 'Francisco Vallejo Pons: 1982-08-21: Es Castell',
## 'Gabriel Sargissian: 1983-09-03: Yerevan',
## 'Gadir Guseinov: 1986-05-21: Moscow',
## 'Garry Kasparov: 1963-04-13: Baku',
## 'Gata Kamsky: 1974-06-02: Novokuznetsk',
## 'Gawain Jones: 1987-12-11: Keighley',
## 'Georg Meier: 1987-08-26: Trier',
## 'Giovanni Vescovi: 1978-06-14: Porto Alegre',
## 'Grigoriy Oparin: 1997-07-01: Munich',
## 'Grzegorz Gajewski: 1985-07-19: Skierniewice',
## 'Haik M. Martirosyan: 2000-07-14: Byuravan',
## 'Hans Niemann: : San Francisco',
## 'Wang Hao: 1989-08-04: Harbin',
## 'Hikaru Nakamura: 1987-12-09: Hirakata',
## 'Hrant Melkumyan: 1989-04-30: Yerevan',
## 'Hristos Banikas: 1978-05-20: Salonica',
## 'Ni Hua: 1983-05-31: Shanghai',
## 'Ian Nepomniachtchi: 1990-07-14: Bryansk',
## 'Igor Kovalenko: 1988-12-29: Novomoskovsk',
## 'Igor Lysyj: 1987-01-01: Sverdlovsk',
## 'Igors Rausis: 1961-04-07: Komunarsk',
## 'Ildar Khairullin: 1990-08-22: Perm',
## 'Ilya Smirin: 1968-01-12: Vitebsk',
## 'Illia Nyzhnyk: 1996-09-27: Vinnytsia',
## 'Ioannis Papaioannou: : Athens',
## 'Ivan Cheparinov: 1986-11-26: Asenovgrad',
## 'Ivan Popov: 1990-03-20: Rostov-on-Don',
## 'Iv%C3%A1n Salgado L%C3%B3pez: : ',
## 'Ivan %C5%A0ari%C4%87: 1990-08-17: Split',
## 'Jan-Krzysztof Duda: 1998-04-26: Wieliczka',
## 'Jan Gustafsson: 1979-06-25: Hamburg',
## 'Jeffery Xiong: 2000-10-30: Plano',
## 'Jeroen Piket: 1969-01-27: Leiden',
## 'Zhou Jianchao: 1988-06-11: Shanghai',
## 'Ye Jiangchuan: 1960-11-20: Wuxi',
## 'Bai Jinshi: 1999-05-18: 2593',
## 'Jo%C3%ABl Lautier: 1973-04-12: Scarborough',
## 'Johan-Sebastian Christiansen: 1998-06-10: 2584',
## 'Jon Ludvig Hammer: 1990-06-02: Bergen',
## 'Jorden van Foreest: 1999-04-30: Utrecht',
## 'Jorge Cori: 1995-07-30: Lima',
## 'Jose Eduardo Martinez Alcantara: 1999-01-31: Lima',
## 'Judit Polg%C3%A1r: 1976-07-23: Budapest',
## 'Jules Moussard: 1995-01-16: Paris',
## 'Julian Hodgson: 1963-07-25: London',
## 'Julio Granda: 1967-02-25: Camaná',
## 'Zhao Jun: 1986-12-12: Jinan',
## 'Kacper Piorun: 1991-11-24: Łowicz',
## 'Karen H. Grigoryan: 1995-02-25: Yerevan',
## 'Kirill Alekseenko: : Vyborg',
## 'Kirill Shevchenko: : Kyiv',
## 'Konstantin Landa: 1972-05-22: Omsk',
## 'Krishnan Sasikiran: 1981-01-07: Chennai',
## 'Laurent Fressinet: 1981-11-30: Dax',
## 'L%C3%A1zaro Bruz%C3%B3n: 1982-05-02: Holguín',
## 'Leinier Dom%C3%ADnguez: 1983-09-23: Havana',
## 'Levon Aronian: 1982-10-06: Yerevan',
## 'Ding Liren: 1992-10-24: Wenzhou',
## 'Liviu-Dieter Nisipeanu: 1976-08-01: Braşov',
## 'Loek van Wely: 1972-10-07: Heesch',
## 'Luka Leni%C4%8D: 1988-05-13: Ljubljana',
## 'Luke McShane: 1984-01-07: 2647',
## 'Amin Tabatabaei: : Tehran',
## 'Magnus Carlsen: 1990-11-30: Tønsberg',
## 'Maksim Chigaev: : ',
## 'Manuel Petrosyan: 1998-05-06: 2637',
## 'Marin Bosio%C4%8Di%C4%87: 1988-08-08: Rijeka',
## 'Markus Ragger: 1988-02-05: Klagenfurt',
## 'Martyn Kravtsiv: 1990-11-26: Lviv',
## 'Mateusz Bartel: 1985-01-03: Warsaw',
## 'Matthew Sadler: 1974-05-15: Chatham',
## 'Matthias Bl%C3%BCbaum: 1997-04-18: Lemgo',
## 'Maxim Matlakov: 1991-03-05: Leningrad',
## 'Maxime Lagarde: : Niort',
## 'Maxime Vachier-Lagrave: 1990-10-21: Nogent-sur-Marne',
## 'Michael Adams: 1971-11-17: Truro',
## 'Miguel Illescas: 1965-12-03: Barcelona',
## 'Miguel Santos Ruiz: 1999-10-04: Utrera',
## 'Mikhail Antipov: 1997-06-10: Moscow',
## 'Mikhail Kobalia: 1978-05-03: 2596',
## 'Karthikeyan Murali: 1999-01-05: Thanjavur',
## 'Mustafa Y%C4%B1lmaz: 1992-11-05: Mamak',
## 'Arjun Erigaisi: : 2633',
## 'S. L. Narayanan: 1998-01-10: Thiruvananthapuram',
## 'Nihal Sarin: 2004-07-13: Thrissur',
## 'Rameshbabu Praggnanandhaa: 2005-08-10: Chennai',
## 'Nigel Short: 1965-06-01: 2620',
## 'Nijat Abasov: 1995-05-14: Baku',
## 'Nikita Vitiugov: 1987-02-04: Leningrad',
## 'Nils Grandelius: 1993-06-03: Lund',
## 'Nodirbek Abdusattorov: 2004-09-18: Tashkent',
## 'Nodirbek Yakubboev: : 2630',
## 'Olexandr Bortnyk: 1996-10-18: Oleksandrivka',
## 'Parham Maghsoodloo: : Gorgan',
## 'Parimarjan Negi: 1993-02-09: New Delhi',
## 'Pavel Eljanov: 1983-05-10: Kharkiv',
## 'Pavel Ponkratov: : 2641',
## 'Pentala Harikrishna: 1986-05-10: Guntur',
## 'Peter Heine Nielsen: 1973-05-24: Holstebro',
## 'Peter Leko: 1979-09-08: Subotica',
## 'Peter Svidler: 1976-06-17: Leningrad',
## 'Pouya Idani: 1995-09-22: Ahvaz',
## 'L%C3%AA Quang Li%C3%AAm: 1991-03-13: Ho Chi Minh City',
## 'Ma Qun: 1991-11-09: Shandong',
## 'Rados%C5%82aw Wojtaszek: 1987-01-13: Elbląg',
## 'Rasmus Svane: 1997-05-21: Allerød Municipality',
## 'Rauf Mamedov: 1988-04-26: Baku',
## 'Raunak Sadhwani: 2005-12-22: Nagpur',
## 'Ray Robson: 1994-10-25: Guam',
## 'Rich%C3%A1rd Rapport: 1996-03-25: Szombathely',
## 'Rinat Jumabayev: 1989-07-23: Shymkent',
## 'Robert Hovhannisyan: 1991-03-23: Yerevan',
## 'Robin van Kampen: 1994-11-14: Blaricum',
## 'Ruslan Ponomariov: 1983-10-11: Horlivka',
## 'Rustam Kasimdzhanov: 1979-12-05: Tashkent',
## 'S. P. Sethuraman: 1993-02-25: Madras',
## 'Sam Shankland: 1991-10-01: Berkeley',
## 'Samuel Sevian: 2000-12-26: Corning',
## 'Samvel Ter-Sahakyan: 1993-09-19: Vanadzor',
## 'Sanan Sjugirov: 1993-01-31: Elista',
## 'Sandro Mareco: 1987-05-13: Haedo',
## 'Vidit Gujrathi: 1994-10-24: [1]',
## 'Sergei Azarov: : ',
## 'Sergei Movsesian: 1978-11-03: Tbilisi',
## 'Sergei Rublevsky: 1974-10-15: Kurgan',
## 'Sergei Tiviakov: 1973-02-14: Krasnodar',
## 'Sergey Fedorchuk: 1981-03-14: 2605',
## 'Sergey Karjakin: 1990-01-12: Simferopol',
## 'Shakhriyar Mamedyarov: 1985-04-12: Sumgait',
## 'Lu Shanglei: 1995-07-10: Shenyang',
## 'Shant Sargsyan: : 2639',
## 'List of Indian chess players: : ',
## 'Surya Shekhar Ganguly: 1983-02-24: Kolkata',
## 'Tam%C3%A1s B%C3%A1nusz: : Mohács',
## 'Tamir Nabaty: 1991-05-04: Ness Ziona',
## 'Teimour Radjabov: 1987-03-12: Baku',
## 'Tigran Gharamian: 1984-07-24: Yerevan',
## 'Vadim Milov: 1972-08-01: Ufa',
## 'Vadim Zvjaginsev: 1976-08-18: Moscow',
## 'Valery Salov: 1964-05-26: Wrocław',
## 'Varuzhan Akobian: 1983-11-19: Armenian SSR',
## 'Vasif Durarbayli: 1992-02-24: Sumqayit',
## 'Vasyl Ivanchuk: 1969-03-18: Kopychyntsi',
## 'Velimir Ivi%C4%87: : Belgrade',
## 'Veselin Topalov: 1975-03-15: Ruse',
## 'Viktor Erd%C5%91s: : 2613',
## 'Viktor L%C3%A1zni%C4%8Dka: 1988-01-09: Pardubice',
## 'Vincent Keymer: 2004-11-15: Mainz',
## 'Viswanathan Anand: 1969-12-11: [1]',
## 'Vitaliy Bernadskiy: 1994-11-17: 2601',
## 'Vladimir Afromeev: : Magadan',
## 'Vladimir Akopian: 1971-12-07: Baku',
## 'Vladimir Fedoseev: 1995-02-16: Saint Petersburg',
## 'Vladimir Kramnik: 1975-06-25: Tuapse',
## 'Vladimir Malakhov: 1980-11-27: Ivanovo',
## 'Volodymyr Onyshchuk: 1991-07-21: Ivano-Frankivsk',
## 'Vladislav Artemiev: 1998-03-05: Omsk',
## 'Vladislav Kovalev: 1994-01-06: Minsk',
## 'Vladislav Tkachiev: 1973-11-09: Russian SFSR',
## 'Wesley So: 1993-10-09: Bacoor',
## 'Wojciech Moranda: 1988-08-17: Kielce',
## 'Bu Xiangzhi: 1985-12-10: Qingdao',
## 'Yannick Gozzoli: : Marseille',
## 'Yaroslav Zherebukh: 1993-07-14: Lviv',
## 'Yasser Seirawan: 1960-03-24: Damascus',
## 'Yevgeniy Vladimirov: 1957-01-20: Alma Ata',
## 'Wei Yi: 1999-06-02: Yancheng',
## 'Hou Yifan: 1994-02-27: Xinghua',
## 'Wang Yue: 1987-03-31: Taiyuan',
## 'Yuri Drozdovskij: : Ukraine',
## 'Yuriy Kryvoruchko: 1986-12-19: Lviv',
## 'Yuriy Kuzubov: 1990-01-26: Sychyovka',
## 'Zahar Efimenko: 1985-07-03: Makiivka',
## 'Zhang Zhong: 1978-09-05: Chongqing',
## 'Zolt%C3%A1n Alm%C3%A1si: 1976-08-29: 2678',
## 'Zoltan Gyimesi: 1977-03-31: 2674',
## 'Zurab Azmaiparashvili: 1960-03-16: Tbilisi',
## 'Zviad Izoria: 1984-01-06: Georgia']
Although we can see some mistakes like ratings and blanks in place of birthplaces, the web-scraper did extract the majority of the data we needed.
We will first convert the information into a data frame in python.
= pd.DataFrame(player_bios, columns= ["Bio"])
player_bios_table_raw print(player_bios_table_raw)
## Bio
## 0 Salem Saleh: 1993-01-04: Sharjah
## 1 Abhijeet Gupta: 1989-10-16: Bhilwara
## 2 Ahmed Adly: 1987-02-18: Cairo
## 3 Alan Pichot: 1998-08-13: Buenos Aires
## 4 Aleksandar In%C4%91i%C4%87: : Belgrade
## .. ...
## 256 Zhang Zhong: 1978-09-05: Chongqing
## 257 Zolt%C3%A1n Alm%C3%A1si: 1976-08-29: 2678
## 258 Zoltan Gyimesi: 1977-03-31: 2674
## 259 Zurab Azmaiparashvili: 1960-03-16: Tbilisi
## 260 Zviad Izoria: 1984-01-06: Georgia
##
## [261 rows x 1 columns]
Next, we will use R to split the “Bio” column into the three variables we need. We will then combine the “clean” chess players and the problematic ones.
<- py$player_bios_table_raw %>%
player_bios_table_cleanish separate(Bio, c("Name", "Birthdate", "City_of_birth"), sep = ": ", remove = TRUE, convert = FALSE)%>%
filter(Name != "List of Armenian chess players" & Name != "List of Indian chess players")
<- rbind(player_bios_table_cleanish, problematic_grandmaster_bios) %>%
player_bios_table_updated arrange(Name)
Our last step is to produce a final data set that will undergo cleaning and validating using information found in the list of grandmasters Wikipedia page. To make sure our data set can be joined with the data from the Wikipedia list, we need to flip the first and last names back to where they were before and add another “Birthdate” column and “Name” column to our data set because the current “Birthdate” and “Name” columns have issues. As successful as our web-scraper was, it was not perfect, so we do need variables that we know don’t have mistakes in them for data validation later on.
# Flip first and last names so that last names come first again
<- player_bios_table_updated %>%
player_bios_table_updated2 separate(Name, c("First", "Last"), sep = " ", remove = TRUE, convert = FALSE) %>%
unite(Name, Last, First, sep = ", ")%>%
arrange(Name)
## Warning: Expected 2 pieces. Additional pieces discarded in 19 rows [55, 63, 88,
## 99, 113, 117, 124, 125, 127, 133, 140, 146, 165, 169, 184, 197, 200, 201, 217].
# Supplemental validation columns from original data frame
<- py$grandmaster_2600_raw %>%
grandmaster_2600_supplementselect(Name, `B-day`) %>%
rename(Birthdate= `B-day`)
# Join supplemental data to bio table using the names and birthdate columns as the keys
library(fuzzyjoin)
## Warning: package 'fuzzyjoin' was built under R version 4.1.2
<- player_bios_table_updated2 %>%
grandmaster_biotable_2600 stringdist_left_join(grandmaster_2600_supplement, by = c("Name", "Birthdate"), method= "qgram", q=2, max_dist = 9 )
str(grandmaster_biotable_2600)
## 'data.frame': 281 obs. of 5 variables:
## $ Name.x : chr "%C5%9Awiercz, Dariusz" "%C5%A0ari%C4%87, Ivan" "Abasov, Nijat" "Abdusattorov, Nodirbek" ...
## $ Birthdate.x : chr "1994-05-31" "1990-08-17" "1995-05-14" "2004-09-18" ...
## $ City_of_birth: chr "Tarnowskie Góry" "Split" "Baku" "Tashkent" ...
## $ Name.y : chr "Swiercz, Dariusz" NA "Abasov, Nijat" "Abdusattorov, Nodirbek" ...
## $ Birthdate.y : num 1994 NA 1995 2004 1971 ...
For transparency, let’s go over the similar columns:
- Name.x: web-scraped names
- Birthdate.x: web-scraped birthdates
- Name.y: original names
- Birthdate.y: original birth years
Data Cleaning
Dealing with Names
Because we went from 266 observations to 281, we know for a fact that there are duplicates that were created during the joining process. This was allowed because we needed the function to join as many records as possible. But duplicates are only part of the problem; missing values in the “Name.y” column are present too because we did not achieve a 100% match while joining. For now, let’s take a look at the duplicate names and the amount of missing data.
# Find duplicate names
<- (duplicated(grandmaster_biotable_2600$Name.x))
grandmaster_biotable_2600_dups $Name.x[grandmaster_biotable_2600_dups] grandmaster_biotable_2600
## [1] "Areshchenko, Alexander" "Chao, Li" "Chernin, Alexander"
## [4] "Donchenko, Alexander" "Gyimesi, Zoltan" "Hao, Wang"
## [7] "L., S." "L., S." "Navara, David"
## [10] "P., S." "Paravyan, David" "van, Robin"
## [13] "Yangyi, Yu" "Yi, Wei" "Yue, Wang"
# Save missing values as data frame
<- grandmaster_biotable_2600 %>%
grandmaster_biotable_2600_missing filter(is.na(Name.y ))
print(grandmaster_biotable_2600_missing)
## Name.x Birthdate.x City_of_birth Name.y Birthdate.y
## 1 %C5%A0ari%C4%87, Ivan 1990-08-17 Split <NA> NA
## 2 Alm%C3%A1si, Zolt%C3%A1n 1976-08-29 2678 <NA> NA
## 3 Ant%C3%B3n, David 1995-06-23 Murcia <NA> NA
## 4 B%C3%A1nusz, Tam%C3%A1s Mohács <NA> NA
## 5 Baskaran, Adhiban 1992-08-15 Mayiladuthurai <NA> NA
## 6 Bl%C3%BCbaum, Matthias 1997-04-18 Lemgo <NA> NA
## 7 Bosio%C4%8Di%C4%87, Marin 1988-08-08 Rijeka <NA> NA
## 8 Bruz%C3%B3n, L%C3%A1zaro 1982-05-02 Holguín <NA> NA
## 9 Dom%C3%ADnguez, Leinier 1983-09-23 Havana <NA> NA
## 10 Eduardo, Jose 1999-01-31 Lima <NA> NA
## 11 Gujrathi, Vidit 1994-10-24 [1] <NA> NA
## 12 H., Karen 1995-02-25 Yerevan <NA> NA
## 13 Henriquez, Cristobal 1996-08-07 La Florida <NA> NA
## 14 Illescas, Miguel 1965-12-03 Barcelona <NA> NA
## 15 In%C4%91i%C4%87, Aleksandar Belgrade <NA> NA
## 16 Iturrizaga, Eduardo 1989-11-01 Caracas <NA> NA
## 17 L%C3%A1zni%C4%8Dka, Viktor 1988-01-09 Pardubice <NA> NA
## 18 M., Haik 2000-07-14 Byuravan <NA> NA
## 19 Onyshchuk, Volodymyr 1991-07-21 Ivano-Frankivsk <NA> NA
## 20 Praggnanandhaa, Rameshbabu 2005-08-10 Chennai <NA> NA
## 21 Quang, L%C3%AA 1991-03-13 Ho Chi Minh City <NA> NA
## 22 Salgado, Iv%C3%A1n <NA> NA
## 23 Santos, Jaime, 1996-07-03 San Sebastián <NA> NA
## 24 Shekhar, Surya 1983-02-24 Kolkata <NA> NA
## 25 Truong, Ngoc 1990-02-23 Rach Gia <NA> NA
## 26 van, Jorden 1999-04-30 Utrecht <NA> NA
With 15 duplicate web-scraped names and 25 missing original names, we have quite a bit of work to do. For the missing values, there are a number of ways to fill them in. At this point in the project, we have successfully created and executed our web-scraper. Because we have the list of grandmasters Wikipedia page available, the easiest way to handle some of our issues is to continuously merge and clean our web-scraped data using the grandmaster list. This should significantly reduce the amount of manual insertions we need to do.
Let’s load the data from grandmaster list inside R.
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context
= pd.read_html('https://en.wikipedia.org/w/index.php?title=List_of_chess_grandmasters&diff=prev&oldid=1043484298', attrs = {'id' : 'grandmasters'})
df = df[0]
all_grandmaster_wiki_table = all_grandmaster_wiki_table.drop(labels=0, axis=0)
all_grandmaster_wiki_table pp.pprint(all_grandmaster_wiki_table.info())
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 1945 entries, 1 to 1945
## Data columns (total 9 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Name 1945 non-null object
## 1 FIDE ID 1872 non-null float64
## 2 Born 1945 non-null object
## 3 Birthplace 1662 non-null object
## 4 Died 217 non-null object
## 5 TitleYear 1945 non-null float64
## 6 Federation 1945 non-null object
## 7 Sex 1945 non-null object
## 8 Notes 1945 non-null object
## dtypes: float64(2), object(7)
## memory usage: 152.0+ KB
## None
Using the missing names table we created, let’s try to fill in as many of the missing values as possible, clean the data, and fuse those missing values back into our main table.
# Merge missing table with grandmaster list data and clean it
<- grandmaster_biotable_2600_missing %>%
grandmaster_biotable_2600_miss_resolved stringdist_left_join(py$all_grandmaster_wiki_table, by = c("Birthdate.x" = "Born"), method= "lcs" , max_dist = 1 ) %>%
mutate(Name = dplyr::recode(Name,
"Nguy<U+1EC5>n Ng<U+1ECD>c Tru<U+1EDD>ng Son" = "Nguyen, Ngoc Truong Son"
%>%
)) arrange(Name)%>%
distinct(`Name.x`, .keep_all = TRUE)
library(data.table)
<- data.table(grandmaster_biotable_2600_miss_resolved)
grandmaster_biotable_2600_miss_resolved
# Manually fill in missing chess players
<- grandmaster_biotable_2600_miss_resolved[23, 6 := "Banusz, Tamas"][23, 2 := "1989-04-08"][24, 6 := "Indjic, Aleksandar" ][24, 2 := "1995-08-24"][25, 6 := "Salgado López, Iván"][25, 2 := "1991-06-29"]
grandmaster_biotable_2600_miss_resolved
<- grandmaster_biotable_2600_miss_resolved %>%
grandmaster_biotable_2600_miss_resolved select( Name, Birthdate.x, Birthplace)
Let’s take a look at the filled in missing values.
print(grandmaster_biotable_2600_miss_resolved)
## Name Birthdate.x Birthplace
## 1: Adhiban B. 1992-08-15 Chennai
## 2: Almási, Zoltán 1976-08-29 Járdánháza
## 3: Antón Guijarro, David 1995-06-23 Murcia
## 4: Blübaum, Matthias 1997-04-18 Lemgo
## 5: Bosiocic, Marin 1988-08-08 Rijeka
## 6: Bruzón Batista, Lázaro 1982-05-02 Holguín
## 7: Domínguez Pérez, Leinier 1983-09-23 Havana
## 8: Ganguly, Surya Shekhar 1983-02-24 Kolkata
## 9: Grigoryan, Karen H. 1995-02-25 Yerevan
## 10: Gujrathi, Vidit 1994-10-24 Indore
## 11: Henriquez Villagra, Cristóbal 1996-08-07 Santiago
## 12: Illescas Cordoba, Miguel 1965-12-03 Barcelona
## 13: Iturrizaga Bonelli, Eduardo 1989-11-01 Caracas
## 14: Laznicka, Viktor 1988-01-09 Pardubice
## 15: Lê Quang Liêm 1991-03-13 Ho Chi Minh City
## 16: Martinez Alcantara, Jose Eduardo 1999-01-31 Lima
## 17: Martirosyan, Haik M. 2000-07-14 Artashat
## 18: Nguy<U+1EC5>n Ng<U+1ECD>c Tru<U+1EDD>ng Son 1990-02-23 R<U+1EA1>ch Giá
## 19: Onischuk, Vladimir 1991-07-21 Ivano-Frankivsk
## 20: Praggnanandhaa R 2005-08-10 Chennai
## 21: Santos Latasa, Jaime 1996-07-03 San Sebastián
## 22: Šaric, Ivan 1990-08-17 Split
## 23: Banusz, Tamas 1989-04-08 Groningen
## 24: Indjic, Aleksandar 1995-08-24
## 25: Salgado López, Iván 1991-06-29
## 26: <NA>
## Name Birthdate.x Birthplace
Now, we are going to merge the resolved missing data with our main table and observe any changes. We will also fill in our missing Name.y values (the clean names) with the brand new names we got from the resolved missing table.
# Merge main bio table and resolved missing data table
<- grandmaster_biotable_2600 %>%
grandmaster_biotable_2600_updatedstringdist_left_join(grandmaster_biotable_2600_miss_resolved, by = c("Birthdate.x" , "City_of_birth" = "Birthplace"), method= "qgram", max_dist = 4 ) %>%
mutate(Name.y= coalesce(Name.y, Name))%>%
select(Name.x, Name.y, Birthdate.x.x, Birthdate.y, City_of_birth, Birthplace ) %>%
rename(Name = Name.y,
Birthdate.x = Birthdate.x.x,
Birthyear = Birthdate.y) %>%
arrange(Name)
# Convert updated bio table data frame to data table class observe how many missing values there are
<- setDT(grandmaster_biotable_2600_updated)
grandmaster_biotable_2600_updated
print(paste(c(sum(is.na(grandmaster_biotable_2600_updated$Name)), "missing values!"), collapse = " "))
## [1] "9 missing values!"
# See the missing values
library(knitr)
kable(grandmaster_biotable_2600_updated[is.na(grandmaster_biotable_2600_updated$Name)])
Name.x | Name | Birthdate.x | Birthyear | City_of_birth | Birthplace |
---|---|---|---|---|---|
Alm%C3%A1si, Zolt%C3%A1n | NA | 1976-08-29 | NA | 2678 | NULL |
B%C3%A1nusz, Tam%C3%A1s | NA | NA | Mohács | NULL | |
Baskaran, Adhiban | NA | 1992-08-15 | NA | Mayiladuthurai | NULL |
Gujrathi, Vidit | NA | 1994-10-24 | NA | [1] | NULL |
Henriquez, Cristobal | NA | 1996-08-07 | NA | La Florida | NULL |
In%C4%91i%C4%87, Aleksandar | NA | NA | Belgrade | NULL | |
M., Haik | NA | 2000-07-14 | NA | Byuravan | NULL |
Salgado, Iv%C3%A1n | NA | NA | NULL | ||
van, Jorden | NA | 1999-04-30 | NA | Utrecht | NULL |
The merge appears to be a success because we managed to fill 16 missing “clean” names. With only 9 missing “clean” names left, it should easy to fill them in manually.
# Fill in missing clean names
<- grandmaster_biotable_2600_updated[273, Name := "Almasi, Zoltan" ][274, Name := "Banusz, Tamas"][275, Name := "Baskarans, Adhiban"][276, Name := "Gujrathi Vidit"][277, Name := "Henriquez, Cristobal"][278, Name := "Indjic, Aleksandar" ][279, Name := "Martirosyan, Haik M."][ 280, Name := "Salgado López, Iván"][281, Name := "Van Foreest, Jorden"]
grandmaster_biotable_2600_updated2
# Check how many missing names are left
print(paste(c(sum(is.na(grandmaster_biotable_2600_updated2$Name)), "missing values!"), collapse = " "))
## [1] "0 missing values!"
Now, we can begin dealing with the duplicates. Remember that the “Name.x” variable is where the web-scraped names are. This means that if we want to detect any duplicates, it is going to show in that column.
<- grandmaster_biotable_2600_updated2 %>%
grandmaster_biotable_2600_duplicates count(Name.x) %>%
filter(n>1)%>%
rename(Copies = n)
kable(grandmaster_biotable_2600_duplicates)
Name.x | Copies |
---|---|
Areshchenko, Alexander | 2 |
Chao, Li | 2 |
Chernin, Alexander | 2 |
Donchenko, Alexander | 2 |
Gyimesi, Zoltan | 2 |
Hao, Wang | 2 |
L., S. | 3 |
Navara, David | 2 |
P., S. | 2 |
Paravyan, David | 2 |
van, Robin | 2 |
Yangyi, Yu | 2 |
Yi, Wei | 2 |
Yue, Wang | 2 |
We can eliminate all of the duplicates using R’s distinct function.
<- grandmaster_biotable_2600_updated2 %>%
grandmaster_biotable_2600_unique distinct(`Name.x`, .keep_all = TRUE)
grandmaster_biotable_2600_unique
Now that the duplicates are gone, there are a number of manual corrections that need to be made due to the merging mistakes that were created earlier.
# Corrections
<- grandmaster_biotable_2600_unique[10, Name := "Gyimesi, Zoltan"][10, Birthyear := "1977"][20, Name := "Donchenko, Alexander"][20, Birthdate.x := "1998-03-22"][20, Birthyear := "1998"][130, Name := "Nielsen, Peter Heine"][130, Birthyear := "1973"][138, Name := "Narayanan, S. L." ][138, Birthyear := "1998" ][139, Name := "Sethuraman, S. P." ][139, Birthyear := "1993"][139, City_of_birth := "Chennai"][165, Name := "Paravyan, David"][165, Birthdate.x := "1998-03-08"][165, Birthyear := "1998"][197, Name := "Van Kampen, Robin"][222, Name := "Wei, Yi"][222, Birthyear := "1999"][222, City_of_birth := "Wuxi"][246, Name := "Wang, Yue" ][246, Birthyear := "1987" ][247, Name := "Yangyi, Yu" ][247, Birthyear := "1994" ][259, Name := "Banusz, Tamas"][259, Birthdate.x := "1989-04-08"][259, Birthyear := "1989"][260, Birthyear := "1992"] grandmaster_biotable_2600_almost
Let’s check out our data.
# Clean up data a bit
<- grandmaster_biotable_2600_almost %>%
grandmaster_biotable_2600_almost2 arrange(Name) %>%
select(Name, Birthdate.x, Birthyear, City_of_birth, Birthplace)
grandmaster_biotable_2600_almost2
Dealing with Birthdates
Phase 2 of the data cleaning now involves making sure our birthdates are correct. Our main issue is that the “Birthdate” column has a number of blank observations. To fix this, we will be doing another merge with the grandmaster list table, using the “Born” column in that data set to fill in the missing blanks. This merge will also permit us to include the “FIDE ID” column so that future merges are more exact.
# Fill in missing birthdates with born column
<- grandmaster_biotable_2600_almost2 %>%
grandmaster_biotable_2600_almost3stringdist_left_join(py$all_grandmaster_wiki_table, by = c("Name"), method= "qgram", max_dist = 2 )%>%
mutate(Birthdate.x = ifelse(Birthdate.x == "", NA, Birthdate.x)) %>%
mutate(Birthdate.x= coalesce(Birthdate.x, Born)) %>%
select(`FIDE ID`, Name.x, Birthdate.x, Birthyear, City_of_birth ) %>%
rename(ID= `FIDE ID`)
<- data.table(grandmaster_biotable_2600_almost3)
grandmaster_biotable_2600_almost3
# View missing data
print(paste(c(sum(is.na(grandmaster_biotable_2600_almost3$Birthdate.x)) , "missing values!"), collapse = " "))
## [1] "3 missing values!"
Now we only have 2 birthdates to manually insert. We can also use the opportunity to fix the FIDE IDs as well.
# Manual Corrections
<- grandmaster_biotable_2600_almost3[5, ID := 4157770 ][5, Birthdate.x := "1954-04-02"][10, ID := 702293 ][16, ID := 4107012 ][18, ID:= 5072786][18, Birthdate.x := "1999-09-11" ][27, ID := 722413 ][31, ID := 5018471 ][79, ID := 3800024 ][92, ID := 3409350 ][139, ID := 8604436 ][200, ID := 11600098 ] grandmaster_biotable_2600_almost4
Dealing with Birthplaces
During the web-scraping process, some of the birthplaces either came out as ratings or blanks. To solve this, we will mutate the columns so that those mistakes become NA values.
#Replace problematic birthplaces with NA values
<- grandmaster_biotable_2600_almost4 %>%
grandmaster_biotable_2600_almost5 mutate(City_of_birth = ifelse(str_detect(City_of_birth , ".*\\d") , NA, City_of_birth) )%>%
mutate(City_of_birth = ifelse(City_of_birth == "", NA, City_of_birth))
Using our grandmaster list table, we will fill in our missing city of birth data using the “Birthplace” column from the list.
#Load wiki table list in R
<- py$all_grandmaster_wiki_table
all_grandmaster_wiki_table_r $Birthplace <- unlist(all_grandmaster_wiki_table_r$Birthplace )
all_grandmaster_wiki_table_r
# Join and fill in missing values
<- grandmaster_biotable_2600_almost5 %>%
grandmaster_biotable_2600_almost6 left_join(all_grandmaster_wiki_table_r, by = c("ID" = "FIDE ID") ) %>%
mutate(City_of_birth = coalesce(City_of_birth, Birthplace))%>%
select(ID, Name.x, Birthdate.x, Birthyear, City_of_birth, Birthplace )
grandmaster_biotable_2600_almost6
With the necessary columns being filled, it’s time to move on to validating the data.
Data Validation
Validation of dates
The reason we kept the birth year column throughout the merges is because we needed some way to ensure that the birthdates matched the players. Because the birth years came with the players, it is an excellent column for validating the birthdates. Using R’s stringdist package, we can compare the string distances between the two columns. If they are accurate, there should only be a distance of 6 because of the additional 2 hyphens and 4 numbers in the “Birthdate” column.
library(stringdist)
stringdist(grandmaster_biotable_2600_almost6$Birthdate.x, grandmaster_biotable_2600_almost6$Birthyear)
## [1] 6 6 6 6 6 6 6 6 6 NA 6 6 6 6 6 6 NA 6 6 6 6 6 6 6 6
## [26] 6 6 6 6 6 6 6 6 6 NA 6 6 NA 6 NA 6 6 6 6 6 6 6 6 6 6
## [51] 6 NA 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 NA 6 6 6 6
## [76] 6 6 6 6 6 NA 6 NA 6 6 6 6 6 6 6 6 NA 6 6 6 6 6 NA 6 NA
## [101] 6 NA 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## [126] 6 6 6 6 6 6 6 6 NA NA 6 6 6 6 6 6 6 6 6 6 6 6 6 NA NA
## [151] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 NA 6 6 6
## [176] 6 6 6 6 NA 6 6 6 6 6 6 6 6 6 6 6 6 6 NA 6 6 6 6 6 6
## [201] 6 6 6 6 6 6 6 NA 6 NA 6 6 6 6 NA 6 6 6 6 6 6 6 6 6 6
## [226] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 NA 6 6 6 6 6
## [251] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
The NA values are being caused by the blanks in the birth year column. Overall, it seems that all of our work payed off because we are only seeing string distances equal to 6.
Validation of Birthplaces
Validating birthplaces is actually very difficult because of one problem: conflicts between the Wikipedia pages (City_of_birth variable) and the grandmaster list (Birthplace variable). The conflicts are mainly caused by inaccuracies, outdated information, and historical changes. For example, a number of Russian cities have gone through recent political transitions, so their names have changed in the past 20-30 years. Additionally, some cities in Russia and Ukraine have names that are the same. Moreover, it may be difficult to know where a Russian or Ukrainian is born because some players were born in one country but were raised in the other.
Let’s take a look at the birthplace conflicts.
$City_of_birth != grandmaster_biotable_2600_almost6$Birthplace] grandmaster_biotable_2600_almost6[grandmaster_biotable_2600_almost6
There are 45 conflicts and there is no automated way to deal with them. The best thing to do is to go through them manually and decide which ones or worth changing. The grandmaster list Wikipedia page does support many of their records with FIDE applications, which sometimes contain player birthplaces. For our purposes, if the grandmaster list has documentation supporting their data (mainly in the form of grandmaster title applications) then that location was chosen over the web-scraped data. Otherwise, the Wikipedia birth places were left alone.
I could not find a function that could be used to substitute the “City_of_birth” variable with the “Birthplace” variable, so I made my own function.
# Correction based on the grandmaster list
<- grandmaster_biotable_2600_almost6[142, City_of_birth := "Yekaterinburg"][258, City_of_birth := "Tashkent"]
grandmaster_biotable_2600_almost7
# Row numbers that are going to change
= c(6,9, 19, 28, 55, 67, 74, 90, 92, 99, 100, 105, 106, 133, 140, 160, 239, 245, 261)
changing_indices
# Function that replaces City_of_birth information with Birthplace information
<- function(dt, index) {
row_substitute = dt[index, Birthplace]
value= dt[index, City_of_birth := value]
dt
dt
}
for (i in changing_indices){
row_substitute(grandmaster_biotable_2600_almost7,i)
}
$City_of_birth != grandmaster_biotable_2600_almost7$Birthplace] grandmaster_biotable_2600_almost7[grandmaster_biotable_2600_almost7
The last 27 conflicts were left as is.
Here is a glimpse of the final data set after removing chess players that are not grandmasters (Afromeev, Vladimir and Rausis, Igors).
<- grandmaster_biotable_2600_almost7%>%
grandmaster_biotable_2600_complete select(ID, Name.x, Birthdate.x, City_of_birth) %>%
mutate(City_of_birth = ifelse(City_of_birth == "NaN", NA, City_of_birth))%>%
rename(Name = Name.x,
Birthdate = Birthdate.x) %>%
filter(Name != "Afromeev, Vladimir" & Name != "Rausis, Igors" ) # Filter out non grandmasters
str(grandmaster_biotable_2600_complete)
## Classes 'data.table' and 'data.frame': 264 obs. of 4 variables:
## $ ID : num 13402960 14204118 400041 10601619 13300580 ...
## $ Name : chr "Abasov, Nijat" "Abdusattorov, Nodirbek" "Adams, Michael" "Adly, Ahmed" ...
## $ Birthdate : chr "1995-05-14" "2004-09-18" "1971-11-17" "1987-02-18" ...
## $ City_of_birth: chr "Baku" "Tashkent" "Truro" "Cairo" ...
## - attr(*, ".internal.selfref")=<externalptr>
Part 2: Getting the Rest of the Grandmasters
Unlike Part 1, there is not enough information on Wikipedia for many of these grandmasters. The best thing to do is to get the majority of the information from the grandmaster list.
Data Preparation
Let’s filter our FIDE data for grandmasters below 2600 and prepare our data sets for merging.
# Filter for grandmasters under 2600
= chess[(chess['SRtng'] < 2600) & (chess["Tit"] == "GM") ]
grandmaster_rest_raw print(grandmaster_rest_raw.info())
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 1474 entries, 368 to 1020287
## Data columns (total 19 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 ID Number 1474 non-null int64
## 1 Name 1474 non-null object
## 2 Fed 1474 non-null object
## 3 Sex 1474 non-null object
## 4 Tit 1474 non-null object
## 5 WTit 33 non-null object
## 6 OTit 24 non-null object
## 7 FOA 3 non-null object
## 8 SRtng 1474 non-null float64
## 9 SGm 1474 non-null float64
## 10 SK 1474 non-null float64
## 11 RRtng 1167 non-null float64
## 12 RGm 1167 non-null float64
## 13 Rk 1167 non-null float64
## 14 BRtng 1161 non-null float64
## 15 BGm 1161 non-null float64
## 16 BK 1161 non-null float64
## 17 B-day 1474 non-null int64
## 18 Flag 414 non-null object
## dtypes: float64(9), int64(2), object(8)
## memory usage: 230.3+ KB
## None
# Data set preparation
<- py$grandmaster_rest_raw %>%
grandmaster_rest_raw_r rename(ID = `ID Number`)%>%
select(ID, Name)
<- all_grandmaster_wiki_table_r %>%
all_grandmaster_wiki_table_r2 rename(ID = `FIDE ID`) %>%
select(ID, Born, Birthplace)
Merging
Now, we can merge the data and find the number of missing birthplace and birthdate values.
<- grandmaster_rest_raw_r %>%
grandmaster_biotable_rest left_join(all_grandmaster_wiki_table_r2, by= "ID") %>%
rename(Birthdate= Born,
City_of_birth = Birthplace)%>%
mutate(City_of_birth = ifelse(City_of_birth == "NaN", NA, City_of_birth))
print(paste(c(sum(is.na(grandmaster_biotable_rest$City_of_birth)), "missing birthplace values!"), collapse = " "))
## [1] "258 missing birthplace values!"
print(paste(c(sum(is.na(grandmaster_biotable_rest$Birthdate)), "missing birthdate values!"), collapse = " "))
## [1] "2 missing birthdate values!"
We can fill in these missing birthdate values using chess.com and Wikipedia.
<- setDT(grandmaster_biotable_rest)
grandmaster_biotable_rest <- grandmaster_biotable_rest[865, Birthdate := "2009-02-05"][1268, Birthdate := "2005-03-22"] grandmaster_biotable_rest_complete
258 missing birthplace values is not the only problem; some grandmaster birthdates only have birth years. Unfortunately, this is the best that can be done for this data set. It’s now time to append the two data sets.
Part 3: The Final Merge
Let’s show the final table and export all of the data sets.
<- rbind(grandmaster_biotable_2600_complete, grandmaster_biotable_rest_complete) %>%
grandmaster_bdates_bplaces arrange(Name)
grandmaster_bdates_bplaces
#write.csv(grandmaster_bdates_bplaces,"C:/Users/laryl/Desktop/Data Sets//all_grandmaster_bdates_bplaces.csv")
#write.csv(grandmaster_biotable_2600_complete,"C:/Users/laryl/Desktop/Data Sets//top_grandmaster_bdates_bplaces.csv")
#write.csv(grandmaster_biotable_rest_complete,"C:/Users/laryl/Desktop/Data Sets//rest_of_grandmaster_bdates_bplaces.csv")
Conclusion
Although this project began with the simple goal of obtaining grandmaster birthdates, we ended up acquiring birthplaces too. This project proved to be very challenging especially during part 1. But along with these challenges came the opportunity to combine new R and Python tools like the “fuzzyjoins” package and the “googlesearch” library.
The data extracted from this project will be combined with other original data sets from previous chess web-scraping projects so that questions about chess player origins and rating trajectories can be answered. Note that these data sets are not complete because there are many manual insertions and corrections that need to be done. However, there is going to be an updated version of this data set that will include country of birth information and longitude and latitude data. If you want to get more information about the data sets (the ones here and the updated one) and download them, please visit my GitHub.
Sources
For the September chess player ratings, check out the FIDE Website.
For some of the birthplace information, I used chess.com’s top chess players page
For the birth information, check out the current list of chess grandmasters and the old list of chess grandmasters which used to have birthplace column before editors changed it.
For information about the highest rated FIDE Master, check out Vladmir Afromeev’s chessgames page and Wikipedia page.