Grandmaster B-days and B-places

Cedric Lary

9/13/2021

Introduction

My goal for this project is to gather as many chess grandmaster birthdates and birthplaces as possible using web-scraping tools. Despite having so much chess player data available to the public, finding more personal data can be quite difficult, especially when dealing with non-famous chess players. To deal with this problem, I will be dividing this project into 2 parts.

In part 1, I will be using a combination of web-scraping tools to extract birthdates and birthplaces from the Wikipedia pages of chess players over a rating of 2600. This standard was set because once a player reaches the 2600 rating threshold, they usually become more visible in the chess world and are more likely to have a Wikipedia page.

In part 2, for grandmasters below this rating threshold, I will be extracting almost all of their birth information from a list of grandmasters Wikipedia page (See sources for why I am using an older version of this page). The reason this method is not being used for all of the grandmasters is because I found this page after I had already devised my own web-scraping methods for part 1. By using a more complex framework for the first part of this project, I get the opportunity to practice new python and R web-scraping tools.

I do make use of the list of grandmasters page in the first part of the project when my web-scraper does not extract the correct information. There are some conflicts that arise between the list of grandmasters page and the players’ Wikipedia pages, so as we progress through this project, I will detail my process for solving these inconsistencies.

Part 1: Getting 2600 Birth Information

Data Preparation

Before anything else, we need to load the package that permits us to use both R and Python.

library(reticulate)

Let’s load the libraries we need for python.

import pandas as pd
from bs4 import BeautifulSoup
import requests
from IPython.display import display
import pprint as pp
import lxml.html as lh

Next, we will be loading our chess rating data, which we obtained from the FIDE website. I chose the September ratings.

file = 'C:/Users/laryl/Desktop/Data Sets/players_list_foa_sept.txt'
chess = pd.read_fwf(file)

Let’s filter for players over 2600.

grandmaster_2600_raw = chess[chess['SRtng']  >= 2600]
display(grandmaster_2600_raw.info())
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 266 entries, 1963 to 1021560
## Data columns (total 19 columns):
##  #   Column     Non-Null Count  Dtype  
## ---  ------     --------------  -----  
##  0   ID Number  266 non-null    int64  
##  1   Name       266 non-null    object 
##  2   Fed        266 non-null    object 
##  3   Sex        266 non-null    object 
##  4   Tit        266 non-null    object 
##  5   WTit       2 non-null      object 
##  6   OTit       3 non-null      object 
##  7   FOA        0 non-null      object 
##  8   SRtng      266 non-null    float64
##  9   SGm        266 non-null    float64
##  10  SK         266 non-null    float64
##  11  RRtng      255 non-null    float64
##  12  RGm        255 non-null    float64
##  13  Rk         255 non-null    float64
##  14  BRtng      253 non-null    float64
##  15  BGm        253 non-null    float64
##  16  BK         253 non-null    float64
##  17  B-day      266 non-null    int64  
##  18  Flag       29 non-null     object 
## dtypes: float64(9), int64(2), object(8)
## memory usage: 41.6+ KB
## None

From the output above, we have 266 players to find information for. Because the players’ last names are first, we need to flip them with their first names so that our search queries are more effective later on. Luckily, we can easily do this in R and clean up some of the names at the same time. Here is a preview of the data:

# Flip the players' names
library(tidyverse)
grandmaster_2600_r <- py$grandmaster_2600_raw %>%
  separate(Name, c("Last", "First"), sep = ", ", remove = TRUE, convert = FALSE) %>%
  unite(Name, First, Last, sep = " ") %>%
  arrange(Name)

# Fix some problematic names for the next phase
grandmaster_2600_cleanish <- grandmaster_2600_r %>%
  mutate(Name = dplyr::recode(Name, 
                                   "A.R. Saleh Salem" = "Salem Saleh (chess player)",
                                   "B. Adhiban" = "Adhiban Baskaran",
                                   "Chao b Li" = "Li Chao (chess player)",
                                   "Chithambaram VR. Aravindh" = "Aravindh Chithambaram",
                                   "David W L Howell" = "David Howell (chess player)",
                                   "Fabiano Caruana" = "Fabiano Caruana wikipedia",
                                   "Robert Hovhannisyan" = "Robert Hovhannisyan en wikipedia" ,
                                   "Hao Wang" = "Wang Hao (chess player)",
                                   "NA Narayanan.S.L" = "S.L. Narayanan",
                                   "NA Nihal Sarin" = "Nihal Sarin",
                                   "NA Praggnanandhaa R"= "Rameshbabu Praggnanandhaa", 
                              "Johan-Sebastian Christiansen" = "Johan-Sebastian Christiansen wikipedia",
                              "Konstantin Landa" = "Konstantin Landa wikipedia"))

head(grandmaster_2600_cleanish)

Getting Data Using Google and Wikipedia

The Wikipedia URLs of chess players are fairly consistent. The formula for any person is (and we’ll use the world champion as an example):

the base URL(https://en.wikipedia.org/wiki/) + First Name(Magnus) + Last Name(_Carlsen) = https://en.wikipedia.org/wiki/Magnus_Carlsen

However, because the data set was made by FIDE and not Wikipedia, there are a number of inconsistencies that will arise if we just blindly follow the formula above. Here are some of the issues:

  • Names from data set could be misspelled
  • Data set first and last names could be flipped (even after we rearranged them)
  • Names that are long are sometimes abbreviated
  • Wikipedia pages are sometimes case sensitive (especially for dutch players)
  • Player Wikipedia pages sometimes only exist in different languages (some of the Spanish and Latino players only have pages in Spanish)
  • Some players have special characters in their names (like accents from other languages)
  • Some players have common names( so we need to ensure that we get the chess player)
  • Some players may not even have a Wikipedia page

With all of these issues in mind, I decided to use Google searches so that I could ensure that I get Wikipedia URLs that exist. Luckily, there is a python library that can return webpage links using search queries made in python. So let’s make our queries using the names in our data set.


grandmaster_list = list(r.grandmaster_2600_cleanish["Name"])

grandmaster_list_wiki_query = [player + ' wiki chess player'  for player in grandmaster_list]
pp.pprint(grandmaster_list_wiki_query)
## ['Salem Saleh (chess player) wiki chess player',
##  'Abhijeet Gupta wiki chess player',
##  'Ahmed Adly wiki chess player',
##  'Alan Pichot wiki chess player',
##  'Aleksandar Indjic wiki chess player',
##  'Aleksandr Lenderman wiki chess player',
##  'Aleksandr Rakhmanov wiki chess player',
##  'Aleksandra Goryachkina wiki chess player',
##  'Aleksey Dreev wiki chess player',
##  'Alexander Areshchenko wiki chess player',
##  'Alexander Chernin wiki chess player',
##  'Alexander Donchenko wiki chess player',
##  'Alexander Galkin wiki chess player',
##  'Alexander Grischuk wiki chess player',
##  'Alexander Ipatov wiki chess player',
##  'Alexander Khalifman wiki chess player',
##  'Alexander Moiseenko wiki chess player',
##  'Alexander Morozevich wiki chess player',
##  'Alexander Motylev wiki chess player',
##  'Alexander Onischuk wiki chess player',
##  'Alexander Riazantsev wiki chess player',
##  'Alexandr Predke wiki chess player',
##  'Alexei Shirov wiki chess player',
##  'Alexey Sarana wiki chess player',
##  'Alireza Firouzja wiki chess player',
##  'Anatoly Karpov wiki chess player',
##  'Andrei Volokitin wiki chess player',
##  'Andrey Esipenko wiki chess player',
##  'Anish Giri wiki chess player',
##  'Ante Brkic wiki chess player',
##  'Anton Demchenko wiki chess player',
##  'Anton Korobov wiki chess player',
##  'Anton Kovalyov wiki chess player',
##  'Anton Smirnov wiki chess player',
##  'Aram Hakobyan wiki chess player',
##  'Arkadij Naiditsch wiki chess player',
##  'Arman Pashikian wiki chess player',
##  'Aryan Tari wiki chess player',
##  'Adhiban Baskaran wiki chess player',
##  'Bartosz Socko wiki chess player',
##  'Bassem Amin wiki chess player',
##  'Benjamin Bok wiki chess player',
##  'Benjamin Gledura wiki chess player',
##  'Bogdan-Daniel Deac wiki chess player',
##  'Boris Alterman wiki chess player',
##  'Boris Gelfand wiki chess player',
##  'Boris Grachev wiki chess player',
##  'Li Chao (chess player) wiki chess player',
##  'Aravindh Chithambaram wiki chess player',
##  'Christian Bauer wiki chess player',
##  'Constantin Lupulescu wiki chess player',
##  'Cristobal Henriquez Villagra wiki chess player',
##  'Daniel Fridman wiki chess player',
##  'Daniel Naroditsky wiki chess player',
##  'Daniel Stellwagen wiki chess player',
##  'Daniele Vocaturo wiki chess player',
##  'Daniil Dubov wiki chess player',
##  'Dariusz Swiercz wiki chess player',
##  'Darmen Sadvakasov wiki chess player',
##  'David Anton Guijarro wiki chess player',
##  'David Baramidze wiki chess player',
##  'David Navara wiki chess player',
##  'David Paravyan wiki chess player',
##  'David Howell (chess player) wiki chess player',
##  'Denis Khismatullin wiki chess player',
##  'Dimitrios Mastrovasilis wiki chess player',
##  'Dmitrij Kollars wiki chess player',
##  'Dmitry Andreikin wiki chess player',
##  'Dmitry Jakovenko wiki chess player',
##  'Dmitry Kononenko wiki chess player',
##  'Eduardo Iturrizaga Bonelli wiki chess player',
##  'Emil Sutovsky wiki chess player',
##  'Eric Hansen wiki chess player',
##  'Ernesto Inarkiev wiki chess player',
##  "Erwin L'Ami wiki chess player",
##  'Etienne Bacrot wiki chess player',
##  'Evgeniy Najer wiki chess player',
##  'Evgeny Alekseev wiki chess player',
##  'Evgeny Bareev wiki chess player',
##  'Evgeny Shtembuliak wiki chess player',
##  'Evgeny Tomashevsky wiki chess player',
##  'Fabiano Caruana wikipedia wiki chess player',
##  'Farrukh Amonatov wiki chess player',
##  'Ferenc Berkes wiki chess player',
##  'Francisco Vallejo Pons wiki chess player',
##  'Gabriel Sargissian wiki chess player',
##  'Gadir Guseinov wiki chess player',
##  'Garry Kasparov wiki chess player',
##  'Gata Kamsky wiki chess player',
##  'Gawain C B Jones wiki chess player',
##  'Georg Meier wiki chess player',
##  'Georgy Pilavov wiki chess player',
##  'Giovanni Vescovi wiki chess player',
##  'Grigoriy Oparin wiki chess player',
##  'Grzegorz Gajewski wiki chess player',
##  'Haik M. Martirosyan wiki chess player',
##  'Hans Moke Niemann wiki chess player',
##  'Wang Hao (chess player) wiki chess player',
##  'Hikaru Nakamura wiki chess player',
##  'Hrant Melkumyan wiki chess player',
##  'Hristos Banikas wiki chess player',
##  'Hua Ni wiki chess player',
##  'Ian Nepomniachtchi wiki chess player',
##  'Igor Kovalenko wiki chess player',
##  'Igor Lysyj wiki chess player',
##  'Igors Rausis wiki chess player',
##  'Ildar Khairullin wiki chess player',
##  'Ilia Smirin wiki chess player',
##  'Illya Nyzhnyk wiki chess player',
##  'Ioannis Papaioannou wiki chess player',
##  'Ivan Cheparinov wiki chess player',
##  'Ivan Popov wiki chess player',
##  'Ivan Salgado Lopez wiki chess player',
##  'Ivan Saric wiki chess player',
##  'Jaime Santos Latasa wiki chess player',
##  'Jan-Krzysztof Duda wiki chess player',
##  'Jan Gustafsson wiki chess player',
##  'Jeffery Xiong wiki chess player',
##  'Jeroen Piket wiki chess player',
##  'Jianchao Zhou wiki chess player',
##  'Jiangchuan Ye wiki chess player',
##  'Jinshi Bai wiki chess player',
##  'Joel Lautier wiki chess player',
##  'Johan-Sebastian Christiansen wikipedia wiki chess player',
##  'Jon Ludvig Hammer wiki chess player',
##  'Jorden Van Foreest wiki chess player',
##  'Jorge Cori wiki chess player',
##  'Jose Eduardo Martinez Alcantara wiki chess player',
##  'Judit Polgar wiki chess player',
##  'Jules Moussard wiki chess player',
##  'Julian M Hodgson wiki chess player',
##  'Julio E Granda Zuniga wiki chess player',
##  'Jun Zhao wiki chess player',
##  'Kacper Piorun wiki chess player',
##  'Karen H. Grigoryan wiki chess player',
##  'Kirill Alekseenko wiki chess player',
##  'Kirill Shevchenko wiki chess player',
##  'Konstantin Landa wikipedia wiki chess player',
##  'Krishnan Sasikiran wiki chess player',
##  'Laurent Fressinet wiki chess player',
##  'Lazaro Bruzon Batista wiki chess player',
##  'Leinier Dominguez Perez wiki chess player',
##  'Levon Aronian wiki chess player',
##  'Liren Ding wiki chess player',
##  'Liviu-Dieter Nisipeanu wiki chess player',
##  'Loek Van Wely wiki chess player',
##  'Luka Lenic wiki chess player',
##  'Luke J McShane wiki chess player',
##  'M. Amin Tabatabaei wiki chess player',
##  'Magnus Carlsen wiki chess player',
##  'Maksim Chigaev wiki chess player',
##  'Manuel Petrosyan wiki chess player',
##  'Marin Bosiocic wiki chess player',
##  'Markus Ragger wiki chess player',
##  'Martyn Kravtsiv wiki chess player',
##  'Mateusz Bartel wiki chess player',
##  'Matthew D Sadler wiki chess player',
##  'Matthias Bluebaum wiki chess player',
##  'Maxim Matlakov wiki chess player',
##  'Maxime Lagarde wiki chess player',
##  'Maxime Vachier-Lagrave wiki chess player',
##  'Michael Adams wiki chess player',
##  'Miguel Illescas Cordoba wiki chess player',
##  'Miguel Santos Ruiz wiki chess player',
##  'Mikhail Al. Antipov wiki chess player',
##  'Mikhail Kobalia wiki chess player',
##  'Murali Karthikeyan wiki chess player',
##  'Mustafa Yilmaz wiki chess player',
##  'NA Erigaisi Arjun wiki chess player',
##  'S.L. Narayanan wiki chess player',
##  'Nihal Sarin wiki chess player',
##  'Rameshbabu Praggnanandhaa wiki chess player',
##  'Ngoc Truong Son Nguyen wiki chess player',
##  'Nigel D Short wiki chess player',
##  'Nijat Abasov wiki chess player',
##  'Nikita Vitiugov wiki chess player',
##  'Nils Grandelius wiki chess player',
##  'Nodirbek Abdusattorov wiki chess player',
##  'Nodirbek Yakubboev wiki chess player',
##  'Olexandr Bortnyk wiki chess player',
##  'Parham Maghsoodloo wiki chess player',
##  'Parimarjan Negi wiki chess player',
##  'Pavel Eljanov wiki chess player',
##  'Pavel Ponkratov wiki chess player',
##  'Pentala Harikrishna wiki chess player',
##  'Peter Heine Nielsen wiki chess player',
##  'Peter Leko wiki chess player',
##  'Peter Svidler wiki chess player',
##  'Pouya Idani wiki chess player',
##  'Quang Liem Le wiki chess player',
##  'Qun Ma wiki chess player',
##  'Radoslaw Wojtaszek wiki chess player',
##  'Rasmus Svane wiki chess player',
##  'Rauf Mamedov wiki chess player',
##  'Raunak Sadhwani wiki chess player',
##  'Ray Robson wiki chess player',
##  'Richard Rapport wiki chess player',
##  'Rinat Jumabayev wiki chess player',
##  'Robert Hovhannisyan en wikipedia wiki chess player',
##  'Robin Van Kampen wiki chess player',
##  'Ruslan Ponomariov wiki chess player',
##  'Rustam Kasimdzhanov wiki chess player',
##  'S.P. Sethuraman wiki chess player',
##  'Sam Shankland wiki chess player',
##  'Samuel Sevian wiki chess player',
##  'Samvel Ter-Sahakyan wiki chess player',
##  'Sanan Sjugirov wiki chess player',
##  'Sandro Mareco wiki chess player',
##  'Santosh Gujrathi Vidit wiki chess player',
##  'Sergei Azarov wiki chess player',
##  'Sergei Movsesian wiki chess player',
##  'Sergei Rublevsky wiki chess player',
##  'Sergei Tiviakov wiki chess player',
##  'Sergey A. Fedorchuk wiki chess player',
##  'Sergey Karjakin wiki chess player',
##  'Shakhriyar Mamedyarov wiki chess player',
##  'Shanglei Lu wiki chess player',
##  'Shant Sargsyan wiki chess player',
##  'Suri Vaibhav wiki chess player',
##  'Surya Shekhar Ganguly wiki chess player',
##  'Tamas Banusz wiki chess player',
##  'Tamir Nabaty wiki chess player',
##  'Teimour Radjabov wiki chess player',
##  'Tigran Gharamian wiki chess player',
##  'Vadim Milov wiki chess player',
##  'Vadim Zvjaginsev wiki chess player',
##  'Valery Salov wiki chess player',
##  'Varuzhan Akobian wiki chess player',
##  'Vasif Durarbayli wiki chess player',
##  'Vasyl Ivanchuk wiki chess player',
##  'Velimir Ivic wiki chess player',
##  'Veselin Topalov wiki chess player',
##  'Viktor Erdos wiki chess player',
##  'Viktor Laznicka wiki chess player',
##  'Vincent Keymer wiki chess player',
##  'Viswanathan Anand wiki chess player',
##  'Vitaliy Bernadskiy wiki chess player',
##  'Vladimir Afromeev wiki chess player',
##  'Vladimir Akopian wiki chess player',
##  'Vladimir Fedoseev wiki chess player',
##  'Vladimir Kramnik wiki chess player',
##  'Vladimir Malakhov wiki chess player',
##  'Vladimir Onischuk wiki chess player',
##  'Vladislav Artemiev wiki chess player',
##  'Vladislav Kovalev wiki chess player',
##  'Vladislav Tkachiev wiki chess player',
##  'Wesley So wiki chess player',
##  'Wojciech Moranda wiki chess player',
##  'Xiangzhi Bu wiki chess player',
##  'Yangyi Yu wiki chess player',
##  'Yannick Gozzoli wiki chess player',
##  'Yaroslav Zherebukh wiki chess player',
##  'Yasser Seirawan wiki chess player',
##  'Yevgeniy Vladimirov wiki chess player',
##  'Yi Wei wiki chess player',
##  'Yifan Hou wiki chess player',
##  'Yue Wang wiki chess player',
##  'Yuri Drozdovskij wiki chess player',
##  'Yuriy Kryvoruchko wiki chess player',
##  'Yuriy Kuzubov wiki chess player',
##  'Zahar Efimenko wiki chess player',
##  'Zhong Zhang wiki chess player',
##  'Zoltan Almasi wiki chess player',
##  'Zoltan Gyimesi wiki chess player',
##  'Zurab Azmaiparashvili wiki chess player',
##  'Zviad Izoria wiki chess player']

We will now use our queries to acquire the URLs.

#Load library
from googlesearch import search

#List to store URLs
player_wiki_pages = []

# Loop to so that we get one URL for every query
for p in grandmaster_list_wiki_query:
  for i in search(p, 
    tld= 'co.in', 
    num= 1, 
    stop= 1, 
    pause= 2.0):
    player_wiki_pages.append(i)
pp.pprint(player_wiki_pages)
## ['https://en.wikipedia.org/wiki/Salem_Saleh_(chess_player)',
##  'https://en.wikipedia.org/wiki/Abhijeet_Gupta',
##  'https://en.wikipedia.org/wiki/Ahmed_Adly',
##  'https://en.wikipedia.org/wiki/Alan_Pichot',
##  'https://en.wikipedia.org/wiki/Aleksandar_In%C4%91i%C4%87',
##  'https://en.wikipedia.org/wiki/Aleksandr_Lenderman',
##  'https://en.wikipedia.org/wiki/Aleksandr_Rakhmanov',
##  'https://en.wikipedia.org/wiki/Aleksandra_Goryachkina',
##  'https://en.wikipedia.org/wiki/Alexey_Dreev',
##  'https://en.wikipedia.org/wiki/Alexander_Areshchenko',
##  'https://en.wikipedia.org/wiki/Alexander_Chernin',
##  'https://en.wikipedia.org/wiki/Alexander_Donchenko',
##  'https://en.wikipedia.org/wiki/Aleksandr_Galkin_(chess_player)',
##  'https://en.wikipedia.org/wiki/Alexander_Grischuk',
##  'https://en.wikipedia.org/wiki/Alexander_Ipatov',
##  'https://en.wikipedia.org/wiki/Alexander_Khalifman',
##  'https://en.wikipedia.org/wiki/Alexander_Moiseenko',
##  'https://en.wikipedia.org/wiki/Alexander_Morozevich',
##  'https://en.wikipedia.org/wiki/Alexander_Motylev',
##  'https://en.wikipedia.org/wiki/Alexander_Onischuk',
##  'https://en.wikipedia.org/wiki/Alexander_Riazantsev_(chess_player)',
##  'https://en.wikipedia.org/wiki/Alexandr_Predke',
##  'https://en.wikipedia.org/wiki/Alexei_Shirov',
##  'https://en.wikipedia.org/wiki/Alexey_Sarana',
##  'https://en.wikipedia.org/wiki/Alireza_Firouzja',
##  'https://en.wikipedia.org/wiki/Anatoly_Karpov',
##  'https://en.wikipedia.org/wiki/Andrei_Volokitin',
##  'https://en.wikipedia.org/wiki/Andrey_Esipenko',
##  'https://en.wikipedia.org/wiki/Anish_Giri',
##  'https://en.wikipedia.org/wiki/Ante_Brki%C4%87',
##  'https://en.wikipedia.org/wiki/Anton_Demchenko',
##  'https://en.wikipedia.org/wiki/Anton_Korobov',
##  'https://en.wikipedia.org/wiki/Anton_Kovalyov',
##  'https://en.wikipedia.org/wiki/Anton_Smirnov_(chess_player)',
##  'https://www.wikidata.org/wiki/Q27525651',
##  'https://en.wikipedia.org/wiki/Arkadij_Naiditsch',
##  'https://en.wikipedia.org/wiki/Arman_Pashikian',
##  'https://en.wikipedia.org/wiki/Aryan_Tari',
##  'https://en.wikipedia.org/wiki/Adhiban_Baskaran',
##  'https://en.wikipedia.org/wiki/Bartosz_So%C4%87ko',
##  'https://en.wikipedia.org/wiki/Bassem_Amin',
##  'https://en.wikipedia.org/wiki/Benjamin_Bok',
##  'https://en.wikipedia.org/wiki/Benj%C3%A1min_Gledura',
##  'https://en.wikipedia.org/wiki/Bogdan-Daniel_Deac',
##  'https://en.wikipedia.org/wiki/Boris_Alterman',
##  'https://en.wikipedia.org/wiki/Boris_Gelfand',
##  'https://en.wikipedia.org/wiki/Boris_Grachev',
##  'https://en.wikipedia.org/wiki/Li_Chao_(chess_player)',
##  'https://en.wikipedia.org/wiki/Aravindh_Chithambaram',
##  'https://en.wikipedia.org/wiki/Christian_Bauer',
##  'https://en.wikipedia.org/wiki/Constantin_Lupulescu',
##  'https://en.wikipedia.org/wiki/Cristobal_Henriquez_Villagra',
##  'https://en.wikipedia.org/wiki/Daniel_Fridman',
##  'https://en.wikipedia.org/wiki/Daniel_Naroditsky',
##  'https://en.wikipedia.org/wiki/Dani%C3%ABl_Stellwagen',
##  'https://en.wikipedia.org/wiki/Daniele_Vocaturo',
##  'https://en.wikipedia.org/wiki/Daniil_Dubov',
##  'https://en.wikipedia.org/wiki/Dariusz_%C5%9Awiercz',
##  'https://en.wikipedia.org/wiki/Darmen_Sadvakasov',
##  'https://en.wikipedia.org/wiki/David_Ant%C3%B3n_Guijarro',
##  'https://en.wikipedia.org/wiki/David_Baramidze',
##  'https://en.wikipedia.org/wiki/David_Navara',
##  'https://en.wikipedia.org/wiki/David_Paravyan',
##  'https://en.wikipedia.org/wiki/David_Howell_(chess_player)',
##  'https://en.wikipedia.org/wiki/Denis_Khismatullin',
##  'https://en.wikipedia.org/wiki/Dimitrios_Mastrovasilis',
##  'https://en.wikipedia.org/wiki/Dmitrij_Kollars',
##  'https://en.wikipedia.org/wiki/Dmitry_Andreikin',
##  'https://en.wikipedia.org/wiki/Dmitry_Jakovenko',
##  'https://en.wikipedia.org/wiki/Dmitry_Kononenko',
##  'https://en.wikipedia.org/wiki/Eduardo_Iturrizaga',
##  'https://en.wikipedia.org/wiki/Emil_Sutovsky',
##  'https://en.wikipedia.org/wiki/Eric_Hansen_(chess_player)',
##  'https://en.wikipedia.org/wiki/Ernesto_Inarkiev',
##  'https://en.wikipedia.org/wiki/Erwin_l%27Ami',
##  'https://en.wikipedia.org/wiki/%C3%89tienne_Bacrot',
##  'https://en.wikipedia.org/wiki/Evgeniy_Najer',
##  'https://en.wikipedia.org/wiki/Evgeny_Alekseev_(chess_player)',
##  'https://en.wikipedia.org/wiki/Evgeny_Bareev',
##  'https://en.wikipedia.org/wiki/Evgeny_Shtembuliak',
##  'https://en.wikipedia.org/wiki/Evgeny_Tomashevsky',
##  'https://en.wikipedia.org/wiki/Fabiano_Caruana',
##  'https://en.wikipedia.org/wiki/Farrukh_Amonatov',
##  'https://en.wikipedia.org/wiki/Ferenc_Berkes',
##  'https://en.wikipedia.org/wiki/Francisco_Vallejo_Pons',
##  'https://en.wikipedia.org/wiki/Gabriel_Sargissian',
##  'https://en.wikipedia.org/wiki/Gadir_Guseinov',
##  'https://en.wikipedia.org/wiki/Garry_Kasparov',
##  'https://en.wikipedia.org/wiki/Gata_Kamsky',
##  'https://en.wikipedia.org/wiki/Gawain_Jones',
##  'https://en.wikipedia.org/wiki/Georg_Meier_(chess_player)',
##  'https://www.wikidata.org/wiki/Q4362660',
##  'https://en.wikipedia.org/wiki/Giovanni_Vescovi',
##  'https://en.wikipedia.org/wiki/Grigoriy_Oparin',
##  'https://en.wikipedia.org/wiki/Grzegorz_Gajewski',
##  'https://en.wikipedia.org/wiki/Haik_M._Martirosyan',
##  'https://en.wikipedia.org/wiki/Hans_Niemann',
##  'https://en.wikipedia.org/wiki/Wang_Hao_(chess_player)',
##  'https://en.wikipedia.org/wiki/Hikaru_Nakamura',
##  'https://en.wikipedia.org/wiki/Hrant_Melkumyan',
##  'https://en.wikipedia.org/wiki/Hristos_Banikas',
##  'https://en.wikipedia.org/wiki/Ni_Hua',
##  'https://en.wikipedia.org/wiki/Ian_Nepomniachtchi',
##  'https://en.wikipedia.org/wiki/Igor_Kovalenko',
##  'https://en.wikipedia.org/wiki/Igor_Lysyj',
##  'https://en.wikipedia.org/wiki/Igors_Rausis',
##  'https://en.wikipedia.org/wiki/Ildar_Khairullin',
##  'https://en.wikipedia.org/wiki/Ilya_Smirin',
##  'https://en.wikipedia.org/wiki/Illia_Nyzhnyk',
##  'https://en.wikipedia.org/wiki/Ioannis_Papaioannou',
##  'https://en.wikipedia.org/wiki/Ivan_Cheparinov',
##  'https://en.wikipedia.org/wiki/Ivan_Popov_(chess_player)',
##  'https://en.wikipedia.org/wiki/Iv%C3%A1n_Salgado_L%C3%B3pez',
##  'https://en.wikipedia.org/wiki/Ivan_%C5%A0ari%C4%87_(chess_player)',
##  'https://second.wiki/wiki/jaime_santos_latasa',
##  'https://en.wikipedia.org/wiki/Jan-Krzysztof_Duda',
##  'https://en.wikipedia.org/wiki/Jan_Gustafsson',
##  'https://en.wikipedia.org/wiki/Jeffery_Xiong',
##  'https://en.wikipedia.org/wiki/Jeroen_Piket',
##  'https://en.wikipedia.org/wiki/Zhou_Jianchao',
##  'https://en.wikipedia.org/wiki/Ye_Jiangchuan',
##  'https://en.wikipedia.org/wiki/Bai_Jinshi',
##  'https://en.wikipedia.org/wiki/Jo%C3%ABl_Lautier',
##  'https://en.wikipedia.org/wiki/Johan-Sebastian_Christiansen',
##  'https://en.wikipedia.org/wiki/Jon_Ludvig_Hammer',
##  'https://en.wikipedia.org/wiki/Jorden_van_Foreest',
##  'https://en.wikipedia.org/wiki/Jorge_Cori',
##  'https://en.wikipedia.org/wiki/Jose_Eduardo_Martinez_Alcantara',
##  'https://en.wikipedia.org/wiki/Judit_Polg%C3%A1r',
##  'https://en.wikipedia.org/wiki/Jules_Moussard',
##  'https://en.wikipedia.org/wiki/Julian_Hodgson',
##  'https://en.wikipedia.org/wiki/Julio_Granda',
##  'https://en.wikipedia.org/wiki/Zhao_Jun_(chess_player)',
##  'https://en.wikipedia.org/wiki/Kacper_Piorun',
##  'https://en.wikipedia.org/wiki/Karen_H._Grigoryan',
##  'https://en.wikipedia.org/wiki/Kirill_Alekseenko',
##  'https://en.wikipedia.org/wiki/Kirill_Shevchenko',
##  'https://en.wikipedia.org/wiki/Konstantin_Landa',
##  'https://en.wikipedia.org/wiki/Krishnan_Sasikiran',
##  'https://en.wikipedia.org/wiki/Laurent_Fressinet',
##  'https://en.wikipedia.org/wiki/L%C3%A1zaro_Bruz%C3%B3n',
##  'https://en.wikipedia.org/wiki/Leinier_Dom%C3%ADnguez',
##  'https://en.wikipedia.org/wiki/Levon_Aronian',
##  'https://en.wikipedia.org/wiki/Ding_Liren',
##  'https://en.wikipedia.org/wiki/Liviu-Dieter_Nisipeanu',
##  'https://en.wikipedia.org/wiki/Loek_van_Wely',
##  'https://en.wikipedia.org/wiki/Luka_Leni%C4%8D',
##  'https://en.wikipedia.org/wiki/Luke_McShane',
##  'https://en.wikipedia.org/wiki/Amin_Tabatabaei',
##  'https://en.wikipedia.org/wiki/Magnus_Carlsen',
##  'https://en.wikipedia.org/wiki/Maksim_Chigaev',
##  'https://en.wikipedia.org/wiki/Manuel_Petrosyan',
##  'https://en.wikipedia.org/wiki/Marin_Bosio%C4%8Di%C4%87',
##  'https://en.wikipedia.org/wiki/Markus_Ragger',
##  'https://en.wikipedia.org/wiki/Martyn_Kravtsiv',
##  'https://en.wikipedia.org/wiki/Mateusz_Bartel',
##  'https://en.wikipedia.org/wiki/Matthew_Sadler',
##  'https://en.wikipedia.org/wiki/Matthias_Bl%C3%BCbaum',
##  'https://en.wikipedia.org/wiki/Maxim_Matlakov',
##  'https://en.wikipedia.org/wiki/Maxime_Lagarde',
##  'https://en.wikipedia.org/wiki/Maxime_Vachier-Lagrave',
##  'https://en.wikipedia.org/wiki/Michael_Adams_(chess_player)',
##  'https://en.wikipedia.org/wiki/Miguel_Illescas',
##  'https://en.wikipedia.org/wiki/Miguel_Santos_Ruiz',
##  'https://en.wikipedia.org/wiki/Mikhail_Antipov',
##  'https://en.wikipedia.org/wiki/Mikhail_Kobalia',
##  'https://en.wikipedia.org/wiki/Karthikeyan_Murali',
##  'https://en.wikipedia.org/wiki/Mustafa_Y%C4%B1lmaz',
##  'https://en.wikipedia.org/wiki/Arjun_Erigaisi',
##  'https://en.wikipedia.org/wiki/S._L._Narayanan',
##  'https://en.wikipedia.org/wiki/Nihal_Sarin',
##  'https://en.wikipedia.org/wiki/Rameshbabu_Praggnanandhaa',
##  'https://en.m.wikipedia.org/wiki/Nguyen_Ngoc_Truong_Son',
##  'https://en.wikipedia.org/wiki/Nigel_Short',
##  'https://en.wikipedia.org/wiki/Nijat_Abasov',
##  'https://en.wikipedia.org/wiki/Nikita_Vitiugov',
##  'https://en.wikipedia.org/wiki/Nils_Grandelius',
##  'https://en.wikipedia.org/wiki/Nodirbek_Abdusattorov',
##  'https://en.wikipedia.org/wiki/Nodirbek_Yakubboev',
##  'https://en.wikipedia.org/wiki/Olexandr_Bortnyk',
##  'https://en.wikipedia.org/wiki/Parham_Maghsoodloo',
##  'https://en.wikipedia.org/wiki/Parimarjan_Negi',
##  'https://en.wikipedia.org/wiki/Pavel_Eljanov',
##  'https://en.wikipedia.org/wiki/Pavel_Ponkratov',
##  'https://en.wikipedia.org/wiki/Pentala_Harikrishna',
##  'https://en.wikipedia.org/wiki/Peter_Heine_Nielsen',
##  'https://en.wikipedia.org/wiki/Peter_Leko',
##  'https://en.wikipedia.org/wiki/Peter_Svidler',
##  'https://en.wikipedia.org/wiki/Pouya_Idani',
##  'https://en.wikipedia.org/wiki/L%C3%AA_Quang_Li%C3%AAm',
##  'https://en.wikipedia.org/wiki/Ma_Qun',
##  'https://en.wikipedia.org/wiki/Rados%C5%82aw_Wojtaszek',
##  'https://en.wikipedia.org/wiki/Rasmus_Svane',
##  'https://en.wikipedia.org/wiki/Rauf_Mamedov',
##  'https://en.wikipedia.org/wiki/Raunak_Sadhwani',
##  'https://en.wikipedia.org/wiki/Ray_Robson',
##  'https://en.wikipedia.org/wiki/Rich%C3%A1rd_Rapport',
##  'https://en.wikipedia.org/wiki/Rinat_Jumabayev',
##  'https://en.wikipedia.org/wiki/Robert_Hovhannisyan',
##  'https://en.wikipedia.org/wiki/Robin_van_Kampen',
##  'https://en.wikipedia.org/wiki/Ruslan_Ponomariov',
##  'https://en.wikipedia.org/wiki/Rustam_Kasimdzhanov',
##  'https://en.wikipedia.org/wiki/S._P._Sethuraman',
##  'https://en.wikipedia.org/wiki/Sam_Shankland',
##  'https://en.wikipedia.org/wiki/Samuel_Sevian',
##  'https://en.wikipedia.org/wiki/Samvel_Ter-Sahakyan',
##  'https://en.wikipedia.org/wiki/Sanan_Sjugirov',
##  'https://en.wikipedia.org/wiki/Sandro_Mareco',
##  'https://en.wikipedia.org/wiki/Vidit_Gujrathi',
##  'https://en.wikipedia.org/wiki/Sergei_Azarov',
##  'https://en.wikipedia.org/wiki/Sergei_Movsesian',
##  'https://en.wikipedia.org/wiki/Sergei_Rublevsky',
##  'https://en.wikipedia.org/wiki/Sergei_Tiviakov',
##  'https://en.wikipedia.org/wiki/Sergey_Fedorchuk',
##  'https://en.wikipedia.org/wiki/Sergey_Karjakin',
##  'https://en.wikipedia.org/wiki/Shakhriyar_Mamedyarov',
##  'https://en.wikipedia.org/wiki/Lu_Shanglei',
##  'https://en.wikipedia.org/wiki/Shant_Sargsyan',
##  'https://en.wikipedia.org/wiki/List_of_Indian_chess_players',
##  'https://en.wikipedia.org/wiki/Surya_Shekhar_Ganguly',
##  'https://en.wikipedia.org/wiki/Tam%C3%A1s_B%C3%A1nusz',
##  'https://en.wikipedia.org/wiki/Tamir_Nabaty',
##  'https://en.wikipedia.org/wiki/Teimour_Radjabov',
##  'https://en.wikipedia.org/wiki/Tigran_Gharamian',
##  'https://en.wikipedia.org/wiki/Vadim_Milov',
##  'https://en.wikipedia.org/wiki/Vadim_Zvjaginsev',
##  'https://en.wikipedia.org/wiki/Valery_Salov',
##  'https://en.wikipedia.org/wiki/Varuzhan_Akobian',
##  'https://en.wikipedia.org/wiki/Vasif_Durarbayli',
##  'https://en.wikipedia.org/wiki/Vasyl_Ivanchuk',
##  'https://en.wikipedia.org/wiki/Velimir_Ivi%C4%87',
##  'https://en.wikipedia.org/wiki/Veselin_Topalov',
##  'https://en.wikipedia.org/wiki/Viktor_Erd%C5%91s',
##  'https://en.wikipedia.org/wiki/Viktor_L%C3%A1zni%C4%8Dka',
##  'https://en.wikipedia.org/wiki/Vincent_Keymer',
##  'https://en.wikipedia.org/wiki/Viswanathan_Anand',
##  'https://en.wikipedia.org/wiki/Vitaliy_Bernadskiy',
##  'https://en.wikipedia.org/wiki/Vladimir_Afromeev',
##  'https://en.wikipedia.org/wiki/Vladimir_Akopian',
##  'https://en.wikipedia.org/wiki/Vladimir_Fedoseev',
##  'https://en.wikipedia.org/wiki/Vladimir_Kramnik',
##  'https://en.wikipedia.org/wiki/Vladimir_Malakhov_(chess_player)',
##  'https://en.wikipedia.org/wiki/Volodymyr_Onyshchuk',
##  'https://en.wikipedia.org/wiki/Vladislav_Artemiev',
##  'https://en.wikipedia.org/wiki/Vladislav_Kovalev',
##  'https://en.wikipedia.org/wiki/Vladislav_Tkachiev',
##  'https://en.wikipedia.org/wiki/Wesley_So',
##  'https://en.wikipedia.org/wiki/Wojciech_Moranda',
##  'https://en.wikipedia.org/wiki/Bu_Xiangzhi',
##  'http://t1.gstatic.com/licensed-image?q=tbn:ANd9GcR-sXmvQj_rsGFI0Z2h8Y8n62Hw1T7L8umRy3URmaukMqSdwXhB-6r8HfGF1run',
##  'https://en.wikipedia.org/wiki/Yannick_Gozzoli',
##  'https://en.wikipedia.org/wiki/Yaroslav_Zherebukh',
##  'https://en.wikipedia.org/wiki/Yasser_Seirawan',
##  'https://en.wikipedia.org/wiki/Yevgeniy_Vladimirov',
##  'https://en.wikipedia.org/wiki/Wei_Yi',
##  'https://en.wikipedia.org/wiki/Hou_Yifan',
##  'https://en.wikipedia.org/wiki/Wang_Yue',
##  'https://en.wikipedia.org/wiki/Yuri_Drozdovskij',
##  'https://en.wikipedia.org/wiki/Yuriy_Kryvoruchko',
##  'https://en.wikipedia.org/wiki/Yuriy_Kuzubov',
##  'https://en.wikipedia.org/wiki/Zahar_Efimenko',
##  'https://en.wikipedia.org/wiki/Zhang_Zhong',
##  'https://en.wikipedia.org/wiki/Zolt%C3%A1n_Alm%C3%A1si',
##  'https://en.wikipedia.org/wiki/Zoltan_Gyimesi',
##  'https://en.wikipedia.org/wiki/Zurab_Azmaiparashvili',
##  'https://en.wikipedia.org/wiki/Zviad_Izoria']

This worked really well because only a few pages are incorrect. We can deal with this in R by cutting out URLs with inconsistent patterns.

#Look for URLs that don't have a particular pattern
player_wiki_pages_test<- grepl(pattern = "^https://en.wikipedia.org/wiki/",  x= py$player_wiki_pages)

# Clean URLs
player_wiki_pages_clean <- py$player_wiki_pages[player_wiki_pages_test]

# Problematic URLS
problematic_player_pages <- py$player_wiki_pages[!player_wiki_pages_test]
problematic_player_pages
## [1] "https://www.wikidata.org/wiki/Q27525651"                                                                        
## [2] "https://www.wikidata.org/wiki/Q4362660"                                                                         
## [3] "https://second.wiki/wiki/jaime_santos_latasa"                                                                   
## [4] "https://en.m.wikipedia.org/wiki/Nguyen_Ngoc_Truong_Son"                                                         
## [5] "http://t1.gstatic.com/licensed-image?q=tbn:ANd9GcR-sXmvQj_rsGFI0Z2h8Y8n62Hw1T7L8umRy3URmaukMqSdwXhB-6r8HfGF1run"

Some problematic pages were not detected because the base URLs were present. Their queries actually generated Wikipedia lists of chess players from particular countries.

Because we don’t have a way to get these players’ URLs through web-scraping, we will manually insert their data into our final data frame.

Name <- c( "Jaime, Santos Latasa", "Yu Yangyi", "Georgy, Pilavov", "Suri, Vaibhav", "Aram, Hakobyan","Ngoc Truong Son Nguyen")
Birthdate <- c('1996-07-03',  '1994-06-08',  "1974-12-13", "1997-02-08", "2001-04-01", "1990-02-23")
City_of_birth <- c( 'San Sebastián', 'Hubei', "Luhansk", "New Delhi", "Yerevan", "Rach Gia" )



problematic_grandmaster_bios <- data.frame(Name, Birthdate, City_of_birth)
problematic_grandmaster_bios

Before we do the actual Wikipedia web-scraping, we need a few helper functions that will make our scraping easier.

# Function to extract name from url so we can keep track of who's page we are scraping
def trim_url(x):
    y = x.replace("https://en.wikipedia.org/wiki/", "").replace("_(chess_player)", "").replace('_'," ")
    return(y)

# Function to combine web-scraping elements and convert them to strings
def combine_strings(x):
  y = ", ".join((str(elements) for elements in x)) 
  return(y)

# Function that eliminates unnecessary strings
def extra_string_remover(x):
  y = combine_strings(x).split(",", 1)
  z = y[0]
  return(z)

Let’s now convert our clean pages vector above into a list in python and view the amount of players we are working with.

player_wiki_pages_final = list(r.player_wiki_pages_clean)
len(player_wiki_pages_final)
## 261

This number we are seeing is the combination of the “clean” chess players and the lists of chess players from particular countries.

Now, let’s use the “Scrapy” library to extract birthdate and birthplace information from each player’s Wikipedia page.

from scrapy import Selector
player_bios = []
for url in list(r.player_wiki_pages_clean):
  html = requests.get(url).content
  sel = Selector( text = html ) 
  bday_text = sel.xpath( '//span[@class="bday"]/text()').extract()
  birthplace = sel.xpath( '//td[@class="infobox-data"]//a/text()').extract()
  birthplace_clean = extra_string_remover(birthplace)
  url_name = trim_url(url)
  player_bios.append( url_name + ": "+   combine_strings(bday_text) + ": "+  birthplace_clean)
pp.pprint(player_bios)
## ['Salem Saleh: 1993-01-04: Sharjah',
##  'Abhijeet Gupta: 1989-10-16: Bhilwara',
##  'Ahmed Adly: 1987-02-18: Cairo',
##  'Alan Pichot: 1998-08-13: Buenos Aires',
##  'Aleksandar In%C4%91i%C4%87: : Belgrade',
##  'Aleksandr Lenderman: 1989-09-23: Leningrad',
##  'Aleksandr Rakhmanov: : Cherepovets',
##  'Aleksandra Goryachkina: 1998-09-28: Orsk',
##  'Alexey Dreev: 1969-01-30: Stavropol',
##  'Alexander Areshchenko: 1986-06-15: Voroshilovgrad',
##  'Alexander Chernin: : ',
##  'Alexander Donchenko: : Moscow',
##  'Aleksandr Galkin: : ',
##  'Alexander Grischuk: 1983-10-31: Moscow',
##  'Alexander Ipatov: 1993-07-16: Lviv',
##  'Alexander Khalifman: 1966-01-18: Leningrad',
##  'Alexander Moiseenko: 1980-05-17: Severomorsk',
##  'Alexander Morozevich: 1977-07-18: Moscow',
##  'Alexander Motylev: 1979-06-17: Sverdlosk',
##  'Alexander Onischuk: 1975-09-03: Sevastopol',
##  'Alexander Riazantsev: 1985-09-12: Moscow',
##  'Alexandr Predke: 1994-01-05: Dimitrovgrad',
##  'Alexei Shirov: 1972-07-04: Riga',
##  'Alexey Sarana: : Moscow',
##  'Alireza Firouzja: : Babol',
##  'Anatoly Karpov: 1951-05-23: Zlatoust',
##  'Andrei Volokitin: 1986-06-18: Lviv',
##  'Andrey Esipenko: : Novocherkassk',
##  'Anish Giri: 1994-06-28: Sopiko Guramishvili',
##  'Ante Brki%C4%87: : 2607',
##  'Anton Demchenko: 1987-08-20: 2654',
##  'Anton Korobov: 1985-06-25: Mezhdurechensk',
##  'Anton Kovalyov: 1992-03-04: Kharkiv',
##  'Anton Smirnov: 2001-01-28: Canberra',
##  'Arkadij Naiditsch: 1985-10-25: Riga',
##  'Arman Pashikian: 1987-07-28: Irkutsk',
##  'Aryan Tari: 1999-06-04: Stavanger',
##  'Adhiban Baskaran: 1992-08-15: Mayiladuthurai',
##  'Bartosz So%C4%87ko: 1978-11-10: Piaseczno',
##  'Bassem Amin: 1988-09-09: 2682',
##  'Benjamin Bok: 1995-01-25: Lelystad',
##  'Benj%C3%A1min Gledura: 1999-07-04: Eger',
##  'Bogdan-Daniel Deac: : Râmnicu Vâlcea',
##  'Boris Alterman: : ',
##  'Boris Gelfand: 1968-06-24: Minsk',
##  'Boris Grachev: 1986-03-27: Moscow',
##  'Li Chao: 1989-04-21: Taiyuan',
##  'Aravindh Chithambaram: : Thirunagar',
##  'Christian Bauer: 1977-01-11: Forbach',
##  'Constantin Lupulescu: 1984-03-25: Buftea',
##  'Cristobal Henriquez Villagra: 1996-08-07: La Florida',
##  'Daniel Fridman: 1976-02-15: Riga',
##  'Daniel Naroditsky: 1995-11-09: San Mateo',
##  'Dani%C3%ABl Stellwagen: 1987-03-01: Soest',
##  'Daniele Vocaturo: 1989-12-16: Rome',
##  'Daniil Dubov: 1996-04-18: Moscow',
##  'Dariusz %C5%9Awiercz: 1994-05-31: Tarnowskie Góry',
##  'Darmen Sadvakasov: 1979-04-28: 2629',
##  'David Ant%C3%B3n Guijarro: 1995-06-23: Murcia',
##  'David Baramidze: 1988-09-27: Georgia',
##  'David Navara: 1985-03-27: Prague',
##  'David Paravyan: : Moscow',
##  'David Howell: 1990-11-14: Eastbourne',
##  'Denis Khismatullin: 1984-12-28: Neftekamsk',
##  'Dimitrios Mastrovasilis: 1983-06-12: 2618',
##  'Dmitrij Kollars: : Bremen',
##  'Dmitry Andreikin: 1990-02-05: Ryazan',
##  'Dmitry Jakovenko: 1983-06-29: Nizhnevartovsk',
##  'Dmitry Kononenko: : ',
##  'Eduardo Iturrizaga: 1989-11-01: Caracas',
##  'Emil Sutovsky: 1977-09-19: Baku',
##  'Eric Hansen: 1992-05-24: Irvine',
##  'Ernesto Inarkiev: 1985-12-09: Khaidarkan',
##  'Erwin l%27Ami: 1985-04-05: Woerden',
##  '%C3%89tienne Bacrot: 1983-01-22: Lille',
##  'Evgeniy Najer: 1977-06-22: Moscow',
##  'Evgeny Alekseev: 1985-11-28: Pushkin',
##  'Evgeny Bareev: 1966-11-21: Yemanzhelinsk',
##  'Evgeny Shtembuliak: : ',
##  'Evgeny Tomashevsky: 1987-07-01: Saratov',
##  'Fabiano Caruana: 1992-07-30: Miami',
##  'Farrukh Amonatov: 1978-04-13: Dushanbe',
##  'Ferenc Berkes: 1985-08-08: Baja',
##  'Francisco Vallejo Pons: 1982-08-21: Es Castell',
##  'Gabriel Sargissian: 1983-09-03: Yerevan',
##  'Gadir Guseinov: 1986-05-21: Moscow',
##  'Garry Kasparov: 1963-04-13: Baku',
##  'Gata Kamsky: 1974-06-02: Novokuznetsk',
##  'Gawain Jones: 1987-12-11: Keighley',
##  'Georg Meier: 1987-08-26: Trier',
##  'Giovanni Vescovi: 1978-06-14: Porto Alegre',
##  'Grigoriy Oparin: 1997-07-01: Munich',
##  'Grzegorz Gajewski: 1985-07-19: Skierniewice',
##  'Haik M. Martirosyan: 2000-07-14: Byuravan',
##  'Hans Niemann: : San Francisco',
##  'Wang Hao: 1989-08-04: Harbin',
##  'Hikaru Nakamura: 1987-12-09: Hirakata',
##  'Hrant Melkumyan: 1989-04-30: Yerevan',
##  'Hristos Banikas: 1978-05-20: Salonica',
##  'Ni Hua: 1983-05-31: Shanghai',
##  'Ian Nepomniachtchi: 1990-07-14: Bryansk',
##  'Igor Kovalenko: 1988-12-29: Novomoskovsk',
##  'Igor Lysyj: 1987-01-01: Sverdlovsk',
##  'Igors Rausis: 1961-04-07: Komunarsk',
##  'Ildar Khairullin: 1990-08-22: Perm',
##  'Ilya Smirin: 1968-01-12: Vitebsk',
##  'Illia Nyzhnyk: 1996-09-27: Vinnytsia',
##  'Ioannis Papaioannou: : Athens',
##  'Ivan Cheparinov: 1986-11-26: Asenovgrad',
##  'Ivan Popov: 1990-03-20: Rostov-on-Don',
##  'Iv%C3%A1n Salgado L%C3%B3pez: : ',
##  'Ivan %C5%A0ari%C4%87: 1990-08-17: Split',
##  'Jan-Krzysztof Duda: 1998-04-26: Wieliczka',
##  'Jan Gustafsson: 1979-06-25: Hamburg',
##  'Jeffery Xiong: 2000-10-30: Plano',
##  'Jeroen Piket: 1969-01-27: Leiden',
##  'Zhou Jianchao: 1988-06-11: Shanghai',
##  'Ye Jiangchuan: 1960-11-20: Wuxi',
##  'Bai Jinshi: 1999-05-18: 2593',
##  'Jo%C3%ABl Lautier: 1973-04-12: Scarborough',
##  'Johan-Sebastian Christiansen: 1998-06-10: 2584',
##  'Jon Ludvig Hammer: 1990-06-02: Bergen',
##  'Jorden van Foreest: 1999-04-30: Utrecht',
##  'Jorge Cori: 1995-07-30: Lima',
##  'Jose Eduardo Martinez Alcantara: 1999-01-31: Lima',
##  'Judit Polg%C3%A1r: 1976-07-23: Budapest',
##  'Jules Moussard: 1995-01-16: Paris',
##  'Julian Hodgson: 1963-07-25: London',
##  'Julio Granda: 1967-02-25: Camaná',
##  'Zhao Jun: 1986-12-12: Jinan',
##  'Kacper Piorun: 1991-11-24: Łowicz',
##  'Karen H. Grigoryan: 1995-02-25: Yerevan',
##  'Kirill Alekseenko: : Vyborg',
##  'Kirill Shevchenko: : Kyiv',
##  'Konstantin Landa: 1972-05-22: Omsk',
##  'Krishnan Sasikiran: 1981-01-07: Chennai',
##  'Laurent Fressinet: 1981-11-30: Dax',
##  'L%C3%A1zaro Bruz%C3%B3n: 1982-05-02: Holguín',
##  'Leinier Dom%C3%ADnguez: 1983-09-23: Havana',
##  'Levon Aronian: 1982-10-06: Yerevan',
##  'Ding Liren: 1992-10-24: Wenzhou',
##  'Liviu-Dieter Nisipeanu: 1976-08-01: Braşov',
##  'Loek van Wely: 1972-10-07: Heesch',
##  'Luka Leni%C4%8D: 1988-05-13: Ljubljana',
##  'Luke McShane: 1984-01-07: 2647',
##  'Amin Tabatabaei: : Tehran',
##  'Magnus Carlsen: 1990-11-30: Tønsberg',
##  'Maksim Chigaev: : ',
##  'Manuel Petrosyan: 1998-05-06: 2637',
##  'Marin Bosio%C4%8Di%C4%87: 1988-08-08: Rijeka',
##  'Markus Ragger: 1988-02-05: Klagenfurt',
##  'Martyn Kravtsiv: 1990-11-26: Lviv',
##  'Mateusz Bartel: 1985-01-03: Warsaw',
##  'Matthew Sadler: 1974-05-15: Chatham',
##  'Matthias Bl%C3%BCbaum: 1997-04-18: Lemgo',
##  'Maxim Matlakov: 1991-03-05: Leningrad',
##  'Maxime Lagarde: : Niort',
##  'Maxime Vachier-Lagrave: 1990-10-21: Nogent-sur-Marne',
##  'Michael Adams: 1971-11-17: Truro',
##  'Miguel Illescas: 1965-12-03: Barcelona',
##  'Miguel Santos Ruiz: 1999-10-04: Utrera',
##  'Mikhail Antipov: 1997-06-10: Moscow',
##  'Mikhail Kobalia: 1978-05-03: 2596',
##  'Karthikeyan Murali: 1999-01-05: Thanjavur',
##  'Mustafa Y%C4%B1lmaz: 1992-11-05: Mamak',
##  'Arjun Erigaisi: : 2633',
##  'S. L. Narayanan: 1998-01-10: Thiruvananthapuram',
##  'Nihal Sarin: 2004-07-13: Thrissur',
##  'Rameshbabu Praggnanandhaa: 2005-08-10: Chennai',
##  'Nigel Short: 1965-06-01: 2620',
##  'Nijat Abasov: 1995-05-14: Baku',
##  'Nikita Vitiugov: 1987-02-04: Leningrad',
##  'Nils Grandelius: 1993-06-03: Lund',
##  'Nodirbek Abdusattorov: 2004-09-18: Tashkent',
##  'Nodirbek Yakubboev: : 2630',
##  'Olexandr Bortnyk: 1996-10-18: Oleksandrivka',
##  'Parham Maghsoodloo: : Gorgan',
##  'Parimarjan Negi: 1993-02-09: New Delhi',
##  'Pavel Eljanov: 1983-05-10: Kharkiv',
##  'Pavel Ponkratov: : 2641',
##  'Pentala Harikrishna: 1986-05-10: Guntur',
##  'Peter Heine Nielsen: 1973-05-24: Holstebro',
##  'Peter Leko: 1979-09-08: Subotica',
##  'Peter Svidler: 1976-06-17: Leningrad',
##  'Pouya Idani: 1995-09-22: Ahvaz',
##  'L%C3%AA Quang Li%C3%AAm: 1991-03-13: Ho Chi Minh City',
##  'Ma Qun: 1991-11-09: Shandong',
##  'Rados%C5%82aw Wojtaszek: 1987-01-13: Elbląg',
##  'Rasmus Svane: 1997-05-21: Allerød Municipality',
##  'Rauf Mamedov: 1988-04-26: Baku',
##  'Raunak Sadhwani: 2005-12-22: Nagpur',
##  'Ray Robson: 1994-10-25: Guam',
##  'Rich%C3%A1rd Rapport: 1996-03-25: Szombathely',
##  'Rinat Jumabayev: 1989-07-23: Shymkent',
##  'Robert Hovhannisyan: 1991-03-23: Yerevan',
##  'Robin van Kampen: 1994-11-14: Blaricum',
##  'Ruslan Ponomariov: 1983-10-11: Horlivka',
##  'Rustam Kasimdzhanov: 1979-12-05: Tashkent',
##  'S. P. Sethuraman: 1993-02-25: Madras',
##  'Sam Shankland: 1991-10-01: Berkeley',
##  'Samuel Sevian: 2000-12-26: Corning',
##  'Samvel Ter-Sahakyan: 1993-09-19: Vanadzor',
##  'Sanan Sjugirov: 1993-01-31: Elista',
##  'Sandro Mareco: 1987-05-13: Haedo',
##  'Vidit Gujrathi: 1994-10-24: [1]',
##  'Sergei Azarov: : ',
##  'Sergei Movsesian: 1978-11-03: Tbilisi',
##  'Sergei Rublevsky: 1974-10-15: Kurgan',
##  'Sergei Tiviakov: 1973-02-14: Krasnodar',
##  'Sergey Fedorchuk: 1981-03-14: 2605',
##  'Sergey Karjakin: 1990-01-12: Simferopol',
##  'Shakhriyar Mamedyarov: 1985-04-12: Sumgait',
##  'Lu Shanglei: 1995-07-10: Shenyang',
##  'Shant Sargsyan: : 2639',
##  'List of Indian chess players: : ',
##  'Surya Shekhar Ganguly: 1983-02-24: Kolkata',
##  'Tam%C3%A1s B%C3%A1nusz: : Mohács',
##  'Tamir Nabaty: 1991-05-04: Ness Ziona',
##  'Teimour Radjabov: 1987-03-12: Baku',
##  'Tigran Gharamian: 1984-07-24: Yerevan',
##  'Vadim Milov: 1972-08-01: Ufa',
##  'Vadim Zvjaginsev: 1976-08-18: Moscow',
##  'Valery Salov: 1964-05-26: Wrocław',
##  'Varuzhan Akobian: 1983-11-19: Armenian SSR',
##  'Vasif Durarbayli: 1992-02-24: Sumqayit',
##  'Vasyl Ivanchuk: 1969-03-18: Kopychyntsi',
##  'Velimir Ivi%C4%87: : Belgrade',
##  'Veselin Topalov: 1975-03-15: Ruse',
##  'Viktor Erd%C5%91s: : 2613',
##  'Viktor L%C3%A1zni%C4%8Dka: 1988-01-09: Pardubice',
##  'Vincent Keymer: 2004-11-15: Mainz',
##  'Viswanathan Anand: 1969-12-11: [1]',
##  'Vitaliy Bernadskiy: 1994-11-17: 2601',
##  'Vladimir Afromeev: : Magadan',
##  'Vladimir Akopian: 1971-12-07: Baku',
##  'Vladimir Fedoseev: 1995-02-16: Saint Petersburg',
##  'Vladimir Kramnik: 1975-06-25: Tuapse',
##  'Vladimir Malakhov: 1980-11-27: Ivanovo',
##  'Volodymyr Onyshchuk: 1991-07-21: Ivano-Frankivsk',
##  'Vladislav Artemiev: 1998-03-05: Omsk',
##  'Vladislav Kovalev: 1994-01-06: Minsk',
##  'Vladislav Tkachiev: 1973-11-09: Russian SFSR',
##  'Wesley So: 1993-10-09: Bacoor',
##  'Wojciech Moranda: 1988-08-17: Kielce',
##  'Bu Xiangzhi: 1985-12-10: Qingdao',
##  'Yannick Gozzoli: : Marseille',
##  'Yaroslav Zherebukh: 1993-07-14: Lviv',
##  'Yasser Seirawan: 1960-03-24: Damascus',
##  'Yevgeniy Vladimirov: 1957-01-20: Alma Ata',
##  'Wei Yi: 1999-06-02: Yancheng',
##  'Hou Yifan: 1994-02-27: Xinghua',
##  'Wang Yue: 1987-03-31: Taiyuan',
##  'Yuri Drozdovskij: : Ukraine',
##  'Yuriy Kryvoruchko: 1986-12-19: Lviv',
##  'Yuriy Kuzubov: 1990-01-26: Sychyovka',
##  'Zahar Efimenko: 1985-07-03: Makiivka',
##  'Zhang Zhong: 1978-09-05: Chongqing',
##  'Zolt%C3%A1n Alm%C3%A1si: 1976-08-29: 2678',
##  'Zoltan Gyimesi: 1977-03-31: 2674',
##  'Zurab Azmaiparashvili: 1960-03-16: Tbilisi',
##  'Zviad Izoria: 1984-01-06: Georgia']

Although we can see some mistakes like ratings and blanks in place of birthplaces, the web-scraper did extract the majority of the data we needed.

We will first convert the information into a data frame in python.

player_bios_table_raw = pd.DataFrame(player_bios, columns= ["Bio"])
print(player_bios_table_raw)
##                                             Bio
## 0              Salem Saleh: 1993-01-04: Sharjah
## 1          Abhijeet Gupta: 1989-10-16: Bhilwara
## 2                 Ahmed Adly: 1987-02-18: Cairo
## 3         Alan Pichot: 1998-08-13: Buenos Aires
## 4        Aleksandar In%C4%91i%C4%87: : Belgrade
## ..                                          ...
## 256          Zhang Zhong: 1978-09-05: Chongqing
## 257   Zolt%C3%A1n Alm%C3%A1si: 1976-08-29: 2678
## 258            Zoltan Gyimesi: 1977-03-31: 2674
## 259  Zurab Azmaiparashvili: 1960-03-16: Tbilisi
## 260           Zviad Izoria: 1984-01-06: Georgia
## 
## [261 rows x 1 columns]

Next, we will use R to split the “Bio” column into the three variables we need. We will then combine the “clean” chess players and the problematic ones.

player_bios_table_cleanish <- py$player_bios_table_raw %>%
  separate(Bio, c("Name", "Birthdate", "City_of_birth"), sep = ": ", remove = TRUE, convert = FALSE)%>%
  filter(Name != "List of Armenian chess players" & Name != "List of Indian chess players")


player_bios_table_updated <- rbind(player_bios_table_cleanish, problematic_grandmaster_bios) %>%
  arrange(Name)

Our last step is to produce a final data set that will undergo cleaning and validating using information found in the list of grandmasters Wikipedia page. To make sure our data set can be joined with the data from the Wikipedia list, we need to flip the first and last names back to where they were before and add another “Birthdate” column and “Name” column to our data set because the current “Birthdate” and “Name” columns have issues. As successful as our web-scraper was, it was not perfect, so we do need variables that we know don’t have mistakes in them for data validation later on.

# Flip first and last names so that last names come first again
player_bios_table_updated2 <- player_bios_table_updated %>%
  separate(Name, c("First", "Last"), sep = " ", remove = TRUE, convert = FALSE) %>%
  unite(Name, Last, First, sep = ", ")%>%
  arrange(Name)
## Warning: Expected 2 pieces. Additional pieces discarded in 19 rows [55, 63, 88,
## 99, 113, 117, 124, 125, 127, 133, 140, 146, 165, 169, 184, 197, 200, 201, 217].
# Supplemental validation columns from original data frame
grandmaster_2600_supplement<- py$grandmaster_2600_raw %>%
  select(Name, `B-day`) %>%
  rename(Birthdate= `B-day`)


# Join supplemental data to bio table using the names and birthdate columns as the keys
library(fuzzyjoin)
## Warning: package 'fuzzyjoin' was built under R version 4.1.2
grandmaster_biotable_2600 <- player_bios_table_updated2 %>%
  stringdist_left_join(grandmaster_2600_supplement, by = c("Name", "Birthdate"), method= "qgram", q=2, max_dist = 9 ) 
  

str(grandmaster_biotable_2600)
## 'data.frame':    281 obs. of  5 variables:
##  $ Name.x       : chr  "%C5%9Awiercz, Dariusz" "%C5%A0ari%C4%87, Ivan" "Abasov, Nijat" "Abdusattorov, Nodirbek" ...
##  $ Birthdate.x  : chr  "1994-05-31" "1990-08-17" "1995-05-14" "2004-09-18" ...
##  $ City_of_birth: chr  "Tarnowskie Góry" "Split" "Baku" "Tashkent" ...
##  $ Name.y       : chr  "Swiercz, Dariusz" NA "Abasov, Nijat" "Abdusattorov, Nodirbek" ...
##  $ Birthdate.y  : num  1994 NA 1995 2004 1971 ...

For transparency, let’s go over the similar columns:

  • Name.x: web-scraped names
  • Birthdate.x: web-scraped birthdates
  • Name.y: original names
  • Birthdate.y: original birth years

Data Cleaning

Dealing with Names

Because we went from 266 observations to 281, we know for a fact that there are duplicates that were created during the joining process. This was allowed because we needed the function to join as many records as possible. But duplicates are only part of the problem; missing values in the “Name.y” column are present too because we did not achieve a 100% match while joining. For now, let’s take a look at the duplicate names and the amount of missing data.

# Find duplicate names 
grandmaster_biotable_2600_dups <- (duplicated(grandmaster_biotable_2600$Name.x))
grandmaster_biotable_2600$Name.x[grandmaster_biotable_2600_dups]
##  [1] "Areshchenko, Alexander" "Chao, Li"               "Chernin, Alexander"    
##  [4] "Donchenko, Alexander"   "Gyimesi, Zoltan"        "Hao, Wang"             
##  [7] "L., S."                 "L., S."                 "Navara, David"         
## [10] "P., S."                 "Paravyan, David"        "van, Robin"            
## [13] "Yangyi, Yu"             "Yi, Wei"                "Yue, Wang"
# Save missing values as data frame
grandmaster_biotable_2600_missing <- grandmaster_biotable_2600 %>%
  filter(is.na(Name.y )) 

print(grandmaster_biotable_2600_missing)
##                         Name.x Birthdate.x    City_of_birth Name.y Birthdate.y
## 1        %C5%A0ari%C4%87, Ivan  1990-08-17            Split   <NA>          NA
## 2     Alm%C3%A1si, Zolt%C3%A1n  1976-08-29             2678   <NA>          NA
## 3            Ant%C3%B3n, David  1995-06-23           Murcia   <NA>          NA
## 4      B%C3%A1nusz, Tam%C3%A1s                       Mohács   <NA>          NA
## 5            Baskaran, Adhiban  1992-08-15   Mayiladuthurai   <NA>          NA
## 6       Bl%C3%BCbaum, Matthias  1997-04-18            Lemgo   <NA>          NA
## 7    Bosio%C4%8Di%C4%87, Marin  1988-08-08           Rijeka   <NA>          NA
## 8     Bruz%C3%B3n, L%C3%A1zaro  1982-05-02          Holguín   <NA>          NA
## 9      Dom%C3%ADnguez, Leinier  1983-09-23           Havana   <NA>          NA
## 10               Eduardo, Jose  1999-01-31             Lima   <NA>          NA
## 11             Gujrathi, Vidit  1994-10-24              [1]   <NA>          NA
## 12                   H., Karen  1995-02-25          Yerevan   <NA>          NA
## 13        Henriquez, Cristobal  1996-08-07       La Florida   <NA>          NA
## 14            Illescas, Miguel  1965-12-03        Barcelona   <NA>          NA
## 15 In%C4%91i%C4%87, Aleksandar                     Belgrade   <NA>          NA
## 16         Iturrizaga, Eduardo  1989-11-01          Caracas   <NA>          NA
## 17  L%C3%A1zni%C4%8Dka, Viktor  1988-01-09        Pardubice   <NA>          NA
## 18                    M., Haik  2000-07-14         Byuravan   <NA>          NA
## 19        Onyshchuk, Volodymyr  1991-07-21  Ivano-Frankivsk   <NA>          NA
## 20  Praggnanandhaa, Rameshbabu  2005-08-10          Chennai   <NA>          NA
## 21              Quang, L%C3%AA  1991-03-13 Ho Chi Minh City   <NA>          NA
## 22          Salgado, Iv%C3%A1n                                <NA>          NA
## 23              Santos, Jaime,  1996-07-03   San Sebastián   <NA>          NA
## 24              Shekhar, Surya  1983-02-24          Kolkata   <NA>          NA
## 25                Truong, Ngoc  1990-02-23         Rach Gia   <NA>          NA
## 26                 van, Jorden  1999-04-30          Utrecht   <NA>          NA

With 15 duplicate web-scraped names and 25 missing original names, we have quite a bit of work to do. For the missing values, there are a number of ways to fill them in. At this point in the project, we have successfully created and executed our web-scraper. Because we have the list of grandmasters Wikipedia page available, the easiest way to handle some of our issues is to continuously merge and clean our web-scraped data using the grandmaster list. This should significantly reduce the amount of manual insertions we need to do.

Let’s load the data from grandmaster list inside R.

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

df = pd.read_html('https://en.wikipedia.org/w/index.php?title=List_of_chess_grandmasters&diff=prev&oldid=1043484298', attrs = {'id' : 'grandmasters'})
all_grandmaster_wiki_table = df[0]
all_grandmaster_wiki_table =  all_grandmaster_wiki_table.drop(labels=0, axis=0)
pp.pprint(all_grandmaster_wiki_table.info())
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 1945 entries, 1 to 1945
## Data columns (total 9 columns):
##  #   Column      Non-Null Count  Dtype  
## ---  ------      --------------  -----  
##  0   Name        1945 non-null   object 
##  1   FIDE ID     1872 non-null   float64
##  2   Born        1945 non-null   object 
##  3   Birthplace  1662 non-null   object 
##  4   Died        217 non-null    object 
##  5   TitleYear   1945 non-null   float64
##  6   Federation  1945 non-null   object 
##  7   Sex         1945 non-null   object 
##  8   Notes       1945 non-null   object 
## dtypes: float64(2), object(7)
## memory usage: 152.0+ KB
## None

Using the missing names table we created, let’s try to fill in as many of the missing values as possible, clean the data, and fuse those missing values back into our main table.

# Merge missing table with grandmaster list data and clean it
grandmaster_biotable_2600_miss_resolved <- grandmaster_biotable_2600_missing %>%
  stringdist_left_join(py$all_grandmaster_wiki_table, by = c("Birthdate.x" = "Born"), method= "lcs" , max_dist = 1 ) %>%
  mutate(Name = dplyr::recode(Name, 
    "Nguy<U+1EC5>n Ng<U+1ECD>c Tru<U+1EDD>ng Son" = "Nguyen, Ngoc Truong Son"
    )) %>%
  arrange(Name)%>%
  distinct(`Name.x`, .keep_all = TRUE)


library(data.table)
grandmaster_biotable_2600_miss_resolved <- data.table(grandmaster_biotable_2600_miss_resolved)

# Manually fill in missing chess players  
grandmaster_biotable_2600_miss_resolved <- grandmaster_biotable_2600_miss_resolved[23, 6 := "Banusz, Tamas"][23, 2 := "1989-04-08"][24, 6 := "Indjic, Aleksandar" ][24, 2 := "1995-08-24"][25, 6 := "Salgado López, Iván"][25, 2 := "1991-06-29"]

grandmaster_biotable_2600_miss_resolved <- grandmaster_biotable_2600_miss_resolved %>%
  select( Name, Birthdate.x, Birthplace)

Let’s take a look at the filled in missing values.

print(grandmaster_biotable_2600_miss_resolved)
##                                            Name Birthdate.x       Birthplace
##  1:                                  Adhiban B.  1992-08-15          Chennai
##  2:                              Almási, Zoltán  1976-08-29       Járdánháza
##  3:                       Antón Guijarro, David  1995-06-23           Murcia
##  4:                           Blübaum, Matthias  1997-04-18            Lemgo
##  5:                             Bosiocic, Marin  1988-08-08           Rijeka
##  6:                      Bruzón Batista, Lázaro  1982-05-02          Holguín
##  7:                    Domínguez Pérez, Leinier  1983-09-23           Havana
##  8:                      Ganguly, Surya Shekhar  1983-02-24          Kolkata
##  9:                         Grigoryan, Karen H.  1995-02-25          Yerevan
## 10:                             Gujrathi, Vidit  1994-10-24           Indore
## 11:               Henriquez Villagra, Cristóbal  1996-08-07         Santiago
## 12:                    Illescas Cordoba, Miguel  1965-12-03        Barcelona
## 13:                 Iturrizaga Bonelli, Eduardo  1989-11-01          Caracas
## 14:                            Laznicka, Viktor  1988-01-09        Pardubice
## 15:                               Lê Quang Liêm  1991-03-13 Ho Chi Minh City
## 16:            Martinez Alcantara, Jose Eduardo  1999-01-31             Lima
## 17:                        Martirosyan, Haik M.  2000-07-14         Artashat
## 18: Nguy<U+1EC5>n Ng<U+1ECD>c Tru<U+1EDD>ng Son  1990-02-23  R<U+1EA1>ch Giá
## 19:                          Onischuk, Vladimir  1991-07-21  Ivano-Frankivsk
## 20:                           Praggna­nandhaa R  2005-08-10          Chennai
## 21:                        Santos Latasa, Jaime  1996-07-03    San Sebastián
## 22:                                 Šaric, Ivan  1990-08-17            Split
## 23:                               Banusz, Tamas  1989-04-08        Groningen
## 24:                          Indjic, Aleksandar  1995-08-24                 
## 25:                       Salgado López, Iván  1991-06-29                 
## 26:                                        <NA>                             
##                                            Name Birthdate.x       Birthplace

Now, we are going to merge the resolved missing data with our main table and observe any changes. We will also fill in our missing Name.y values (the clean names) with the brand new names we got from the resolved missing table.

# Merge main bio table and resolved missing data table
grandmaster_biotable_2600_updated<- grandmaster_biotable_2600 %>%
  stringdist_left_join(grandmaster_biotable_2600_miss_resolved, by = c("Birthdate.x" , "City_of_birth" = "Birthplace"), method= "qgram", max_dist = 4 ) %>%
  mutate(Name.y= coalesce(Name.y, Name))%>%
  select(Name.x, Name.y,  Birthdate.x.x, Birthdate.y, City_of_birth, Birthplace ) %>%
  rename(Name = Name.y, 
         Birthdate.x = Birthdate.x.x, 
         Birthyear = Birthdate.y) %>%
  arrange(Name) 

# Convert updated bio table data frame to data table class observe how many missing values there are 
grandmaster_biotable_2600_updated <- setDT(grandmaster_biotable_2600_updated)

print(paste(c(sum(is.na(grandmaster_biotable_2600_updated$Name)), "missing values!"), collapse = " "))
## [1] "9 missing values!"
# See the missing values 
library(knitr)
kable(grandmaster_biotable_2600_updated[is.na(grandmaster_biotable_2600_updated$Name)])
Name.x Name Birthdate.x Birthyear City_of_birth Birthplace
Alm%C3%A1si, Zolt%C3%A1n NA 1976-08-29 NA 2678 NULL
B%C3%A1nusz, Tam%C3%A1s NA NA Mohács NULL
Baskaran, Adhiban NA 1992-08-15 NA Mayiladuthurai NULL
Gujrathi, Vidit NA 1994-10-24 NA [1] NULL
Henriquez, Cristobal NA 1996-08-07 NA La Florida NULL
In%C4%91i%C4%87, Aleksandar NA NA Belgrade NULL
M., Haik NA 2000-07-14 NA Byuravan NULL
Salgado, Iv%C3%A1n NA NA NULL
van, Jorden NA 1999-04-30 NA Utrecht NULL

The merge appears to be a success because we managed to fill 16 missing “clean” names. With only 9 missing “clean” names left, it should easy to fill them in manually.

# Fill in missing clean names 
grandmaster_biotable_2600_updated2 <- grandmaster_biotable_2600_updated[273, Name := "Almasi, Zoltan" ][274, Name := "Banusz, Tamas"][275, Name := "Baskarans, Adhiban"][276, Name := "Gujrathi Vidit"][277, Name := "Henriquez, Cristobal"][278, Name := "Indjic, Aleksandar" ][279, Name := "Martirosyan, Haik M."][ 280, Name := "Salgado López, Iván"][281, Name := "Van Foreest, Jorden"]

# Check how many missing names are left

print(paste(c(sum(is.na(grandmaster_biotable_2600_updated2$Name)), "missing values!"), collapse = " "))
## [1] "0 missing values!"

Now, we can begin dealing with the duplicates. Remember that the “Name.x” variable is where the web-scraped names are. This means that if we want to detect any duplicates, it is going to show in that column.

grandmaster_biotable_2600_duplicates <- grandmaster_biotable_2600_updated2 %>%
  count(Name.x) %>%
  filter(n>1)%>%
  rename(Copies = n)

kable(grandmaster_biotable_2600_duplicates)
Name.x Copies
Areshchenko, Alexander 2
Chao, Li 2
Chernin, Alexander 2
Donchenko, Alexander 2
Gyimesi, Zoltan 2
Hao, Wang 2
L., S. 3
Navara, David 2
P., S. 2
Paravyan, David 2
van, Robin 2
Yangyi, Yu 2
Yi, Wei 2
Yue, Wang 2

We can eliminate all of the duplicates using R’s distinct function.

grandmaster_biotable_2600_unique <-  grandmaster_biotable_2600_updated2 %>%
  distinct(`Name.x`, .keep_all = TRUE)

grandmaster_biotable_2600_unique

Now that the duplicates are gone, there are a number of manual corrections that need to be made due to the merging mistakes that were created earlier.

# Corrections 

grandmaster_biotable_2600_almost <- grandmaster_biotable_2600_unique[10, Name :=  "Gyimesi, Zoltan"][10, Birthyear :=  "1977"][20, Name :=  "Donchenko, Alexander"][20, Birthdate.x :=  "1998-03-22"][20, Birthyear :=  "1998"][130, Name :=  "Nielsen, Peter Heine"][130, Birthyear :=  "1973"][138, Name :=  "Narayanan, S. L." ][138, Birthyear :=  "1998" ][139, Name :=  "Sethuraman, S. P." ][139, Birthyear :=  "1993"][139, City_of_birth :=  "Chennai"][165, Name :=  "Paravyan, David"][165, Birthdate.x :=  "1998-03-08"][165, Birthyear :=  "1998"][197, Name :=  "Van Kampen, Robin"][222, Name :=  "Wei, Yi"][222, Birthyear :=  "1999"][222, City_of_birth := "Wuxi"][246, Name :=  "Wang, Yue" ][246, Birthyear :=  "1987" ][247, Name :=  "Yangyi, Yu" ][247, Birthyear :=  "1994" ][259, Name :=  "Banusz, Tamas"][259, Birthdate.x :=  "1989-04-08"][259, Birthyear :=  "1989"][260, Birthyear :=  "1992"]

Let’s check out our data.

# Clean up data a bit
grandmaster_biotable_2600_almost2 <- grandmaster_biotable_2600_almost %>%
  arrange(Name) %>%
  select(Name, Birthdate.x, Birthyear, City_of_birth, Birthplace)

grandmaster_biotable_2600_almost2

Dealing with Birthdates

Phase 2 of the data cleaning now involves making sure our birthdates are correct. Our main issue is that the “Birthdate” column has a number of blank observations. To fix this, we will be doing another merge with the grandmaster list table, using the “Born” column in that data set to fill in the missing blanks. This merge will also permit us to include the “FIDE ID” column so that future merges are more exact.

# Fill in missing birthdates with born column
grandmaster_biotable_2600_almost3<- grandmaster_biotable_2600_almost2 %>%
  stringdist_left_join(py$all_grandmaster_wiki_table, by = c("Name"), method= "qgram", max_dist = 2 )%>%
  mutate(Birthdate.x = ifelse(Birthdate.x == "", NA, Birthdate.x)) %>%
  mutate(Birthdate.x=  coalesce(Birthdate.x, Born)) %>%
  select(`FIDE ID`, Name.x,  Birthdate.x,  Birthyear, City_of_birth ) %>%
  rename(ID= `FIDE ID`)

grandmaster_biotable_2600_almost3 <- data.table(grandmaster_biotable_2600_almost3)

# View missing data 

print(paste(c(sum(is.na(grandmaster_biotable_2600_almost3$Birthdate.x)) , "missing values!"), collapse = " "))
## [1] "3 missing values!"

Now we only have 2 birthdates to manually insert. We can also use the opportunity to fix the FIDE IDs as well.

# Manual Corrections
grandmaster_biotable_2600_almost4 <- grandmaster_biotable_2600_almost3[5, ID := 4157770 ][5, Birthdate.x := "1954-04-02"][10, ID := 702293 ][16, ID :=  4107012 ][18, ID:= 5072786][18, Birthdate.x := "1999-09-11" ][27, ID := 722413 ][31, ID := 5018471 ][79, ID := 3800024 ][92, ID := 3409350 ][139, ID := 8604436 ][200, ID := 11600098 ]

Dealing with Birthplaces

During the web-scraping process, some of the birthplaces either came out as ratings or blanks. To solve this, we will mutate the columns so that those mistakes become NA values.

#Replace problematic birthplaces with NA values
grandmaster_biotable_2600_almost5 <-  grandmaster_biotable_2600_almost4 %>%
  mutate(City_of_birth = ifelse(str_detect(City_of_birth , ".*\\d") , NA, City_of_birth) )%>%
  mutate(City_of_birth = ifelse(City_of_birth == "", NA, City_of_birth))

Using our grandmaster list table, we will fill in our missing city of birth data using the “Birthplace” column from the list.

#Load wiki table list in R
all_grandmaster_wiki_table_r <- py$all_grandmaster_wiki_table
all_grandmaster_wiki_table_r$Birthplace <- unlist(all_grandmaster_wiki_table_r$Birthplace )

# Join and fill in missing values
grandmaster_biotable_2600_almost6 <- grandmaster_biotable_2600_almost5 %>%
  left_join(all_grandmaster_wiki_table_r, by = c("ID" = "FIDE ID") )  %>%
  mutate(City_of_birth =  coalesce(City_of_birth, Birthplace))%>%
  select(ID, Name.x, Birthdate.x, Birthyear, City_of_birth, Birthplace )



grandmaster_biotable_2600_almost6

With the necessary columns being filled, it’s time to move on to validating the data.

Data Validation

Validation of dates

The reason we kept the birth year column throughout the merges is because we needed some way to ensure that the birthdates matched the players. Because the birth years came with the players, it is an excellent column for validating the birthdates. Using R’s stringdist package, we can compare the string distances between the two columns. If they are accurate, there should only be a distance of 6 because of the additional 2 hyphens and 4 numbers in the “Birthdate” column.

library(stringdist)
stringdist(grandmaster_biotable_2600_almost6$Birthdate.x, grandmaster_biotable_2600_almost6$Birthyear)
##   [1]  6  6  6  6  6  6  6  6  6 NA  6  6  6  6  6  6 NA  6  6  6  6  6  6  6  6
##  [26]  6  6  6  6  6  6  6  6  6 NA  6  6 NA  6 NA  6  6  6  6  6  6  6  6  6  6
##  [51]  6 NA  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6 NA  6  6  6  6
##  [76]  6  6  6  6  6 NA  6 NA  6  6  6  6  6  6  6  6 NA  6  6  6  6  6 NA  6 NA
## [101]  6 NA  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6
## [126]  6  6  6  6  6  6  6  6 NA NA  6  6  6  6  6  6  6  6  6  6  6  6  6 NA NA
## [151]  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6 NA  6  6  6
## [176]  6  6  6  6 NA  6  6  6  6  6  6  6  6  6  6  6  6  6 NA  6  6  6  6  6  6
## [201]  6  6  6  6  6  6  6 NA  6 NA  6  6  6  6 NA  6  6  6  6  6  6  6  6  6  6
## [226]  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6 NA  6  6  6  6  6
## [251]  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6

The NA values are being caused by the blanks in the birth year column. Overall, it seems that all of our work payed off because we are only seeing string distances equal to 6.

Validation of Birthplaces

Validating birthplaces is actually very difficult because of one problem: conflicts between the Wikipedia pages (City_of_birth variable) and the grandmaster list (Birthplace variable). The conflicts are mainly caused by inaccuracies, outdated information, and historical changes. For example, a number of Russian cities have gone through recent political transitions, so their names have changed in the past 20-30 years. Additionally, some cities in Russia and Ukraine have names that are the same. Moreover, it may be difficult to know where a Russian or Ukrainian is born because some players were born in one country but were raised in the other.

Let’s take a look at the birthplace conflicts.

grandmaster_biotable_2600_almost6[grandmaster_biotable_2600_almost6$City_of_birth != grandmaster_biotable_2600_almost6$Birthplace]

There are 45 conflicts and there is no automated way to deal with them. The best thing to do is to go through them manually and decide which ones or worth changing. The grandmaster list Wikipedia page does support many of their records with FIDE applications, which sometimes contain player birthplaces. For our purposes, if the grandmaster list has documentation supporting their data (mainly in the form of grandmaster title applications) then that location was chosen over the web-scraped data. Otherwise, the Wikipedia birth places were left alone.

I could not find a function that could be used to substitute the “City_of_birth” variable with the “Birthplace” variable, so I made my own function.

# Correction based on the grandmaster list 
grandmaster_biotable_2600_almost7 <- grandmaster_biotable_2600_almost6[142, City_of_birth := "Yekaterinburg"][258, City_of_birth := "Tashkent"]

# Row numbers that are going to change
changing_indices = c(6,9, 19, 28, 55, 67, 74, 90, 92, 99, 100,  105, 106,  133, 140, 160, 239, 245, 261)

# Function that replaces City_of_birth information with Birthplace information
row_substitute <- function(dt, index) {
  value= dt[index, Birthplace]
  dt = dt[index, City_of_birth := value]
  dt
}



for (i in changing_indices){
  row_substitute(grandmaster_biotable_2600_almost7,i)
}

grandmaster_biotable_2600_almost7[grandmaster_biotable_2600_almost7$City_of_birth != grandmaster_biotable_2600_almost7$Birthplace]

The last 27 conflicts were left as is.

Here is a glimpse of the final data set after removing chess players that are not grandmasters (Afromeev, Vladimir and Rausis, Igors).

grandmaster_biotable_2600_complete <- grandmaster_biotable_2600_almost7%>%
  select(ID, Name.x, Birthdate.x, City_of_birth) %>%
  mutate(City_of_birth = ifelse(City_of_birth == "NaN", NA, City_of_birth))%>%
  rename(Name = Name.x, 
         Birthdate = Birthdate.x) %>%
  filter(Name != "Afromeev, Vladimir" & Name != "Rausis, Igors" ) # Filter out non grandmasters
str(grandmaster_biotable_2600_complete)
## Classes 'data.table' and 'data.frame':   264 obs. of  4 variables:
##  $ ID           : num  13402960 14204118 400041 10601619 13300580 ...
##  $ Name         : chr  "Abasov, Nijat" "Abdusattorov, Nodirbek" "Adams, Michael" "Adly, Ahmed" ...
##  $ Birthdate    : chr  "1995-05-14" "2004-09-18" "1971-11-17" "1987-02-18" ...
##  $ City_of_birth: chr  "Baku" "Tashkent" "Truro" "Cairo" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Part 2: Getting the Rest of the Grandmasters

Unlike Part 1, there is not enough information on Wikipedia for many of these grandmasters. The best thing to do is to get the majority of the information from the grandmaster list.

Data Preparation

Let’s filter our FIDE data for grandmasters below 2600 and prepare our data sets for merging.

# Filter for grandmasters under 2600
grandmaster_rest_raw = chess[(chess['SRtng']  < 2600) & (chess["Tit"] == "GM") ]
print(grandmaster_rest_raw.info())
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 1474 entries, 368 to 1020287
## Data columns (total 19 columns):
##  #   Column     Non-Null Count  Dtype  
## ---  ------     --------------  -----  
##  0   ID Number  1474 non-null   int64  
##  1   Name       1474 non-null   object 
##  2   Fed        1474 non-null   object 
##  3   Sex        1474 non-null   object 
##  4   Tit        1474 non-null   object 
##  5   WTit       33 non-null     object 
##  6   OTit       24 non-null     object 
##  7   FOA        3 non-null      object 
##  8   SRtng      1474 non-null   float64
##  9   SGm        1474 non-null   float64
##  10  SK         1474 non-null   float64
##  11  RRtng      1167 non-null   float64
##  12  RGm        1167 non-null   float64
##  13  Rk         1167 non-null   float64
##  14  BRtng      1161 non-null   float64
##  15  BGm        1161 non-null   float64
##  16  BK         1161 non-null   float64
##  17  B-day      1474 non-null   int64  
##  18  Flag       414 non-null    object 
## dtypes: float64(9), int64(2), object(8)
## memory usage: 230.3+ KB
## None
# Data set preparation
grandmaster_rest_raw_r <- py$grandmaster_rest_raw %>% 
  rename(ID = `ID Number`)%>%
  select(ID, Name)
  

all_grandmaster_wiki_table_r2 <- all_grandmaster_wiki_table_r %>% 
  rename(ID = `FIDE ID`) %>%
  select(ID, Born, Birthplace)

Merging

Now, we can merge the data and find the number of missing birthplace and birthdate values.

grandmaster_biotable_rest <- grandmaster_rest_raw_r %>% 
  left_join(all_grandmaster_wiki_table_r2, by= "ID") %>%
  rename(Birthdate= Born, 
         City_of_birth = Birthplace)%>% 
  mutate(City_of_birth = ifelse(City_of_birth == "NaN", NA, City_of_birth)) 
  

print(paste(c(sum(is.na(grandmaster_biotable_rest$City_of_birth)), "missing birthplace values!"), collapse = " "))
## [1] "258 missing birthplace values!"
print(paste(c(sum(is.na(grandmaster_biotable_rest$Birthdate)), "missing birthdate values!"), collapse = " "))
## [1] "2 missing birthdate values!"

We can fill in these missing birthdate values using chess.com and Wikipedia.

grandmaster_biotable_rest <- setDT(grandmaster_biotable_rest)
grandmaster_biotable_rest_complete <- grandmaster_biotable_rest[865, Birthdate := "2009-02-05"][1268, Birthdate := "2005-03-22"]

258 missing birthplace values is not the only problem; some grandmaster birthdates only have birth years. Unfortunately, this is the best that can be done for this data set. It’s now time to append the two data sets.

Part 3: The Final Merge

Let’s show the final table and export all of the data sets.

grandmaster_bdates_bplaces <- rbind(grandmaster_biotable_2600_complete, grandmaster_biotable_rest_complete) %>%
  arrange(Name) 

grandmaster_bdates_bplaces
#write.csv(grandmaster_bdates_bplaces,"C:/Users/laryl/Desktop/Data Sets//all_grandmaster_bdates_bplaces.csv")
#write.csv(grandmaster_biotable_2600_complete,"C:/Users/laryl/Desktop/Data Sets//top_grandmaster_bdates_bplaces.csv")
#write.csv(grandmaster_biotable_rest_complete,"C:/Users/laryl/Desktop/Data Sets//rest_of_grandmaster_bdates_bplaces.csv")

Conclusion

Although this project began with the simple goal of obtaining grandmaster birthdates, we ended up acquiring birthplaces too. This project proved to be very challenging especially during part 1. But along with these challenges came the opportunity to combine new R and Python tools like the “fuzzyjoins” package and the “googlesearch” library.

The data extracted from this project will be combined with other original data sets from previous chess web-scraping projects so that questions about chess player origins and rating trajectories can be answered. Note that these data sets are not complete because there are many manual insertions and corrections that need to be done. However, there is going to be an updated version of this data set that will include country of birth information and longitude and latitude data. If you want to get more information about the data sets (the ones here and the updated one) and download them, please visit my GitHub.

Sources