Scraping Advanced Football Stats for the Indian Super League using Python
Indian Super League (also referred to as ISL) is one of the budding leagues in Asia. According to me, the league has the potential to compete with the best clubs of Asia, provided the clubs think long-term and not just plan season by season.
Every league needs fans, some passionate ones, and some analytical ones. I was a passionate kinda fan slowly leaning into the analytical side. For the analytics fans, I know one thing for sure, we love data (better if free). Hence this article, I’m going to demonstrate how I scraped advanced stats like xG, xG set play, xG Open play, etc.
Prerequisites:
- Knowledge of python
- A little knowledge of web scraping (HTML & CSS)
I would recommend reading this article before proceeding :
https://www.analyticsvidhya.com/blog/2020/08/web-scraping-selenium-with-python/
Tools used:
- Jupyter Notebook
- Python, libraries used: Pandas, NumPy and Selenium
What is web scraping?
Web scraping is a way to collect data from websites. Few websites do not allow web scraping since it causes additional load on the servers. After this tutorial, you can scrape any website to collect any sort of data you need. Data collection is a very important skill for a data analyst.
Libraries description
Now if you are used to python you would know that BeautifulSoup is one of the most popular web scraping libraries, so why Selenium? Selenium is an open-source module that is used in automation and web scraping as well.
The reason why I used Selenium over BeautifulSoup is that the website that we will be scraping from has dynamic pages and BeautifulSoup cannot handle dynamic pages. Hence Selenium to the rescue.
Note: I tried scraping with BeautifulSoup which didn’t work, if any one of the readers has found a way then please let everyone know in the comments. thanks.
Scraping Procedure
Importing the necessary libraries
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
import numpy as np
Defining a function to get the driver up and running
def get_driver():
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-dev-shm-usage')
# loading the browser
# driver = webdriver.Chrome("chromedriver",options=chrome_options)
driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)
return driver
My approach was to scrape stats of every match, goals, shots, xG, xG On target, etc. To extract stats from one page, you need to put in your URL and then run the function (which will be discussed later on).
I wanted to scrape stats for all the matches and not just one. But it is quite troublesome to input the URL manually for all the matches. To solve this you have to be a bit creative, there's no perfect solution for this problem.
In my case, what I noticed what the matches had a unique match id sort of in their URL and the following match had the id = previous match id + 1
ISL first match URL: https://www.fotmob.com/match/3727946/
ISL second match URL: https://www.fotmob.com/match/3727947/
Now I had a pattern and an increment, so things became clear. I just need to run a for loop till the last match of the season. Although I encountered an unexpected problem. The structure of Fotmob’s website went through some changes in the middle of the season, thus the match IDs for the later matches were a bit different.
Match 55 ID: 3728002
Match 56 ID: 3791566
Thus the match IDs were:
3727946 to 3728002 for the first 55 matches
3791566 to 3791620 for the last 55 matches
Now that the URL is sorted, comes the main part of finding ways to gather the data. To scrape the data you can use the following methods
- find_elements_by_name
- find_elements_by_xpath
- find_elements_by_link_text
- find_elements_by_partial_link_text
- find_elements_by_tag_name
- find_elements_by_class_name
- find_elements_by_css_selector
From the above list, the highlighted one is the one that I have used throughout the code. In this method, all you need to do is find the ‘class name’ of the data. Let's get started to get a better understanding.
- Team Names
The first thing is we need the home and away team names, using the ‘Inspect Element’ command we can search for the class name of the content. The snapshot below shows the class name of the team names container:
Team names class → ‘css-er0nau-TeamName’
match_1 = 3727946
URL = f"https://www.fotmob.com/match/{match_1}"
driver = get_driver()
driver.get(URL)
scores_divs = driver.find_elements_by_class_name('css-jkaqxa')
names_divs = driver.find_elements_by_class_name('css-er0nau-TeamName')
The scores_divs and names_divs contain the info we require, both of these are lists and hence we need to locate the index. I did this manually but you can try to automate this as well.
home_team_name = names_divs[2].text
away_team_name = names_divs[3].text
2. Scoreline
Similarly scrape the score for home and away team respectively.
scores_divs = driver.find_elements_by_class_name(‘css-jkaqxa’) home_team_score = (list(scores_divs[1].text)[0])
away_team_score = (list(scores_divs[1].text)[4])
3. Advanced Stats (xG, xGOT, xG Set Play, xG Open Play)
In the stats section, we had to 2 main classes, one with the stat titles like xG, shots, etc, and the other with the value.
stats_divs = driver.find_elements_by_class_name('css-radwzz-Stat')
stats_titles_divs = driver.find_elements_by_class_name('css-1s3albn-StatTitle')
The stats_divs contains the actual value of a particular stat and stats_titles_divs contains the titles as the name suggests. Since these are 2 lists I looped for the stats_titles_divs and stored the index of the stat that I wanted. Now with this index, we can easily retrieve the value for that stat in the stats_divs list.
I got the stat title by using inspect element command:
for i in range(len(stats_titles_divs)):
if stats_titles_divs[i].text == 'Expected goals (xG)':
index_xG = i
elif stats_titles_divs[i].text == 'Total shots':
index_total_shots = i
elif stats_titles_divs[i].text == 'xG first half':
index_xG_first_half = i
elif stats_titles_divs[i].text == 'xG second half':
index_xG_second_half = i
elif stats_titles_divs[i].text == 'xG open play':
index_open_play = i
elif stats_titles_divs[i].text == 'xG set play':
index_set_play = i
elif stats_titles_divs[i].text == 'xG on target (xGOT)':
index_xGOT = i
After getting all the indices, the only step remaining was to extract the value from the stats_divs list.
xG = stats_divs[index_xG].text
home_xG,away_xG = xG[:4],xG[25:]shots = stats_divs[index_total_shots].text
home_shots,away_shots = shots[:2],shots[15:]xG_first_half = stats_divs[index_xG_first_half].text
home_xG_first_half,away_xG_first_half = xG_first_half[:4],xG_first_half[19:]xG_second_half = stats_divs[index_xG_second_half].text
home_xG_second_half,away_xG_second_half = xG_second_half[:4],xG_second_half[20:]xG_Open_Play = stats_divs[index_open_play].text
home_xG_Open_Play,away_xG_Open_Play = xG_Open_Play[:4],xG_Open_Play[19:]xG_Set_Play = stats_divs[index_set_play].text
home_xG_Set_Play,away_xG_Set_Play = xG_Set_Play[:4],xG_Set_Play[17:]xGOT = stats_divs[index_xGOT].text
home_xGOT,away_xGOT = xGOT[:4],xGOT[25:]
The final step was to separate out home and away team stats, which I did by trial and error, that is slicing by guessing the index.
Function for scraping the stats
def get_team_stats(driver,match):
driver.get(URL)
#getting team name
names_divs = driver.find_elements_by_class_name('css-er0nau-TeamName')
if len(names_divs) != 0:
home_team_name = names_divs[2].text
away_team_name = names_divs[3].text
#getting team scores
# 3791566 to 3791620 for the last 55 matches
scores_divs = driver.find_elements_by_class_name('css-slmchi-wrapper')
if len(scores_divs) != 0:
home_team_score = (list(scores_divs[1].text)[0])
away_team_score = (list(scores_divs[1].text)[4])
else:
# for first 56 match the class name is this 372746 to 3728002
scores_divs = driver.find_elements_by_class_name('css-jkaqxa')
home_team_score = (list(scores_divs[1].text)[0])
away_team_score = (list(scores_divs[1].text)[4])
#getting team stats
stats_divs = driver.find_elements_by_class_name('css-1k6dpoy-Stat')
if len(stats_divs) != 0:
stats_titles_divs = driver.find_elements_by_class_name('css-1w7h84q-StatTitle')
for i in range(len(stats_titles_divs)):
if stats_titles_divs[i].text == 'Expected goals (xG)':
index_xG = i
elif stats_titles_divs[i].text == 'xG first half':
index_xG_first_half = i
elif stats_titles_divs[i].text == 'xG second half':
index_xG_second_half = i
elif stats_titles_divs[i].text == 'xG open play':
index_open_play = i
elif stats_titles_divs[i].text == 'xG set play':
index_set_play = i
elif stats_titles_divs[i].text == 'xG penalty':
index_xG_penalty = i
elif stats_titles_divs[i].text == 'xG on target (xGOT)':
index_xGOT = itry:
xG = stats_divs[index_xG].text
home_xG,away_xG = xG[:4],xG[25:]
except NameError:
home_xG,away_xG = 0,0try:
shots = stats_divs[index_total_shots].text
home_shots,away_shots = shots[:2],shots[15:]
except NameError:
home_shots,away_shots = 0,0
try:
xG_first_half = stats_divs[index_xG_first_half].text
home_xG_first_half,away_xG_first_half = xG_first_half[:4],xG_first_half[19:]
except NameError:
home_xG_first_half,away_xG_first_half = 0,0try:
xG_second_half = stats_divs[index_xG_second_half].text
home_xG_second_half,away_xG_second_half = xG_second_half[:4],xG_second_half[19:]
except NameError:
home_xG_second_half,away_xG_second_half = 0,0try:
xG_Open_Play = stats_divs[index_open_play].text
home_xG_Open_Play,away_xG_Open_Play = xG_Open_Play[:4],xG_Open_Play[19:]
except NameError:
home_xG_Open_Play,away_xG_Open_Play = 0,0try:
xG_Set_Play = stats_divs[index_set_play].text
home_xG_Set_Play,away_xG_Set_Play = xG_Set_Play[:4],xG_Set_Play[17:]
except NameError:
home_xG_Set_Play,away_xG_Set_Play = 0,0
try:
xG_penalty = stats_divs[index_xG_penalty].text
home_xG_penalty,away_xG_penalty = xG_penalty[:4],xG_penalty[16:]
except NameError:
home_xG_penalty,away_xG_penalty = 0,0try:
xGOT = stats_divs[index_xGOT].text
home_xGOT,away_xGOT = xGOT[:4],xGOT[25:]
except NameError:
home_xGOT,away_xGOT = 0,0
else:
stats_divs = driver.find_elements_by_class_name('css-radwzz-Stat')
stats_titles_divs = driver.find_elements_by_class_name('css-1s3albn-StatTitle')
for i in range(len(stats_titles_divs)):
if stats_titles_divs[i].text == 'Expected goals (xG)':
index_xG = i
elif stats_titles_divs[i].text == 'xG first half':
index_xG_first_half = i
elif stats_titles_divs[i].text == 'xG second half':
index_xG_second_half = i
elif stats_titles_divs[i].text == 'xG open play':
index_open_play = i
elif stats_titles_divs[i].text == 'xG set play':
index_set_play = i
elif stats_titles_divs[i].text == 'xG penalty':
index_xG_penalty = i
elif stats_titles_divs[i].text == 'xG on target (xGOT)':
index_xGOT = itry:
xG = stats_divs[index_xG].text
home_xG,away_xG = xG[:4],xG[25:]
except NameError:
home_xG,away_xG = 0,0
try:
shots = stats_divs[index_total_shots].text
home_shots,away_shots = shots[:2],shots[15:]
except NameError:
home_shots,away_shots = 0,0try:
xG_first_half = stats_divs[index_xG_first_half].text
home_xG_first_half,away_xG_first_half = xG_first_half[:4],xG_first_half[19:]
except NameError:
home_xG_first_half,away_xG_first_half = 0,0try:
xG_second_half = stats_divs[index_xG_second_half].text
home_xG_second_half,away_xG_second_half = xG_second_half[:4],xG_second_half[19:]
except NameError:
home_xG_second_half,away_xG_second_half = 0,0try:
xG_Open_Play = stats_divs[index_open_play].text
home_xG_Open_Play,away_xG_Open_Play = xG_Open_Play[:4],xG_Open_Play[19:]
except NameError:
home_xG_Open_Play,away_xG_Open_Play = 0,0try:
xG_Set_Play = stats_divs[index_set_play].text
home_xG_Set_Play,away_xG_Set_Play = xG_Set_Play[:4],xG_Set_Play[17:]
except NameError:
home_xG_Set_Play,away_xG_Set_Play = 0,0
try:
xG_penalty = stats_divs[index_xG_penalty].text
home_xG_penalty,away_xG_penalty = xG_penalty[:4],xG_penalty[16:]
except NameError:
home_xG_penalty,away_xG_penalty = 0,0try:
xGOT = stats_divs[index_xGOT].text
home_xGOT,away_xGOT = xGOT[:4],xGOT[25:]
except NameError:
home_xGOT,away_xGOT = 0,0
return {
'match_id':match,
'home_team':home_team_name,
'away_team':away_team_name,
'home_team_score':float(home_team_score),
'away_team_score':float(away_team_score),
'home_xG':float(home_xG),
'away_xG':float(away_xG),
'home_shots':float(home_shots),
'away_shots':float(away_shots),
'home_xG_first_half': float(home_xG_first_half),
'away_xG_first_half': float(away_xG_first_half),
'home_xG_second_half':float(home_xG_second_half),
'away_xG_second_half':float(away_xG_second_half),
'home_xG_Open_Play':float(home_xG_Open_Play),
'away_xG_Open_Play':float(away_xG_Open_Play),
'home_xG_Set_Play':float(home_xG_Set_Play),
'away_xG_Set_Play':float(away_xG_Set_Play),
'away_xG_Open_Play':float(away_xG_Open_Play),
'home_xGOT':float(home_xGOT),
'away_xGOT':float(away_xGOT)
}
By running this function once you can scrape all the stats of one particular match. And since I wanted stats of all the matches, we’ll run a loop using this same function multiple times.
# 3727946 to 3728002 for first 56 match
# 3791566 to 3791620 for matches after that
match_1 = 3727946
URL = f"https://www.fotmob.com/match/{match_1}"
stats = []
for i in range(0,58):
match = match_1 + i
print("Trying to get match",match,"stats")
try:
URL = f"https://www.fotmob.com/match/{match}"
driver.get(URL)
stats.append(get_team_stats(driver,match))
print("Stats scraped!")
except:
print("Could not get match",match,"stats")df1 = pd.DataFrame(stats) #part 1 of the stats
Looping over the second half of the league
# 3727946 to 3728002 for first 56 match
# 3791566 for matches after that
match_1 = 3791565
URL = f"https://www.fotmob.com/match/{match_1}"
stats = []
for i in range(0,56):
match = match_1 + i
print("Trying to get match",match,"stats")
try:
URL = f"https://www.fotmob.com/match/{match}"
driver.get(URL)
stats.append(get_team_stats(driver,match))
print("Stats scraped!")
except:
print("Could not get match",match,"stats")df2 = pd.DataFrame(stats)#part 2 of the statsdf = pd.concat([df1,df2],ignore_index=True) # combined dataset
Conclusion
Thanks for having a read, I’m still a beginner in this and would love to know your thoughts about this work. Any sort of suggestions and comments are welcomed.
In my next article, I’ll be explaining how I made an expected points table using this data. Stay tuned.
You can find the code and dataset on my GitHub profile.