serve-and-volley / atp-world-tour-tennis-data Goto Github PK
View Code? Open in Web Editor NEWUsing Python to scrape ATP World Tour tennis data
Using Python to scrape ATP World Tour tennis data
Hi,
this is a great database, many thanks for it! But I have a note on a data quality issue. Perhaps it can help you to make it even better.
I think it is wrong that the field match_id from the match_scores and match_stats files is not a primary key. Actually it should be. But consider the case where in a single tournament, there were 2 matches between the same players and the outcome (in terms of winner/loser) was the same. For instance if they met in the round_robin phase and later at the final. Then it is unclear which match from the match_score file corresponds to which stats from the match_stats file.
And also in 3 cases the first match (or the second?) is completely missing from the match_stat file. As an example consider the following 14 match_id-s:
2001-605-h432-g379
2003-605-f324-a092
2004-605-f324-h432
2008-605-d643-d402
2009-306-c514-r383
2010-573-c882-r772
2011-605-f324-t786
2013-319-h571-sf41
2013-520-p701-w522
2015-410-r772-h756
2015-573-bc72-m824
2017-337-db59-f586
2017-7648-bh09-ca99
results-605-d875-gb88
The first 3 from 2001, 2003 and 2004 are the ones with missing matches from match_stats.
Hello @serve-and-volley!
Just to tell you I really like what you did on the scrapping of this humongous website, that's a lot of work so I appreciate.
I recently found a way to use directly their API by decoding the encrypted data, I just thought you might be interested as it could make scrapping a bit easier and less a burden to maintain as long as they do not change the cipher method (not relying on the interface anymore):
https://stackoverflow.com/questions/73735401/scraping-an-atptour-com-api-returns-what-looks-like-encrypted-data/75086660#75086660
If I may be of any help, please reach ๐
Cheers
Fantastic work, this is a very cool project!
I tried running the "match_stats.py..." command (for 2018), and it looks like most of the matches came back with no results.
Would this indicate that the script needs to be adjusted to the ATP website?
I have tried to get any of the 3 scripts to work and they work. But the data count is always zero. It seems that the problem is in the line
tourney_titles_parsed = xpath_parse(year_tree, tourney_titles_xpath)
at least in tournament.py.
Could you help me?
While scrapping player stats, I am not able to find the players name. Any idea on how to extract the winner and loser's name for each match.
I am having trouble figuring out the column headers for the csv files in https://github.com/serve-and-volley/atp-world-tour-tennis-data/tree/master/csv/3_match_stats
Hi, could you please add the player overviews CSV column titles, etc to the python folder?
Thank you soooo much!
How do you go about filling in missing data?
I'm new to the scraping realm but very interesting in the tennis data using for analysis in side project. On the scraping script, I got an 'list out of index' type of error on the Routine Command. I was running your script on Spyder and Jupyter Note book.
start_year = str(sys.argv[1])
end_year = str(sys.argv[2])
This above line was the error.
best,
Paripon Thanthong
Hi,
I was wondering if you plan on adding a licence so others can use these programs and datasets? I think it would be very useful in my academic research but due to ethical reasons require a licence - simple examples you could add to the readme are here: https://creativecommons.org/share-your-work/
Thanks,
W
Hi,
I ran the following command as per your suggestion - $ python atp_match_data_player.py roger-federer f324 1998 2016
and got the following error:
File "atp_match_data_player.py", line 268, in <module>
match_score += "(" + match_score_tiebreak_parsed[k] + ") "
IndexError: list index out of range
It seems like the element the code is trying to call is outside the range but I am not sure where the exact problem is. Would be great if you can help. Thanks!
Hemant
Hi :)
First of all, great job on this project. Very nice work and thanks for making it available. :)
I have been trying to get the "tournaments.py" script to work. When I try to run it for 2016 - 2017, it returns 0 results for both years.
When I debug it, I noticed the server is returning a "403 forbidden" code when I try to GET the URL (
It looks like a cloudfare bot check or a captcha of some sort.
I've put the server response in the attached file.
html-error-response.txt
I noticed in the code that none of the headers are set or changed before sending the GET url request. By default the headers of the python GET request are as follows:
"headers": {
"Accept": "/",
"Accept-Encoding": "gzip, deflate",
"Host": "www.atptour.com",
"User-Agent": "python-requests/2.31.0"
}
The user-agent being "python" seems to advertise ourselves as a bot and so I thought that was the problem.
I have tried setting these custom headers before making the GET request:
HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-GB,en;q=0.9,fr;q=0.8",
"Sec-Ch-Ua": ""Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"",
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": ""Windows"",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
"Host": "www.atptour.com"
}
page = requests.get(url, headers=HEADERS)
.... but that doesn't work either. I am still getting the 403 error. It doesn't seem to be an IP issue either. If I try to go to the URL with my browser, it displays fine.
Has something changed on their side? Have you tested this recently?
Any ideas?
Thanks. :)
When I try to run the script I can't receive any information. When I look into requests.get(url)
content I get something like this:
<title>Attention Required! | Cloudflare</title>
...
...
<div class="cf-section cf-wrapper">
<div class="cf-columns two">
<div class="cf-column">
<h2 data-translate="why_captcha_headline">Why do I have to complete a CAPTCHA?</h2>
<p data-translate="why_captcha_detail">Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.</p>
</div>
<div class="cf-column">
<h2 data-translate="resolve_captcha_headline">What can I do to prevent this in the future?</h2>
<p data-translate="resolve_captcha_antivirus">If you are on a personal connection, like at home, you can run an anti-virus scan on your device to make sure it is not infected with malware.</p>
<p data-translate="resolve_captcha_network">If you are at an office or shared network, you can ask the network administrator to run a scan across the network looking for misconfigured or infected devices.</p>
<p data-translate="resolve_captcha_privacy_pass"> Another way to prevent getting this page in the future is to use Privacy Pass. You may need to download version 2.0 now from the <a href="https://addons.mozilla.org/en-US/firefox/addon/privacy-pass/">Firefox Add-ons Store</a>.</p>
</div>
Is this new behaviour or is there still a way to avoid this?
Trying to run with the example
time python match_scores.py 2016 2017
from command window and I'm getting the following error.
The system cannot accept the time entered.
Enter the new time:
If I set up a run configuration with Pycharm I get a different error when using a script parameter of the same command.
NameError: name 'xrange' is not defined
This behavior is with python 3.6.1
Hey man great work on the python scripts!
I have picked up an issue with the match stats though. The "scrape_match_stats" function in functions.py makes the assumption that the winner is always left. This is not always the case though.
Here is an example: stats
To get around this you can check which side has the "won-game" class which produces the checkmark next to their name. Here is the xpath for finding the class.
//table[@class='scores-table']/tbody/tr[1]/td[1]/@class
Cool stuff, but the .csv it generates displays only the first line. Any ideas of what could it be?
After rebuilding the match_stats
file, I noticed there are some discrepancies in the current CSV files.
Please see the file match_stats_2016 against the following URLs:
http://www.atpworldtour.com/en/scores/2016/96/MS036/match-stats
http://www.atpworldtour.com/en/scores/2016/96/MS045/match-stats
http://www.atpworldtour.com/en/scores/2016/308/QS005/match-stats
http://www.atpworldtour.com/en/scores/2016/322/MS015/match-stats
http://www.atpworldtour.com/en/scores/2016/329/MS008/match-stats
http://www.atpworldtour.com/en/scores/2016/329/QS006/match-stats
http://www.atpworldtour.com/en/scores/2016/338/MS015/match-stats
http://www.atpworldtour.com/en/scores/2016/403/MS049/match-stats
http://www.atpworldtour.com/en/scores/2016/403/MS058/match-stats
http://www.atpworldtour.com/en/scores/2016/416/MS054/match-stats
http://www.atpworldtour.com/en/scores/2016/520/MS023/match-stats
http://www.atpworldtour.com/en/scores/2016/520/QS119/match-stats
http://www.atpworldtour.com/en/scores/2016/533/MS019/match-stats
http://www.atpworldtour.com/en/scores/2016/533/QS013/match-stats
http://www.atpworldtour.com/en/scores/2016/560/MS070/match-stats
http://www.atpworldtour.com/en/scores/2016/560/MS120/match-stats
http://www.atpworldtour.com/en/scores/2016/580/MS090/match-stats
http://www.atpworldtour.com/en/scores/2016/580/MS122/match-stats
http://www.atpworldtour.com/en/scores/2016/807/MS019/match-stats
http://www.atpworldtour.com/en/scores/2016/6242/MS011/match-stats
http://www.atpworldtour.com/en/scores/2016/6932/MS012/match-stats
http://www.atpworldtour.com/en/scores/2016/6932/MS030/match-stats
http://www.atpworldtour.com/en/scores/2016/6932/QS004/match-stats
http://www.atpworldtour.com/en/scores/2016/7290/QS015/match-stats
http://www.atpworldtour.com/en/scores/2016/7480/MS003/match-stats
http://www.atpworldtour.com/en/scores/2016/7480/QS008/match-stats
There were only 26 incorrect rows in the year of 3923 matches.
The only year I have tested so far is 2016 however if these issues exist, it's fair to say they probably exist in other years too. The PR #4 produces CSVs without these errors.
When I run any of the scrappers it returns that there are 0 tournaments found. Any idea what I could be doing wrong?
Hi thank you very much for this nice piece of code,
I can't get the match_scores data past Auckland 2018. I get the following error:
Traceback (most recent call last):
File "match_scores.py", line 29, in
scrape_tourney_output = scrape_tourney(tourney_urls_scrape[i])
File "C:\Users\Efstratios\atp-world-tour-tennis-data\python\functions.py", line 342, in scrape_tourney
winner_slug = winner_url_split[3]
IndexError: list index out of range
Do you have any idea how to fix this?
Many Thanks in advance !
I believe that in line 363 of the match_scores.py script, the index of match_stats_url_suffix_split should be set to 7 instead of 5.
I noticed this when trying to match the data sets of match_stats and match_scores by the match_id column, but it gave me no matchings. This was becuase, for example, the match_id in the match_stats csv corresponding to /en/scores/stats-centre/archive/2022/8998/ms001 is 2022-8998-ms001-7-1-mc65-ke29, whereas in the match_scores csv, it's 2022-8998-2022-7-1-mc65-ke29. I believe the former is the intended match_id.
First off, thanks for making your project available.
This is probably my fault as I'm a python newbie. You'll see I'm using python 2.7 and installed the futures and requests packages.
When trying to import match stats I receive the following error:
$ python match_stats.py 2017 0
Collecting match stats data for 66 tournaments:
Index Tourney slug Matches
----- ------------ -------
Traceback (most recent call last):
File "match_stats.py", line 51, in <module>
match_stats_data_scrape += asynchronous(match_stats_url_suffixes, scrape_match_stats, tourney_index, tourney_slug)
File "/.../bin/atp-world-tour-tennis-data/python/functions.py", line 567, in asynchronous
scrape_match_stats_output += future.result()
File "/usr/local/lib/python2.7/site-packages/concurrent/futures/_base.py", line 422, in result
return self.__get_result()
File "/usr/local/lib/python2.7/site-packages/concurrent/futures/_base.py", line 381, in __get_result
raise exception_type, self._exception, self._traceback
IndexError: list index out of range
I can run tournament.py
and match_scores.py
without issue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.