Using Python to scrape ATP World Tour tennis data

Python 99.89% R 0.11%

atp-world-tour-tennis-data's People

Contributors

Stargazers

Watchers

Forkers

apertile rossant abresler imclab nknithinabc udaymenon xiyuesong fmdallafontana ruaridhw jsnoke taherelsheikh frejnielsen rjjeffries jase1177 xiaoliuyujia kruegerol matthewpapke wenpei amilivojevic mlearningtennis danbraunai kytosatt icecream-sandwich-tm vsoaresbruno amdelas mdf15 ciyer64 jammyninja datasets rcorty tazamorte epochiero mneedham paulsteffen-lab petermokhov kklobe nathanschafer99 wraith10 ikitozen pis147879 amldev misterbwang sgwozdz ebbflo235 dkarris seanreed1111 technoir9 anasancho drixgod jlpmedina6 rithvik-kranti olifer gabrielpprieto lsheng23 dlisk92 jaimeleite tacoman784 leerouxnee ricomix23 alexargus rmamod bingranc tkikuchi2000 rokoo riobaldo1967 herissondev davidfreifeld jenpink25 erdemklyc brandontaylor156 reizlejade amilos nereusrb paul72187 timwalenczak10 harleytowler alsac salvprest armstjc sebagodoy1 mikeyl-sportsbet octogon21 ziyedk thabelang kedaarray ivan-cepeda christinamarkopoulou avbtfiles feraranas giovannimin zx41 e-alvarado terio0819 chboudry ryanlstevens malomiquel jrcxl magicrockman cws01 mccodycasey

atp-world-tour-tennis-data's Issues

Data quality issue

Hi,

this is a great database, many thanks for it! But I have a note on a data quality issue. Perhaps it can help you to make it even better.

I think it is wrong that the field match_id from the match_scores and match_stats files is not a primary key. Actually it should be. But consider the case where in a single tournament, there were 2 matches between the same players and the outcome (in terms of winner/loser) was the same. For instance if they met in the round_robin phase and later at the final. Then it is unclear which match from the match_score file corresponds to which stats from the match_stats file.

And also in 3 cases the first match (or the second?) is completely missing from the match_stat file. As an example consider the following 14 match_id-s:

2001-605-h432-g379
2003-605-f324-a092
2004-605-f324-h432
2008-605-d643-d402
2009-306-c514-r383
2010-573-c882-r772
2011-605-f324-t786
2013-319-h571-sf41
2013-520-p701-w522
2015-410-r772-h756
2015-573-bc72-m824
2017-337-db59-f586
2017-7648-bh09-ca99
results-605-d875-gb88

The first 3 from 2001, 2003 and 2004 are the ones with missing matches from match_stats.

Way to directly use their API

Hello @serve-and-volley!

Just to tell you I really like what you did on the scrapping of this humongous website, that's a lot of work so I appreciate.

I recently found a way to use directly their API by decoding the encrypted data, I just thought you might be interested as it could make scrapping a bit easier and less a burden to maintain as long as they do not change the cipher method (not relying on the interface anymore):
https://stackoverflow.com/questions/73735401/scraping-an-atptour-com-api-returns-what-looks-like-encrypted-data/75086660#75086660

If I may be of any help, please reach 😉

Cheers

Scraping Issue

Fantastic work, this is a very cool project!

I tried running the "match_stats.py..." command (for 2018), and it looks like most of the matches came back with no results.

Would this indicate that the script needs to be adjusted to the ATP website?

Please help

I have tried to get any of the 3 scripts to work and they work. But the data count is always zero. It seems that the problem is in the line

tourney_titles_parsed = xpath_parse(year_tree, tourney_titles_xpath)

at least in tournament.py.

Could you help me?

Getting Player Names for Match Stats

While scrapping player stats, I am not able to find the players name. Any idea on how to extract the winner and loser's name for each match.

csv column headers

I am having trouble figuring out the column headers for the csv files in https://github.com/serve-and-volley/atp-world-tour-tennis-data/tree/master/csv/3_match_stats

About the player overviews

Hi, could you please add the player overviews CSV column titles, etc to the python folder?
Thank you soooo much!

2021 is missing from ATP site.

How do you go about filling in missing data?

Scraping Script.

I'm new to the scraping realm but very interesting in the tennis data using for analysis in side project. On the scraping script, I got an 'list out of index' type of error on the Routine Command. I was running your script on Spyder and Jupyter Note book.

Command line input

start_year = str(sys.argv[1])
end_year = str(sys.argv[2])

This above line was the error.

best,

Paripon Thanthong

Licence?

Hi,

I was wondering if you plan on adding a licence so others can use these programs and datasets? I think it would be very useful in my academic research but due to ethical reasons require a licence - simple examples you could add to the readme are here: https://creativecommons.org/share-your-work/

Thanks,
W

Getting the following error

Hi,

I ran the following command as per your suggestion - $ python atp_match_data_player.py roger-federer f324 1998 2016

and got the following error:

 File "atp_match_data_player.py", line 268, in <module>

match_score += "(" + match_score_tiebreak_parsed[k] + ") "

IndexError: list index out of range

It seems like the element the code is trying to call is outside the range but I am not sure where the exact problem is. Would be great if you can help. Thanks!

Hemant

403 error

Hi :)

First of all, great job on this project. Very nice work and thanks for making it available. :)

I have been trying to get the "tournaments.py" script to work. When I try to run it for 2016 - 2017, it returns 0 results for both years.

When I debug it, I noticed the server is returning a "403 forbidden" code when I try to GET the URL (

It looks like a cloudfare bot check or a captcha of some sort.

I've put the server response in the attached file.
html-error-response.txt

I noticed in the code that none of the headers are set or changed before sending the GET url request. By default the headers of the python GET request are as follows:

"headers": {
"Accept": "/",
"Accept-Encoding": "gzip, deflate",
"Host": "www.atptour.com",
"User-Agent": "python-requests/2.31.0"
}

The user-agent being "python" seems to advertise ourselves as a bot and so I thought that was the problem.

I have tried setting these custom headers before making the GET request:

HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-GB,en;q=0.9,fr;q=0.8",
"Sec-Ch-Ua": ""Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"",
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": ""Windows"",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
"Host": "www.atptour.com"
}

page = requests.get(url, headers=HEADERS)

.... but that doesn't work either. I am still getting the 403 error. It doesn't seem to be an IP issue either. If I try to go to the URL with my browser, it displays fine.

Has something changed on their side? Have you tested this recently?

Any ideas?

Thanks. :)

CAPTCHA checking

When I try to run the script I can't receive any information. When I look into requests.get(url) content I get something like this:

<title>Attention Required! | Cloudflare</title>

...
...

 <div class="cf-section cf-wrapper">
        <div class="cf-columns two">
          <div class="cf-column">
            <h2 data-translate="why_captcha_headline">Why do I have to complete a CAPTCHA?</h2>
            
            <p data-translate="why_captcha_detail">Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.</p>
          </div>

          <div class="cf-column">
            <h2 data-translate="resolve_captcha_headline">What can I do to prevent this in the future?</h2>
            

            <p data-translate="resolve_captcha_antivirus">If you are on a personal connection, like at home, you can run an anti-virus scan on your device to make sure it is not infected with malware.</p>

            <p data-translate="resolve_captcha_network">If you are at an office or shared network, you can ask the network administrator to run a scan across the network looking for misconfigured or infected devices.</p>
            
              
              <p data-translate="resolve_captcha_privacy_pass"> Another way to prevent getting this page in the future is to use Privacy Pass. You may need to download version 2.0 now from the <a href="https://addons.mozilla.org/en-US/firefox/addon/privacy-pass/">Firefox Add-ons Store</a>.</p>
              
            
          </div>

Is this new behaviour or is there still a way to avoid this?

Cannot run match_scores in python 3.5 windows

Trying to run with the example

time python match_scores.py 2016 2017

from command window and I'm getting the following error.

The system cannot accept the time entered.
Enter the new time:

If I set up a run configuration with Pycharm I get a different error when using a script parameter of the same command.
NameError: name 'xrange' is not defined

This behavior is with python 3.6.1

Match Stats Winner & Loser Issues

Hey man great work on the python scripts!

I have picked up an issue with the match stats though. The "scrape_match_stats" function in functions.py makes the assumption that the winner is always left. This is not always the case though.

Here is an example: stats

To get around this you can check which side has the "won-game" class which produces the checkmark next to their name. Here is the xpath for finding the class.
//table[@class='scores-table']/tbody/tr[1]/td[1]/@class

Only header is displayed

Cool stuff, but the .csv it generates displays only the first line. Any ideas of what could it be?

Data errors

After rebuilding the match_stats file, I noticed there are some discrepancies in the current CSV files.

Please see the file match_stats_2016 against the following URLs:

http://www.atpworldtour.com/en/scores/2016/96/MS036/match-stats
http://www.atpworldtour.com/en/scores/2016/96/MS045/match-stats
http://www.atpworldtour.com/en/scores/2016/308/QS005/match-stats
http://www.atpworldtour.com/en/scores/2016/322/MS015/match-stats
http://www.atpworldtour.com/en/scores/2016/329/MS008/match-stats
http://www.atpworldtour.com/en/scores/2016/329/QS006/match-stats
http://www.atpworldtour.com/en/scores/2016/338/MS015/match-stats
http://www.atpworldtour.com/en/scores/2016/403/MS049/match-stats
http://www.atpworldtour.com/en/scores/2016/403/MS058/match-stats
http://www.atpworldtour.com/en/scores/2016/416/MS054/match-stats
http://www.atpworldtour.com/en/scores/2016/520/MS023/match-stats
http://www.atpworldtour.com/en/scores/2016/520/QS119/match-stats
http://www.atpworldtour.com/en/scores/2016/533/MS019/match-stats
http://www.atpworldtour.com/en/scores/2016/533/QS013/match-stats
http://www.atpworldtour.com/en/scores/2016/560/MS070/match-stats
http://www.atpworldtour.com/en/scores/2016/560/MS120/match-stats
http://www.atpworldtour.com/en/scores/2016/580/MS090/match-stats
http://www.atpworldtour.com/en/scores/2016/580/MS122/match-stats
http://www.atpworldtour.com/en/scores/2016/807/MS019/match-stats
http://www.atpworldtour.com/en/scores/2016/6242/MS011/match-stats
http://www.atpworldtour.com/en/scores/2016/6932/MS012/match-stats
http://www.atpworldtour.com/en/scores/2016/6932/MS030/match-stats
http://www.atpworldtour.com/en/scores/2016/6932/QS004/match-stats
http://www.atpworldtour.com/en/scores/2016/7290/QS015/match-stats
http://www.atpworldtour.com/en/scores/2016/7480/MS003/match-stats
http://www.atpworldtour.com/en/scores/2016/7480/QS008/match-stats

There were only 26 incorrect rows in the year of 3923 matches.

The only year I have tested so far is 2016 however if these issues exist, it's fair to say they probably exist in other years too. The PR #4 produces CSVs without these errors.

Not finding any tournaments

When I run any of the scrappers it returns that there are 0 tournaments found. Any idea what I could be doing wrong?

2018-2019 Data

Hi thank you very much for this nice piece of code,
I can't get the match_scores data past Auckland 2018. I get the following error:

Traceback (most recent call last):
File "match_scores.py", line 29, in
scrape_tourney_output = scrape_tourney(tourney_urls_scrape[i])
File "C:\Users\Efstratios\atp-world-tour-tennis-data\python\functions.py", line 342, in scrape_tourney
winner_slug = winner_url_split[3]
IndexError: list index out of range

Do you have any idea how to fix this?

Many Thanks in advance !

Minor update to match_scores.py

I believe that in line 363 of the match_scores.py script, the index of match_stats_url_suffix_split should be set to 7 instead of 5.

I noticed this when trying to match the data sets of match_stats and match_scores by the match_id column, but it gave me no matchings. This was becuase, for example, the match_id in the match_stats csv corresponding to /en/scores/stats-centre/archive/2022/8998/ms001 is 2022-8998-ms001-7-1-mc65-ke29, whereas in the match_scores csv, it's 2022-8998-2022-7-1-mc65-ke29. I believe the former is the intended match_id.

IndexError: list index out of range

First off, thanks for making your project available.

This is probably my fault as I'm a python newbie. You'll see I'm using python 2.7 and installed the futures and requests packages.

When trying to import match stats I receive the following error:

$ python match_stats.py 2017 0

Collecting match stats data for 66 tournaments:

Index    Tourney slug       Matches
-----    ------------       -------
Traceback (most recent call last):
  File "match_stats.py", line 51, in <module>
    match_stats_data_scrape += asynchronous(match_stats_url_suffixes, scrape_match_stats, tourney_index, tourney_slug)
  File "/.../bin/atp-world-tour-tennis-data/python/functions.py", line 567, in asynchronous
    scrape_match_stats_output += future.result()
  File "/usr/local/lib/python2.7/site-packages/concurrent/futures/_base.py", line 422, in result
    return self.__get_result()
  File "/usr/local/lib/python2.7/site-packages/concurrent/futures/_base.py", line 381, in __get_result
    raise exception_type, self._exception, self._traceback
IndexError: list index out of range

I can run tournament.py and match_scores.py without issue.

serve-and-volley / atp-world-tour-tennis-data Goto Github PK