Git Product home page Git Product logo

rymscraper's Introduction

rymscraper

Build Status Codacy Badge

rymscraper is an unofficial Python API to extract data from rateyourmusic.com (👍 consider supporting them!).

⚠️ An excessive usage of rymscraper can make your IP address banned by rateyourmusic for a few days.

Requirements

  • beautifulsoup4
  • lxml
  • requests
  • pandas
  • selenium with geckodriver
  • tqdm

Installation

Classic installation

python setup.py install

Installation in a virtualenv with pipenv

pipenv install '-e .'

Example

The data format used by the library is the python dict. It can be easily converted to CSV or JSON.

>>> import pandas as pd
>>> from rymscraper import rymscraper, RymUrl

>>> network = rymscraper.RymNetwork()

Artist

>>> artist_infos = network.get_artist_infos(name="Daft Punk")
>>> # or network.get_artist_infos(url="https://rateyourmusic.com/artist/daft-punk")
>>> import json
>>> json.dumps(artist_infos, indent=2, ensure_ascii=False)
{
    "Name": "Daft Punk",
    "Formed": "1993, Paris, Île-de-France, France",
    "Disbanded": "22 February 2021",
    "Members": "Thomas Bangalter (programming, synthesizer, keyboards, drum machine, guitar, bass, vocals, vocoder, talk box), Guy-Manuel de Homem-Christo (programming, synthesizer, keyboards, drums, drum machine, guitar)",
    "Related Artists": "Darlin'",
    "Notes": "See also: Discovered: A Collection of Daft Funk Samples",
    "Also Known As": "Draft Ponk",
    "Genres": "French House, Film Score, Disco, Electronic, Synthpop, Electroclash"
}
>>> # you can easily convert all returned values to a pandas dataframe
>>> df = pd.DataFrame([artist_infos])
>>> df[['Name', 'Formed', 'Disbanded']]
     Name                              Formed         Disbanded
Daft Punk  1993, Paris, Île-de-France, France  22 February 2021

You can also extract several artists at once:

# several artists
>>> list_artists_infos = network.get_artists_infos(names=["Air", "M83"])
>>> # or network.get_artists_infos(urls=["https://rateyourmusic.com/artist/air", "https://rateyourmusic.com/artist/m83"])
>>> df = pd.DataFrame(list_artists_infos)

Album

>>> # name field should use the format Artist - Album name (not ideal but it works for now)
>>> album_infos = network.get_album_infos(name="XTC - Black Sea")
>>> # or network.get_album_infos(url="https://rateyourmusic.com/release/album/xtc/black-sea/")
>>> df = pd.DataFrame([album_infos])

You can also extract several albums at once:

# several albums
>>> list_album_infos = network.get_albums_infos(names=["Ride - Nowhere", "Electrelane - Axes"])
>>> # or network.get_albums_infos(urls=["https://rateyourmusic.com/release/album/ride/nowhere/", "https://rateyourmusic.com/release/album/electrelane/axes/"])
>>> df = pd.DataFrame(list_album_infos)

Album Timeline

Number of ratings per day:

>>> album_timeline = network.get_album_timeline(url="https://rateyourmusic.com/release/album/feu-chatterton/palais-dargile/")
>>> df = pd.DataFrame(album_timeline)
>>> df["Date"] = df["Date"].apply(lambda x: datetime.datetime.strptime(x, "%d %b %Y"))
>>> df["Date"].groupby(df["Date"].dt.to_period("D")).count().plot(kind="bar")

timeline_plot

Chart

>>> # (slow for very long charts)
>>> rym_url = RymUrl.RymUrl() # default: top of all-time. See examples/get_chart.py source code for more options.
>>> chart_infos = network.get_chart_infos(url=rym_url, max_page=3)
>>> df = pd.DataFrame(chart_infos)
>>> df[['Rank', 'Artist', 'Album', 'RYM Rating', 'Ratings']]
Rank                         Artist                                              Album RYM Rating Ratings
   1                      Radiohead                                        OK Computer       4.23   67360
   2                     Pink Floyd                                 Wish You Were Here       4.29   46534
   3                   King Crimson                   In the Court of the Crimson King       4.30   42784
   4                      Radiohead                                              Kid A       4.21   55999
   5            My Bloody Valentine                                           Loveless       4.24   47394
   6                 Kendrick Lamar                                To Pimp a Butterfly       4.27   41040
   7                     Pink Floyd                          The Dark Side of the Moon       4.20   55535
   8                    The Beatles                                         Abbey Road       4.25   42739
   9  The Velvet Underground & Nico                      The Velvet Underground & Nico       4.24   44002
  10                    David Bowie  The Rise and Fall of Ziggy Stardust and the Sp...       4.26   37963

Discography

>>> discography_infos = network.get_discography_infos(name="Aufgang", complementary_infos=True)
>>> # or network.get_discography_infos(url="https://rateyourmusic.com/artist/aufgang")
>>> df = pd.DataFrame.from_records(discography_infos)
>>> # don't forget to close and quit the browser (prevent memory leaks)
>>> network.browser.close()
>>> network.browser.quit()

Example Scripts

Some scripts are included in the examples folder.

  • get_artist_infos.py : extract informations about one or several artists by name or url in a csv file.
  • get_chart.py : extract albums information appearing in a chart by name, year or url in a csv file.
  • get_discography.py : extract the discography of one or several artists by name or url in a csv file.
  • get_album_infos.py : extract informations about one or several albums by name or url in a csv file.
  • get_album_timeline.py : extract the timeline of an album into a json file.

Usage

python get_artist_infos.py -a "u2,xtc,brad mehldau"
python get_artist_infos.py --file_artist artist_list.txt

python get_chart.py -g rock
python get_chart.py -g ambient -y 2010s -c France --everything

python get_discography.py -a magma
python get_discography.py -a "the new pornographers, ween, stereolab" --complementary_infos --separate_export

python get_album_infos.py -a "ride - nowhere"
python get_album_infos.py --file_url urls_list.txt --no_headless

python get_album_timeline.py -a "ride - nowhere"
python get_album_timeline.py -u "https://rateyourmusic.com/release/album/feu-chatterton/palais-dargile/"

Help

python get_artist_infos.py -h
usage: get_artist_infos.py [-h] [--debug] [-u URL] [--file_url FILE_URL]
                           [--file_artist FILE_ARTIST] [-a ARTIST] [-s]
                           [--no_headless]

Scraper rateyourmusic (artist version).

optional arguments:
  -h, --help            show this help message and exit
  --debug               Display debugging information.
  -u URL, --url URL     URLs of the artists to extract (separated by comma).
  --file_url FILE_URL   File containing the URLs to extract (one by line).
  --file_artist FILE_ARTIST
                        File containing the artists to extract (one by line).
  -a ARTIST, --artist ARTIST
                        Artists to extract (separated by comma).
  -s, --separate_export
                        Also export the artists in separate files.
  --no_headless         Launch selenium in foreground (background by default).
python get_chart.py -h
usage: get_chart.py [-h] [--debug] [-u URL] [-g GENRE] [-y YEAR] [-c COUNTRY]
                    [-p PAGE] [-e] [--no_headless]

Scraper rateyourmusic (chart version).

optional arguments:
  -h, --help            show this help message and exit
  --debug               Display debugging information.
  -u URL, --url URL     Chart URL to parse.
  -g GENRE, --genre GENRE
                        Chart Option : Genre (use + if you need a space).
  -y YEAR, --year YEAR  Chart Option : Year.
  -c COUNTRY, --country COUNTRY
                        Chart Option : Country.
  -p PAGE, --page PAGE  Number of page to extract. If not set, every pages
                        will be extracted.
  -e, --everything      Chart Option : Extract Everything / All Releases
                        (otherwise only albums).
  --no_headless         Launch selenium in foreground (background by default).
python get_discography.py -h
usage: get_discography.py [-h] [--debug] [-u URL] [--file_url FILE_URL]
                          [--file_artist FILE_ARTIST] [-a ARTIST] [-s] [-c]
                          [--no_headless]

Scraper rateyourmusic (discography version).

optional arguments:
  -h, --help            show this help message and exit
  --debug               Display debugging information.
  -u URL, --url URL     URLs to extract (separated by comma).
  --file_url FILE_URL   File containing the URLs to extract (one by line).
  --file_artist FILE_ARTIST
                        File containing the artists to extract (one by line).
  -a ARTIST, --artist ARTIST
                        Artists to extract (separated by comma).
  -s, --separate_export
                        Also export the artists in separate files.
  -c, --complementary_infos
                        Extract complementary informations for each releases
                        (slower, more requests on rym).
  --no_headless         Launch selenium in foreground (background by default).
python get_album_infos.py -h
usage: get_album_infos.py [-h] [--debug] [-u URL] [--file_url FILE_URL]
                          [--file_album_name FILE_ALBUM_NAME] [-a ALBUM_NAME]
                          [-s] [--no_headless]

Scraper rateyourmusic (album version).

optional arguments:
  -h, --help            show this help message and exit
  --debug               Display debugging information.
  -u URL, --url URL     URL to extract (separated by comma).
  --file_url FILE_URL   File containing the URLs to extract (one by line).
  --file_album_name FILE_ALBUM_NAME
                        File containing the name of the albums to extract (one
                        by line, format Artist - Album).
  -a ALBUM_NAME, --album_name ALBUM_NAME
                        Albums to extract (separated by comma, format Artist -
                        Album).
  -s, --separate_export
                        Also export the artists in separate files.
  --no_headless         Launch selenium in foreground (background by default).
python get_album_timeline.py -h
usage: get_album_timeline.py [-h] [--debug] [-u URL] [-a ALBUM_NAME]
                             [--no_headless]

Scraper rateyourmusic (album timeline version).

optional arguments:
  -h, --help            show this help message and exit
  --debug               Display debugging information.
  -u URL, --url URL     URL to extract.
  -a ALBUM_NAME, --album_name ALBUM_NAME
                        Album to extract (format Artist - Album).
  --no_headless         Launch selenium in foreground (background by default).

rymscraper's People

Contributors

chotom avatar dbeley avatar dependabot-preview[bot] avatar shatteringlass avatar vuizur avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

rymscraper's Issues

get album cover url

Would there be any way to get the album cover url when calling network.get_album_infos() ?

Support track ratings

Most music players support track ratings, not album ratings. iTunes, Musicbee, foobar2000 etc. Since paid RYM customers have access to track ratings on each album, it would be nice to be able to query that information.

Might want to note that user data exports are supported by RYM

User data exports are visible on the bottom of your user page (while logged in?). Example user page is https://rateyourmusic.com/~M4rcus

I don't see the export button (very bottom of the page) there, but on my own page, I have it. It exports a simple CSV of user ratings.

Having this saves me some effort - I was going to cook something up that used this project as a baseline. This project comes up on Google when searching for "rateyourmusic api", so maybe note this in the readme.md?

difflib.get_close_matches is case sensitive and can cause lack of matches

Here's an example of what could happen when querying for albums whose titles are sometimes stylised:

>>> ai = n.get_album_infos(name='Tyler the creator - IGOR')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/federico/rymscraper/rymscraper/__init__.py", line 28, in get_album_infos
    url = utils.get_url_from_album_name(self.browser, name)
  File "/Users/federico/rymscraper/rymscraper/utils.py", line 41, in get_url_from_album_name
    difflib.get_close_matches(album_name, artist_album_name)[0]
IndexError: list index out of range

This is caused by the implicit case-sensitivity of the difflib.get_close_matches method.
I will submit a PR with a patch that can work around this limitation.

Infinite receiving of a RYM page in get_url method

When I try to get info from RYM, I encounter an infinite loop in get_url() method of RymBrowser class (Selenium get() method of a class WebDriver, to be precise). Somehow it can't handle with getting any RYM page.

I've tried to pass other websites (google.com) into get() method and it works just fine.

I also notice that other Sonemic service, glitchewave.com, has the same problem.

I couldn't figure out how to solve this. Maybe I'm just missing something, idk.

Unable to change URL path

First off, thank you so much for making this available!!

This is my absolute first thing I've used Python for or pulled from Github. So please excuse any noob issues.

I can not figure out how to change the URL that is being used in get_chart.py.

It looks like at the top it comes from rymscraper and then in rymscraper it looks like it comes from RymUrl.py where I can see a hard-coded path. When I update that path it has no effect on my get_chart script.

I see I can use the -U to change the URL, but that only exports 1 page and the escapes out because 'str' object has no attribute 'page'

New Cloudflare protection

Hello,

it seems with the new Cloudflare protection they implemented in RateYourMusic, the scraper does not work anymore

image

Somebody has any solution for this?

Thanks and Best Regards!

Albums with "Needs more weighted ratings to attain its full chart position" are currently ignored

Hello,

currently when a chart is scraped that has entries with this tag, all of those are ignored. An example is here (code works on my fork):

from rymscraper.RymUrl import RymUrl
from rymscraper import rymscraper
url = RymUrl(genres="dream pop", origin_countries="russia")
print(url)
network = rymscraper.RymNetwork()

list_rows = network.get_chart_infos(url)
print(list_rows)

This prints only 4 results, because the other entries under https://rateyourmusic.com/charts/top/album/all-time/g:dream-pop/loc:russia/1/ have this indicator. It would be cool if they would be scraped despite of this.

Track list

Hi, thank you for the awesome work! I am wondering if there is a way that I can extract the track titles in the album? Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.