Git Product home page Git Product logo

goodbooks-10k's Introduction

goodbooks-10k

This dataset contains six million ratings for ten thousand most popular (with most ratings) books. There are also:

  • books marked to read by the users
  • book metadata (author, year, etc.)
  • tags/shelves/genres

Access

Some of these files are quite large, so GitHub won't show their contents online. See samples/ for smaller CSV snippets.

Open the notebook for a quick look at the data. Download individual zipped files from releases.

The dataset is accessible from Spotlight, recommender software based on PyTorch.

A smaller version of the dataset was generated with the notebook and is under the folder reduced. In the reduced folder, a single file books_summary.csv contains the main textual information of books and tags.

Contents

ratings.csv contains ratings sorted by time. It is 69MB and looks like that:

user_id,book_id,rating
1,258,5
2,4081,4
2,260,5
2,9296,5
2,2318,3

Ratings go from one to five. Both book IDs and user IDs are contiguous. For books, they are 1-10000, for users, 1-53424.

to_read.csv provides IDs of the books marked "to read" by each user, as user_id,book_id pairs, sorted by time. There are close to a million pairs.

books.csv has metadata for each book (goodreads IDs, authors, title, average rating, etc.). The metadata have been extracted from goodreads XML files, available in books_xml.

Tags

book_tags.csv contains tags/shelves/genres assigned by users to books. Tags in this file are represented by their IDs. They are sorted by goodreads_book_id ascending and count descending.

In raw XML files, tags look like this:

<popular_shelves>
	<shelf name="science-fiction" count="833"/>
	<shelf name="fantasy" count="543"/>
	<shelf name="sci-fi" count="542"/>
	...
	<shelf name="for-fun" count="8"/>
	<shelf name="all-time-favorites" count="8"/>
	<shelf name="science-fiction-and-fantasy" count="7"/>	
</popular_shelves>

Here, each tag/shelf is given an ID. tags.csv translates tag IDs to names.

goodreads IDs

Each book may have many editions. goodreads_book_id and best_book_id generally point to the most popular edition of a given book, while goodreads work_id refers to the book in the abstract sense.

You can use the goodreads book and work IDs to create URLs as follows:

https://www.goodreads.com/book/show/2767052
https://www.goodreads.com/work/editions/2792775

Note that book_id in ratings.csv and to_read.csv maps to work_id, not to goodreads_book_id, meaning that ratings for different editions are aggregated.

goodbooks-10k's People

Contributors

zygmuntz avatar cimarieta avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.