Git Product home page Git Product logo

davrob01 / amazonbestsellerbooksscraper Goto Github PK

View Code? Open in Web Editor NEW
11.0 4.0 5.0 132 KB

This program scrapes Amazon to get book data (the ISBN, title, author, price, etc.) of the top 100 Best Selling books in EVERY category and every subcategory in both US and International domains. Download EXE from here:

Home Page: http://students.cs.niu.edu/~z1540454/scraper/amazonbooks.php

License: Other

C# 85.37% HTML 12.29% CSS 2.34%
web scraping-websites scraping amazon books best seller csharp windows

amazonbestsellerbooksscraper's Introduction

****************************************
*                                      *
*   Amazon Best Seller Books Scraper   *
*                                      *
* Copyright (c) David T Robertson 2016 *
*                                      *
****************************************

-------------------------------------
DESCRIPTION:
-------------------------------------

This program scrapes Amazon to get all the ISBNs of the top 100 Best Selling books in EVERY category and every subcategory.

The following websites are included in the scraping process:

https://www.amazon.com/best-sellers-books-Amazon/zgbs/books/
https://www.amazon.co.uk/gp/bestsellers/books/

-------------------------------------
SYSTEM REQUIREMENTS:
-------------------------------------

Windows 7 64 bit or better

.NET Framework ver 4.5 installed

                  Minimum                Recommended
                  =======                ===========
Memory:           2 GB Available         4 GB Available is recommended
CPU:              Core 2 duo             Intel i5 or better is recommended
Bandwidth:        At least 500 Kbps      10 Mbps is recommended
Connection: 	  WIRED ETHERNET
HDD:              500 MB free

-------------------------------------
INSTRUCTIONS:
-------------------------------------

1. Extract the zip file to a new folder
2. Open the folder and Run the file AmazonBestSellerBooksScraper
3. A new window should appear. Click "Start" button to begin the process.
4. Progress will be updated throughout. Books added should constantly be increasing.
5. Let the program run until completion.
6. A successful run will say "Process Complete!" and about 2 million books should be added.
7. See "Results" folder inside the same folder that contains AmazonBestSellerBooks

Note: Depending on your internet connection, and system specs, the process may take between 1 to 3 hours.

* You can optionally do a quick test run via the "Test Run" button. About 53500 books are scraped in a successful test run. This test serves to verify you can run the process and also gives you an example of results. 

-------------------------------------
OUTPUT:
-------------------------------------

See "RESULTS" folder, created in the current folder after running.

A text file with all the ISBNs, unordered, one per line. This is always created. It is marked by date and time- this way future scrapes will not overwrite previous result files.

Example of file contents:

8856653141
8867315196
B016P0AYC2
8817085006
8863862192
B0064BV4RW

------------------------------------------------
OPTIONAL OUTPUT - MORE DETAIL - OPEN WITH EXCEL:
------------------------------------------------

By selecting the corresponding check box, additional output files are created, one per domain, in the CSV Format (Comma-Separated-Value). They contain the following: 

Book category, Rank, ISBN, price, author and title is displayed. One record per line.

This file was designed for opening in Microsoft Excel.

With this option checked, the scraping process will take a bit longer and consume more system memory. See System Requirements.

Example when viewing in Excel:

Category                      Rank  ISBN        Price   Author          Title
US Books - Arts & Photography   1   0804189765  $17.98  George W. Bush  Portraits of Courage: A Commander...
US Books - Arts & Photography   2   B01IW9TM5O  $21.95  Trevor Noah     Born a Crime: Stories from a South...
US Books - Arts & Photography   3   B0017SYOTM  $1.99   Benjamin Blech  The Sistine Secrets: Michelangelo's...
US Books - Arts & Photography   4   080241270X  $9.59   Gary Chapman    The 5 Love Languages: The Secret...
US Books - Arts & Photography   5   1607747308  $10.19  Marie Kondo     The Life-Changing Magic of Tidying...
US Books - Arts & Photography   6   B01BEGV5DY  $2.99   Grace Bonney    In the Company of Women: Inspiration...

-------------------------------------
ADVANCED OPTIONS:
-------------------------------------

A) Connections_Per_Domain

If you know you have a very fast internet connection of 20 Mbps or greater. You may optionally increase the number of connection per domain (website). This setting is found in the .config file. The default setting is 10. Max is 100. 

Increasing this number may speed up the process. However, more bandwidth may be consumed, so if you expect to use your internet connection while the process is running, it may be slow (you may even want to reduce this number in that case).

B) Automatic Start and Scheduling

The command line option "autostart" bypasses the Start button and will run the process immediately. This can be used in conjuction will the Windows Task Scheduler if you would like to run this process daily, or at a set time.

C) Error Logging

An error log file is always created in the Results folder after every run.

-------------------------------------
Frequently Asked Questions
-------------------------------------

1. I selected for the detailed output. Why do some category have less than 100 books?

A: If you browse to that category on Amazon's websites you will most likely find that some categories do in fact have less than 100 books.

	i. I checked Amazon and the amount of books in that category is way different.
	
	A: Check the log.txt file. It is possible there was an error downloading one of the pages.

2. Regarding detailed output, why are the results divided into separate files by domain? Why not just make one big file?

A: MS Excel cannot open a file that has more than 1,048,576 rows. Also the file would be very large and take time to open.

-------------------------------------
SUPPORT AND FEEDBACK
-------------------------------------

Please send any feedback or questions to [email protected]


-------------------------------------
CREDITS
-------------------------------------

This software uses Html Agility Pack. 
https://htmlagilitypack.codeplex.com/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.