Git Product home page Git Product logo

scrapy-crawler's Introduction

Scrapy Crawler

Summary

  • this is appstore crawler by python Scrapy
  • basic setup plus added progress bar etc.
  • the result saved as csv file with icon image

Setup

brew install pyenv
brew install pipenv

pipenv install
pipenv install --dev

Run

pipenv shell
cd appstore_crawler
export LANG=en_US.utf8  # to avoid ParserError
  • run one category
# set category
CATEGORY=photo-video

# this example will crawl urls includes /genre/ios-{category}
# i.e. https://apps.apple.com/us/genre/ios-photo-video/id6008

scrapy crawl appstore \
    -a category=${CATEGORY} \
    -s JOBDIR=crawls/${CATEGORY}-1  # update this number or delete the dir
  • or run all categories
scrapy crawl appstore \
    -s JOBDIR=crawls/appstore-1 \
    -o csvdata/appstore.csv

Read icon image

from PIL import Image
import numpy as np

icon = Image.open('appstore_crawler/icondata/photo-video/1006639052.png')
img = np.array(icon, 'f')

print(img.dtype)   # dtype('float32')
print(img.shape)   # (246, 246, 3)
# NOTE: the order is RGB (c.f. OpenCV is BGR)

Read CSV result

import pandas as pd

df = pd.read_csv('appstore_crawler/csvdata/photo-video.csv')

df.head()

df.columns
Index(['id', 'category', 'name', 'subtitle', 'url', 'date_published',
       'rating_value', 'rating_count', 'rating_ratio', 'price_category',
       'price', 'price_currency', 'has_in_app_purchases', 'author_name',
       'author_url', 'description'],
      dtype='object')

# show rating ranking for example
df.sort_values(
    ['rating_value', 'rating_count'], ascending=False
    )[['rating_value', 'rating_count', 'name']].head()

Debug scrapy

# for example Evernote
scrapy shell https://apps.apple.com/us/app/evernote/id281796108
# another example Dropbox
scrapy shell https://apps.apple.com/us/app/dropbox/id327630330

(reference) command log of initial setup

export PIPENV_VENV_IN_PROJECT=true
pipenv --python 3.6
pipenv install scrapy tqdm python-dateutil pandas
pipenv run pip list
pipenv graph
pipenv shell
scrapy startproject appstore_crawler
cd appstore_crawler
scrapy genspider appstore example.com
# check
pipenv update --outdated

# do update
pipenv update
pipenv update --dev
pipenv clean

# re-crete virtual env
pipenv --rm
pipenv install
pipenv install --dev

Reference

Scrapy

App Store

scrapy-crawler's People

Contributors

kenmd avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.