Git Product home page Git Product logo

scrapers's Introduction

Scrapers for ThePornDB

These are scrapers used by ThePornDB, written in Python using Scrapy.

Writing a scraper

Most of the hard work is abstracted away from each scraper, making writing a new scraper quite simple.

We'll use AmateurBoxxx as an example.

import scrapy
from tpdb.BaseSceneScraper import BaseSceneScraper

class AmateurBoxxxSpider(BaseSceneScraper):
    name = 'AmateurBoxxx'
    network = 'Amateur Boxxx'

First we import the BaseSceneScraper which has all of our helper methods, you can see this in our scraper repository.

Name is the name of this scraper, used when running it, this scraper is called AmateurBoxxx so we would run it like scrapy crawl AmateurBoxxx

Network is what network this site or group of sites belong to. We do this to group sites, scenes and performers together. AmateurBoxxx is a one of site as far as we know, so we'll just leave it apart of it's own Network.

    start_urls = [
        'https://tour.amateurboxxx.com'
    ]

Start urls is a require property that will list our all urls that our scraper will start scraping. For a general purpose scraper that may all belongs to a single Network may have the same Xpath for every single site, so we can just list all URLs here and it will loop over them.

    selector_map = {
        'title': 'span.update_title::text',
        'description': 'span.latest_update_description::text',
        'performers': 'span.tour_update_models a::text',
        'date': 'span.availdate::text',
        'image': 'img.large_update_thumb::attr(src)',
        'tags': '',
        'trailer': '',
        'external_id': 'updates/(.+).html',
        'pagination': '/categories/updates_%s_d.html'
    }

The selector map property is the main part of the scraper you'll be working with. title, description, performers, date, image, tags and trailer are all the selectors for extracting the actual data from the page. They can be either CSS selectors or Xpath selectors. For this specific scraper i have opted to use CSS selectors. If the select starts with / it will translate to Xpath, otherwise CSS.

All selectors are required, except for tags and trailer. You can read more about writing Scrapy selectors here

The external_id selector is RegEx for extracting the ID of the scene from the URL, this is also a required field and must include a single extraction for the ID. If the ID is not in the url, you can override what's returned by implementing the get_id function.

    def get_id(self, response):
        return slugify(self.get_title(response))

Finally pagination is a string that we format with the page number, so the scraper knows how to loop over pages getting a list of Scenes.

Additionally if you're scraping a network site that has multiple subsites, you can return the site for that specific scene by implementing the get_site function.

If you look at the base scraper, every data point has a corresponding function eg. get_title and just grabs the title from selector_map returning that data. If you have data on a page that can not be extracted with just a Xpath selector, you can override the get_title function, which is provided with the raw scrapy object which includes the raw HTML so you can extract the data however you want. You can see how it is done in the Scrapy repository.

For example this is how we get the images from PornPros:

    def get_image(self, response):
        if response.xpath('//meta[@name="twitter:image"]').get() is not None:
            return response.xpath('//meta[@name="twitter:image"]/@content').get()

        if response.xpath('//video').get() is not None:
            if response.xpath('//video/@poster').get() is not None:
                return response.xpath('//video/@poster').get()

        if response.xpath('//img[@id="no-player-image"]') is not None:
            return response.xpath('//img[@id="no-player-image"]/@src').get()

If the data required is missing from this page, but it's on the pagination page, you can see how to use that data below.

    def get_scenes(self, response):
        scenes = response.css('.updateItem h4 a::attr(href)').getall()
        for scene in scenes:
            yield scrapy.Request(url=self.format_link(response, scene), callback=self.parse_scene)

Finally we have the get_scenes function, this is a required function which yields the links to scenes from the paginated pages.

Scrapy will loop over each pagination page eg. https://amateurboxxx.com/categories/updates_1_d.html and pass the raw html instance to get_scenes where you must loop over the list of scenes passing them back to the main scrapy instance using yield.

format_link is a handy helper which will format the URL found on the page to make sure it's a FQDN.

The request must be yielded as a scrapy.Request, and must pass the callback to self.parse_scene

If there is data on these paginated pages that you want to pass into the scraper, you can pass the meta object through. Here's an example from PornPros;

    def get_scenes(self, response):
        scenes = response.xpath("//div[contains(@class, 'video-releases-list')]//div[@data-video-id]")
        for scene in scenes:
            link = scene.css('a::attr(href)').get()
            meta = {}

            if scene.css('div::attr(data-date)').get() is not None:
                meta['date'] = dateparser.parse(
                    scene.css('div::attr(data-date)').get()).isoformat()

            if scene.css('div::attr(data-video-id)').get() is not None:
                meta['id'] = scene.css('div::attr(data-video-id)').get()

            yield scrapy.Request(url=self.format_link(response, link), callback=self.parse_scene, meta=meta)

If the site you're scraping doesn't have a standard pagination listing page, or is an API, all functions can be overwritten to work for your specific case. Check out our MetArt or ProjectOneService scrapers

scrapers's People

Contributors

chalupabatman69 avatar darklyter avatar deusofgamers avatar dirtyracer1337 avatar ferengi82 avatar gykes avatar jane558654 avatar numberlies26 avatar vontittyslappen avatar zosky avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.