Git Product home page Git Product logo

gs.chainsearch's Introduction

gs.chainsearch

Current Version GitHub stars GitHub forks

This package provides methods to perform a forward chaining search via Google Scholar. It accomplishes this by scraping publications that cite a cornerstone publication via the “Cited By” search feature. The primary purpose of this package is to enable researchers to produce comprehensive literature reviews.

The general strategy is as follows:

  1. Identify a cornerstone publication.
  2. Scrape the search results for citing publications and save the raw html.
  3. Parse and combine key metadata (publication details) from the raw html.
  4. Process metadata (e.g., remove duplicates).

Contributions are more than welcome! See CONTRIBUTING for guidance.

Package Data

By default, this package tracks files within the R user package cache (see tools::R_user_dir). This can be modified to any mounted directory via storage_update. Each cornerstone publication is given a dedicated subdirectory within that. The structure is as follows:

├── /<storage>/
  ├── proxy_table.csv
  ├── proxy_blacklist.csv
  ├── <publication_id>/
    ├── pages/
      ├── page1.html
      ├── page2.html
      ├── ...
    ├── meta_raw.csv
    ├── meta_final.csv
  • proxy_table.csv: A table of proxy IPs. The table includes ip, port, and active (either TRUE or FALSE, indicating if the proxy is the current default`.
  • proxy_blacklist.csv: A table of proxy IPs that are marked as “blacklisted”, either manually via blacklist_ip or automatically via save_gs_page(..., auto_cycle_ip = TRUE). The table includes ip and mark_method (either “manual” or “automatic”).
  • <publication_id>/: Dedicated storage for a cornerstone publication.
  • pages/: A subdirectory used as storage for raw HTML files that are scraped.
  • meta_raw.csv: A table of raw metadata extracted from raw HTML pages.
  • meta_final.csv: A table of processed metadata.

Proxy Cycling

A proxy cycling procedure is implemented internally in order to gracefully recover from IP bans issued by Google. At the beginning of a working session, a fresh list of public proxies is fetched (thanks Geonode!. This list is randomly sampled during the scraping process. Each time an IP ban is detected, the culprit IP is blacklisted and subsequent scrapes will not consider it.

Shiny App

This package includes a shiny interface accessible by running gs.chainsearch::app_run().

Session Settings

Via the Session Settings interface, the user is able to select the storage directory, indicate the cornerstone publication, and manage package files.

Proxy Settings

Via the Proxy Settings interface, the user is able to browse and refresh the proxy list, manually set the active proxy, blacklist individual proxies, and view proxy logs.

Scrape

Via the Scrape interface, the user is able to control and monitor the active scraping job.

Results

Via the Results interface, the user is able to view and modify publication metadata extracted from the scraped HTML.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.