fossasia / loklak_scraper_js Goto Github PK

View Code? Open in Web Editor NEW

1.5K 1.5K 16.0 56 KB

Scrapers for loklak in javascript

License: GNU Lesser General Public License v2.1

JavaScript 98.82% Java 1.18%

loklak_scraper_js's People

Contributors

Stargazers

Watchers

Forkers

hemantjadon djmgit vibhcool singhpratyush achint08 kavithaenair lakshaykapoor7198 filoingko akshitgrover samiullahd a-v-lymarenko kapillamba4 psikarwal dshivansh parth910 12-malak

loklak_scraper_js's Issues

Add a Twitter scraper

Add scraping logic to scrapers/twitter.js.

Update TimeAndDate scraper to use base.js

At present, TimeAndDate.js scraper's structure is simple. To follow the convention, it shall implement BaseScraper

Add a Youtube scraper

Add a youtube scraper similar to loklak_server's.

Improve flickr scraper

Improve flickr scraper and upgrade to ES6 and extend base scraper.

Base Scrapper Not working

The base scrapper doesn't have an onInit() method but its object is created so it throws an error.

Short Description

Implement the scraping logic of Youtube scraper in ES 6 by extending the scraper from BaseLoklakScraper using request-promise-native for sending GET requests. The current implementation is written in ES5.1 and uses synchronous request.

Add test frame work and add tests for Github profile scraper

Add a test frame work and add tests for Github profile scraper.

JSON format of Twitter Scraper doesn't match with output of loklak search API

The difference in JSON format, in a status element:

timestamp is of datatype string: parse it to int
created_at is a string: it should match the value of timestamp (datatype: int)
screen_name is missing: there should be two screen_name, one in user object and another in the parent object.

Wrong "library" output name of bundled files

Issue type: Bug report

Short description

Bundled files generated by Webpack has wrong library name. library name is used to import the CommonJS module.

Environment

Operating system: Independent of OS
Software source: Github repository

Steps to reproduce

execute **webpack** command in the root directory.
files will be generated in the build directory.
the second word in any file of build directory is the name of the file itself, e.g. for twitter.js it is **var twitter**.

Expected behaviour

The library name i.e the 2nd word in the generated bundled file should be the same with the name of scraper class.

Actual behaviour

The library name is the name of the generated bundled file.

Would you like to work on the issue?

Yes

Add build script

Issue type: Feature request

Usecase

Users should be able to create optimized build using webpack.

Description

The script should produce optimized build and also show progress while building.

webpack -p --progress

TimeandDate scraper is not working

Issue: Bug

Short Description

The js scraper is not outputting the expected data when run in the console.
The output now seen is: https://pastebin.ubuntu.com/25220615/

This should be fixed and correct data should be shown in the output..

Add tests for quora scraper.

Adding Flickr Scrapper.

Flickr is one of the top ten most used social networking site. Flickr HTML can be scrapped and can be added.

Process date fetched to standard format in TimeAndDate scraper

There is one issue in TimeAndDate scraper that it doesn't process the date fetched to standard format in which it can be directly used .
Something like: Thu Apr 06 15:14:32 IST 2017 or 2017-04-06T09:44:32.000Z

this scraper requires processing of the fetched data.

Add API for using JS scrapers as loklak harvesting workers

Issue Description

Issue type: Parent issue

As of now, this JS has to be bundled so that it can be used in other projects and even then, the functions have to be manually imported.

It would be good to have an API of the following type or similar -

import { loklakHarvester } from 'loklak_scrapers_js';

let myLoklakHarvester = loklakHarvester('http://api.loklak.org', 4)
                            .onHarvestStart((backend, query) => {
                                ...
                            })
                            .onHarvestComplete((backend, query, messages) => {
                                ...
                            })
                            .onHarvestError((backend, query, error) => {
                                ...
                            })
                            .onPushStart((backend, messages) => {
                                ...
                            })
                            .onPushComplete((backend, messages) => {
                                ...
                            })
                            .onPushError((backend, messages, error) => {
                                ...
                            })
                            .onSuggestionFetch((backend, suggestions) => {
                                ...
                            })
                            .onShutDown(() => {
                                ...
                            });

...

myLoklakHarvester.setBackend('http://backend.loklak.org');
myLoklakHarvester.setWorkers(3);

...

myLoklakHarvester.shutDown();

This would facilitate usage of loklak_scrapers_js in many projects and also allow an easy, plug and play interface for any website.

Create a BaseScrapper Class

Creating a BaseLoklakScrapper class which is extended by all the other scrappers and provides the easy API and uniform for the scrapper implementation

Add a Github Scraper

Add a Github Profile Scraper.

Update Quora profie scraper to use base.js

Make quora profile scraper use base.js and also enhance it to make it at par with loklak_server's quora scraper

Improve Quora Scraper

Short Description

Currently, a valid profile (name of user) needs to be provided as a query parameter to scrape the details of a profile and the GET request to obtain the HTML of user profile is sent at the instantiation of a QuoraLoklakScraper object.

Required enhancement

Rather than providing a valid profile (name of user), a query should be taken as an input and then profile search is done using the https://www.quora.com/search?q=QUERY&type=profile. Links of profiles are obtained from the above-mentioned URL and then each profile is scraped. Finally, the scraped profile data is aggregated in a list and returned.

Improve Reddit scraper

Issue type: Enhancment

Required Enhancement

Current scraper takes query parameter from command line argument and scraping is done while RedditLoklakScraper is instantiated. A new method is created that takes the query parameter and provides the scraped data through a callback.

Add instagram profile scraper

Add a scraper for instagram user profiles.

Improve wordpress scrapper by extending base.js

Extend base.js in wordpress scrapper.

Add a scraper for reddit

Add a js scraper for reddit profiles

Improve Github scraper

Issue type: Enhancement

Short Description

Currently, Github scraper fetches data using Github API in a synchronous way. Rather than fetching data in a synchronous way.

Required Enhancement

The data should be fetched in an asynchronous way using request-promise-native. Along with that, the code should be converted to ES6 and the scraper should extend BaseLoklakScraper.

Create Github Issue creation and Pull request template

Short Description

Create templates for issue creation and pull request as in loklak_server and apps.loklak.org.

I am working on this.

Add webpack config file to transpile and bundle

Create a webpack.config.js file that can be used to bundle 3rd party NodeJS modules. So, that the bundled files can be used in Java Platforms (loklak_server and loklak_wok_android). Also, include Babel plugins in webpack.config.js for transpiling ES6.

Add tests for twitter scraper and scrap last tweet.

Short description

Twitter scraper currently does not harvest last tweet, for more details, see #31 (comment)
The issue with last tweet does not seem to exist anymore.

Would you like to work on the issue?

Yes

StackOverflow Scraper

Hello! I am new here and want to contribute.
I was thinking of adding a stackoverflow scraper, in which we could pass question as query and result would be some questions related to our question that has been already asked on SO and their answers.
Plus i can also add user details scraper too.
I will start working on it as soon as you let me know.

~~Compile using Webpack to generate bundled files and push the bundled files to a new branch build.~~
~~Push the bundled twitter.js file to loklak_wok_android, so that it can be used by LiquidCore for scraping tweets.~~

Add a scraper for Quora

Add a scraper for scraping Quora

fossasia / loklak_scraper_js Goto Github PK

loklak_scraper_js's People

Contributors

Stargazers

Watchers

Forkers

loklak_scraper_js's Issues

Short Description

Short description

Environment

Steps to reproduce

Expected behaviour

Actual behaviour

Would you like to work on the issue?

Usecase

Description

Short Description

Issue Description

Short Description

Required enhancement

Required Enhancement

Short Description

Required Enhancement

Short Description

Short description

Would you like to work on the issue?

Recommend Projects

Recommend Topics

Recommend Org