fossasia / loklak_scraper_js Goto Github PK
View Code? Open in Web Editor NEWScrapers for loklak in javascript
License: GNU Lesser General Public License v2.1
Scrapers for loklak in javascript
License: GNU Lesser General Public License v2.1
Add scraping logic to scrapers/twitter.js.
At present, TimeAndDate.js scraper's structure is simple. To follow the convention, it shall implement BaseScraper
Scraper for Pinterest can be added.
Add a youtube scraper similar to loklak_server's.
Improve flickr scraper and upgrade to ES6 and extend base scraper.
The base scrapper doesn't have an onInit()
method but its object is created so it throws an error.
Issue Type: Enhancement
Implement the scraping logic of Youtube scraper in ES 6 by extending the scraper from BaseLoklakScraper
using request-promise-native
for sending GET requests. The current implementation is written in ES5.1 and uses synchronous request.
Add a test frame work and add tests for Github profile scraper.
The difference in JSON format, in a status element:
timestamp
is of datatype string: parse it to int
created_at
is a string: it should match the value of timestamp
(datatype: int)screen_name
is missing: there should be two screen_name
, one in user
object and another in the parent object.Bundled files generated by Webpack has wrong library
name. library
name is used to import the CommonJS module.
The library name i.e the 2nd word in the generated bundled file should be the same with the name of scraper class.
The library name is the name of the generated bundled file.
Yes
Users should be able to create optimized build using webpack
.
The script should produce optimized build and also show progress while building.
webpack -p --progress
Issue: Bug
The js scraper is not outputting the expected data when run in the console.
The output now seen is: https://pastebin.ubuntu.com/25220615/
This should be fixed and correct data should be shown in the output..
Add tests for quora scraper.
Flickr is one of the top ten most used social networking site. Flickr HTML can be scrapped and can be added.
There is one issue in TimeAndDate scraper that it doesn't process the date fetched to standard format in which it can be directly used .
Something like: Thu Apr 06 15:14:32 IST 2017 or 2017-04-06T09:44:32.000Z
this scraper requires processing of the fetched data.
Issue type: Parent issue
As of now, this JS has to be bundled so that it can be used in other projects and even then, the functions have to be manually imported.
It would be good to have an API of the following type or similar -
import { loklakHarvester } from 'loklak_scrapers_js';
let myLoklakHarvester = loklakHarvester('http://api.loklak.org', 4)
.onHarvestStart((backend, query) => {
...
})
.onHarvestComplete((backend, query, messages) => {
...
})
.onHarvestError((backend, query, error) => {
...
})
.onPushStart((backend, messages) => {
...
})
.onPushComplete((backend, messages) => {
...
})
.onPushError((backend, messages, error) => {
...
})
.onSuggestionFetch((backend, suggestions) => {
...
})
.onShutDown(() => {
...
});
...
myLoklakHarvester.setBackend('http://backend.loklak.org');
myLoklakHarvester.setWorkers(3);
...
myLoklakHarvester.shutDown();
This would facilitate usage of loklak_scrapers_js
in many projects and also allow an easy, plug and play interface for any website.
Creating a BaseLoklakScrapper
class which is extended by all the other scrappers and provides the easy API and uniform for the scrapper implementation
Add a Github Profile Scraper.
Make quora profile scraper use base.js and also enhance it to make it at par with loklak_server's quora scraper
Currently, a valid profile (name of user) needs to be provided as a query parameter to scrape the details of a profile and the GET request to obtain the HTML of user profile is sent at the instantiation of a QuoraLoklakScraper
object.
Rather than providing a valid profile (name of user), a query should be taken as an input and then profile search is done using the https://www.quora.com/search?q=QUERY&type=profile
. Links of profiles are obtained from the above-mentioned URL and then each profile is scraped. Finally, the scraped profile data is aggregated in a list and returned.
Issue type: Enhancment
Current scraper takes query parameter from command line argument and scraping is done while RedditLoklakScraper
is instantiated. A new method is created that takes the query parameter and provides the scraped data through a callback.
Add a scraper for instagram user profiles.
Extend base.js in wordpress scrapper.
Add a js scraper for reddit profiles
Issue type: Enhancement
Currently, Github scraper fetches data using Github API in a synchronous way. Rather than fetching data in a synchronous way.
The data should be fetched in an asynchronous way using request-promise-native
. Along with that, the code should be converted to ES6 and the scraper should extend BaseLoklakScraper
.
Create templates for issue creation and pull request as in loklak_server and apps.loklak.org.
I am working on this.
Create a webpack.config.js
file that can be used to bundle 3rd party NodeJS modules. So, that the bundled files can be used in Java Platforms (loklak_server and loklak_wok_android). Also, include Babel plugins in webpack.config.js
for transpiling ES6.
Twitter scraper currently does not harvest last tweet, for more details, see #31 (comment)
The issue with last tweet does not seem to exist anymore.
Yes
Hello! I am new here and want to contribute.
I was thinking of adding a stackoverflow scraper, in which we could pass question as query and result would be some questions related to our question that has been already asked on SO and their answers.
Plus i can also add user details scraper too.
I will start working on it as soon as you let me know.
Add scraper for locationwise date and time scraper and timeanddate scraper
There are some public profile in Linkedin that can be scrapped. Create a scraper that can scrape linkedin profiles.
1 : Extending base.js
2 : Proper final data format
Add a js scraper for wordpress blogs
As of now, TravisCI should do the following:
build
.twitter.js
file to loklak_wok_android, so that it can be used by LiquidCore
for scraping tweets.Add a scraper for scraping Quora
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.