Git Product home page Git Product logo

saffron's Introduction

Saffron | News & announcements aggregation framework.

Table of Contents

What is Saffron?

Saffron stands for Simple Abstract Framework For the Retrieval Of News

As said saffron is a framework. It is an abstraction engine that helps you collect news and announcements from websites in a uniform way.

It supports different ways of data collection, such as API endpoints and web-scraping. It tries to ease the process of integrating all data sources, by abstracting data collection into a few simple and powerful functions.

Architecture

Saffron's architecture is based on a main node that issues scraping instructions and several worker nodes that do the scraping & upload the data to the database.

The communication between the nodes is happening through the Grid. The grid will generate events to communicate with other classes. Saffron supports remote nodes by using socket.io server and clients as a middleware to connect to the main node.

Installation

To install the latest release:

npm install @unistudents/saffron

To install a specific version:

npm install @unistudents/saffron@version

Initialization

Once you have installed the library and created your configuration:

import Saffron from "@unistudents/saffron";

const saffron = new Saffron();

// Initialize saffron
saffron.initialize({/* configuration */});

// Start sheduler and workers.
saffron.start();

Configuration

Read the configuration file for more information.

Parsers

To retrieve the desired information from the websites we use parsers. There are four available parser types: wordpress, rss, html, api and dynamic.

WordPress V2

Parser type: wordpress-v2

By default, WordPress based websites has an open API for news retrieval. We make use of that to get access on the articles and categories of the website.

To quickly check if a website supports the WordPress API simply open your browser and type <website-root-link>/wp-json/wp/v2/posts/. If a valid JSON file is displayed on the browser (or downloaded on your computer) which contains the website's articles, then you can safely use the wordpress parser.

RSS

Parser type: rss

Many websites support RSS feed. RSS allows users and applications to access updates to websites in a standardized, computer-readable format. You can check if a website supports RSS if you can see this icon .

JSON / XML

Parser type: json (or xml)

This parser is best to be used when it comes to pages that are loading data using API requests (e.g. lazy loading). The only prerequisite for this parser is that the response of the API requests is in a structured JSON or XML format.

HTML

Parser type: html

This parser uses scrapping tools like CheerioJS to scrape the website content and receive the displayed news. This parser is best to be used when the HTML in the website is structured. Websites where the HTML and CSS are not structured will be very difficult to scrape.

Dynamic

Parser type: dynamic

Unlike the other parsers, this parser uses javascript/typescript code to parse a website. All the logic for the scraping is decided by the user by extending the class DynamicSourceFile.

Which to choose

We recommend a specific order for using the available parsers.

  • If the desired website is based an WordPress and the WordPress articles API is enabled, then choose the wordpress-v2 parser.
  • If the desired website supports RSS feed. then choose the rss parser.
  • If the desired website is loading data using API requests with structured responses (e.g. lazy loading), then choose the json or xml parser.
  • If the desired website has a structured form, the use the html parser.
  • If none of the above is possible (bad html or custom API) then the dynamic parser is our last choice.

Article

We have created a universal format for the parsed news, and we named it Article.

Read the article file for more information.

Source files

What is a source file?

A source file is a json or javascript file that represents a website. These files are generated from the user and guide Saffron on how to parse a website.

Creating a source file

Read the source file for the common options or the parsers files WordPress V2, RSS, API, HTML or Dynamic for the scrape options.

Middleware

A middleware is a function that gets executed before the articles are passed to newArticles function. Middleware functions can be useful for logging, article formatting or sorting.

The order where the middleware are executed is the order where they were reistered. Each middleware function can be called more than once.

Register a middleware

saffron.use("name", (...args: any) => {
    //...
});

Format article

For changing the contents of the articles. It gets as parameter every article that was found from the parsers and must return the same object when it changed.

saffron.use("article.format", (article: Article) => {
    // If possible set pubDate with milliseconds.
    let ms = new Date(article.pubDate).getTime();
    if (!isNaN(ms)) article.pubDate = ms;

    // Append source name before title for every article
    article.title = `[${article.getSource(saffron).name}] ${article.title}`;

    // Return the changed article.
    return article;
});

You can also access the source class of the article by calling article.getSource(). Note that any changes made on the source class will also affect the saved source.

Articles

This middleware can be used to edit the articles in bulk. You can sort or filter them as you want. The only requirement is to return an array (empty or not) of articles.

saffron.use("articles", (articles: Article[]) => {
    sort(articles);
    return articles.filter(
        (article) => article.title != null && article.title !== ""
    );
});

Listeners

Saffron supports listeners for various event. Listeners can be used for logging or creating analytics.

Read the listeners file for more information.

Standalone

Saffron supports immediate parsing using the static function parse.

import {Saffron} from "@unistudents/saffron";

try {
    const result = Saffron.parse({
        name: "source-name",
        url: ["Category 1", "https://example.com"],
        type: "html",
        // ...
        scrape: {
            // ...
        },
    }, null); // or pass a config

    console.log("Result:", result);
} catch (e) {
    console.log("Encountered an error during parsing:", e);
}

The result of the parse function is an array of objects for each url passed in the source file:

[
    {
        url: "https://example.com",
        aliases: ["Category 1"],
        articles: [/*Article*/, /*Article*/, /*Article*/, /*...*/]
    },
];

saffron's People

Contributors

constarg avatar cybergl1tch avatar donfn avatar jexsrs avatar nickskla avatar unistudents-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

saffron's Issues

Enhance WordPress parser to include article's thumbnail

Solution

Use the endpoint /wp-json/wp/v2/posts?_embed to include the article's thumbnail in the response.

The thumbnail is in: [article]._embedded['wp:featuredmedia']['0']. media_details.sizes.thumbnail

if the article does not have a thumbnail, then[article]._embedded should be missing from the article object.

The expected response should be:

Article {
  id: '...',
  source: {
    ...
  },
  title: '...',
  content: '...',
  attachments: [
    ...
  ],
  categories: [
    ...
  ],
  hash: '...',
  thumbnail: 'https://example.com/image.png'
}

Create a database module

Create an ES6 database module under /modules/database/index.js.
The module should export the following functions:
getArticles(),
pushArticles(),
updateArticles(),
deleteArticles().

The module should then handle the subsequent database driver loading with credentials etc...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.