Git Product home page Git Product logo

similar-posts's Introduction

Similar Posts: A Plugin for Pelican

Build Status PyPI Version License

Similar Posts is a Pelican plugin that adds the similar_posts variable to every published article's context.

The inputs to the similarity measurement algorithm are article tags. Thus, for this plugin to be of any use, at least some of your articles must have a tags element in their metadata.

The similar_posts variable is a list of Article objects, or an empty list if no articles could be found with at least one tag in common with the given article. The list is sorted by descending similarity, then by descending date.

Requirements

This plugin requires Python 3.6 or later.

It depends on Gensim, which has its own dependencies such as NumPy, SciPy, and smart_open.

Installation

This plugin can be installed via:

python -m pip install pelican-similar-posts
As long as you have not explicitly added a `PLUGINS` setting to your Pelican settings file, then the newly-installed plugin should be automatically detected and enabled. Otherwise, you must add `similar_posts` to your existing `PLUGINS` list. For more information, please see the [How to Use Plugins](https://docs.getpelican.com/en/latest/plugins.html#how-to-use-plugins) documentation.

Configuration
-------------

By default, up to five articles are listed. You may customize this value by defining `SIMILAR_POSTS_MAX_COUNT` in your Pelican settings file. For example:

```python
SIMILAR_POSTS_MAX_COUNT = 10

You may also define SIMILAR_POSTS_MIN_SCORE in the settings file. It defaults to .0001. A value of 1.0 would restrict the list of similar posts to articles that have the same set of tags. Any value greater than 0.0 acts as a similarity threshold, but to play with this you'll probably have to find a proper value empirically. When running Pelican with the --debug option, extra messages show the scores of the similar posts.

You can output the similar_posts variable in your article template. This might look like the following:

{% if article.similar_posts %}
    <ul>
    {% for similar in article.similar_posts %}
        <li><a href="{{ SITEURL }}/{{ similar.url }}">{{ similar.title }}</a></li>
    {% endfor %}
    </ul>
{% endif %}

Similarity Score

The measure of similarity is based on the vector space model, which represents text documents as vectors. Each vector component corresponds to one of the terms that exists in the corpus. Thus the corpus may be represented as a matrix whose lines correspond to documents, and whose columns correspond to terms.

In this implementation, terms (tags) are weighted using the tf-idf model, which essentially means that terms that are rare across the whole corpus have greater values than those that are very common. The idea is that a term that is present in a great number of documents does not provide much specificity; it does not help relate a document to another in particular as much as a term that occurs in only a few documents.

Say we have a corpus with five terms. A document's vector might look like:

[.9, .1, .0, .0, .3]

This document has three terms. The first one has a high value, meaning it is relatively rare across the whole corpus. The second one has a low value, meaning it is much more common. The next two terms are absent from this document, while the last term is present and also somewhat common in the corpus, although not as common as the second term. If, for example, another document contains only the first and last terms, it should be considered more relevant to this document than another that would have just the first and second terms, or just the second and last terms.

We measure the similarity of two documents by computing the cosine of the angle between their unit vectors. The resulting "score" is bounded in [0, 1]. Two vectors with the same orientation have a cosine similarity of 1. The lower the value, the greater the angle between the vectors; the more "dissimilar" the documents are.

Comparison with the Related Posts plugin

The Related Posts plugin relates articles that have the greatest number of tags in common, without any tag weighting. If many articles match with the same number of tags, they all get the same score. On most web sites, the list of recommended articles is short, so the most relevant ones will often be left out if all have the same score.

Perhaps to circumvent this problem, the plugin allows one to manually link related posts by slug. However, that creates a content maintenance burden; old posts will not link to newer ones, unless they are manually edited to add them.

Contributing

Contributions are welcome and much appreciated. Every little bit helps. You can contribute by improving the documentation, adding missing features, and fixing bugs. You can also help out by reviewing and commenting on existing issues.

To start contributing to this plugin, review the Contributing to Pelican documentation, beginning with the Contributing Code section.

similar-posts's People

Contributors

botpub avatar davidlesieur avatar justinmayer avatar kdeldycke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

zhangzq melsigl

similar-posts's Issues

Manual relations

Allow users to enter some relations manually if desired, with the similarity algorithm filling the rest (up to some configured number of relations). The Gensim dependency should be made optional for users who need manual relations only.

Consider compatibility with related-posts configurations, to ease an eventual merge of the two plugins.

(These ideas were initially mentioned in #3)

Release v1.0.0 on PyPi

Now that the plugin is properly packaged in its master branch since #5 , here is the remaining tasks to have a proper v1.0.0 release on PyPi:

  • Rename master branch to main.
  • Sync repository description with the one from pyproject.toml.
  • Setup CI/CD by the way of GitHub workflows.
  • Release v1.0.0 first revision on PyPi.

similar_posts: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated.

Hello!

I'm using the similar_posts plugin for Pelican and while doing html make I'm getting the following message:

FutureWarning: Conversion of the second argument of issubdtype from int to np.signedinteger is deprecated. In future, it will be treated as np.int64 == np.dtype(int).type.
if np.issubdtype(vec.dtype, np.int):

I tried to locate the source, but since I'm a beginner in Python, I could not.

can anybody help me? If the correction is easy, I can do it and send a PR.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.