Git Product home page Git Product logo

django-feed-reader's Introduction

Django Feed Reader

This is a simple Django module to allow you subscribe to RSS (and other) feeds.

This app has no UI, it just reads and stores the feeds for you to use as you see fit.

This app builds on top of the FeedParser library to provide feed management, storage, scheduling etc.

Features

  • Consumes RSS, Atom and JSONFeed feeds.
  • Parses feeds liberally to try and accomodate simple errors.
  • Will attempt to bypass Cloudflare protection of feeds
  • Supports enclosure (podcast) discovery
  • Automatic feed scheduling based on frequency of updates

Installation

django-feed-reader is written in Python 3 and supports Django 2.2+

  • pip install django-feed-reader
  • Add feeds to your INSTALLED_APPS
  • Setup some values in settings.py so that your feed reader politely announces itself to servers
    • Set FEEDS_USER_AGENT to the name and (optionally version) of your service e.g. "ExampleFeeder/1.2"
    • Set FEEDS_SERVER to preferred web address of your service so that feed hosts can locate you if required e.g. https://example.com
  • Setup a mechanism to periodically refresh the feeds (see below)

Optional Settings

  • FEEDS_VERIFY_HTTPS (Default True)
    • Older versions of this library did not verify https connections when fetching feeds. Set this to False to revert to the old behaviour.
  • KEEP_OLD_ENCLOSURES (Default False)
    • Some feeds (particularly podcasts with Dynamic Ad Insertion) will change their enclosure urls between reads. By default, old enclosures are deleted and replaced with new ones. Set this to true, to retain old enclosures - they will have their is_current flag set to False
  • SAVE_JSON (Default False)
    • If set, Sources and Posts will store a JSON representation of the all the data retrieved from the feed so that uncommon or custom attributes can be retrieved. Caution - this will dramatically increase tha amount of space used in your database.
  • DRIPFEED_KEY (Default None)
    • If set to a valid Dripfeed API Key, then feeds that are blocked by Cloudflare will be automatically polled via Dripfeed instead.

Basic Models

A feed is represented by a Source object which has (among other things) a feed_url.

To start reading a feed, simply create a new Source with the desired feed_url

Source objects have Post children which contain the content.

A Post may have Enclosure (or more) which is what podcasts use to send their audio. The app does not download enclosures, if you want to do that you will need to do that in your project using the url provided.

Refreshing feeds

To conserve resources with large feed lists, the module will adjust how often it polls feeds based on how often they are updated. The fastest it will poll a feed is every hour. The slowest it will poll is every 24 hours.

Sources that don't get updated are polled progressively more slowly until the 24 hour limit is reached. When a feed changes, its polling frequency increases.

You will need to decided how and when to run the poller. When the poller runs, it checks all feeds that are currently due. The ideal frequency to run it is every 5 - 10 minutes.

Polling with cron

Set up a job that calls python manage.py refreshfeeds on your desired schedule.

Be careful to ensure you're running out of the correct directory and with the correct python environment.

Polling with celery

Create a new celery task and schedule in your app (see the celery documentation for details). Your tasks.py should look something like this:

from celery import shared_task
from feeds.utils import update_feeds

@shared_task
def get_those_feeds():

  # the number is the max number of feeds to poll in one go
  update_feeds(30)

Tracking read/unread state of feeds

There are two ways to track the read/unread state of feeds depending on your needs.

Single User Installations

If your usage is just for a single user, then there are helper methods on a Source to track your read state.

All posts come in unread. You can get the current number of unread posts from Source.unread_count.

To get a ResultSet of all the unread posts from a feed call Source.get_unread_posts

To mark all posts on a fed as read call Source.mark_read

To get all of the posts in a feed regardless of read status, a page at a time call Source.get_paginated_posts which returns a tuple of (Posts, Paginator)

Multi-User Installations

To allow multiple users to follow the same feed with individual read/unread status, create a new Subscription for that Source and User.

Subscription has the same helper methods for retrieving posts and marking read as Source.

You can also arrange feeds into a folder-like hierarchy using Subscriptions. Every Subscription has an optional parent. Subscriptions with a None parent are considered at the root level. By convention, Subscriptions that are acting as parent folders should have a None source

Subscriptions have a name field which by convention should be a display name if it is a folder or the name of the Source it is tracking. However this can be set to any value if you want to give a personally-meaningful name to a feed who's name is cryptic.

There are two helper methods in the utils module to help manage subscriptions as folders. get_subscription_list_for_user will return all Subscriptions for a User where the parent is None. get_unread_subscription_list_for_user will do the same but only returns Subscriptions that are unread or that have unread children if they are a folder.

Cloudflare Busting

django-feed-reader has Dripfeed support built in. If a feed becomes blocked by Cloudflare it can be polled via Dripfeed instead. This requires a Dripfeed account and API key.

For more details see the full documentation.

django-feed-reader's People

Contributors

cspencer-eod avatar philgyford avatar xurble avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

django-feed-reader's Issues

Fetch error

Getting a fetch error after updating the feed with utils.update_feeds() after a number of tries.

Fetch error:HTTPSConnectionPool(host='www.thehindu.com', port=443): Max retries exceeded with url: /feeder/default.rss (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fcd9c31d4e0>: Failed to establish a new connection

InsecureRequestWarning when fetching feeds

With every feed I fetch I get this warning:

/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py:1045: InsecureRequestWarning:
Unverified HTTPS request is being made to host 'www.example.com'.
Adding certificate verification is strongly advised.
See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings

I think this is because in this line in utils.read_feed():

ret = requests.get(feed_url, headers=headers, verify=False, allow_redirects=False, timeout=20, proxies=proxies)

there's verify=False. Given the default is True, I'm wondering why it's set to False, which generates these warnings?

Warning when creating a Source with default due_poll value

If I have USE_TZ = True in my settings, and I create a new Source without setting its due_poll field:

source = Source.objects.create(feed_url="https://example.com/feed")

Then I get this warning:

/usr/local/lib/python3.10/site-packages/django/db/models/fields/__init__.py:1564: RuntimeWarning: DateTimeField Source.due_poll received a naive datetime (1900-01-01 00:00:00) while time zone support is active.
  warnings.warn(

Because the default value is set like this:

    due_poll      = models.DateTimeField(default=datetime.datetime(1900, 1, 1))

I'm not sure off the top of my head the best way to fix this, taking into account the value of USE_TZ and which version of Django is currently used, given the change from pytz to zoneinfo (available in python from 3.9) that happened with Django 4.0. e.g. https://stackoverflow.com/a/71823301/250962

And, fwiw, timezone support will be enabled by default with Django 5.0.

Missing migration for Post.guid max_size change

There appears to be no new migration file added for the change to the Post model.
At least for me, when executing python manage.py makemigrations after updating to 1.0.8 a new migration was auto-generated.

Furthermore, applying the auto-generated migration fails with the following error message:
django.db.utils.OperationalError: (1071, 'Specified key was too long; max key length is 3072 bytes')

I guess max_length 1024 is incompatible with all DBs.

For reference: I am using MySQL 8.0.26.

Handling longer GUIDs

Hello! Thanks for all your work on django-feed-reader!

Is there any reason why models.Post.guid has max_length=512 (rather than eg 1024 -- I understand max_length has to be set to something)? I am reading an rss feed which occasionally has posts with guids longer than this. At the moment these feed posts cause an exception when utils.parse_feed_xml() tries to call p.save().

I am wondering what the best response might be, eg:

  1. increase max_length to 1024
  2. truncate offending guids to max_length
  3. catch the exception, discard the post and log an error

If you have a preference (I think mine would be 1 & 3), I'd be happy to submit a PR

Fatal error when fetching a feed that has an item with guid longer than 255 characters

Although the Post model has a limit of 512 characters for link, it only allows 255 characters for guid, which is often the same as the link.

I'm currently getting a fatal error when fetching https://feeds.feedburner.com/mcsweeneys because it has an item like this (I've removed the long description):

<item>
  <title>The Estate of Édouard Manet Wishes to Remind Museum Visitors That the Best Way to Prevent Climate Change Is to Throw Bucket After Bucket of Hot Sloppy Soup on the Eminently Mediocre Paintings of That Son of a Bitch Monet</title>
  <dc:creator>Chas Gillespie</dc:creator>
  <description>[REMOVED]</description>
  <pubDate>Mon, 31 Oct 2022 14:00:00 -0400</pubDate>
  <link>https://www.mcsweeneys.net/articles/the-estate-of-edouard-manet-wishes-to-remind-museum-visitors-that-the-best-way-to-prevent-climate-change-is-to-throw-bucket-after-bucket-of-hot-sloppy-soup-on-the-eminently-mediocre-paintings-of-that-son-of-a-bitch-monet</link>
  <guid>https://www.mcsweeneys.net/articles/the-estate-of-edouard-manet-wishes-to-remind-museum-visitors-that-the-best-way-to-prevent-climate-change-is-to-throw-bucket-after-bucket-of-hot-sloppy-soup-on-the-eminently-mediocre-paintings-of-that-son-of-a-bitch-monet</guid>
</item>

That guid is 256 characters! The traceback when doing a refreshfeeds is:

...
EXISTING https://www.mcsweeneys.net/articles/if-elected-i-promise-to-murder-you
EXISTING https://www.mcsweeneys.net/articles/horror-movie-titles-according-to-their-side-characters
NEW https://www.mcsweeneys.net/articles/the-estate-of-edouard-manet-wishes-to-remind-museum-visitors-that-the-best-way-to-prevent-climate-change-is-to-throw-bucket-after-bucket-of-hot-sloppy-soup-on-the-eminently-mediocre-paintings-of-that-son-of-a-bitch-monet
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/feeds/utils.py", line 515, in parse_feed_xml
    p  = Post.objects.filter(source=source_feed).filter(guid=guid)[0]
  File "/usr/local/lib/python3.10/site-packages/django/db/models/query.py", line 446, in __getitem__
    return qs._result_cache[0]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.StringDataRightTruncation: value too long for type character varying(255)


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/code/manage.py", line 22, in <module>
    main()
  File "/code/manage.py", line 18, in main
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.10/site-packages/django/core/management/__init__.py", line 446, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.10/site-packages/django/core/management/__init__.py", line 440, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python3.10/site-packages/django/core/management/base.py", line 402, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python3.10/site-packages/django/core/management/base.py", line 448, in execute
    output = self.handle(*args, **options)
  File "/usr/local/lib/python3.10/site-packages/feeds/management/commands/refreshfeeds.py", line 11, in handle
    update_feeds(30, self.stdout)
  File "/usr/local/lib/python3.10/site-packages/feeds/utils.py", line 125, in update_feeds
    read_feed(src, output)
  File "/usr/local/lib/python3.10/site-packages/feeds/utils.py", line 350, in read_feed
    (ok,changed) = import_feed(source_feed=source_feed, feed_body=ret.content, content_type=content_type, output=output)
  File "/usr/local/lib/python3.10/site-packages/feeds/utils.py", line 387, in import_feed
    (ok,changed) = parse_feed_xml(source_feed, feed_body, output)
  File "/usr/local/lib/python3.10/site-packages/feeds/utils.py", line 534, in parse_feed_xml
    p.save()
  File "/usr/local/lib/python3.10/site-packages/django/db/models/base.py", line 812, in save
    self.save_base(
  File "/usr/local/lib/python3.10/site-packages/django/db/models/base.py", line 863, in save_base
    updated = self._save_table(
  File "/usr/local/lib/python3.10/site-packages/django/db/models/base.py", line 1006, in _save_table
    results = self._do_insert(
  File "/usr/local/lib/python3.10/site-packages/django/db/models/base.py", line 1047, in _do_insert
    return manager._insert(
  File "/usr/local/lib/python3.10/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/django/db/models/query.py", line 1790, in _insert
    return query.get_compiler(using=using).execute_sql(returning_fields)
  File "/usr/local/lib/python3.10/site-packages/django/db/models/sql/compiler.py", line 1660, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/lib/python3.10/site-packages/django/db/backends/utils.py", line 103, in execute
    return super().execute(sql, params)
  File "/usr/local/lib/python3.10/site-packages/django/db/backends/utils.py", line 67, in execute
    return self._execute_with_wrappers(
  File "/usr/local/lib/python3.10/site-packages/django/db/backends/utils.py", line 80, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/usr/local/lib/python3.10/site-packages/django/db/backends/utils.py", line 84, in _execute
    with self.db.wrap_database_errors:
  File "/usr/local/lib/python3.10/site-packages/django/db/utils.py", line 91, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/usr/local/lib/python3.10/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.DataError: value too long for type character varying(255)

Impossible to save a Source without setting last_success and last_change fields

The Source model has fields like this:

    last_success   = models.DateTimeField(null=True)
    last_change    = models.DateTimeField(null=True)

But if you're creating a Source in the Admin then what would you set these to, given the feed hasn't been fetched yet?

And if you create a Source by other means, which works, if you then want to edit the Source (e.g. because the Feed URL is wrong, and couldn't be fetched) then you can't save the updated version without setting these fields.

I assume they should both have blank=True?

Parsing a feed item with only updated dates, and no published, results in a Post with a created time of now

Having just added a load of Sources, and fetched their feeds, I've noticed that all of their Posts had created times of now.

It seems that a lot of feeds have an <updated> date for each <entry> but nothing like <published>.

But in these lines when fetching and parsing a feed, we only look for a published date:

                try:
                    p.created  = datetime.datetime.fromtimestamp(time.mktime(e.published_parsed)).replace(tzinfo=timezone.utc)
                except Exception as ex2:
                    output.write("CREATED ERROR:" + str(ex2))     
                    p.created  = timezone.now()

I think it would make sense that if that first attempt results in an exception we try the same thing with e.updated_parsed – which exists if the entry had an <updated> field – and only if that fails set created to now.

I'll do a PR for that tomorrow unless you have a better idea.

Migrations created due to DEFAULT_AUTO_FIELD

I have this in my project's settings (see DEFAULT_AUTO_FIELD docs):

DEFAULT_AUTO_FIELD = "django.db.models.BigAutoField"

But when I add django-feed-reader to my project, and run makemigrations, a new migration is created for the app (shown below).

I think django-feed-reader needs to either add this line to FeedsConfig in apps.py to prevent the migration:

default_auto_field = 'django.db.models.AutoField'

...or generate its own migration.

The migration generated (feeds/migrations/0008_alter_enclosure_id_alter_post_id_alter_source_id_and_more.py):

# Generated by Django 4.1.1 on 2022-09-30 13:25

from django.db import migrations, models


class Migration(migrations.Migration):

    dependencies = [
        ("feeds", "0007_auto_20210502_0716"),
    ]

    operations = [
        migrations.AlterField(
            model_name="enclosure",
            name="id",
            field=models.BigAutoField(
                auto_created=True, primary_key=True, serialize=False, verbose_name="ID"
            ),
        ),
        migrations.AlterField(
            model_name="post",
            name="id",
            field=models.BigAutoField(
                auto_created=True, primary_key=True, serialize=False, verbose_name="ID"
            ),
        ),
        migrations.AlterField(
            model_name="source",
            name="id",
            field=models.BigAutoField(
                auto_created=True, primary_key=True, serialize=False, verbose_name="ID"
            ),
        ),
        migrations.AlterField(
            model_name="webproxy",
            name="id",
            field=models.BigAutoField(
                auto_created=True, primary_key=True, serialize=False, verbose_name="ID"
            ),
        ),
    ]

Enclosure Deletion Bug

Hi,

Just wanted to alert you to a massive bug I discovered. I've been maintaining my separate branch for a project I'm working on. I have a lot of custom models linked to the Enclosure model by foreign key. Then the other day, I noticed all my records linked to enclosures for a specific source suddenly vanished.

It took me a while, but I think I've tracked it down to the parse_feed_json() method. You have logic there to delete old enclosures if their URL doesn't exist in the current feed. That's usually not a problem. However, many RSS feeds wrap the true podcast URL with a prefix URL to track downloads, and occasionally these tracker URLs change. When that happens, that causes this code to delete all the enclosure records, and, in my case, everything attached to them.

My patch for this is to remove the calls to delete(), so I never risk wiping out my database when some podcast server switches up its tracking links. I've also disabled the logic to create an enclosure if the URL is new, since that allows logical duplicates where the same MP3 url has two different tracking links prepended to it. It seems safe to only create a new enclosure if none exist, since 99.99% of the time, a post only ever has one enclosure.

I'd submit a PR, but my codebase has diverged so much from yours, it would be a little unwieldy. I just want to give you a heads up, so this problem doesn't bite you too.

Unable to add source

Browsing to /admin/feeds/source/add/ generates the traceback

.env/lib/python3.7/site-packages/django/utils/timezone.py in is_aware, line 248
return value.utcoffset() is not None 
AttributeError at /admin/feeds/source/add/
'str' object has no attribute 'utcoffset'

Documentation

hello,

Do you have any documentation about how to use it?
I would like to use your library to create a REST or GRAPHQL API.

Is it compatible with the latest rss and atom version?
Can i parse any type of data in the rss feed (description , title , link are probably the basic ones but if there is extra field, can i parse them?

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.