mmcdole / gofeed Goto Github PK

View Code? Open in Web Editor NEW

2.5K 2.5K 200.0 10.91 MB

Parse RSS, Atom and JSON feeds in Go

License: MIT License

Go 100.00%

atom atom-feed feed go golang jsonfeed parser rss rss-feed

gofeed's People

Contributors

Stargazers

Watchers

Forkers

gonuuts alash3al neuroradiology ligadous terinjokes lifeinoppo rayantony nicolas-forks elliotf slashk antonlindstrom shuyaoyimei sdelassus jalfresi techdevt casperin digitalbytes kokizzu jankiel7410 aa88kk puyeto ken-britton-hs nathanrthomas edwlarkey recoilme edavis yamamushi figassis francoishill thewhitewizard michaelrbond danmactough laurilopes pposca crazcalm t-ashula tomnewton wondershake melissah67 tystuyfzand carthics renesugar alexismontagne chase-seibert ukyiwin stanxii skhuang1993 tockn priestd09 christoofar gethydrocarbon splisson-altair musabgultekin appetizermonster threebenji edumir aimof liuzl dy-dx shoobyban olimpias codelingobot attackgithub orkon gaybro8777 carolynvs mkusiciel keinos godexsoft isgasho backwardn migzone varmamsp unackd myhololens zegrandpa ddagght thebitmonk adrianmester kodebot orange888 simon-ding summercms stucamp andrikod wechuli ismiyati hobeone mikewiacek grafoo toasterktn halfwit bcl kozty3003 sreesa7144 golanpkgs albadraco timmydo shassard guilhem

gofeed's Issues

Parse content:encoded for Item.Content

Starting parsing content:encoded for Item.Content for RSS feeds.

Use of internal package not allowed

Sample code:

package main

import (
        "fmt"
        "log"

        "github.com/mmcdole/gofeed"
)

func main() {
        feedData :=
                `<?xml version="1.0" ?>
                 <rss version="2.0">
                   <channel>
                     <item>
                       <title>Test</title>
                       <description>Some description</description>
                     </item>
                   </channel>
                 </rss>`

        fp := gofeed.NewParser()
        feed, err := fp.ParseString(feedData)
        if err != nil {
                log.Fatalf("cannot parse feed: %v", err)
        }

        fmt.Printf("%s\n", feed.Items[0].Description)
}

If I run it: vendor/github.com/mmcdole/gofeed/detector.go:7:2: use of internal package not allowed
I also tried with a url feed with same result.

atom.Parser atom:content type html might not be wrapped in DIV

Expected behavior

The following example entry, included within a valid Atom feed, should create a gofeed.Item with the following content.

<atom:entry>
  <atom:title>Parsing Atom with gofeed</atom:title>
  <atom:link href="https://example.com/blog/2016/04/18/parsing-atom-with-gofeed" />
  <atom:updated>2016-04-18T00:00:00+00:00</atom:updated>
  <atom:id>https://example.com/blog/2016/04/18/parsing-atom-with-gofeed</atom:id>
  <atom:content type="html">
    &lt;p&gt;This is a directly included child element, no wrapping in a DIV element.&lt;/p&gt;

    &lt;div class="not-root"&gt;&lt;p&gt;This DIV is part of the post content, wholly unrelated to what RFC 4287 might say about DIVs.&lt;/p&gt;&lt;/div&gt;
  </atom:content>
</atom:entry>

for _, item := range feed.Items {
  fmt.Println(item.Content)
}
// <p>This is a directly included child element, no wrapping in a DIV element.</p>\n\n<div class="not-root"><p>This DIV is part of the post content, wholly unrelated to what RFC 4287 might say about DIVs.</p></div>

Actual behavior

for _, item := range feed.Items {
  fmt.Println(item.Content)
}
// <p>This DIV is part of the post content, wholly unrelated to what RFC 4287 might say about DIVs.</p>

Steps to reproduce the behavior

The problematic feed is https://terinstock.com/atom.xml. The author is alright.

Supporting documentation

RFC 4287 § 4.1.3.3:

    2.  If the value of "type" is "html", the content of atom:content
        MUST NOT contain child elements and SHOULD be suitable for
        handling as HTML.  The HTML markup MUST be escaped; for
        example, "<br>" as "&lt;br>".  The HTML markup SHOULD be such
        that it could validly appear directly within an HTML <DIV>
        element.  Atom Processors that display the content MAY use the
        markup to aid in displaying it.
    3.  If the value of "type" is "xhtml", the content of atom:content
        MUST be a single XHTML div element [XHTML] and SHOULD be suitable
        for handling as XHTML.  The XHTML div element itself MUST NOT be
        considered part of the content.  Atom Processors that display the
        content MAY use the markup to aid in displaying it.  The escaped
        versions of characters such as "&" and ">" represent those
        characters, not markup.

Of course, a DIV is valid within a DIV, but it's not required for type html. Even if the content was wrapped with a DIV, it should be considered part of the content, for the html type.

Unable to parse feeds with a BOM prefix

Expected behavior

Feed parsed

Actual behavior

http://www.larevuedudigital.com/feed/ -> Failed to detect feed type

Copyright attribution and licensing required for XML unit tests

Hey Matthew, thanks for your help over on the feedparser issue tracker!

Mark Pilgrim claimed copyright over the feedparser XML unit tests released them under the 2-clause BSD license. In addition, I've spent the last six years cleaning up and adding new XML unit tests. However, I didn't see any copyright attribution nor the text of the 2-clause BSD license, both of which are mandatory in order to use the XML unit tests.

Would you update the gofeed documentation so that it clearly identifies both Mark Pilgrim and myself as the XML unit test copyright owners and identify that the unit test files are released under the terms of the 2-clause BSD license?

Thanks!

Start vendoring dependencies

Parse http://eatcodeplay.com/feed.xml

This malformed feed has a self-closing feed tag at the beginning of it. Need to iterate through other sibling elements if the first element is empty?

Feed is parsed into the extensions map when there is a single "extension" node

The feed I am trying to parse has a few nodes that are using a different namespace. The library causes the whole feed to be parsed into the extension map instead of just those few nodes. Ideally, I'd like to be able to access the content from the Item slice and discard the extension map.

The feed in question is http://feeds.feedburner.com/blogspot/RLXA

Expected behavior

The feed to parse properly, with extensions limited to only the single offending node.

Actual behavior

The whole feed is put inside the extension map.

Steps to reproduce the behavior

Parsing the following will cause the entry to be in the extension map.

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
	<title type="text">Foo</title>
	<subtitle type="html">Bar</subtitle>
	<author>
		<name>Foo</name>
	</author>
	<blah:link href="foo" rel="self" type="application/atom+xml" xmlns:blah="http://www.w3.org/2005/Atom"/>
	<entry>
		<id>Foo</id>
		<title type="text">title</title>
		<content type="html">foo</content>
	</entry>
</feed>

but removing the following line will marshal it into the feed struct flawlessly.

	<blah:link href="foo" rel="self" type="application/atom+xml" xmlns:blah="http://www.w3.org/2005/Atom"/>

Note: Please include any links to problem feeds, or the feed content itself!

Improve parser error messages

Currently when the feed encounters an unexpected token, I get an error message such as this:

(master) % gotest f /Users/mmcdole/Downloads/feeds/1016.dat
Error: Expected StartTag or EndTag but got Text

I'd like to improve these error messages to give more context about the location and source of the error.

Attaching a broken feed file as an example.
1016.txt

Parsing custom tags in Item

Hi,
i'm trying to parse this feed https://ctftime.org/event/list/upcoming/rss/
As you can see each item has tags like weight but i can't find a way to parse them, looking at the doc i can't find anything about that

Problems parsing enclosures with text tokens

The following feed is failing to parse due to issues with the <enclosure> tag. Need to look into this feed further.

4171.txt

Modify code coverage reporting to include all packages

I need to research how to get our code-coverage CI tool to include the aggregate coverage for all of gofeed's packages. Having it just be the coverage for the gofeed package doesn't seem as useful.

README.md flow diagram flaw

! seems to have confused Atom and RSS feeds.

I believe RSS feeds is being parsed by RSS parser, not Atom.

JSON Feed Support?

Any plans on supporting the JSON Feed format?

https://jsonfeed.org/

Obviously JSON is easy enough to parse on it's own, but it would be great to be able to have it just work with this out of the box.

failure to parse categories from RSS (?) doc

Expected behavior

I expect feed.Categories to be ["News & Politics"].

Actual behavior

feed.Categories is: [Jon Favreau POTUS Donald Trump politics Tommy Vietor Dan Pfeiffer Jon Lovett Pod Save America News & Politics].

Steps to reproduce the behavior

Please see unit test here: https://gist.github.com/armhold/96be9635883fd417b6cb82ab445abddd

This feed comes from: http://feeds.feedburner.com/pod-save-america.

I haven't really dug into the code yet, so it's possible I'm doing something dumb here. But I think gofeed is accidentally adding the next line of the xml to feed.Categories.

http.Get can return non-nil response on non-nil error

Are you aware that http.Get can return a non-nil response that still needs closing even if err is non-nil as well? Not doing so leaks. http://devs.cloudimmunity.com/gotchas-and-common-mistakes-in-go-golang/index.html#close_http_resp_body

CDATA elements aren't being parsed correctly

I seemed to have introduced a regression in parsing CDATA sections. It is including the CDATA prefix itself in the parsed content. I assume this was introduced when I fixed the naked markup issues.

I also need to add some more tests for both parsers around CDATA parsing.

Add more godoc examples

Expose iTunesExt and DublinExt in gofeed.Feed (not just rss.Feed)

ParseURL should have reasonable timeout

Expected behavior

If the connection to the given url is hanging, the httpClient should timeout.
Ideally the desired timeout could be an argument with a reasonable default

Actual behavior

the connection hangs

Steps to reproduce the behavior

find a url with really slow network and try parsing it.

Note: Please include any links to problem feeds, or the feed content itself!
the link that hung for me was http://rss.shanghaidaily.com/Portal/mainSite/Handler.ashx?i=7

the function that's hanging for me is ParseURL in parser.go

atom.Parser isn't properly resolving relative urls

The atom.Parser needs to be modified to support resolving relative urls specified with xml:base.

Add relative url tests as well.

Cannot build

Expected behavior

Be able to run go feed

Actual behavior

When i try to 'go build', i am seeing this error.
'package gofeed
imports github.com/mmcdole/gofeed/internal/shared: use of internal package not allowed'

My go version is 'go version go1.5.1 darwin/amd64'

Steps to reproduce the behavior

run go build

Note: Please include any links to problem feeds, or the feed content itself!
I am new to go. Did i miss anything?

Add support for user defined user-agent string

Expected behavior

Parsing https://www.reddit.com/r/games/.rss should work with an appropriate delay in making requests (Reddit asks for 2 seconds between bot requests).

To further describe the issue, this could be resolved if we had the option of defining our own user-agent strings (or any headers for that matter) when calling gofeed.ParseURL(url string) or when constructing our parser with gofeed.NewParser() .

Actual behavior

Returns 429 Too Many Requests, as Reddit filters requests that do not have user-agent strings.

The first request will work, after which Reddit will block all new requests for a period of time.

Steps to reproduce the behavior

fp := gofeed.NewParser()
feed, err := fp.ParseURL("https://www.reddit.com/r/games/.rss")
if err != nil {
fmt.Println(err.Error())
return
}
// This first request will work
fmt.Println(feed.Title)

time.Sleep(5 * time.Second)

// This second request will fail because no user-agent string is defined for the request
secondfeed, err := fp.ParseURL("https://www.reddit.com/r/games/.rss")
if err != nil {
fmt.Println(err.Error())
return
}
fmt.Println(secondfeed.Title)

Note: Please include any links to problem feeds, or the feed content itself!

atom.Parser is not properly parsing the xml:lang attribute for it's Language field

The xml:lang attribute is not being parsed currently to populate the Language field of atom.Parser.

We are also missing Atom unit tests for the Language field.

Error : EOF

many times i have ERROR: EOF
for eample :http://basijnews.ir/fa/rss/39: EOF

Fix all things reported in the goreportcard

https://goreportcard.com/report/github.com/mmcdole/gofeed

Feed type cannot be detected

Expected behavior

The feed is apparently a valid Atom 1.0 feed according to the W3C. https://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fwww.qdep.org%2Ffeed%2Fatom%2F

Actual behavior

Error message: "Failed to detect feed type".

Steps to reproduce the behavior

Try parsing http://www.qdep.org/feed/atom/ on commit 1bc2cbeba25b7b594430cff43d7c9e9367cfdca0

xref : matrix-org/go-neb#187

DublinCore extension parsing needs unit tests

Feed creator (XML writer)

Since all the structs are here, are there any plans to use them to create feeds?

Failed to parse gestalten tv feed

Failed to parse the following feed:
http://feeds.feedburner.com/gestalten_tv

Change feed parser interfaces to use io.Reader instead of strings

Enhance Print function for feeds

socket: too many open files

when i start parser with 30 feed url i have error:
socket: too many open files

Rewrite tests that were removed due to licensing issues

Failed to parse element with mixed content - YaleFeed

Failed to parse the following feed:
http://feeds.feedburner.com/YalePressPodcastITunes

HTTP errors are ignored

Expected behavior

When calling ParseURL() on a feed URL, if the server returns an error (e.g. 404), ParseURL() should return this error.

Actual behavior

ParseURL() returns "Failed to detect feed type".

Steps to reproduce the behavior

Simple example: use ParseURL() on http://boinkor.net/atom.xml (should return 404).

Numerical character references are not decoded in text elements

Expected behavior

Feeds are XML documents. Character data can contains both predefined entites (such as <) and numerical character references (e.g. "). These references must be decoded; for example, " has to be decoded to ". encoding/xml does it properly.

Actual behavior

Gofeed decodes predefined entities (internal/shared/parseutils.go: DecodeEntities) but ignores numerical character references. Lots of feeds include html data which are encoded since they are included in a XML document. For example, feed generators often encode " with " instead of ".

Steps to reproduce the behavior

package main

import (
        "fmt"
        "log"

        "github.com/mmcdole/gofeed"
)

func main() {
        feedData :=
                `<?xml version="1.0" ?>
                 <rss version="2.0">
                   <channel>
                     <item>
                       <title>Test</title>
                       <description>&lt;a&gt; &#34;b&#34; &quot;c&quot;</description>
                     </item>
                   </channel>
                 </rss>`

        fp := gofeed.NewParser()
        feed, err := fp.ParseString(feedData)
        if err != nil {
                log.Fatalf("cannot parse feed: %v", err)
        }

        fmt.Printf("%s\n", feed.Items[0].Description)
}

This prints <a> "b" "c" instead of <a> "b" "c".

ITunesFeedExtension.Image & ITunesItemExtension.Image

I'm not 100% sure about this but I think the logic to parse the <itunes:image> element into ITunesFeedExtension.Image and ITunesItemExtension.Image may need an update.

All the feeds I'm parsing seem to be using a self-closing tag with the image URL contained in the href attribute. Currently the parser calls parseTextExtension("image", extensions) which returns an empty string.

I've modified a copy of gofeed I'm using to do something similar to the <itunes:owner> and <itunes:category> parsers:

func parseImage(extensions map[string][]Extension) (image string) {
    if extensions == nil {
        return
    }

    matches, ok := extensions["image"]
    if !ok || len(matches) == 0 {
        return
    }

    image = matches[0].Attrs["href"]
    return
}

Thanks so much for creating this library. It is fantastic.

Error parsing non-utf8 feeds

rss_feed with this xml encoding detect error

<title>RSS Title</title>

DefaultAtomTranslator needs unit tests

Missing or incomplete values when parsing extensions

Expected behavior

When parsing this RSS feed, the iTunesExt.Summary field should be correctly populated for each item in the feed.

Actual behavior

The iTunesExt.Summary field is blank for every item.

Steps to reproduce the behavior

Parse the feed and inspect the resulting rss.Item values. You could also look at the translated gofeed.Item values.

What's going on?

The problem appears to be in the parseExtensionElement function. The function takes an XML node (in this case an <itunes:summary> tag) and uses it to create a new ext.Extension. It iterates over any child nodes and, if the child is of type text, sets the Value of the new Extension to the text node's value. Note that if the parent node contains multiple child nodes of type text, only the final node's value is retained.

In this particular feed, the item-level <itunes:summary> tags all contain three text nodes. The first and last are blank while the middle node holds the actual text. Currently this text is being overwritten with the final blank string.

A possible cause

If you view the source for the feed you will see that there are extra line breaks around the text in the <itunes:summary> tags. These line breaks are not present on any other tags (all of which are being parsed correctly as far as I can tell). Perhaps the line breaks are causing the spurious text nodes.

A possible fix

I fixed this in my vendored version of the code by changing this line:

e.Value = strings.TrimSpace(p.Text)

to this:

e.Value += strings.TrimSpace(p.Text)

But I'm not familiar with the project. Maybe this quick fix isn't the best approach. Let me know what you think (I can submit a PR if you'd like).

Zeroing not done for gofeed.Item.Author

Hello,

When a feed does not specify the author, the gofeed parser does not bother zeroing the
item.Author.Name and item.Author.Email fields.

Expected behavior when author is not set in the feed

item.Author.Name -> ""

Actual behavior

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x6842a1]

goroutine 1 [running]:
main.main()
/home/src/go/feedparser/main.go:52 +0x251

A feed which doesn't set the author field is the linux kernel feed "https://www.kernel.org/feeds/kdist.xml"

ActivityStreams Support

Would ActivityStreams support be considered? It's the basis for ActivityPub, which the Mastodon network is built on.

(Sorry, I originally posted this on #80 and realized it should probably be a separate item.)

Add support for the Media RSS extension

XML syntax - illegal character code U+000C

Expected behavior

Parse content

Actual behavior

Check http://blog.octo.com/category/architecture-et-technologies/feed/
error in channel: XML syntax error on line 574: illegal character code U+000C
XML syntax error on line 574: illegal character code U+000C

DefaultRSSTranslator needs unit tests

Fix stuttering method names in FeedParser/atom.Parser/rss.Parser

Take the chance to address the non-idiomatic method names while I still cant before hitting 1.0.

See the following reddit comment for details:

https://www.reddit.com/r/golang/comments/4e8say/gofeed_a_fast_and_robust_rss_and_atom_parser/d1y8vif

iTunes extension parsing need unit tests

FeedParser and DetectFeedType need unit tests

item.PublishedParsed and *time.Time

When I try to use item.PublishedParsed in goroutine, error occurs like

panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x69193e]
item.PublishedParsed is a *time.Time type.

But on official golang doc https://golang.org/pkg/time/#Time, it should be time.Time

Programs using times should typically store and pass them as values, not pointers. That is, time variables and struct fields should be of type time.Time, not *time.Time. A Time value can be used by multiple goroutines simultaneously.

Is my problem relative to this ?

Thank you

mmcdole / gofeed Goto Github PK

gofeed's People

Contributors

Stargazers

Watchers

Forkers

gofeed's Issues

Expected behavior

Actual behavior

Steps to reproduce the behavior

Supporting documentation

Expected behavior

Actual behavior

Expected behavior

Actual behavior

Steps to reproduce the behavior

Expected behavior

Actual behavior

Steps to reproduce the behavior

Expected behavior

Actual behavior

Steps to reproduce the behavior

Expected behavior

Actual behavior

Steps to reproduce the behavior

Expected behavior

Actual behavior

Steps to reproduce the behavior

Expected behavior

Actual behavior

Steps to reproduce the behavior

Expected behavior

Actual behavior

Steps to reproduce the behavior

Expected behavior

Actual behavior

Steps to reproduce the behavior

Expected behavior

Actual behavior

Steps to reproduce the behavior

What's going on?

A possible cause

A possible fix

Expected behavior when author is not set in the feed

Actual behavior

Expected behavior

Actual behavior

Recommend Projects

Recommend Topics

Recommend Org