Git Product home page Git Product logo

nreadability's Introduction

NReadability

Description

NReadability cleans up hard-to-read articles on the Web. It's a tool for removing clutter from HTML pages so that they are more enjoyable to read.

The NReadability package consists of the .NET class library and a simple console application.

NReadability is a C# port of Arc90's Readability bookmarklet.

Installation

You can start using NReadability right away by installing the NuGet package:

PM> Install-Package NReadability

Getting Started

In order to transcode content downloaded from the Web:

var transcoder = new NReadabilityTranscoder();
string content;

using (var wc = new WebClient())
{
  content = wc.DownloadString("https://github.com/marek-stoj/NReadability");
}

bool success;

string transcodedContent =
  transcoder.Transcode(content, out success);

Or even simpler:

var transcoder = new NReadabilityWebTranscoder();
bool success;

string transcodedContent =
  transcoder.Transcode("https://github.com/marek-stoj/NReadability", out success);

nreadability's People

Contributors

marek-stoj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nreadability's Issues

Getting double the content, page is repeated twice in the TranscodingResult

When running NReadability on the following URL:
http://www.internationalreview.co.uk/music/johnny-hodges/

It returns the extracted content twice. If there is a chance of checking that it isn't just me, and that someone else get's the same issue, that would be great.

var input = new TranscodingInput(htmlContent) { DomSerializationParams = new DomSerializationParams { DontIncludeContentTypeMetaElement = true, DontIncludeDocTypeMetaElement = true, DontIncludeGeneratorMetaElement = true, DontIncludeMobileSpecificMetaElements = true, PrettyPrint = true } }; var transcodedContent = transcoder.Transcode(input);

Where htmlContent is the html string extracted as UTF-8 encoding from the url: http://www.internationalreview.co.uk/music/johnny-hodges/

Many thanks...

Missing .cs files in solution

I'm not sure if I'm missing the obvious or not - but here goes:

I download the zipped file - unzip it
Navigate to "\marek-stoj-NReadability-4520e39\marek-stoj-NReadability-4520e39\Src\NReadability"
Open "NReadability.sln" in VS

I'm missing a total of 8 files - across two projects

NReadability:
TranscodingInput.cs
TranscodingResult.cs
WebTranscodingInput.cs
WebTranscodingResult.cs

NReadability.Tests:
FileBasedUrlFetcherStub.cs
NReadabilityTranscoderTests_Old.cs
NReadabilityWebTranscoderTests_Old.cs
SimpleUrlFetcherStub.cs

Can anyone point me in the right direction?

Many thanks in advance

Question/Issue: Getting the 'body' of readable page

With a Transcode I'm able to get full <!doctype html> with head, styles, etc. Sometimes its not required and you need to get only the body of content.

For now I use a workaround as using HtmlAgilityPack to get #readInner innerText.

Would be good assest for framework.

Could not load type 'NReadability.NReadabilityWebTranscoder' from assembly 'NReadability

Hello,

I'm having the following issue while loading the library:

Could not load type 'NReadability.NReadabilityWebTranscoder' from assembly 'NReadability

I installed the lib via NuGet, and I'm using it as follows:

var input = new WebTranscodingInput("http://....");
var transcoder = new NReadabilityWebTranscoder();
var result = transcoder.Transcode(d);

But as soon as I start the compiled build, I get that error and the app stops.

Please find the attached screenshot about the error.

Could you help please? :(

Thanks
Nicholas

Screen Shot 2013-02-28 at 16 18 35

Hidden section returned instead of the main article body

Hi,

I've been using your component for some time with good results, but lately we have encountered more and more cases like this one http://ir.tcfbank.com/file/Index?KeyFile=32068838 where NR decides that the terms and condition body of text (hidden and visible via a popup when you click the "terms and conditions" link at the bottom of the article) is extracted instead of the actual article body.
Technically the decision is correct as the "t&c" body of text is larger and more compact than the main article body.

In other cases the (now) omnipresent "this website uses cookies" text is chosen instead of the article on the same grounds.

Do you have any plans to address such issues in the near future ?

For the moment we have resolved it by using an in house modified version of NR where we can tweak the algorithm regex-es on a case by case basis to exclude the irrelevant content.

Thanks and best regards,
Razvan Goga

NuGet package requested

I've tried that and NReadability works pretty smooth. Thanks a lot for your effort, cool product :)

I would request, to package NReadability into NuGet package, so it's much more convinient to work, instead of building from sources.

Exception on certain URL's - {"The prefix '' cannot be redefined from '' to 'http://www.w3.org/1999/xhtml' within the same start element tag."}

When I try to Transcode this URL (http://www.rollingstone.com/politics/news/the-ten-worst-members-of-the-worst-congress-ever-20120112), I get the following exception:

Message: {"The prefix '' cannot be redefined from '' to 'http://www.w3.org/1999/xhtml' within the same start element tag."}
Stack Trace:
at System.Xml.XmlWellFormedWriter.PushNamespaceExplicit(String prefix, String ns)
at System.Xml.XmlWellFormedWriter.WriteEndAttribute()
at System.Xml.Linq.ElementWriter.WriteStartElement(XElement e)
at System.Xml.Linq.ElementWriter.WriteElement(XElement e)
at System.Xml.Linq.XElement.WriteTo(XmlWriter writer)
at System.Xml.Linq.XNode.GetXmlString(SaveOptions o)
at System.Xml.Linq.XNode.ToString(SaveOptions options)
at NReadability.DomExtensions.GetInnerHtml(XContainer container) in D:\src\Zaprica\Zaprica.net\ThirdParty\marek-stoj-NReadability-18ccfba\Src\NReadability\NReadability\DomExtensions.cs:line 232
at NReadability.NReadabilityTranscoder.KillBreaks(XElement element) in D:\src\Zaprica\Zaprica.net\ThirdParty\marek-stoj-NReadability-18ccfba\Src\NReadability\NReadability\NReadabilityTranscoder.cs:line 1259
at NReadability.NReadabilityTranscoder.PrepareArticleContentElement(XElement articleContentElement) in D:\src\Zaprica\Zaprica.net\ThirdParty\marek-stoj-NReadability-18ccfba\Src\NReadability\NReadability\NReadabilityTranscoder.cs:line 1096
at NReadability.NReadabilityTranscoder.ExtractArticleContent(XDocument document) in D:\src\Zaprica\Zaprica.net\ThirdParty\marek-stoj-NReadability-18ccfba\Src\NReadability\NReadability\NReadabilityTranscoder.cs:line 737
at NReadability.NReadabilityTranscoder.TranscodeToXml(String htmlContent, String url, Boolean& mainContentExtracted, String& nextPageUrl) in D:\src\Zaprica\Zaprica.net\ThirdParty\marek-stoj-NReadability-18ccfba\Src\NReadability\NReadability\NReadabilityTranscoder.cs:line 305
at NReadability.NReadabilityWebTranscoder.Transcode(String url, DomSerializationParams domSerializationParams, Boolean& mainContentExtracted) in D:\src\Zaprica\Zaprica.net\ThirdParty\marek-stoj-NReadability-18ccfba\Src\NReadability\NReadability\NReadabilityWebTranscoder.cs:line 115
at NReadability.NReadabilityWebTranscoder.Transcode(String url, Boolean& mainContentExtracted) in D:\src\Zaprica\Zaprica.net\ThirdParty\marek-stoj-NReadability-18ccfba\Src\NReadability\NReadability\NReadabilityWebTranscoder.cs:line 142
at Zaprica.Utilities.ReadabilityManager.TranscodeFromUrl(String url) in D:\src\Zaprica\Zaprica.net\Zaprica.Utilities\ReadabilityManager.cs:line 26
at Zaprica.Server.WcfService.transcode.Page_Load(Object sender, EventArgs e) in D:\src\Zaprica\Zaprica.net\Zaprica.Server.Transcoder\transcode.aspx.cs:line 23

Invalid character.

Hi there,

I seem to be having an issue with selected articles which causes an error like ''�', hexadecimal value 0x08, is an invalid character.' to occur.

An example article is http://www.lifehacker.com.au/2015/01/read-more-every-day-by-creating-reading-triggers/

This is running ASP.NET 4.0 and the NReadability package was downloaded via NUGet.

Current running code is:
NReadability.NReadabilityWebTranscoder tc = new NReadability.NReadabilityWebTranscoder();
NReadability.WebTranscodingInput ti = new NReadability.WebTranscodingInput(url);
NReadability.WebTranscodingResult tcr = tc.Transcode(ti); //Exception thrown on this line.
Response.Write(tcr.ExtractedContent);

(however I've tried variations of different code, including that which is included in the readme)

I reliase this is due to incorrect tags being read by the Xml reader within NReadability however I do not seem to be able to work around this.

Suggestions?

Space between adjacent <a..>s ignored

Currently if I have adjacent links in original text ...<a href="/f">first <a href="/s">second... the transcoder will remove the meaningful space between the links and will return ...<a href="/f">first<a href="/s">second.... This is due to the option:

sgmlReader.WhitespaceHandling = WhitespaceHandling.None;

in the SgmlDomBuilder class. If changed to

sgmlReader.WhitespaceHandling = WhitespaceHandling.Significant;

the space between the links stays after transcoding.

This will require some changes, e.g. in CollapseRedundantParagraphDivs method. However, current behavior make the text ugly and unreadable in some cases.

The XmlReader state should be Interactive

When trying to transcode the url: http://www.ad120.co.il/ it fails with:

System.InvalidOperationException occurred
Message=The XmlReader state should be Interactive.
Source=System.Xml.Linq
StackTrace:
at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r)
at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r, LoadOptions o)
at System.Xml.Linq.XDocument.Load(XmlReader reader, LoadOptions options)
at System.Xml.Linq.XDocument.Load(XmlReader reader)
at NReadability.SgmlDomBuilder.LoadDocument(String htmlContent) in D:\Development\ExternalLibrary\NReadability\Src\NReadability\NReadability\SgmlDomBuilder.cs:line 105
InnerException:

It happens on many other non-english pages (namely hebrew & russian).

Only getting half the content

Trying NReadability on http://www.propertysearch4u.com.au/buyers-agent-sydney gives me just the following InnerHtml: http://pastebin.com/fuA2QJsH

The first line of content on the actual webpage is "The Buyers Agent Sydney Services that can help Sydneysiders, Interstate investors and Australian Expatriates cost effectively acquire their Sydney property." but the TranscodingResult.ExtractedContent I'm getting starts with "Bidding at auction can be intimidating." which is actually halfway down the page.

Here's how I'm calling it:

var transcodingInput = new WebTranscodingInput(strURL);
var transcoder = new NReadability.NReadabilityWebTranscoder();
var transcodingResult = transcoder.Transcode(transcodingInput);
if (!transcodingResult.ContentExtracted)
    throw new ArgumentNullException();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(transcodingResult.ExtractedContent);
var bodyNode = doc.DocumentNode.SelectSingleNode("//div[@id='readInner']");

A copy of the original page, in case it changes: http://pastebin.com/20nZ2qRa

Calculating a page "score" for readability

Hi Marek,
Given a URL, have you figured out a way to calulate how "readable" that page is (score 1 - 100; 100 being fully readable)? The safari browser on mac shows the Reader button (on the right side on the address bar) for some pages.

Thanks
Nikhil

Java version

Hello,will you supply a java version?
I'm a java guy,if you will do that,please let me konw.
Thanks...

Possible bugs with fixes, improvements

Not sure if you are going to continue the development, but recently I came across some fixes and improvements I made long ago - in 2011. The parts of code I modified back then look almost the same now, so the changes may still be applicable. I myself don't write in C# anymore and don't have the software installed, so I can't test it myself now.

The repository with my changes is at https://github.com/aplavin/nreadability (see last 3 commits).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.