Git Product home page Git Product logo

anglesharp.io's Introduction

logo

AngleSharp

CI GitHub Tag NuGet Count Issues Open Gitter Chat StackOverflow Questions CLA Assistant

AngleSharp is a .NET library that gives you the ability to parse angle bracket based hyper-texts like HTML, SVG, and MathML. XML without validation is also supported by the library. An important aspect of AngleSharp is that CSS can also be parsed. The included parser is built upon the official W3C specification. This produces a perfectly portable HTML5 DOM representation of the given source code and ensures compatibility with results in evergreen browsers. Also standard DOM features such as querySelector or querySelectorAll work for tree traversal.

⚡⚡ Migrating from AngleSharp 0.9 to AngleSharp 0.10 or later (incl. 1.0)? Look at our migration documentation. ⚡⚡

Key Features

  • Portable (using .NET Standard 2.0)
  • Standards conform (works exactly as evergreen browsers)
  • Great performance (outperforms similar parsers in most scenarios)
  • Extensible (extend with your own services)
  • Useful abstractions (type helpers, jQuery like construction)
  • Fully functional DOM (all the lists, iterators, and events you know)
  • Form submission (easily log in everywhere)
  • Navigation (a BrowsingContext is like a browser tab - control it from .NET!).
  • LINQ enhanced (use LINQ with DOM elements, naturally without wrappers)

The advantage over similar libraries like HtmlAgilityPack is that the exposed DOM is using the official W3C specified API, i.e., that even things like querySelectorAll are available in AngleSharp. Also the parser uses the HTML 5.1 specification, which defines error handling and element correction. The AngleSharp library focuses on standards compliance, interactivity, and extensibility. It is therefore giving web developers working with C# all possibilities as they know from using the DOM in any modern browser.

The performance of AngleSharp is quite close to the performance of browsers. Even very large pages can be processed within milliseconds. AngleSharp tries to minimize memory allocations and reuses elements internally to avoid unnecessary object creation.

Simple Demo

The simple example will use the website of Wikipedia for data retrieval.

var config = Configuration.Default.WithDefaultLoader();
var address = "https://en.wikipedia.org/wiki/List_of_The_Big_Bang_Theory_episodes";
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(address);
var cellSelector = "tr.vevent td:nth-child(3)";
var cells = document.QuerySelectorAll(cellSelector);
var titles = cells.Select(m => m.TextContent);

Or the same with explicit types:

IConfiguration config = Configuration.Default.WithDefaultLoader();
string address = "https://en.wikipedia.org/wiki/List_of_The_Big_Bang_Theory_episodes";
IBrowsingContext context = BrowsingContext.New(config);
IDocument document = await context.OpenAsync(address);
string cellSelector = "tr.vevent td:nth-child(3)";
IHtmlCollection<IElement> cells = document.QuerySelectorAll(cellSelector);
IEnumerable<string> titles = cells.Select(m => m.TextContent);

In the example we see:

  • How to setup the configuration for supporting document loading
  • Asynchronously get the document in a new context using the configuration
  • Performing a query to get all cells with the content of interest
  • The whole DOM supports LINQ queries

Every collection in AngleSharp supports LINQ statements. AngleSharp also provides many useful extension methods for element collections that cannot be found in the official DOM.

Supported Platforms

AngleSharp has been created as a .NET Standard 2.0 compatible library. This includes, but is not limited to:

  • .NET Core (2.0 and later)
  • .NET Framework (4.6.2 and later)
  • Xamarin.Android (7.0 and 8.0)
  • Xamarin.iOS (10.0 and 10.14)
  • Xamarin.Mac (3.0 and 3.8)
  • Mono (4.6 and 5.4)
  • UWP (10.0 and 10.0.16299)
  • Unity (2018.1)

Documentation

The documentation of AngleSharp is located in the docs folder. More examples, best-practices, and general information can be found there. The documentation also contains a list of frequently asked questions.

More information is also available by following some of the hyper references mentioned in the Wiki. In-depth articles will be published on the CodeProject, with links being placed in the Wiki at GitHub.

Use-Cases

  • Parsing HTML (incl. fragments)
  • Parsing CSS (incl. selectors, declarations, ...)
  • Constructing HTML (e.g., view-engine)
  • Minifying CSS, HTML, ...
  • Querying document elements
  • Crawling information
  • Gathering statistics
  • Web automation
  • Tools with HTML / CSS / ... support
  • Connection to page analytics
  • HTML / DOM unit tests
  • Automated JavaScript interaction
  • Testing other concepts, e.g., script engines
  • ...

Vision

The project aims to bring a solid implementation of the W3C DOM for HTML, SVG, MathML, and CSS to the CLR - all written in C#. The idea is that you can basically do everything with the DOM in C# that you can do in JavaScript (plus, of course, more).

Most parts of the DOM are included, even though some may still miss their (fully specified / correct) implementation. The goal for v1.0 is to have all practically relevant parts implemented according to the official W3C specification (with useful extensions by the WHATWG).

The API is close to the DOM4 specification, however, the naming has been adjusted to apply with .NET conventions. Nevertheless, to make AngleSharp really useful for, e.g., a JavaScript engine, attributes have been placed on the corresponding interfaces (and methods, properties, ...) to indicate the status of the field in the official specification. This allows automatic generation of DOM objects with the official API.

This is a long-term project which will eventually result in a state of the art parser for the most important angle bracket based hyper-texts.

Our hope is to build a community around web parsing and libraries from this project. So far we had great contributions, but that goal was not fully achieved. Want to help? Get in touch with us!

Participating in the Project

If you know some feature that AngleSharp is currently missing, and you are willing to implement the feature, then your contribution is more than welcome! Also if you have a really cool idea - do not be shy, we'd like to hear it.

If you have an idea how to improve the API (or what is missing) then posts / messages are also welcome. For instance there have been ongoing discussions about some styles that have been used by AngleSharp (e.g., HTMLDocument or HtmlDocument) in the past. In the end AngleSharp stopped using HTMLDocument (at least visible outside of the library). Now AngleSharp uses names like IDocument, IHtmlElement and so on. This change would not have been possible without such fruitful discussions.

The project is always searching for additional contributors. Even if you do not have any code to contribute, but rather an idea for improvement, a bug report or a mistake in the documentation. These are the contributions that keep this project active.

Live discussions can take place in our Gitter chat, which supports using GitHub accounts.

More information is found in the contribution guidelines. All contributors can be found in the CONTRIBUTORS file.

This project has also adopted the code of conduct defined by the Contributor Covenant to clarify expected behavior in our community.

For more information see the .NET Foundation Code of Conduct.

Funding / Support

If you use AngleSharp frequently, but you do not have the time to support the project by active participation you may still be interested to ensure that the AngleSharp projects keeps the lights on.

Therefore we created a backing model via Bountysource. Any donation is welcome and much appreciated. We will mostly spend the money on dedicated development time to improve AngleSharp where it needs to be improved, plus invest in the web utility eco-system in .NET (e.g., in JavaScript engines, other parsers, or a renderer for AngleSharp to mention some outstanding projects).

Visit Bountysource for more details.

Development

AngleSharp is written in the most recent version of C# and thus requires Roslyn as a compiler. Using an IDE like Visual Studio 2019+ is recommended on Windows. Alternatively, VSCode (with OmniSharp or another suitable Language Server Protocol implementation) should be the tool of choice on other platforms.

The code tries to be as clean as possible. Notably the following rules are used:

  • Use braces for any conditional / loop body
  • Use the -Async suffixed methods when available
  • Use VIP ("Var If Possible") style (in C++ called AAA: Almost Always Auto) to place types on the right

More important, however, is the proper usage of tests. Any new feature should come with a set of tests to cover the functionality and prevent regression.

Changelog

A very detailed changelog exists. If you are just interested in major releases then have a look at the GitHub releases.

.NET Foundation

This project is supported by the .NET Foundation.

License

AngleSharp is released using the MIT license. For more information see the license file.

anglesharp.io's People

Contributors

florianrappl avatar joelverhagen avatar zyano avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

anglesharp.io's Issues

Documentation

For the version v1 of this project (or partially when AngleSharp.Core reaches v1) some documentation is required.

The outline for the IO project is (roughly):

  • Introduction
  • New requesters
  • IO pipeline(s) and data caches
  • (Sample) use-cases w. (simple) examples
  • Core interfaces
  • Services and extensibility

HttpClientRequester does not use HttpClientHandler.AllowAutoRedirect

When using HttpClient, we can disable automatic redirects by using this:

var handler = new HttpClientHandler { AllowAutoRedirect = false };
var client = new HttpClient(handler);

However, when using HttpClientRequester, the request will still be redirected.

var requester = new HttpClientRequester(client);
var configuration = Configuration.Default.WithDefaultLoader(requesters: new[] { requester });
var context = BrowsingContext.New(configuration);
var document = await context.OpenAsync("url that will be redirected");
Console.WriteLine(document.Url); // this is a different url which proves that a redirect happened

Creating a form resulting infinite document awaiting from https page

Bug Report

Prerequisites

  • Can you reproduce the problem in a MWE?
  • Are you running the latest version of AngleSharp?
  • Did you check the FAQs to see if that helps you?
    I haven't found the FAQ
  • Are you reporting to the correct repository? (there are multiple AngleSharp libraries, e.g., AngleSharp.Css for CSS support)
  • Did you perform a search in the issues?

For more information, see the CONTRIBUTING guide.

Description

That's super strange bug I found. The code

var requester = new DefaultHttpRequester("Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:99.0) Gecko/20100101 Firefox/99.0");
var config = new Configuration().WithDefaultLoader().With(requester);
var document = await BrowsingContext.New(config).OpenAsync(urlAddress);

is not working if the Form was created and urlAddress is https

Steps to Reproduce

  1. Download a project from here or:
  • Create .NET 6 Winforms project
  • Get AngleSharp from nuGet
  • Paste this code
  1. Launch it. Don't see "all ok" message
  2. Comment var form = new Form1(); line and launch again
  3. See "all ok" message
  4. Uncomment the line, remove "https"//" from link
  5. See "all ok" message

Expected behavior: [getting either document or exception]

Actual behavior: [infinite loop]

Environment details: [Windows 10, .NET 6]

Add FTP requester

Legacy pages may still have links to resources available on FTP servers. Therefore, the AngleSharp.Io library should provide a requester for this protocol.

Portable AngleSharp.Io?

Right now, it looks like the AngleSharp.Io package targets net45. Are there any plans to make this package (or a subset of functionality available on other TFMs?

In particular, I was wondering if the new .NET Platform standard will be supported. The piece that I worked on before (an adapter for HttpClient) should work on netstandard1.1, since this is the earliest version that supports HttpClient.

About scheme

AngleSharp.Io should also come with a special scheme (by default called about) that can be extended in an IDictionary<string, Func<IRequest, Task<IResponse>>> like manner.

Essentially, this offers the possibility to integrate any kind of about-like pages in whatever scenario.

For more on the about scheme see Wikipedia.

.NET Core build

We should slowly start to align all (official) AngleSharp extensions. .NET core should be used as the platform of choice, however, with different targets. Most likely, we add netstandard1.0 (or later) with potential external dependencies (e.g., x-plat filesystem). Important is that all features, e.g., an HTTP client as currently proposed, are available. In the end this will determine the starting target.

A connection attempt failed

Hi Florian,

I scarp url using AngleSharp.Io i do this:
var handler = new HttpClientHandler { Proxy = new WebProxy($"{proxy?.Address}:{proxy?.Port}",false), PreAuthenticate = true, UseDefaultCredentials = false, };
var config = Configuration.Default.WithRequesters(handler).WithTemporaryCookies() .WithDefaultLoader(new LoaderOptions { IsResourceLoadingEnabled = true });

but i get following error :"A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond."
i do this:
var client = new HttpClient(handler: handler, disposeHandler: true) { Timeout = TimeSpan.FromMinutes(2) };

but still get the error

IRequester example

Is there any code samples of using custom IRequester?

I am trying to create multithreaded process where each thread will use a different proxy. I saw another issue that said this can be done with IRequester but I am not exactly sure how to implement it.

I can see it is a parameter of:

Configuration.Default.WithRequesters();

But I cannot see how I am supposed to initialize it with proxy information.

Thank you.

How to use both Proxy and Headers in AngleSharp.Io

I am trying to get AngleSharp to use both a proxy and set header properties like this:

        var handler = new HttpClientHandler
        {
            Proxy = new WebProxy(ProxyMan.GetProxy()),
            UseProxy = true,
            PreAuthenticate = true,
            UseDefaultCredentials = false
        };
        var requester = new DefaultHttpRequester();
        requester.Headers["User-Agent"] = Tools.GetAgentString();
        requester.Headers["Accept-Language"] = "en-US";
        requester.Headers["Accept-Charset"] = "ISO-8859-1";
        requester.Headers["Content-Type"] = "text/html; charset=UTF-8";
        var config = Configuration.Default
            .WithRequesters(handler)
            .With(requester)
            .WithTemporaryCookies()
            .WithDefaultLoader();
        var context = BrowsingContext.New(config);
        var doc = await context.OpenAsync(Url);

When I added the Header requester, it stopped the proxy handler from working. I know there is some conflict between the .WithRequesters() and the .With() (or WithRequester() ) because they are separately part of AngleSharp and AngleSharp.Io but I cannot locate the proper syntax for doing both in the same request.

Apologies for the 'bug' label. Should be a help request but I can't figure out how to change that now.

Thanks.

Request for Support / Sponsorship

Over the years this project had great contributors and sponsors. Moving forward the last year has shown that dedicated support (e.g., as provided by AWS and JetBrains) is crucial to allocate time for maintenance and move it forward. I'd like to continue in this mode; not only sometimes cutting out some of my spare time, but actually being able to have dedicated time slots.

So with this sticky note I call to support this project. It would be really wonderful and there are still some plans for a potential v2 that would benefit a lot from additional support (as well as its ecosystem, esp. CSS / JS / ...).

Background

In the past we already had some great sponsors who brought this project forward.

By far the largest contribution came from AWS:

Closely behind (and still active) is the sponsorship from JetBrains:

And other companies:

| | |

We also had individuals (much appreciated!) that have been very gracious:

🙏 thanks to everyone!

Also to be clear; AngleSharp and all associated libraries will always be free and MIT licensed.

There is no consequence from no sponsors coming in - except that my time available for the project will definitely be less as compared to having some sponsorship.

One final remark: While GitHub sponsorships are potentially the best way of supporting the project there are also other ways, e.g., getting in touch regarding consulting on AngleSharp best practices if you use it or directly getting in touch regarding development of specific features.

(Original post / issue available at AngleSharp/AngleSharp#1163)

Introduce improved cookie container

The memory cookie provider from the core has way too many problems to be the ultimate solution. Therefore, AngleSharp.Io should start introducing a better version - handling all sorts of cookies and introducing (optional) save-to-disk capabilities.

For the latter the persistent vs. session cookie question is important. To be flexible AngleSharp.Io should work against a file system abstraction that comes with a default implementation. For the moment, this should be as lightweight as possible.

Writing the cookie handling from scratch may be tedious, but should be ultimately rewarding.

Unhandle exception crash the application

Bug Report

An "unhandled exception" cause the full application crash. The configuration include a requester and also allow javascript execution.

var requester = new HttpClientRequester(httpClient);
var config = Configuration.Default.WithRequester(requester).WithDefaultLoader(new LoaderOptions { IsResourceLoadingEnabled = true }).WithJs();

But a network issues cause the app crashes. Looks like the request occurs in an isolated thread and the program can't catch the exception.

Stack Trace:

Unhandled exception. System.Net.Http.HttpRequestException: Error while copying content to a stream.
 ---> System.IO.IOException:  Received an unexpected EOF or 0 bytes from the transport stream.
   at System.Net.Security.SslStream.<FillBufferAsync>g__InternalFillBufferAsync|215_0[TReadAdapter](TReadAdapter adap, ValueTask`1 task, Int32 min, Int32 initial)
   at System.Net.Security.SslStream.ReadAsyncInternal[TReadAdapter](TReadAdapter adapter, Memory`1 buffer)
   at System.Net.Http.HttpConnection.FillAsync()
   at System.Net.Http.HttpConnection.CopyToContentLengthAsync(Stream destination, UInt64 length, Int32 bufferSize, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnection.ContentLengthReadStream.CompleteCopyToAsync(Task copyTask, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionResponseContent.SerializeToStreamAsync(Stream stream, TransportContext context, CancellationToken cancellationToken)
   at System.Net.Http.HttpContent.LoadIntoBufferAsyncCore(Task serializeToStreamTask, MemoryStream tempBuffer)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpContent.LoadIntoBufferAsyncCore(Task serializeToStreamTask, MemoryStream tempBuffer)
   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at AngleSharp.Io.Network.HttpClientRequester.PerformRequestAsync(Request request, CancellationToken cancel)
   at AngleSharp.Io.BaseRequester.RequestAsync(Request request, CancellationToken cancel)
   at AngleSharp.Io.BaseLoader.LoadAsync(Request request, CancellationToken cancel)
   at AngleSharp.Browser.DefaultNavigationHandler.NavigateAsync(DocumentRequest request, CancellationToken cancel)
   at AngleSharp.Dom.Document.LocationChanged(Object sender, ChangedEventArgs e)
   at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__139_1(Object state)
   at System.Threading.QueueUserWorkItemCallback.<>c.<.cctor>b__6_0(QueueUserWorkItemCallback quwi)
   at System.Threading.ExecutionContext.RunForThreadPoolUnsafe[TState](ExecutionContext executionContext, Action`1 callback, TState& state)
   at System.Threading.QueueUserWorkItemCallback.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()

Doesn't handle cookies where the Domain starts with a dot.

The AdvancedCookieProvider does not cater for cookies lacking the "www" on the front.

The website I am working with has cookies with the "www.websitename.com" domain and cookies with the ".websitename.com" domain.

On adding the cookies the code currently strips off the "." and when selecting cookies to send on the next request, ignores all these cookies because they don't match the host name.

Should the code be altering the Domain name of cookies??

I think it needs to leave the Domain alone when set, and do a comparison on the basis of the host and the cookie domain excluding a leading "www." or "."

Support .net standard 1.3 and .net46

New Feature Proposal

Currently this doesn't support the same .NET standard versions as the normal AngleSharp package.

Description

It would be nice if the io package supported the same versions for .net framework and .net standard as the normal AngleSharp. The issue seems to have been discussed in #9 but from looking at the csproj it currently only support Windows NT and .net standard 2.0

Background

Easier framework alignment with AngleSharp

Add File requester

Local webpages are opened using the file:// protocol. The relative links will also follow this scheme requiring a requester to understand the protocol and fetch local resources. Therefore, the AngleSharp.Io library should provide a requester for this protocol.

Binary data restore when opening an image url

Sample code:

var config = Configuration.Default.WithDefaultLoader().WithCookies();
var context = BrowsingContext.New(config);
await context.OpenAsync("http://www.xxx.com/");
var doc = await context.OpenAsync("http://www.xxx.com/abc.jpg");          
var data = doc.Body.TextContent;

When debug into the code, I can see the binary data of the image has been converted to UTF-8 string in doc.Boday.TextContent. That means the image has been downloaded. However I can't restore the image binary data with the UTF-8 string as some data has lost during the conversion.

Is there a way that I can get the original binary data of the image? For some reason, I can't download the image using regular way.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.