nolanw / htmlreader Goto Github PK

View Code? Open in Web Editor NEW

802.0 802.0 70.0 3.46 MB

A WHATWG-compliant HTML parser in Objective-C.

License: Other

Objective-C 98.22% Ruby 0.13% Swift 0.26% Python 0.79% C 0.61%

htmlreader's People

Contributors

Stargazers

Watchers

htmlreader's Issues

Getting an element by ID

Can I somehow get an element by its ID with HTMLReader?
If I can, I can't find it in the code.

HTMLElement's `objectForKeyedSubscript:` is not imported into Swift as `subscript`

If you try e.g.

let e: HTMLElement = …
let value = e["key"]

you get

Type 'HTMLElement' has no subscript members

But if you do

extension HTMLElement {
    subscript(attributeName: String) -> String? {
        return objectForKeyedSubscript(attributeName) as? String
    }
}

you get

Subscript getter with Objective-C selector 'objectForKeyedSubscript:' conflicts with method 'objectForKeyedSubscript' with the same Objective-C selector

This works:

extension HTMLElement {
    @nonobjc subscript(attributeName: String) -> String? {
        return objectForKeyedSubscript(attributeName) as? String
    }
}

but I think it shouldn't be necessary? It seems like Swift usually maps -objectForKeyedSubscript: to subscript, so why is it failing here?

<HTMLReader/HTMLReader.h> file not found

How can i instal the framework?

case-insensitive attribute option

I've encountered a problem when I need to do a case-insensitive attribute match

As discussed here
http://stackoverflow.com/questions/5671238/css-selector-case-insensitive-for-attributes

Shouldn't these be equivalent?

let nodes = doc.firstNodeMatchingSelector("link[rel='shortcut icon']")

and

let nodes = doc.firstNodeMatchingSelector("link[rel='Shortcut icon' i]")

iOS 9 / Xcode 7 compatibility issue

HTMLTreeEnumerator.m:20:17: Method override for the designated initializer of the superclass '-init' not found

HTMLPreprocessedInputStream.m:8:17: Method override for the designated initializer of the superclass '-init' not found

HTMLParser.m:50:17: Method override for the designated initializer of the superclass '-init' not found

HTMLSelector.m:790:17: Method override for the designated initializer of the superclass '-init' not found

HTMLTokenizer.m:2596:17: Method override for the designated initializer of the superclass '-init' not found

This are the warnings I am seeing in Xcode 7

Parsed Document back to String

I've been looking through the source and have been trying to find a way to get the parsed document back as a string. I am trying to use the library to parse apart an HTML document to then search for elements and remove and/or replace some attributes so that they will render within a WKWebview.

I have the the changes working (examining the .attributes) but can not find a way to get the main document to create a string I can then feed into the WebView.

Any insight would be appreciated!

Thanks,
Rob

Whitespace issue

in your example code you do stringByTrimmingCharactersInSet. For my parsed string it makes no difference if I trim or not. The string always looks the same: completely without any whitespace characters.

But I would like the opposite. I would like a parsed text with newline characters. Is it possible to get this text?

Here is the text http://www.dwd.de/bvbw/appmanager/bvbw/dwdwwwDesktop?_nfpb=true&_state=maximized&_windowLabel=T43402027281174298183241&T43402027281174298183241gsbDocumentPath=Navigation%2FSchifffahrt%2FSeewetter%2FNavtex__Empfang__518__Emden__node.html%3F__nnn%3Dtrue&_pageLabel=_dwdwww_spezielle_nutzer_schiffffahrt_seewetter&switchLang=de

and the node I look at is .blockBodyPre .

In webinspector I can see that the text is shown like parsed with HTMLReader, but in Safari it shows with newlines.

Memory leaks on ARC

While implementing the log output on init and dealloc methods in all of classes,
I found dealloc method is not called in some of the classes and it causes possible memory leaks.

The classes dealloc is not called are below.

HTMLDocument
HTMLDocumentType
HTMLComment
HTMLElement
HTMLMarker
HTMLTextNode

Hou can I fix this?

My test code is like this.

#import "ViewController.h"
#import "HTMLDocument.h"
#import "AFHTTPRequestOperationManager.h"

@interface ViewController ()
@property (strong, nonatomic) HTMLDocument *htmlDoc;
@end

@implementation ViewController

- (void)viewDidAppear:(BOOL)animated {
    [super viewDidAppear:animated];
    NSLog(@"viewDidAppear");

    __weak ViewController *weakSelf = self;
    NSURL *URL = [NSURL URLWithString:@"http://www.wired.com"];
    NSURLRequest *request = [NSURLRequest requestWithURL:URL];
    AFHTTPRequestOperation *operation = [[AFHTTPRequestOperation alloc] initWithRequest:request];
    [operation setCompletionBlockWithSuccess:^(AFHTTPRequestOperation *operation, id responseObject) {
        [weakSelf loadComplete:operation];
    } failure:^(AFHTTPRequestOperation *operation, NSError *error) {
        [weakSelf loadFailed:operation];    }];
    [operation start];
}

- (void)loadComplete:(AFHTTPRequestOperation *)operation {
    NSLog(@"loadComplete");
    NSString *htmlString = [[NSString alloc] initWithData:operation.responseObject encoding:NSUTF8StringEncoding];
//    NSLog(@"htmlString: %@", htmlString);
    self.htmlDoc = [HTMLDocument documentWithString:htmlString];
    NSLog(@"htmlDoc.bodyElement: %@", self.htmlDoc);
    self.htmlDoc = nil;
}

- (void)loadFailed:(AFHTTPRequestOperation *)operation {
    NSLog(@"loadFailed");
    NSLog(@"%@", operation.error);
}

@end

HTMLTextNode plaintext with comments

Hi there,
I notice that if the text in the HTMLTextNode contains comments "" the comment is also returned. Is this expected?

Thanks!

Broken pod

Since the headers are in the include/ folder, cocoapods/Xcode is unable to find them without adding include/ to pretty much all of the files.

HTMLParser will crash in initialParser.changeEncoding block if correctedString is nil

We discovered this issue in beta testing, saw consistent crashes parsing the url: http://buff.ly/1KQAykF

Failed with selector "h1:first" on webpage http://en.m.wikipedia.org/wiki/Katy_Perry

Getting following error/exception:

Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: 'Attempted to use selector with error: Error Domain=HTMLSelectorErrorDomain Code=1 "Unrecognized pseudo class" UserInfo=0x7fd24389f2d0 {NSLocalizedFailureReason=Error near character 8: Unrecognized pseudo class

h1:first
^        , HTMLSelectorInputString=h1:first, HTMLSelectorLocation=8, NSLocalizedDescription=Unrecognized pseudo class}'

At below line number two
#1 HTMLDocument *document = [HTMLDocument documentWithString:markup];
#2 HTMLElement * element = [document firstNodeMatchingSelector:@"h1:first"];

self-enclosed tags incorrectly end up as child tags after parsing

We found that self-enclosed tags end up as child tags after loading the html into an HTMLDocument* then retrieving using either innerHTML or serializedFragment methods:

NSString* rawHtmlString = @"<html><body><div class=\"self-enclosed\"/><div class=\"also-self-enclosed\"/></body></html>";
HTMLDocument *document = [HTMLDocument documentWithString:rawHtmlString];
NSString* formattedHtmlString= [document innerHTML];
NSLog(@"rawHtmlString: %@", rawHtmlString);
NSLog(@"formattedHtmlString: %@", formattedHtmlString);

Log:
2015-08-04 10:25:56.976 App[2819:40244] rawHtmlString: <html><body><div class="self-enclosed"/><div class="also-self-enclosed"/></body></html>
2015-08-04 10:25:59.343 App[2819:40244] formattedHtmlString: <html><head></head><body><div class="self-enclosed"><div class="also-self-enclosed"></div></div></body></html>

are we doing something wrong or does the parser have an issue processing self-enclosed tags?

Importing files directly on project doesn't work anymore

From the README, one of the options of installation is:

Copy the files in the Code folder into your project.

I've added HTMLReader to a project back in version 0.7 using git submodule and imported the files directly into it, but now when I pulled the latest revision, the app won't build anymore because of the #import <HTMLReader/*> imports.

[closed]Analyse src content in html str

Hi,
I want to analyse "some.jpg" in the html str blow.
What should I do? Thank you.

<div class="box-border" style="height:700px;">
<ul id="Cont01">
<li style="display:block" id="projectContainer"><div class="spacing"></div><center><img src="some.jpg"></center><hr></li>
</ul>
</div>

Parsing DOM

Hi,
This is more of a question than an issue.
How do I use this library to search for a specific div and store the child element values to a model class instances?

As an example my HTML is a collection of divs as below.

<div class="rc">
<h3 class="r">
<a href="http://someurl">click here</a>
</h3>
</div>

Please help.

Is there a way to grab the text in an element?

When I traversed the DOM tree using treeEnumerator of HTMLNode, I can see the text node: HTMLTextNode. However, it is not exposed as a public API so that I can't type cast to it.

And the textContent of HTMLNode includes too many text, which is useless.

I have a HTML file which contains the following code:

<dd>
  Some Description
  <dl>...</dl>
</dd>

Xcode Framework - needs headers to build

Including the HTMLReader.framework in another Xcode projects results in an inability to compile because some HTMLReader headers are not found. Making them public in the HTMLReader project fixes the issues.

Namely, the headers are
HTMLDocumentType.h
HTMLElement.h
HTMLNamespace.h
HTMLQuirksMode.h

Build failed when in testing of my OS X App.

Xcode 7.1 and OS X 10.11.1, OS X app.

I used Carthage to install HTMLReader, when build for my test target, error occurred:

ld: framework not found HTMLReader for architecture x86_64

This error only occurs in the test bundle.

If I install HTMLReader through Git Submodule, everything is OK.

Remove some strings

Hi! Please help me. I read docs but don't understand how remove some strings. I have some html strings with different parts(aHirg7S8Zu0):

<p><img src="//img.youtube.com/vi/aHirg7S8Zu0/0.jpg" height="505" width="640"></p>
<p>&nbsp;</p>
<h2 style="text-align: center;">Dear parents, I want say you...</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sit sane ista voluptas. Aliter autem vobis placet. Fortemne possumus dicere eundem illum Torquatum? Duo Reges: constructio interrete. Igitur neque stultorum quisquam beatus neque sapientium non beatus.<br>
<p>&nbsp;</p>

How i can delete first line and all nbsp(2-nd line)?

1. <p><img src="//img.youtube.com/vi/aHirg7S8Zu0/0.jpg" height="505" width="640"></p> 
2. <p>&nbsp;</p>

Thank you very much

Custom tags

Hi!

My HTML contains custom not closed tag, as a <tag> or <tag/>. HTMLReader expecting <tag></tag>, so my document does not good for me.
I can customize parse process, to add support for my tags?

Thanks!

HTMLReader fails to find any children for SA's body

That was something of its original genesis, was it not?

NSString *page = [NSString stringWithContentsOfFile:@"/Users/jwilliams/sa_index.html" encoding:NSUTF8StringEncoding error:nil]; HTMLDocument *doc = [HTMLDocument documentWithString:page]; HTMLElementNode *html = doc.childNodes[0]; NSLog(@"%@", html.childNodes);

Copy of SA's main page I'm testing with: https://gist.github.com/SASinestro/6314661

Replace child node

How to replace child node in parent node?

Parse attributed HTML?

How to parse attributes in HTML element like name and value?

Potential for Null Dereference

Hello,
Our security team has identified potential security concerns in the following files:

HTMLSelector.m:(Line 647)
HTMLNode.m:(Line 167)

Impact:
Most null pointer issues result in general software reliability problems, but if an attacker can intentionally trigger a null pointer dereference, the attacker might be able to use the resulting exception to bypass security logic or to cause the application to reveal debugging information that will be valuable in planning subsequent attacks.

Recommendation:
Implement careful checks before dereferencing objects that might be null. When possible, abstract null checks into wrappers around code that manipulates resources to ensure that they are applied in all cases and to minimize the places where mistakes can occur.

Elements after self-closing tag will be added as child of previous tag

Hi,

In my experiment, this html <td><div style="clear: both;" /><img src="abc.jpg" /></td> will add image element as child of the div instead of td. Is this a bug?

Thanks for the awesome component!

Cheers,
Joe

issues

No I was not able to solve it. Probably a one-off issue with that
particular article

How to get css file's style from object c? Is there a demo project?

.i-AED{background-position:0 0;}.i-ARS{background-position:-16px 0;}

I wanna get the css file's .i-AED's background-position attribute in object-c code

Mach-o error

Don't know how to sort it .

I have imported Foundation library already.

When using HTMLDocument xcode giving error :

ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Strip HTML

This is a feature suggestion.

Since the HTML is already parsed, maybe it would be possible to add a method which strips the HTML but keeps the line breaks?

I'm specially thinking about this to be used on watchOS 2 projects, where NSAttributedString can't be used to strip HTML and it was quite a popular solution.

Selector with or

Hey,

i'm trying to get some elements via the selector (@".class1, .class2"), to get elements that either have class1 or class2. But it doesn't work. I get the following Parsing Error:

Attempted to use selector with error: Error Domain=HTMLSelectorErrorDomain Code=1 "Expected a 
combinator here" UserInfo=0x6100000f2f80 {NSLocalizedFailureReason=Error near character 7:
Expected a combinator here

.class1, .class2
^       , HTMLSelectorInputString=.class1, .class2, HTMLSelectorLocation=7, NSLocalizedDescription=Expected a combinator here}

Is it simply not supported?

best regards,
Joscha

HTMLNode's `-nodesMatchingSelector:` return type is imported into Swift as `[AnyObject]`

Should be [HTMLElement]. Do the macros confuse the importer? Since we support SDKs that predate Objective-C generics, we can't just put the generics right in.

Possible to access video id?

I am pretty unfamiliar with HTML scraping but as far as the documentation goes, it covers mostly about imgs and text.

Does HTMLReader have the ability to scrape video urls such as this one?

 <video id="my_video_1_html5_api" class="vjs-tech" preload="auto" src="https://redirector.googlevideo.com/videoplayback?requiressl=yes&amp;id=45d2fdf73f5ea442&amp;itag=22&amp;source=picasa&amp;cmo=secure_transport%3Dyes&amp;ip=0.0.0.0&amp;ipbits=0&amp;expire=1438962730&amp;sparams=requiressl,id,itag,source,ip,ipbits,expire&amp;signature=A1870313E674D7D0FAAA420CB49BAC57C744A158.45144C1E44617AE5405CE7A27517A4B84DDAE50C&amp;key=lh1"></video>

Handle tagName with a : in it

I am getting back some html that has some namespaced elements in it, like:

<some-ns:some-tag /> and I am unable to build a selector that targets it, as it chokes on the :.

Creating self-enclosing HTMLElement

Hi!
I'm trying to create a self-enclosing HTMLElement - the one. Is there any way to do it?

Warnings on HTMLOrderedDictionary

HTMLOrderedDictionary gives warnings about "designated initializers" on constructors from line line 46 to 55.

This method declaration in the header file fixes this:

(instancetype)initWithCapacity:(NSUInteger)numItems NS_DESIGNATED_INITIALIZER;

parse html..

i want to parse an one html. here is an html code :

<p>可愛的兔兔應該是繼汪星人和喵星人之外比較常見的家庭寵物，而在日本就有一隻垂耳兔PuiPui不只本身擁有超萌的高顏值，她的主人也很用心地幫PuiPui準備專屬服裝，並且在Instagram上分享牠的變裝日記，讓PuiPui成為超多粉絲追蹤的人氣時尚潮兔，快一起來認識PuiPui吧！</p>

<figure>
  <img src="http://images.900.tw/upload_file/33/content/dc38a115-5732-b707-34d2-83513508a273.jpg" />
  <figcaption>哥什麼造型都能消化！超卡哇伊又高顏值的萌兔PuiPui穿搭日記圖集...第12張簡直是撩妹高手呀！</figcaption>
</figure>


<p>▼穿上可愛的學生制服、萌兔PuiPui要在櫻花的目送下上學去啦！</p>
<a href="http://www.styletc.com/wp-content/uploads/2016/05/110.jpg">
<figure>
  <img src="http://www.styletc.com/wp-content/uploads/2016/05/110.jpg" />
  <figcaption>哥什麼造型都能消化！超卡哇伊又高顏值的萌兔PuiPui穿搭日記圖集...第12張簡直是撩妹高手呀！</figcaption>
</figure>
</a>

<p>▼到了秋天就換上福爾摩斯裝來襯托深沉的秋意！</p>

<figure>
  <img src="http://www.styletc.com/wp-content/uploads/2016/05/28.jpg" />
  <figcaption>哥什麼造型都能消化！超卡哇伊又高顏值的萌兔PuiPui穿搭日記圖集...第12張簡直是撩妹高手呀！</figcaption>
</figure>

i want to parse an 'p', 'figure.img', 'figure.figcaption', 'a' and by default sort.....
i don't know how to user HTMLDocument to parse this...

Could you help me?
Thanks

innerHTML in the xhtml

<img src = '...' /> is convert to <img src='...'>

so, xhtml contents display error.

xhtml contents have to tag close.

how can i fix it.

Nullable to nonnull warning

I just clone the repo and try to build with Xcode 8(8A218a), it fails with an error Implicit conversion from nullable pointer 'NSURL * _Nullable' to non-nullable pointer type 'NSURL * _Nonnull'

// EncodingLabeler.m
static NSString * const EncodingLabelsURL = @"https://encoding.spec.whatwg.org/encodings.json";
NSData *data = [NSData dataWithContentsOfURL:[NSURL URLWithString:EncodingLabelsURL]];

I noticed there's a CLANG_WARN_NULLABLE_TO_NONULL_CONVERSION option set to YES in build settings. Since [NSURL URLWithString:..] returns a nullable result and [NSData dataWithContentsOfURL:...] expected a nonnull argument, it produces a warning. Also, due to the Treat Warnings as Errors is set to YES, the build fails.

I suppressed the warning by adding a temporary variable then the build passed.

static NSString * const EncodingLabelsURL = @"https://encoding.spec.whatwg.org/encodings.json";
NSURL *url = [NSURL URLWithString:EncodingLabelsURL];
NSData *data = [NSData dataWithContentsOfURL:url];

Hope it helps.

Build failed

Hi,

I added HTMLReader through pod as instructed, but the project isn't compiling, I get the following message:

Undefined symbols for architecture x86_64:
  "_OBJC_CLASS_$_HTMLDocument", referenced from:
      objc-class-ref in FetchData.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Any idea ?

The test case testEncodingDetection fails, missing HTML5 test cases?

The testEncodingDetection case fails on my system:

Assertions: failed: caught "NSInternalInconsistencyException", "possible error listing test directory: Error Domain=NSCocoaErrorDomain Code=260 "The file “encoding” couldn’t be opened because there is no such file." UserInfo={NSURL=/Users/zoul/Code/HTMLReader/Tests/html5lib/encoding, NSFilePath=/Users/zoul/Code/HTMLReader/Tests/html5lib/encoding, NSUnderlyingError=0x10050bf40 {Error Domain=NSPOSIXErrorDomain Code=2 "No such file or directory"}}"
(
    0   CoreFoundation                      0x00007fff9dc7dae2 __exceptionPreprocess + 178
    1   libobjc.A.dylib                     0x00007fff9e162f7e objc_exception_throw + 48
    2   CoreFoundation                      0x00007fff9dc7d8ba +[NSException raise:format:arguments:] + 106
    3   Foundation                          0x00007fff92888d4a -[NSAssertionHandler handleFailureInFunction:file:lineNumber:description:] + 169
    4   Tests on OS X                       0x0000000105275f72 TestFileURLs + 610
    5   Tests on OS X                       0x00000001052759d6 -[HTMLEncodingTests testEncodingDetection] + 70
    6   CoreFoundation                      0x00007fff9db4817c __invoking___ + 140
    7   CoreFoundation                      0x00007fff9db47fce -[NSInvocation invoke] + 286
    8   XCTest                              0x0000000100022598 __24-[XCTestCase invokeTest]_block_invoke_2 + 159
    9   XCTest                              0x000000010005602e -[XCTestContext performInScope:] + 184
    10  XCTest                              0x00000001000224e8 -[XCTestCase invokeTest] + 169
    11  XCTest                              0x0000000100022983 -[XCTestCase performTest:] + 443
    12  XCTest                              0x0000000100020654 -[XCTestSuite performTest:] + 377
    13  XCTest                              0x0000000100020654 -[XCTestSuite performTest:] + 377
    14  XCTest                              0x0000000100020654 -[XCTestSuite performTest:] + 377
    15  XCTest                              0x000000010000e892 __25-[XCTestDriver _runSuite]_block_invoke + 51
    16  XCTest                              0x0000000100033a1b -[XCTestObservationCenter _observeTestExecutionForBlock:] + 611
    17  XCTest                              0x000000010000e7db -[XCTestDriver _runSuite] + 408
    18  XCTest                              0x000000010000f38a -[XCTestDriver _checkForTestManager] + 696
    19  XCTest                              0x000000010005729f _XCTestMain + 628
    20  xctest                              0x0000000100001dca xctest + 7626
    21  libdyld.dylib                       0x00007fff903ac5ad start + 1
)
  File: HTMLEncodingTests.m:156

Is it simply a symptom of some missing HTML5 testing resources? If so, could we skip the particular tests when the resources are not found?

License

https://github.com/nolanw/HTMLReader/blob/master/HTMLReader.podspec#L6 says public domain, https://github.com/nolanw/HTMLReader/blob/master/Code/HTMLAttribute.h#L6 says All rights reserved. Which is it?

With textContent .

hi ,
thank you your code.
now I use it in my project.
my code low-level .
I have a problem: I can get textContent (观音山商务区站) by ' NSArray *array =[document nodesMatchingSelector:@"a"]; '
but I want to get other textContent --> (1辆开往) .
How do I do?
help me.
HTML like:

        <div class="list-bus-station-content float-left">&nbsp;<a href="/RealtimeQuery?lineId=155&amp;direction=1&amp;station=%E8%A7%82%E9%9F%B3%E5%B1%B1%E5%95%86%E5%8A%A1%E5%8C%BA%E7%AB%99&amp;ordinal=2&amp;">观音山商务区站</a></div>
        <div class="list-bus-station-showBus float-left">
                <div style="padding-top:8px;"><div class="station-bus-status station-bus-way-l float-left"></div><div class="float-left" style="font-size: 11px;line-height: 12px;">&nbsp;1辆开往</div><div class="clear"></div></div>
             <div class="clear"></div>
        </div>
        <div class="float-right list-bus-station-gt">&gt;</div>
        <div class="clear"></div>
    </div>

Ability to insert/remove a node?

Will HTMLReader support manipulation of HTML documents in the future, such as inserting or deleting a node?

BTW, thanks for such brilliant parser, Nolan.

String to HTML is not complete

I can't get right html string when plan source text contains newline character.

Result HTML contains " \ n ", not "< br >" or "< p >" tag.

Access element via item prop

Is it possible to access an HTML element based on it's itemprop that is used by schema.org? Thanks!

Parse contents of a web page

Is it possible parse the contents of a html page based on the url link? Thanks

not detecting complete set of meta tags

Hi,

I am trying to parse meta tags using this code:

NSArray *metaNodes = [document nodesMatchingSelector:@"meta"];

I ran the code through this page:
http://www.nytimes.com/2015/08/16/technology/inside-amazon-wrestling-big-ideas-in-a-bruising-workplace.html

and it only picked up 31 meta tags when there is clearly 50+

firstNodeMatchingSelector return nil when looking for node which exist

Hi
I'm trying to parse some HTML document to get two texts from tags:
"Some text to display.image_name_to_display.jpg"
so I use this code :
HTMLDocument *document = [HTMLDocument documentWithString:self.content]; //content is html above
NSString *handAndImageStr = [document firstNodeMatchingSelector:@"hand"].textContent;
if (handAndImageStr) {
NSString *imgStr = [document firstNodeMatchingSelector:@"image"].textContent;

and then imgStr is null instead of "image_name_to_display.jpg"

I'm using HTMLReader 0.9.4

nolanw / htmlreader Goto Github PK

htmlreader's People

Contributors

Stargazers

Watchers

Forkers

htmlreader's Issues

Recommend Projects

Recommend Topics

Recommend Org