nolanw / htmlreader Goto Github PK
View Code? Open in Web Editor NEWA WHATWG-compliant HTML parser in Objective-C.
License: Other
A WHATWG-compliant HTML parser in Objective-C.
License: Other
Can I somehow get an element by its ID with HTMLReader?
If I can, I can't find it in the code.
If you try e.g.
let e: HTMLElement = …
let value = e["key"]
you get
Type 'HTMLElement' has no subscript members
But if you do
extension HTMLElement {
subscript(attributeName: String) -> String? {
return objectForKeyedSubscript(attributeName) as? String
}
}
you get
Subscript getter with Objective-C selector 'objectForKeyedSubscript:' conflicts with method 'objectForKeyedSubscript' with the same Objective-C selector
This works:
extension HTMLElement {
@nonobjc subscript(attributeName: String) -> String? {
return objectForKeyedSubscript(attributeName) as? String
}
}
but I think it shouldn't be necessary? It seems like Swift usually maps -objectForKeyedSubscript:
to subscript
, so why is it failing here?
How can i instal the framework?
I've encountered a problem when I need to do a case-insensitive attribute match
As discussed here
http://stackoverflow.com/questions/5671238/css-selector-case-insensitive-for-attributes
Shouldn't these be equivalent?
let nodes = doc.firstNodeMatchingSelector("link[rel='shortcut icon']")
and
let nodes = doc.firstNodeMatchingSelector("link[rel='Shortcut icon' i]")
I used this library to parse html file but it didn't work with id selector
NSURL *URL = [NSURL URLWithString:link];
NSURLSession *session = [NSURLSession sharedSession];
[[session dataTaskWithURL:URL completionHandler:^(NSData *data, NSURLResponse *response, NSError *error) {
NSString *contentType = nil;
if ([response isKindOfClass:[NSHTTPURLResponse class]]) {
NSDictionary *headers = [(NSHTTPURLResponse *)response allHeaderFields];
contentType = headers[@"Content-Type"];
}
HTMLDocument *home = [HTMLDocument documentWithData:data
contentTypeHeader:contentType];
HTMLElement *div = [home firstNodeMatchingSelector:@"#view"];
NSCharacterSet *whitespace = [NSCharacterSet whitespaceAndNewlineCharacterSet];
NSLog(@"%@ - %@", div.tagName, [div.textContent stringByTrimmingCharactersInSet:whitespace]);
}] resume];
HTMLElement is always nil
HTMLTreeEnumerator.m:20:17: Method override for the designated initializer of the superclass '-init' not found
HTMLPreprocessedInputStream.m:8:17: Method override for the designated initializer of the superclass '-init' not found
HTMLParser.m:50:17: Method override for the designated initializer of the superclass '-init' not found
HTMLSelector.m:790:17: Method override for the designated initializer of the superclass '-init' not found
HTMLTokenizer.m:2596:17: Method override for the designated initializer of the superclass '-init' not found
This are the warnings I am seeing in Xcode 7
I've been looking through the source and have been trying to find a way to get the parsed document back as a string. I am trying to use the library to parse apart an HTML document to then search for elements and remove and/or replace some attributes so that they will render within a WKWebview.
I have the the changes working (examining the .attributes) but can not find a way to get the main document to create a string I can then feed into the WebView.
Any insight would be appreciated!
Thanks,
Rob
in your example code you do stringByTrimmingCharactersInSet
. For my parsed string it makes no difference if I trim or not. The string always looks the same: completely without any whitespace characters.
But I would like the opposite. I would like a parsed text with newline characters. Is it possible to get this text?
and the node I look at is .blockBodyPre
.
In webinspector I can see that the text is shown like parsed with HTMLReader, but in Safari it shows with newlines.
While implementing the log output on init and dealloc methods in all of classes,
I found dealloc method is not called in some of the classes and it causes possible memory leaks.
The classes dealloc is not called are below.
Hou can I fix this?
My test code is like this.
#import "ViewController.h"
#import "HTMLDocument.h"
#import "AFHTTPRequestOperationManager.h"
@interface ViewController ()
@property (strong, nonatomic) HTMLDocument *htmlDoc;
@end
@implementation ViewController
- (void)viewDidAppear:(BOOL)animated {
[super viewDidAppear:animated];
NSLog(@"viewDidAppear");
__weak ViewController *weakSelf = self;
NSURL *URL = [NSURL URLWithString:@"http://www.wired.com"];
NSURLRequest *request = [NSURLRequest requestWithURL:URL];
AFHTTPRequestOperation *operation = [[AFHTTPRequestOperation alloc] initWithRequest:request];
[operation setCompletionBlockWithSuccess:^(AFHTTPRequestOperation *operation, id responseObject) {
[weakSelf loadComplete:operation];
} failure:^(AFHTTPRequestOperation *operation, NSError *error) {
[weakSelf loadFailed:operation]; }];
[operation start];
}
- (void)loadComplete:(AFHTTPRequestOperation *)operation {
NSLog(@"loadComplete");
NSString *htmlString = [[NSString alloc] initWithData:operation.responseObject encoding:NSUTF8StringEncoding];
// NSLog(@"htmlString: %@", htmlString);
self.htmlDoc = [HTMLDocument documentWithString:htmlString];
NSLog(@"htmlDoc.bodyElement: %@", self.htmlDoc);
self.htmlDoc = nil;
}
- (void)loadFailed:(AFHTTPRequestOperation *)operation {
NSLog(@"loadFailed");
NSLog(@"%@", operation.error);
}
@end
Hi there,
I notice that if the text in the HTMLTextNode contains comments "" the comment is also returned. Is this expected?
Thanks!
Since the headers are in the include/ folder, cocoapods/Xcode is unable to find them without adding include/ to pretty much all of the files.
We discovered this issue in beta testing, saw consistent crashes parsing the url: http://buff.ly/1KQAykF
Getting following error/exception:
Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: 'Attempted to use selector with error: Error Domain=HTMLSelectorErrorDomain Code=1 "Unrecognized pseudo class" UserInfo=0x7fd24389f2d0 {NSLocalizedFailureReason=Error near character 8: Unrecognized pseudo class
h1:first
^ , HTMLSelectorInputString=h1:first, HTMLSelectorLocation=8, NSLocalizedDescription=Unrecognized pseudo class}'
At below line number two
#1 HTMLDocument *document = [HTMLDocument documentWithString:markup];
#2 HTMLElement * element = [document firstNodeMatchingSelector:@"h1:first"];
We found that self-enclosed tags end up as child tags after loading the html into an HTMLDocument*
then retrieving using either innerHTML
or serializedFragment
methods:
NSString* rawHtmlString = @"<html><body><div class=\"self-enclosed\"/><div class=\"also-self-enclosed\"/></body></html>";
HTMLDocument *document = [HTMLDocument documentWithString:rawHtmlString];
NSString* formattedHtmlString= [document innerHTML];
NSLog(@"rawHtmlString: %@", rawHtmlString);
NSLog(@"formattedHtmlString: %@", formattedHtmlString);
Log:
2015-08-04 10:25:56.976 App[2819:40244] rawHtmlString: <html><body><div class="self-enclosed"/><div class="also-self-enclosed"/></body></html>
2015-08-04 10:25:59.343 App[2819:40244] formattedHtmlString: <html><head></head><body><div class="self-enclosed"><div class="also-self-enclosed"></div></div></body></html>
are we doing something wrong or does the parser have an issue processing self-enclosed tags?
From the README, one of the options of installation is:
Copy the files in the Code folder into your project.
I've added HTMLReader to a project back in version 0.7 using git submodule and imported the files directly into it, but now when I pulled the latest revision, the app won't build anymore because of the #import <HTMLReader/*>
imports.
Hi,
I want to analyse "some.jpg" in the html str blow.
What should I do? Thank you.
<div class="box-border" style="height:700px;">
<ul id="Cont01">
<li style="display:block" id="projectContainer"><div class="spacing"></div><center><img src="some.jpg"></center><hr></li>
</ul>
</div>
Hi,
This is more of a question than an issue.
How do I use this library to search for a specific div and store the child element values to a model class instances?
As an example my HTML is a collection of divs as below.
<div class="rc">
<h3 class="r">
<a href="http://someurl">click here</a>
</h3>
</div>
Please help.
When I traversed the DOM tree using treeEnumerator
of HTMLNode
, I can see the text node: HTMLTextNode
. However, it is not exposed as a public API so that I can't type cast to it.
And the textContent
of HTMLNode
includes too many text, which is useless.
I have a HTML file which contains the following code:
<dd>
Some Description
<dl>...</dl>
</dd>
Including the HTMLReader.framework in another Xcode projects results in an inability to compile because some HTMLReader headers are not found. Making them public in the HTMLReader project fixes the issues.
Namely, the headers are
HTMLDocumentType.h
HTMLElement.h
HTMLNamespace.h
HTMLQuirksMode.h
Xcode 7.1 and OS X 10.11.1, OS X app.
I used Carthage to install HTMLReader, when build for my test target, error occurred:
ld: framework not found HTMLReader for architecture x86_64
This error only occurs in the test bundle.
If I install HTMLReader through Git Submodule, everything is OK.
Hi! Please help me. I read docs but don't understand how remove some strings. I have some html strings with different parts(aHirg7S8Zu0):
<p><img src="//img.youtube.com/vi/aHirg7S8Zu0/0.jpg" height="505" width="640"></p>
<p> </p>
<h2 style="text-align: center;">Dear parents, I want say you...</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sit sane ista voluptas. Aliter autem vobis placet. Fortemne possumus dicere eundem illum Torquatum? Duo Reges: constructio interrete. Igitur neque stultorum quisquam beatus neque sapientium non beatus.<br>
<p> </p>
How i can delete first line and all nbsp(2-nd line)?
1. <p><img src="//img.youtube.com/vi/aHirg7S8Zu0/0.jpg" height="505" width="640"></p>
2. <p> </p>
Thank you very much
Hi!
My HTML contains custom not closed tag, as a <tag>
or <tag/>
. HTMLReader expecting <tag></tag>
, so my document does not good for me.
I can customize parse process, to add support for my tags?
Thanks!
That was something of its original genesis, was it not?
NSString *page = [NSString stringWithContentsOfFile:@"/Users/jwilliams/sa_index.html" encoding:NSUTF8StringEncoding error:nil]; HTMLDocument *doc = [HTMLDocument documentWithString:page]; HTMLElementNode *html = doc.childNodes[0]; NSLog(@"%@", html.childNodes);
Copy of SA's main page I'm testing with: https://gist.github.com/SASinestro/6314661
How to replace child node in parent node?
Hello,
Our security team has identified potential security concerns in the following files:
HTMLSelector.m:(Line 647)
HTMLNode.m:(Line 167)
Impact:
Most null pointer issues result in general software reliability problems, but if an attacker can intentionally trigger a null pointer dereference, the attacker might be able to use the resulting exception to bypass security logic or to cause the application to reveal debugging information that will be valuable in planning subsequent attacks.
Recommendation:
Implement careful checks before dereferencing objects that might be null. When possible, abstract null checks into wrappers around code that manipulates resources to ensure that they are applied in all cases and to minimize the places where mistakes can occur.
Hi,
In my experiment, this html <td><div style="clear: both;" /><img src="abc.jpg" /></td>
will add image element as child of the div instead of td. Is this a bug?
Thanks for the awesome component!
Cheers,
Joe
No I was not able to solve it. Probably a one-off issue with that
particular article
.i-AED{background-position:0 0;}.i-ARS{background-position:-16px 0;}
I wanna get the css file's .i-AED
's background-position
attribute in object-c code
Don't know how to sort it .
I have imported Foundation library already.
When using HTMLDocument xcode giving error :
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
This is a feature suggestion.
Since the HTML is already parsed, maybe it would be possible to add a method which strips the HTML but keeps the line breaks?
I'm specially thinking about this to be used on watchOS 2 projects, where NSAttributedString can't be used to strip HTML and it was quite a popular solution.
Hey,
i'm trying to get some elements via the selector (@".class1, .class2"), to get elements that either have class1 or class2. But it doesn't work. I get the following Parsing Error:
Attempted to use selector with error: Error Domain=HTMLSelectorErrorDomain Code=1 "Expected a
combinator here" UserInfo=0x6100000f2f80 {NSLocalizedFailureReason=Error near character 7:
Expected a combinator here
.class1, .class2
^ , HTMLSelectorInputString=.class1, .class2, HTMLSelectorLocation=7, NSLocalizedDescription=Expected a combinator here}
Is it simply not supported?
best regards,
Joscha
Should be [HTMLElement]
. Do the macros confuse the importer? Since we support SDKs that predate Objective-C generics, we can't just put the generics right in.
I am pretty unfamiliar with HTML scraping but as far as the documentation goes, it covers mostly about imgs and text.
Does HTMLReader have the ability to scrape video urls such as this one?
<video id="my_video_1_html5_api" class="vjs-tech" preload="auto" src="https://redirector.googlevideo.com/videoplayback?requiressl=yes&id=45d2fdf73f5ea442&itag=22&source=picasa&cmo=secure_transport%3Dyes&ip=0.0.0.0&ipbits=0&expire=1438962730&sparams=requiressl,id,itag,source,ip,ipbits,expire&signature=A1870313E674D7D0FAAA420CB49BAC57C744A158.45144C1E44617AE5405CE7A27517A4B84DDAE50C&key=lh1"></video>
I am getting back some html that has some namespaced elements in it, like:
<some-ns:some-tag />
and I am unable to build a selector that targets it, as it chokes on the :
.
Hi!
I'm trying to create a self-enclosing HTMLElement - the one. Is there any way to do it?
HTMLOrderedDictionary gives warnings about "designated initializers" on constructors from line line 46 to 55.
This method declaration in the header file fixes this:
Hi
i want to parse an one html. here is an html code :
<p>可愛的兔兔應該是繼汪星人和喵星人之外比較常見的家庭寵物,而在日本就有一隻垂耳兔PuiPui不只本身擁有超萌的高顏值,她的主人也很用心地幫PuiPui準備專屬服裝,並且在Instagram上分享牠的變裝日記,讓PuiPui成為超多粉絲追蹤的人氣時尚潮兔,快一起來認識PuiPui吧!</p>
<figure>
<img src="http://images.900.tw/upload_file/33/content/dc38a115-5732-b707-34d2-83513508a273.jpg" />
<figcaption>哥什麼造型都能消化!超卡哇伊又高顏值的萌兔PuiPui穿搭日記圖集...第12張簡直是撩妹高手呀!</figcaption>
</figure>
<p>▼穿上可愛的學生制服、萌兔PuiPui要在櫻花的目送下上學去啦!</p>
<a href="http://www.styletc.com/wp-content/uploads/2016/05/110.jpg">
<figure>
<img src="http://www.styletc.com/wp-content/uploads/2016/05/110.jpg" />
<figcaption>哥什麼造型都能消化!超卡哇伊又高顏值的萌兔PuiPui穿搭日記圖集...第12張簡直是撩妹高手呀!</figcaption>
</figure>
</a>
<p>▼到了秋天就換上福爾摩斯裝來襯托深沉的秋意!</p>
<figure>
<img src="http://www.styletc.com/wp-content/uploads/2016/05/28.jpg" />
<figcaption>哥什麼造型都能消化!超卡哇伊又高顏值的萌兔PuiPui穿搭日記圖集...第12張簡直是撩妹高手呀!</figcaption>
</figure>
i want to parse an 'p', 'figure.img', 'figure.figcaption', 'a' and by default sort.....
i don't know how to user HTMLDocument to parse this...
Could you help me?
Thanks
<img src = '...' /> is convert to <img src='...'>
so, xhtml contents display error.
xhtml contents have to tag close.
how can i fix it.
I just clone the repo and try to build with Xcode 8(8A218a), it fails with an error Implicit conversion from nullable pointer 'NSURL * _Nullable' to non-nullable pointer type 'NSURL * _Nonnull'
// EncodingLabeler.m
static NSString * const EncodingLabelsURL = @"https://encoding.spec.whatwg.org/encodings.json";
NSData *data = [NSData dataWithContentsOfURL:[NSURL URLWithString:EncodingLabelsURL]];
I noticed there's a CLANG_WARN_NULLABLE_TO_NONULL_CONVERSION
option set to YES in build settings. Since [NSURL URLWithString:..]
returns a nullable result and [NSData dataWithContentsOfURL:...]
expected a nonnull argument, it produces a warning. Also, due to the Treat Warnings as Errors
is set to YES, the build fails.
I suppressed the warning by adding a temporary variable then the build passed.
static NSString * const EncodingLabelsURL = @"https://encoding.spec.whatwg.org/encodings.json";
NSURL *url = [NSURL URLWithString:EncodingLabelsURL];
NSData *data = [NSData dataWithContentsOfURL:url];
Hope it helps.
Hi,
I added HTMLReader through pod as instructed, but the project isn't compiling, I get the following message:
Undefined symbols for architecture x86_64:
"_OBJC_CLASS_$_HTMLDocument", referenced from:
objc-class-ref in FetchData.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Any idea ?
The testEncodingDetection
case fails on my system:
Assertions: failed: caught "NSInternalInconsistencyException", "possible error listing test directory: Error Domain=NSCocoaErrorDomain Code=260 "The file “encoding” couldn’t be opened because there is no such file." UserInfo={NSURL=/Users/zoul/Code/HTMLReader/Tests/html5lib/encoding, NSFilePath=/Users/zoul/Code/HTMLReader/Tests/html5lib/encoding, NSUnderlyingError=0x10050bf40 {Error Domain=NSPOSIXErrorDomain Code=2 "No such file or directory"}}"
(
0 CoreFoundation 0x00007fff9dc7dae2 __exceptionPreprocess + 178
1 libobjc.A.dylib 0x00007fff9e162f7e objc_exception_throw + 48
2 CoreFoundation 0x00007fff9dc7d8ba +[NSException raise:format:arguments:] + 106
3 Foundation 0x00007fff92888d4a -[NSAssertionHandler handleFailureInFunction:file:lineNumber:description:] + 169
4 Tests on OS X 0x0000000105275f72 TestFileURLs + 610
5 Tests on OS X 0x00000001052759d6 -[HTMLEncodingTests testEncodingDetection] + 70
6 CoreFoundation 0x00007fff9db4817c __invoking___ + 140
7 CoreFoundation 0x00007fff9db47fce -[NSInvocation invoke] + 286
8 XCTest 0x0000000100022598 __24-[XCTestCase invokeTest]_block_invoke_2 + 159
9 XCTest 0x000000010005602e -[XCTestContext performInScope:] + 184
10 XCTest 0x00000001000224e8 -[XCTestCase invokeTest] + 169
11 XCTest 0x0000000100022983 -[XCTestCase performTest:] + 443
12 XCTest 0x0000000100020654 -[XCTestSuite performTest:] + 377
13 XCTest 0x0000000100020654 -[XCTestSuite performTest:] + 377
14 XCTest 0x0000000100020654 -[XCTestSuite performTest:] + 377
15 XCTest 0x000000010000e892 __25-[XCTestDriver _runSuite]_block_invoke + 51
16 XCTest 0x0000000100033a1b -[XCTestObservationCenter _observeTestExecutionForBlock:] + 611
17 XCTest 0x000000010000e7db -[XCTestDriver _runSuite] + 408
18 XCTest 0x000000010000f38a -[XCTestDriver _checkForTestManager] + 696
19 XCTest 0x000000010005729f _XCTestMain + 628
20 xctest 0x0000000100001dca xctest + 7626
21 libdyld.dylib 0x00007fff903ac5ad start + 1
)
File: HTMLEncodingTests.m:156
Is it simply a symptom of some missing HTML5 testing resources? If so, could we skip the particular tests when the resources are not found?
https://github.com/nolanw/HTMLReader/blob/master/HTMLReader.podspec#L6 says public domain, https://github.com/nolanw/HTMLReader/blob/master/Code/HTMLAttribute.h#L6 says All rights reserved. Which is it?
hi ,
thank you your code.
now I use it in my project.
my code low-level .
I have a problem: I can get textContent (观音山商务区站) by ' NSArray *array =[document nodesMatchingSelector:@"a"]; '
but I want to get other textContent --> (1辆开往) .
How do I do?
help me.
HTML like:
<div class="list-bus-station-content float-left"> <a href="/RealtimeQuery?lineId=155&direction=1&station=%E8%A7%82%E9%9F%B3%E5%B1%B1%E5%95%86%E5%8A%A1%E5%8C%BA%E7%AB%99&ordinal=2&">观音山商务区站</a></div>
<div class="list-bus-station-showBus float-left">
<div style="padding-top:8px;"><div class="station-bus-status station-bus-way-l float-left"></div><div class="float-left" style="font-size: 11px;line-height: 12px;"> 1辆开往</div><div class="clear"></div></div>
<div class="clear"></div>
</div>
<div class="float-right list-bus-station-gt">></div>
<div class="clear"></div>
</div>
Will HTMLReader support manipulation of HTML documents in the future, such as inserting or deleting a node?
BTW, thanks for such brilliant parser, Nolan.
I can't get right html string when plan source text contains newline character.
Result HTML contains " \ n ", not "< br >" or "< p >" tag.
Is it possible to access an HTML element based on it's itemprop that is used by schema.org? Thanks!
Is it possible parse the contents of a html page based on the url link? Thanks
Hi,
I am trying to parse meta tags using this code:
NSArray *metaNodes = [document nodesMatchingSelector:@"meta"];
I ran the code through this page:
http://www.nytimes.com/2015/08/16/technology/inside-amazon-wrestling-big-ideas-in-a-bruising-workplace.html
and it only picked up 31 meta tags when there is clearly 50+
Hi
I'm trying to parse some HTML document to get two texts from tags:
"Some text to display.image_name_to_display.jpg"
so I use this code :
HTMLDocument *document = [HTMLDocument documentWithString:self.content]; //content is html above
NSString *handAndImageStr = [document firstNodeMatchingSelector:@"hand"].textContent;
if (handAndImageStr) {
NSString *imgStr = [document firstNodeMatchingSelector:@"image"].textContent;
and then imgStr is null instead of "image_name_to_display.jpg"
I'm using HTMLReader 0.9.4
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.