kostya / myhtml Goto Github PK
View Code? Open in Web Editor NEWFast HTML5 Parser with css selectors for Crystal language
License: MIT License
Fast HTML5 Parser with css selectors for Crystal language
License: MIT License
Can it be replaced with the latest release version?
wget https://github.com/lexborisov/myhtml/archive/caa7d711847c02db8c4dd24855fb1caa8b728a9a.tar.gz
make: wget: No such file or directory
make: *** [myhtml-c] Error 1
Subj.
page = HTTP::Client.get "http://something.com"
anu = page.body
myhtml = Myhtml::Parser.new(anu)
links = myhtml.css(".mycssselector").map(&.attribute_by("href")).to_a
hi, im newbie in crystal language, i try to use your package but when i run crystal deps its return that error
crystal deps
Updating https://github.com/kemalcr/kemal.git
Updating https://github.com/luislavena/radix.git
Updating https://github.com/jeromegn/kilt.git
Updating https://github.com/RX14/multipart.cr.git
Updating https://github.com/kostya/myhtml.git
Updating https://github.com/kostya/modest.git
Using kemal (ac8ec0a07b5929dc1656006a135e08a2a2732f5e)
Using radix (0.3.5)
Using kilt (0.3.3)
Using multipart (0.1.1)
Installing myhtml (master)
Postinstall cd src/ext && make package
Failed cd src/ext && make package:
/bin/sh: 1: cd: can't cd to src/ext
Hi kostya!
It seems latest tag name is not v0.13
but v.0.13
. Could you release v0.13.1
? π
With this code:
html = Myhtml::Parser.new(" ")
puts html.body!.children.inspect
# Myhtml::Iterator::Children(@start_node=Myhtml::Node(:body), @current_node=Myhtml::Node(:_text, "Β "))
Is there a way I can ask if this node is an entity? Or do I just check that it's a _text
node with a space?
Example:
html = <<-HTML
<tr><td>Hello</td></tr>
<tr><td>123</td><td>other</td></tr>
<tr><td>foo</td><td>columns</td></tr>
<tr><td>bar</td><td>are</td></tr>
<tr><td>xyz</td><td>ignored</td></tr>
HTML
myhtml = Myhtml::Parser.new(html)
puts myhtml.css("tr td").map(&.to_html).to_a
# => []
Is this possible somehow? It only works once I wrap the HTML in <table>
.
The original HTML content is
origin = <<-HTML
<!doctype html>
<html lang="en">
<head>
<title></title>
</head>
<body> </body>
</html>
HTML
But, I can't get it to print out the DOCTYPE after parsing:
puts MyHTML::Parser.new(origin).root!.to_html
Is there a way to put out the full HTML with a doctype?
In the PR and create example, I find itβs a little tedious to use (...) and node all the time.
div = tree.create_node(:div)
div.attribute_add("class", "red")
body.append_child(div)
Can be
div = tree.create_div
div.attribute_add "class", "red"
body.append_child div
What do you think?
Code:
require "myhtml"
require "modest"
require "http/client"
url = "http://academica.ru/vysshee-obrazovanie/negosudarstvennyj-vuz/stranitsa_1/"
response = HTTP::Client.get url
source = Myhtml::Parser.new(response.body)
source.css("li.sectionListItem").each do |node|
p node
end
Output:
laptop% crystal build parser.cr
laptop% ./parser
Invalid memory access (signal 11) at address 0x8
[4742053] *CallStack::print_backtrace:Int32 +117
[4710520] __crystal_sigfault_handler +56
[140079244742784] ???
[5443431] modest_finder_by_selectors_list +119
[5264191] *Modest::Finder#find<Myhtml::Node>:Myhtml::CollectionIterator +127
[5255785] *Myhtml::Parser#css<String>:Myhtml::CollectionIterator +233
[4657361] ???
[4710265] main +41
[140079232176785] __libc_start_main +241
[4655226] _start +42
[0] ???
laptop%
Version:
laptop% crystal --version
Crystal 0.20.5 (2017-01-25)
This stage get's stuck
Installing myhtml (1.4.1)
Postinstall cd src/ext && make package
I try everything but cant make to this lib work with HTTP::Client.get method I convert response to string but still dont work
While there is inner_text
, and to_html
, neither of them achieves what I'm looking for: combined HTML for everything inside the node.
<div>
<a href="#">Link</a>
<p>Read this</p>
</div>
I want to get <a href="#">Link</a><p>Read this</p>
as the complete inner_html of the node. I couldn't find a straight-forward way of doing this.
Hello, first let me say thank you for this shard!
I am trying to convert my document back into html without adding html
, head
, body
tags. I am parsing components and not full documents. Here is a nice example:
require "myhtml"
class HTMLTransformer
property key_attribute : String
property state_attribute : String
def initialize(key_attribute = "key", state_attribute = "state")
@key_attribute = key_attribute
@state_attribute = state_attribute
end
def add_state_to_html(component, html)
return if html.blank?
key, state = ["1234", "5678"]
transform_root(component, html) do |root|
root[key_attribute] = key
root[state_attribute] = state
end
end
private def transform_root(component, html)
fragment = Myhtml::Parser.new(html)
root = fragment.root!
yield root
puts fragment.to_html
fragment.to_html
end
end
class Test
def initialize
@hi = "hi"
@num = 1
end
end
test = Test.new
transformer = HTMLTransformer.new
html = <<-HTML
<div id="t1" class="red">
<a href="/#" data-motion="add">Link to site</a>
</div>
HTML
puts transformer.add_state_to_html(test, html) == <<-HTML
<div id="t1" class="red" key="1234" state="5678">
<a href="/#" data-motion="add">Link to site</a>
</div>
HTML
# Outputs:
# <html key="1234" state="5678"><head></head><body><div id="t1" class="red">
# <a href="/#" data-motion="add">Link to site</a>
# </div></body></html>
# false
Is there any way to do this? If not, do you have a recommended side step?
When I try to parse the template tag, it just returns nil
.
html = Myhtml::Parser.new("<template>test</template>")
body = html.body!
puts body.children.inspect
#=> Myhtml::Iterator::Children(@start_node=Myhtml::Node(:body), @current_node=nil)
The strange thing is that I can make up random tags like <jeremy>
and it parses those fine.
html = Myhtml::Parser.new("<jeremy>test</jeremy>")
body = html.body!
puts body.children.inspect
#=> Myhtml::Iterator::Children(@start_node=Myhtml::Node(:body), @current_node=Myhtml::Node(:last_entry))
Whenever I try to include this as a dependency for a new project the postinstall fails. When I try to make
it manually, it fails with the same error. This wasn't the case in previous versions.
Here's the full error:
(base) Fishbowl:myhtml shark$ make
cd src/ext && make package
git clone https://github.com/lexborisov/Modest.git ./modest-c
Cloning into './modest-c'...
remote: Enumerating objects: 4945, done.
remote: Counting objects: 100% (34/34), done.
remote: Compressing objects: 100% (28/28), done.
remote: Total 4945 (delta 11), reused 15 (delta 6), pack-reused 4911
Receiving objects: 100% (4945/4945), 6.44 MiB | 25.38 MiB/s, done.
Resolving deltas: 100% (3556/3556), done.
cd modest-c && git reset --hard 393338d994c921705ff71dfbd1d98ceb31328f14
HEAD is now at 393338d Update includes.
cd modest-c && make static MyHTML_BUILD_SHARED=OFF MyCORE_BUILD_WITHOUT_THREADS=YES PROJECT_OPTIMIZATION_LEVEL=-O3 -j
sed -e 's,@version\@,0.0.6,g' -e 's,@prefix\@,/usr/local,g' -e 's,@exec_prefix\@,/usr/local,g' -e 's,@libdir\@,lib,g' -e 's,@includedir\@,include,g' -e 's,@cflags\@,-I$\{includedir}/modest -I$\{includedir}/mycore -I$\{includedir}/mycss -I$\{includedir}/myencoding -I$\{includedir}/myfont -I$\{includedir}/myhtml -I$\{includedir}/myunicode -I$\{includedir}/myurl,g' -e 's,@libname\@,modest,g' -e 's,@description\@,fast HTML renderer library with no outside dependency,g' modest.pc.in > modest.pc
mkdir -p bin lib test_suite
cc -Wall -Werror -pipe -pedantic -Isource -DMyCORE_BUILD_WITHOUT_THREADS -fPIC -O3 -Wno-unused-variable -Wno-unused-function -std=c99 -DMODEST_BUILD_OS=Darwin -DMODEST_PORT_NAME=posix -DMyCORE_OS_DARWIN -c -o ssource/mycss/selectors/serialization.c:183:69: error: cast to smaller integer type 'mycss_selectors_function_drop_type_t' (aka 'enum mycss_selectors_function_drop_type') from 'void *' [-Werror,-Wvoid-pointer-to-enum-cast]
mycss_selectors_function_drop_type_t drop_val = mycss_selector_value_drop(selector->value);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
source/mycss/selectors/value.h:28:41: note: expanded from macro 'mycss_selector_value_drop'
#define mycss_selector_value_drop(obj) ((mycss_selectors_function_drop_type_t)(obj))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
ource/myhtml/./tokenizer_end.o source/myhtml/./tokenizer_end.c
make[2]: *** [source/mycss/selectors/serialization.o] Error 1
make[2]: *** Waiting for unfinished jobs....
source/mycss/selectors/function_parser.c:469:57: error: cast to smaller integer type 'mycss_selectors_function_drop_type_t' (aka 'enum mycss_selectors_function_drop_type') from 'void *' [-Werror,-Wvoid-pointer-to-enum-cast]
mycss_selectors_function_drop_type_t drop_val = mycss_selector_value_drop(selector->value);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
source/mycss/selectors/value.h:28:41: note: expanded from macro 'mycss_selector_value_drop'
#define mycss_selector_value_drop(obj) ((mycss_selectors_function_drop_type_t)(obj))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
make[2]: *** [source/mycss/selectors/function_parser.o] Error 1
make[1]: *** [modest-c/lib/libmodest_static.a] Error 2
make: *** [src/ext/myhtml-c/lib/libmodest_static.a] Error 2
I have an html document like this:
<html>
<body>
<div id="greeting()"></div>
</body>
</html>
As far as I can tell, I am unable to retrieve that div
by id
because of the parenthesis:
require "myhtml"
doc = Myhtml::Parser.new(html)
doc.css("#greeting()") # empty
Am I missing anything? Or is this a bug in https://github.com/lexborisov/myhtml?
@kostya is there anyway to make it find <!-- <a>old comment</a> -->
inside a doc?
I am trying to build this in a docker and gets following error:
> [6/6] RUN CRYSTAL_ENV=production crystal build --release src/worker.cr:
#10 154.0 _main.o: In function `initialize':
#10 154.0 /app/lib/myhtml/src/myhtml/tree.cr:14: undefined reference to `myhtml_create'
#10 154.0 /app/lib/myhtml/src/myhtml/tree.cr:15: undefined reference to `myhtml_init'
#10 154.0 /app/lib/myhtml/src/myhtml/tree.cr:20: undefined reference to `myhtml_tree_create'
#10 154.0 /app/lib/myhtml/src/myhtml/tree.cr:21: undefined reference to `myhtml_tree_init'
#10 154.0 _main.o: In function `parse':
#10 154.0 /app/lib/myhtml/src/myhtml/parser.cr:95: undefined reference to `myencoding_detect_and_cut_bom'
#10 154.0 /app/lib/myhtml/src/myhtml/parser.cr:116: undefined reference to `myhtml_parse'
#10 154.0 _main.o: In function `initialize':
#10 154.0 /app/lib/myhtml/src/myhtml/tree.cr:24: undefined reference to `myhtml_destroy'
#10 154.0 _main.o: In function `free':
#10 154.0 /app/lib/myhtml/src/myhtml/tree.cr:153: undefined reference to `myhtml_tree_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/tree.cr:154: undefined reference to `myhtml_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/tree.cr:153: undefined reference to `myhtml_tree_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/tree.cr:154: undefined reference to `myhtml_destroy'
#10 154.0 _main.o: In function `document!':
#10 154.0 /app/lib/myhtml/src/myhtml/tree.cr:58: undefined reference to `myhtml_tree_get_document'
#10 154.0 _main.o: In function `initialize':
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:9: undefined reference to `mycss_create'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:11: undefined reference to `mycss_init'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:17: undefined reference to `mycss_entry_create'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:18: undefined reference to `mycss_entry_init'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:25: undefined reference to `modest_finder_create_simple'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:26: undefined reference to `mycss_entry_selectors'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:27: undefined reference to `mycss_selectors_parse'
#10 154.0 _main.o: In function `search_from':
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:36: undefined reference to `modest_finder_by_selectors_list'
#10 154.0 _main.o: In function `free':
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:43: undefined reference to `mycss_selectors_list_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:44: undefined reference to `modest_finder_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:45: undefined reference to `mycss_entry_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:46: undefined reference to `mycss_destroy'
#10 154.0 _main.o: In function `initialize':
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:13: undefined reference to `mycss_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:20: undefined reference to `mycss_entry_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:21: undefined reference to `mycss_destroy'
#10 154.0 _main.o: In function `free':
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:43: undefined reference to `mycss_selectors_list_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:44: undefined reference to `modest_finder_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:45: undefined reference to `mycss_entry_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:46: undefined reference to `mycss_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:43: undefined reference to `mycss_selectors_list_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:44: undefined reference to `modest_finder_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:45: undefined reference to `mycss_entry_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:46: undefined reference to `mycss_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:43: undefined reference to `mycss_selectors_list_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:44: undefined reference to `modest_finder_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:45: undefined reference to `mycss_entry_destroy'
#10 154.0 /app/lib/myhtml/src/myhtml/css_filter.cr:46: undefined reference to `mycss_destroy'
#10 154.0 _main.o: In function `free':
#10 154.0 /app/lib/myhtml/src/myhtml/iterator/collection.cr:44: undefined reference to `myhtml_collection_destroy'
#10 154.0 _main.o: In function `next':
#10 154.0 /app/lib/myhtml/src/myhtml/node/navigate.cr:2: undefined reference to `myhtml_node_next'
#10 154.0 _main.o: In function `parent':
#10 154.0 /app/lib/myhtml/src/myhtml/node/navigate.cr:2: undefined reference to `myhtml_node_parent'
#10 154.0 _main.o: In function `next':
#10 154.0 /app/lib/myhtml/src/myhtml/node/navigate.cr:2: undefined reference to `myhtml_node_next'
#10 154.0 _main.o: In function `child':
#10 154.0 /app/lib/myhtml/src/myhtml/node/navigate.cr:2: undefined reference to `myhtml_node_child'
#10 154.0 _main.o: In function `next':
#10 154.0 /app/lib/myhtml/src/myhtml/node/navigate.cr:2: undefined reference to `myhtml_node_next'
#10 154.0 _main.o: In function `tag_id':
#10 154.0 /app/lib/myhtml/src/myhtml/node.cr:24: undefined reference to `myhtml_node_tag_id'
#10 154.0 _main.o: In function `parent':
#10 154.0 /app/lib/myhtml/src/myhtml/node/navigate.cr:2: undefined reference to `myhtml_node_parent'
#10 154.0 _main.o: In function `next':
#10 154.0 /app/lib/myhtml/src/myhtml/node/navigate.cr:2: undefined reference to `myhtml_node_next'
#10 154.0 _main.o: In function `tag_text_slice':
#10 154.0 /app/lib/myhtml/src/myhtml/node.cr:65: undefined reference to `myhtml_node_text'
#10 154.0 _main.o: In function `tag_id':
#10 154.0 /app/lib/myhtml/src/myhtml/node.cr:24: undefined reference to `myhtml_node_tag_id'
#10 154.0 _main.o: In function `nodes':
#10 154.0 /app/lib/myhtml/src/myhtml/tree.cr:78: undefined reference to `myhtml_get_nodes_by_tag_id'
#10 154.0 /app/lib/myhtml/src/myhtml/tree.cr:78: undefined reference to `myhtml_get_nodes_by_tag_id'
#10 154.0 _main.o: In function `attribute_by':
#10 154.0 /app/lib/myhtml/src/myhtml/node/attributes.cr:78: undefined reference to `myhtml_node_attribute_first'
#10 154.0 /app/lib/myhtml/src/myhtml/node/attributes.cr:81: undefined reference to `myhtml_attribute_next'
#10 154.0 _main.o: In function `attribute_name':
#10 154.0 /app/lib/myhtml/src/myhtml/node/attributes.cr:93: undefined reference to `myhtml_attribute_key'
#10 154.0 _main.o: In function `attribute_value':
#10 154.0 /app/lib/myhtml/src/myhtml/node/attributes.cr:99: undefined reference to `myhtml_attribute_value'
#10 154.0 collect2: error: ld returned 1 exit status
#10 154.0 Error: execution of command failed with code: 1: `cc "${@}" -o /app/worker -rdynamic -L/usr/bin/../lib/crystal/lib -lxml2 -lz `command -v pkg-config > /dev/null && pkg-config --libs --silence-errors libssl || printf %s '-lssl -lcrypto'` `command -v pkg-config > /dev/null && pkg-config --libs --silence-errors libcrypto || printf %s '-lcrypto'` /app/lib/myhtml/src/myhtml/../ext/modest-c/lib/libmodest_static.a -lyaml -lpcre -lm -lgc -lpthread /usr/share/crystal/src/ext/libcrystal.a -levent -lrt -ldl`
When I run on my mac directly, it works fine.
This is my dockerfile:
FROM crystallang/crystal:1.0.0
RUN mkdir /app
WORKDIR /app
COPY . /app
RUN shards install --production --ignore-crystal-version
RUN CRYSTAL_ENV=production crystal build --release src/worker.cr
I found that Myhtml::Iterator::Collection#empty?
doesn't return same value when calling more than one times. I don't know why this happens, but I think it is not an expected behavior.
Here is a short example. Are there any mistakes in my code?
require "myhtml"
html = <<-HTML
<html>
<meta>
<head>
<title>page title</title>
</head>
<body></body>
</html>
BODY
HTML
myhtml = Myhtml::Parser.new(html)
node = myhtml.css("title")
p node.size # => 1
p node.empty? # => true
p node.size # => 1
p node.empty? # => false
p node.size # => 1
p node.empty? # => false
btw, thank you for this library. myhtml is very helpful.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.