Comments (8)
In [2]: from selectolax.parser import HTMLParser
...:
...: html = """
...: <p class="TweetTextSize js-tweet-text tweet-text" lang="es" data-aria-label-part="0">Quiero besar tus labios. <img c
...: lass="Emoji Emoji--forText" src="https://abs.twimg.com/emoji/v2/72x72/2764.png" draggable="false" alt="❤" title="Red
...: heart" aria-label="Emoji: Red heart"><img class="Emoji Emoji--forText" src="https://abs.twimg.com/emoji/v2/72x72/1f
...: 618.png" draggable="false" alt="😘" title="Face throwing a kiss" aria-label="Emoji: Face throwing a kiss"><a href="h
...: ttps://t.co/I1dwxjT0Mp" class="twitter-timeline-link u-hidden" data-pre-embedded="true" dir="ltr">pic.twitter.com/I1
...: dwxjT0Mp</a></p>
...: """
In [3]: for node in HTMLParser(html).css(".tweet-text img[alt]"):
...: print(node.attributes.get('alt', ''))
...: print(node.parent.text())
...:
❤
Quiero besar tus labios. pic.twitter.com/I1dwxjT0Mp
😘
Quiero besar tus labios. pic.twitter.com/I1dwxjT0Mp
In [4]: for node in HTMLParser(html).css(".Emoji"):
...: print(node.attributes.get('alt', ''))
...: print(node.parent.text())
...:
❤
Quiero besar tus labios. pic.twitter.com/I1dwxjT0Mp
😘
Quiero besar tus labios. pic.twitter.com/I1dwxjT0Mp
In [5]:
Does that answer your question? If not, can you please elaborate a bit more?
from selectolax.
I did get that far -- the issue is that I'd like the images to appear in the proper order of where they are in relationship to the inner text of the
tag. That's where I'm lost. I can print them out, but I don't know how to maintain the order of where the images were placed within the p.text()
from selectolax.
Ok, I see. So you want to be able to iterate all nodes inside the p tag?
In [3]: node = HTMLParser(html).css_first("p")
...: for node in node.iter():
...: print(node)
...:
<Node img>
<Node img>
<Node a>
I don't remember why the text node is ignored.
I can add it to the list too so that the output can be:
<Node text>
<Node img>
<Node img>
<Node a>
Let me know if you have better ideas.
from selectolax.
That would probably work well. The other thought I had was to be able to replace a node with text when iterating so that the would just become the image alt text. If you had a generic function that could replace a thing with something when iterating inside something else, that would likely work. In this case, the image nodes would just be replaced with the image alt text and then printing p_node.text() would then show the image alt text since it was converted internally.
But your idea also works -- whatever you think would be easier. The ability to iterate (which is already there) would be more powerful if you could convert things in place (although I don't know how hard it would be to convert a node into text, etc.)
from selectolax.
Ok, I will think about it later this week.
For now, you can manually iterate over them:
In [4]: node = HTMLParser(html).css_first("p")
...:
...:
...: def yield_child_nodes(node):
...: n = node.child
...: yield n
...: while n.next:
...: n = n.next
...: yield n
...:
...:
...: for node in yield_child_nodes(node):
...: if node.tag == '-text':
...: text = node.html
...: else:
...: text = node.attributes.get('alt', '')
...: print(text)
...:
Quiero besar tus labios.
❤
😘
from selectolax.
No rush! Thanks for replying!
from selectolax.
Added a new replace_with
method.
In [1]: from selectolax.parser import HTMLParser
In [2]: html = """<p class="TweetTextSize js-tweet-text tweet-text" lang="es" data-aria-label-part="0">Quiero besar tus labios. <img class="Emoji Emoji--forText" src="https://abs.tw
...: img.com/emoji/v2/72x72/2764.png" draggable="false" alt="❤" title="Red heart" aria-label="Emoji: Red heart"><img class="Emoji Emoji--forText" src="https://abs.twimg.com/emoji
...: /v2/72x72/1f618.png" draggable="false" alt="😘" title="Face throwing a kiss" aria-label="Emoji: Face throwing a kiss"><a href="https://t.co/I1dwxjT0Mp" class="twitter-timeli
...: ne-link u-hidden" data-pre-embedded="true" dir="ltr">pic.twitter.com/I1dwxjT0Mp</a></p>"""
In [3]: html_parser = HTMLParser(html)
In [4]: for node in html_parser.css('p a'):
...: node.decompose()
...:
In [5]: for node in html_parser.css('img'):
...: node.replace_with("%s " % node.attributes.get('alt', ''))
...:
In [6]: html_parser.css_first('p').text().strip()
Out[6]: 'Quiero besar tus labios. ❤ 😘'
from selectolax.
This is great! Thanks so much -- this will definitely be very helpful.
from selectolax.
Related Issues (20)
- Node.child should be named Node.first_child ? HOT 2
- Awful text parsing issue HOT 6
- Release wheel for python 3.12 HOT 5
- Tags out of order in returned list when using css to specify multiple tags HOT 5
- What is/was the format for the pages/pages.json file? HOT 1
- HTMLParser and LexborHTMLParser search differently HOT 1
- css_matches of LexborHTMLParser does not free memory HOT 2
- [Typing] `_Attributes` in .pyi stub file is missing dictionary methods like `__getitem__`
- Selectolax couldn't load large html string (87MB) but lxml could HOT 3
- I am still getting this error even with the update - not able to load large html contents HOT 1
- Error in LexborHTMLParser HOT 7
- Memory leak HOT 3
- Memory leak when using LexborHTMLParser HOT 1
- Segmentation fault with Lexbor engine HOT 2
- Allow regular expressions in `text_contains` / `any_text_contains` HOT 2
- Adding AdvancedHTMLParser to benchmark HOT 2
- Weird issue in rendering HTML HOT 4
- Cannot import name modest HOT 1
- ModuleNotFoundError: No module named 'selectolax.parser'; 'selectolax' is not a package HOT 1
- Best way to handle content not found? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from selectolax.