Comments (3)
Is it reproducible when you parse different HTML files?
Selectolax uses Modest directly, when removing nodes:
selectolax/selectolax/modest/node.pxi
Lines 491 to 494 in 6b67223
I think your particular HTML file violates some of the standards and Modest can't process it properly.
If you have more example, please upload them too.
from selectolax.
I think the real problem is that your function can call decompose
on already removed objects.
You can try using decompose(recursive=False)
.
Unfortunately, selectolax is a very thin wrapper over Modest and it does not check for such problems.
I think removing the same node multiple times corrupts memory.
You always need to keep in mind, that traverse iterates over all objects, and some of them could be already deleted or modified.
The nodes_to_remove
array contains a parent and some of its children. When you use recursive decomposing, the child nodes get removed with the parent. On the next iteration, you are trying to remove the child object which does not exist anymore.
This is a common problem: lexbor/lexbor#132 (comment)
from selectolax.
Thank you, much clearer now!
from selectolax.
Related Issues (20)
- `<!DOCTYPE ...>` is not preserved during parsing HOT 5
- Development of Lexbor HOT 4
- Add support to release Linux aarch64 wheels
- getting the entire tags ia HTMl HOT 1
- Benchmark against google-gumbo based html5-parser HOT 1
- Cannot access member "unwrap_tags" for type "HTMLParser"
- Using LexborHTMLParser seems to remove some HTML tags HOT 3
- node.text() does not account for </br> HOT 2
- Not work if it has self-closing iframe tag HOT 1
- Text nodes not displayed with `deep=True` HOT 2
- Missing py.typed file for mypy / PEP 561 compliance HOT 4
- Equivalent of find_all_next() HOT 1
- Feature request: print html tag of the node (not including its children) for lexbor engine HOT 2
- Best practice using selectolax together with if and dict HOT 2
- Node.text() does not respect changes from Node.unwrap_tags HOT 17
- Add `content` method
- v3.8.9 typo? HOT 4
- Lexbor engine: Absence of case insensitive search by attribute value HOT 1
- Expose Lexbor fragment parser
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from selectolax.