Comments (6)
@Popolechien They will probably not appear after next release of zimit with the blacklist (today or tomorrow), you should re-run the scrape.
from zimit.
One of the problem is that we have this HTML redirects to secure suggestions work fine.
from zimit.
I don't think there's much we can do about this here but please share the ZIM so I can provide exact figures
from zimit.
I believe this would be the task https://farm.youzim.it/pipeline/d1c2f201514f3da67f887df5
from zimit.
Thanks ; ZIM was gone so I recreated one which gave me the following preview on kiwix 3.4.3
Drastically changed (about half) but still above the limit.
This ZIM has the following:
- 16,532 ZIM articles
- 7,696 ZIM articles in
A/
namespace - 1,147 HTML articles in
A/
namespace. I believe that's what Kiwix android displays.
And here's the ventilation of those articles:
404.html
:1
embed.ted.com
:1
i0.wp.com
:4
index.html
:1
mesquartierschinois.wordpress.com
:475
player.vimeo.com
:1
public-api.wordpress.com
:2
topFrame.html
:1
twitter.com
:125
vimeo.com
:1
widgets.wp.com
:3
www.benbest.com
:1
www.facebook.com
:227
www.google.com
:136
www.pinterest.com
:124
www.youtube.com
:44
We can clearly see the target website has 475 HTML articles ; well bellow the 1,000 limit. Other HTML articles are embed of some sort that Kiwix has no knowledge about and can not discriminate.
So nothing to be done here, expect feeding the articles' numbers discussion for kiwix-lib.
from zimit.
Interesting, thanks.
twitter.com: 125
www.benbest.com: 1
www.facebook.com: 227
www.google.com: 136
www.pinterest.com: 124
None of these are (willingly) used/mentioned/quoted in the blog - I suspect they could be ads or ad trackers of some sort that get counted as items.
from zimit.
Related Issues (20)
- Release 1.6.3
- Adopt Python bootstrap
- Add support for `--logging` parameter of browsertrix crawler
- Pass scraper parameter to warc2zim HOT 1
- Remove cookie banners HOT 1
- Add parameter to exclude certain resources
- tvtropes is failing HOT 3
- TV Tropes 403 errors HOT 3
- Invalid leading whitespace in header HOT 1
- URL is different in error message HOT 2
- solar.lowtechmagazine.com is very unstable HOT 4
- Upgrade to browsertrix crawler 1.0.0 beta HOT 7
- Enhance integration test to assert final content of the ZIM
- Add support for downloading the browser profile from a URL
- networkidle is no longer a valid waitUntil HOT 7
- Browsertrix Crawler is stopping on disk full while it is not full HOT 2
- Zimit2: Youtube videos are not working everywhere HOT 8
- --exclude question HOT 4
- No output after quitting early HOT 3
- [zimit1] scraper never exits
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from zimit.