Comments (7)
Hi,
I'm sorry but such filters are currently not planned as doing this kind of topic or semantic filter is not trivial if you're planning to do that with high reliability. Hence, I see this as a task that should be addressed externally. However, if you're talking about filtering by checking if an article's text contains one or more specific tokens, that could be easily added as another step of the Pipeline. Please reopen the ticket if you're planning to implement that and I'll be happy to support you should you have any questions.
Best,
Felix
from news-please.
I think I have to explain myself in more detail.
I've looked through the docs from referenced projects like newspaper. Newspaper3k has this feature called "Keyword extraction from text". Now, like the "Date Filter" - which one can setup in the config.cfg -, it should be doable to filter out those articles right away, e.g. pipeline filtering. Anyways, this type of filtering rely on the presence of such keywords within the site (that's what you called Tokens, right?)
Isn't it doable to solve this issue through a parser?
I'm far away being a (good) coder guy. i'm looking at this more from a user perspective. Given the sheer (and growing) amount of "information" that should be a well received method within crawlers. That said, I can't estimate the complexity of such task.
Grüße
from news-please.
Keyword extraction from text sounds to me more like extraction of terms that are representative for the document, e.g., maybe with a high TF-IDF score or similar term scoring methods. That is something else than "filtering", at least if I understand you correctly. If I understand you correctly, the pipeline filter you would like to add should check whether a single article's main text (or title?) contains at least one (or all?) terms defined in a new parameter in the config file?
Or are you specifically referring to the keywords extracted by newspaper and want to check against them?
from news-please.
The keyword is a JSON attribute extracted from the sites article like "authors", "publish_date" or "text"
Direct quote from newspaper3k docs (example site extracted)
`>>> article.keywords
['New Years', 'resolution', ...]`
Adding a parameter in the config file would allow to filter articles containing the keyword 'New Years' (for example), like the "Date filter" does. Other articles would be dropped in the pipeline. That's one (lean) way, BUT relies on the site containing this keywords.
Another way is to do the task through a/the parser (to filter atricles containing a given word), i presume. But that can be horribly wrong, or too complex. Or otherwise.
from news-please.
According to https://github.com/codelucas/newspaper the keywords are terms that are extracted when invoking article.nlp() so they don't need to be contained in the website but are generated by the newspaper library. I will close this issue, since we don't want any extractor-specific dependencies but keep news-please as general as possible. So far, I only know of newspaper extracting such "keywords".
I think what is more beneficial as to the goals of news-please would be to add a simple term filter as I described above.
Anyways, if you want to add such a feature (I guess it's roughly 20-40 SLOC that need to be added), you're more than welcome to contribute and I'll be happy to help you doing that.
from news-please.
How would one begin to implement this keyword filtering step in the pipeline? I simply want to not save an article to disk if it doesn't contain a keyword in a set of keywords. Can I just stuff an if-statement somewhere before it saves?
from news-please.
See #101
from news-please.
Related Issues (20)
- news-please at background HOT 2
- Configure options to optimize the crawling and extraction process
- Proxy Server configuration (HttpProxyMiddleware) HOT 4
- ModuleNotFoundError: No module named 'newsplease' HOT 3
- Get only the recursive list of URLs using the Library mode HOT 2
- Failed to build for python 3.11 HOT 3
- DateFilter is never used HOT 7
- Specify more recent awscli dependency to avoid dependency resolution issues HOT 8
- Error : You must `download()` an article first! HOT 2
- Scrape by Domain HOT 1
- NewsPlease.from_urls behaves inconsistently in situations where a url results in 404
- Newer version of ElasticSearch API changed a lot
- Unable to Crawl and Save PDF files HOT 1
- Change Crawlers to RecursiveCrawler with as a library and store to Mongodb HOT 1
- can not extract main text. HOT 1
- Implement user agent functionality similar to News Paper 3k
- maintext article attribute length limitation HOT 1
- Reuter news scrip failed HOT 1
- ImportError: libpq.so.5: cannot open shared object file: No such file or directory HOT 1
- Unable to change URLS from example URLS HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from news-please.