Comments (6)
Currently, news-please does not support MySQL export out of the box, that's why it's also not mentioned on the main page of the repository. We had that in one of the earlier versions but to lower maintenance costs of the project and since we didn't and still don't need MySQL decided to drop this feature. Instead, we added Elasticsearch support.
However, I imagine that except for some attribute renaming, e.g., the back-then called html_title
is called simply title
in the latest stable version, MySQL support should run out of the box.
You can find the init-db.sql script here: https://github.com/fhamborg/news-please/blob/15abe7cfeb08b4a78e580f96215a9f651927a900/init-db.sql
Let me know if you run into any problems. In case they are minor I might be able to help you out. I would greatly appreciate if you get it running and create a pull request for that, so that the whole community can benefit from your efforts.
from news-please.
I might be confusing two separate things. I'm trying to use MySQL to store the crawl history so I can cluster the newplease crawlers. I edited the configuration but it didn't connect. I though it might be because I didn't create a newsplease table & schema in MySQL.
I agree with your decision 100%, I think ElasticSearch is a far better export source for a crawler then MySQL.
from news-please.
Okay, so I'm closing this. Regarding the cluster of news-please instance, I'd recommend to separate them by domain, does that not work?
from news-please.
I apologize. I have over complicated a very simple question.
I would like to have newsplease save the crawl history to a MySQL database. I have updated the config.cfg with the MySQL username & password. But it does not seem to be working. When I run newsplease it does not save any data to MySQL.
[MySQL]
# MySQL-Connection required for saving meta-informations
host = localhost
port = 3306
db = 'news-please'
username = 'crawler_root'
password = 'XXXXXXXXXXXX`
Are there steps that are missing from the documentation?
Do I need to create the newsplease database & schema or does the newsplease app do that automatically?
from news-please.
Ah, now I got it :) The main reason, why what you're planning to do does not work is that MySQL support was originally added to export crawled & extracted articles to MySQL (instead of json files or Elasticsearch). MySQL export was never supposed to be a crawling history, even though one could of course use it as such. The other reason is that the support of MySQL export was dropped quite some time ago.
Anyways, what I wrote in my previous message still holds, so it should be easy for you to enable MySQL support. Simply init your MySQL using the script that I linked above. Afterwards, you need to add the MySQL pipeline task to your pipeline in the config file, the exact name can be found here, it is MySQLStorage
. If you look at the history of commits to this repository, particularly the very first commits, you should find the full functionality of MySQL export (if anything else than what I mentioned earlier is required).
from news-please.
I see where I was confused. I only have another week or two to work on my project. So I don't have time sort out MySQL now. I will take a look though and start to understand the code. :)
My backup plan is to connect the Docker host to a NFS share and then mount each container to it's own folder. So when newsplease starts up it will read it's configuration from a subfolder on the NFS share and then will write it's crawl history in to sub-directory.
Do you think that would be a good workaround for now?
from news-please.
Related Issues (20)
- news-please at background HOT 2
- Configure options to optimize the crawling and extraction process
- Proxy Server configuration (HttpProxyMiddleware) HOT 4
- ModuleNotFoundError: No module named 'newsplease' HOT 3
- Get only the recursive list of URLs using the Library mode HOT 2
- Failed to build for python 3.11 HOT 3
- DateFilter is never used HOT 7
- Specify more recent awscli dependency to avoid dependency resolution issues HOT 8
- Error : You must `download()` an article first! HOT 2
- Scrape by Domain HOT 1
- NewsPlease.from_urls behaves inconsistently in situations where a url results in 404
- Newer version of ElasticSearch API changed a lot
- Unable to Crawl and Save PDF files HOT 1
- Change Crawlers to RecursiveCrawler with as a library and store to Mongodb HOT 1
- can not extract main text. HOT 1
- Implement user agent functionality similar to News Paper 3k
- maintext article attribute length limitation HOT 1
- Reuter news scrip failed HOT 1
- ImportError: libpq.so.5: cannot open shared object file: No such file or directory HOT 1
- Unable to change URLS from example URLS HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from news-please.