Git Product home page Git Product logo

Comments (6)

fhamborg avatar fhamborg commented on August 15, 2024

Currently, news-please does not support MySQL export out of the box, that's why it's also not mentioned on the main page of the repository. We had that in one of the earlier versions but to lower maintenance costs of the project and since we didn't and still don't need MySQL decided to drop this feature. Instead, we added Elasticsearch support.

However, I imagine that except for some attribute renaming, e.g., the back-then called html_titleis called simply titlein the latest stable version, MySQL support should run out of the box.

You can find the init-db.sql script here: https://github.com/fhamborg/news-please/blob/15abe7cfeb08b4a78e580f96215a9f651927a900/init-db.sql

Let me know if you run into any problems. In case they are minor I might be able to help you out. I would greatly appreciate if you get it running and create a pull request for that, so that the whole community can benefit from your efforts.

from news-please.

dustyny avatar dustyny commented on August 15, 2024

I might be confusing two separate things. I'm trying to use MySQL to store the crawl history so I can cluster the newplease crawlers. I edited the configuration but it didn't connect. I though it might be because I didn't create a newsplease table & schema in MySQL.

I agree with your decision 100%, I think ElasticSearch is a far better export source for a crawler then MySQL.

from news-please.

fhamborg avatar fhamborg commented on August 15, 2024

Okay, so I'm closing this. Regarding the cluster of news-please instance, I'd recommend to separate them by domain, does that not work?

from news-please.

dustyny avatar dustyny commented on August 15, 2024

I apologize. I have over complicated a very simple question.

I would like to have newsplease save the crawl history to a MySQL database. I have updated the config.cfg with the MySQL username & password. But it does not seem to be working. When I run newsplease it does not save any data to MySQL.

[MySQL]

# MySQL-Connection required for saving meta-informations
host = localhost
port = 3306
db = 'news-please'
username = 'crawler_root'
password = 'XXXXXXXXXXXX`

Are there steps that are missing from the documentation?
Do I need to create the newsplease database & schema or does the newsplease app do that automatically?

from news-please.

fhamborg avatar fhamborg commented on August 15, 2024

Ah, now I got it :) The main reason, why what you're planning to do does not work is that MySQL support was originally added to export crawled & extracted articles to MySQL (instead of json files or Elasticsearch). MySQL export was never supposed to be a crawling history, even though one could of course use it as such. The other reason is that the support of MySQL export was dropped quite some time ago.

Anyways, what I wrote in my previous message still holds, so it should be easy for you to enable MySQL support. Simply init your MySQL using the script that I linked above. Afterwards, you need to add the MySQL pipeline task to your pipeline in the config file, the exact name can be found here, it is MySQLStorage. If you look at the history of commits to this repository, particularly the very first commits, you should find the full functionality of MySQL export (if anything else than what I mentioned earlier is required).

from news-please.

dustyny avatar dustyny commented on August 15, 2024

I see where I was confused. I only have another week or two to work on my project. So I don't have time sort out MySQL now. I will take a look though and start to understand the code. :)

My backup plan is to connect the Docker host to a NFS share and then mount each container to it's own folder. So when newsplease starts up it will read it's configuration from a subfolder on the NFS share and then will write it's crawl history in to sub-directory.

Do you think that would be a good workaround for now?

newspleasecrawler - page 1

from news-please.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.