Git Product home page Git Product logo

news-please's Introduction

news-please

PyPI version DOI

news-please is an open source, easy-to-use news crawler that extracts structured information from almost any news website. It can recursively follow internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website to crawl it completely. news-please combines the power of multiple state-of-the-art libraries and tools, such as scrapy, Newspaper, and readability.

news-please also features a library mode, which allows Python developers to use the crawling and extraction functionality within their own program. Moreover, news-please allows to conveniently crawl and extract articles from the (very) large news archive at commoncrawl.org.

If you want to contribute to news-please, please first read here.

Announcements

10/11/2023: If you're interested in text annotation software, check out textada - an AI-powered text annotation tool. Add your documents and categories, do some manual annotations, and let the AI do the work for you. The university-based project is not open source, but free to use.

03/23/2021: If you're interested in sentiment classification in news articles, check out our large-scale dataset for target-dependent sentiment classification. We also publish an easy-to-use neural model that achieves state-of-the-art performance. Visit the project here.

06/01/2018: If you're interested in event extraction from news, you might also want to check out our new project, Giveme5W1H - a tool that extracts phrases answering the journalistic five W and one H questions to describe an article's main event, i.e., who did what, when, where, why, and how.

Extracted information

news-please extracts the following attributes from news articles. An examplary json file as extracted by news-please can be found here.

  • headline
  • lead paragraph
  • main text
  • main image
  • name(s) of author(s)
  • publication date
  • language

Features

  • works out of the box: install with pip, add URLs of your pages, run :-)
  • run news-please conveniently using its CLI mode
  • use it as a library within your own software
  • extract articles from commoncrawl.org's news archive

Modes and use cases

news-please supports three main use cases, which are explained in more detail in the following.

CLI mode

  • stores extracted results in JSON files, PostgreSQL, ElasticSearch, or your own storage
  • simple but extensive configuration (if you want to tweak the results)
  • revisions: crawl articles multiple times and track changes

Library mode

  • crawl and extract information given a list of article URLs
  • to use news-please within your own Python code

News archive from commoncrawl.org

  • commoncrawl.org provides an extensive, free-to-use archive of news articles from small and major publishers world wide
  • news-please enables users to conveniently download and extract articles from commoncrawl.org
  • you can optionally define filter criteria, such as news publisher(s) or the date period, within which articles need to be published
  • clone the news-please repository, adapt the config section in newsplease/examples/commoncrawl.py, and execute python3 -m newsplease.examples.commoncrawl

Getting started

It's super easy, we promise!

Installation

news-please runs on Python 3.5+.

$ pip3 install news-please

Use within your own code (as a library)

You can access the core functionality of news-please, i.e. extraction of semi-structured information from one or more news articles, in your own code by using news-please in library mode. If you want to use news-please's full website extraction (given only the root URL) or continuous crawling mode (using RSS), you'll need to use the CLI mode, which is described later.

from newsplease import NewsPlease
article = NewsPlease.from_url('https://www.nytimes.com/2017/02/23/us/politics/cpac-stephen-bannon-reince-priebus.html?hp')
print(article.title)

A sample of an extracted article can be found here (as a JSON file).

If you want to crawl multiple articles at a time, optionally with a timeout in seconds

NewsPlease.from_urls([url1, url2, ...], timeout=6)

or if you have a file containing all URLs (each line containing a single URL)

NewsPlease.from_file(path)

or if you have raw HTML data (you can also provide the original URL to increase the accuracy of extracting the publishing date)

NewsPlease.from_html(html, url=None)

or if you have a WARC file (also check out our commoncrawl workflow, which provides convenient methods to filter commoncrawl's archive for specific news outlets and dates)

NewsPlease.from_warc(warc_record)

In library mode, news-please will attempt to download and extract information from each URL. The previously described functions are blocking, i.e., will return once news-please has attempted all URLs. The resulting list contains all successfully extracted articles.

Finally, you can process the extracted information contained in the article object(s). For example, to export into a JSON format, you may use:

import json

with open("article.json", "w") as file:
    json.dump(article.get_serializable_dict(), file)

Run the crawler (via the CLI)

$ news-please

news-please will then start crawling a few examples pages. To terminate the process press CTRL+C. news-please will then shut down within 5-60 seconds. You can also press CTRL+C twice, which will immediately kill the process (not recommended, though).

The results are stored by default in JSON files in the data folder. In the default configuration, news-please also stores the original HTML files.

Crawl other pages

Most likely, you will not want to crawl from the websites provided in our example configuration. Simply head over to the sitelist.hjson file and add the root URLs of the news outlets' web pages of your choice. news-please also can extract the most recent events from the GDELT project, see here.

ElasticSearch

news-please also supports export to ElasticSearch. Using Elasticsearch will also enable the versioning feature. First, enable it in the config.cfg at the config directory, which is by default ~/news-please/config but can also be changed with the -c parameter to a custom location. In case the directory does not exist, a default directory will be created at the specified location.

[Scrapy]

ITEM_PIPELINES = {
                   'newsplease.pipeline.pipelines.ArticleMasterExtractor':100,
                   'newsplease.pipeline.pipelines.ElasticsearchStorage':350
                 }

That's it! Except, if your Elasticsearch database is not located at http://localhost:9200, uses a different username/password or CA-certificate authentication. In these cases, you will also need to change the following.

[Elasticsearch]

host = localhost
port = 9200    

...

# Credentials used  for authentication (supports CA-certificates):

use_ca_certificates = False           # True if authentification needs to be performed
ca_cert_path = '/path/to/cacert.pem'  
client_cert_path = '/path/to/client_cert.pem'  
client_key_path = '/path/to/client_key.pem'  
username = 'root'  
secret = 'password'

PostgreSQL

news-please allows for storing of articles to a PostgreSQL database, including the versioning feature. To export to PostgreSQL, open the corresponding config file (config_lib.cfg for library mode and config.cfg for CLI mode) and add the PostgresqlStorage module to the pipeline and adjust the database credentials:

[Scrapy]
ITEM_PIPELINES = {
               'newsplease.pipeline.pipelines.ArticleMasterExtractor':100,
               'newsplease.pipeline.pipelines.PostgresqlStorage':350
             }

[Postgresql]
# Postgresql-Connection required for saving meta-informations
host = localhost
port = 5432
database = 'news-please'
# schema = 'news-please'
user = 'user'
password = 'password'

If you plan to use news-please and its export to PostgreSQL in a production environment, we recommend to uninstall the psycopg2-binary package and install psycopg2. We use the former since it does not require a C compiler in order to be installed. See here, for more information on differences between psycopg2 and psycopg2-binary and how to setup a production environment.

What's next?

We have collected a bunch of useful information for both users and developers. As a user, you will most likely only deal with two files: sitelist.hjson (to define sites to be crawled) and config.cfg (probably only rarely, in case you want to tweak the configuration).

Support (also, how to open an issue)

You can find more information on usage and development in our wiki! Before contacting us, please check out the wiki. If you still have questions on how to use news-please, please create a new issue on GitHub. Please understand that we are not able to provide individual support via email. We think that help is more valuable if it is shared publicly so that more people can benefit from it. However, if you still require individual support, e.g., due to confidentiality of your project, we may be able to provide you with private consultation. Contact us for information about pricing and further details.

Issues

For bug reports, we ask you to use the Bug report template. Make sure you're using the latest version of news-please, since we cannot give support for older versions. As described earlier, we cannot give support for issues or questions sent by email.

Donation

Your donations are greatly appreciated! They will free us up to work on this project more, to take on tasks such as adding new features, bug-fix support, and addressing further concerns with the library.

Acknowledgements

This project would not have been possible without the contributions of the following students (ordered alphabetically):

  • Moritz Bock
  • Michael Fried
  • Jonathan Hassler
  • Markus Klatt
  • Kevin Kress
  • Sören Lachnit
  • Marvin Pafla
  • Franziska Schlor
  • Matt Sharinghousen
  • Claudio Spener
  • Moritz Steinmaier

We also thank all other contributors, which you can find on the contributors page!

How to cite

If you are using news-please, please cite our paper (ResearchGate, Mendeley):

@InProceedings{Hamborg2017,
  author     = {Hamborg, Felix and Meuschke, Norman and Breitinger, Corinna and Gipp, Bela},
  title      = {news-please: A Generic News Crawler and Extractor},
  year       = {2017},
  booktitle  = {Proceedings of the 15th International Symposium of Information Science},
  location   = {Berlin},
  doi        = {10.5281/zenodo.4120316},
  pages      = {218--223},
  month      = {March}
}

You can find more information on this and other news projects on our website.

Contributions

Do you want to contribute? Great, we are always happy for any support on this project! We are particularly looking for pull requests that fix bugs. We also welcome pull requests that contribute your own ideas.

By contributing to this project, you agree that your contributions will be licensed under the project's license.

Pull requests

We love contributions by our users! If you plan to submit a pull request, please open an issue first and desribe the issue you want to fix or what you want to improve and how! This way, we can discuss whether your idea could be added to news-please in the first place and, if so, how it could best be implemented in order to fit into architecture and coding style. In the issue, please state that you're planning to implement the described features.

Custom features

Unfortunately, we do not have resources to implement features requested by users. Instead, we recommend that you implement features you need and if you'd like open a pull request here so that the community can benefit from your improvements, too.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use news-please except in compliance with the License. A copy of the License is included in the project, see the file LICENSE.txt.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. The news-please logo is courtesy of Mario Hamborg.

Copyright 2016-2024 The news-please team

news-please's People

Contributors

abidibo avatar ahacad avatar amjltc295 avatar andreierdoss avatar arcolife avatar cookieshake avatar d1mitriz avatar donglixp avatar eliias avatar fhamborg avatar frankier avatar jemisonf avatar jeromegill avatar jeyb88 avatar lgov avatar loganamcnichols avatar medno avatar megatron-me-uk avatar moyid avatar mxab avatar ntlf avatar phdowling avatar sebastian-nagel avatar shangw-nvidia avatar somnathrakshit avatar stepinsilence avatar t1h0 avatar thihara avatar tsoernes avatar yldoctrine avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

news-please's Issues

HTTP Error 505: HTTP Version not supported

Hi,
I try to getting started by running commoncrawl.py but encountered this error. I checked everything I know but still no luck. Do you happen to know what this issue is about? Attached is the error log.
errorlog.txt

Thank you for your time.

Inconsistent beautifulsoup4 dependency for Python 2.7

python2.7 version of news-please requires beautifulsoup4>=4.5.1 and newspaper>=0.0.9.8.

newspaper >=0.0.9.8 requires beautifulsoup4==4.3.2, creating a beautifulsoup4 dependency issue when trying to install news-please with pip.

Is there a way around this issue? Python3 is not an option.

Running Windows7.

improve json export

  • config: where to store
  • config: format: pretty or compact
  • change file ending (currently .html.json)

How to limit the time spend on the same website

Hi first of all I want to thank you for this project.
Now, I what I need is to limit the time that the spider spend on the same website, for example the last time that I execute it I was having almost 1.5 Gb dowloaded from the same webiste.
For the moment I'm using the configuration by defaults and I only specify the urls for the sites that i want to crawl.

Thank you very much.

Re-enable MySQL

I'm working on configuring newsplease to use MySQL for persistent storage. I'm running newplease and mysql in docker containers on my workstation. Newsplease is running in CLI mode. MySQL has been tested using a SQL client (Toad) and I can create databases, read from them, etc.

I updated the setting in my newsplease config file to use MySQL (updated username & password). When I start the docker container newplease starts crawling and saving json & html files to the file system but it is not writing to the MySQL database.

In the documentation, there is a mention of a init-db.sql script that can be used to setup the database. This doesn't seem to be in the repo.

Configuration:
To use this module you have to enter the address, the used port and if needed your user credentials into the MySQL section of newscrawler.cfg. There is also a setup script init-db.sql for a convenient creation of the used tables.

In main.py I see reset MySQL but there is no mention of this in the documentation.

Problem with using ElasticSeach

I have installed this docker image on my Mac, and it starts fine. Nothing changed configuration-wise.

When I run news-please, however, I get the following error:
`[newsplease.config:163|INFO] Loading config-file (/Users/lukaskawerau/news-please-repo/config/config.cfg)
[newsplease.config:163|INFO] Loading config-file (/Users/lukaskawerau/news-please-repo/config/config.cfg)
[newsplease.config:163|INFO] Loading config-file (/Users/lukaskawerau/news-please-repo/config/config.cfg)
[newsplease.config:163|INFO] Loading config-file (/Users/lukaskawerau/news-please-repo/config/config.cfg)
Unhandled error in Deferred:

Unhandled error in Deferred:

Unhandled error in Deferred:

[newsplease.main:270|INFO] Graceful stop called manually. Shutting down.`

Is there anything, any steps I'm missing to get news-please to run with ES?
Running it with basic json/html-export works fine.

Any help appreciated!

assets

All recources stored here were created with the online tool draw.io.
If you want to change a file, visit www.draw.io and open the picture with their online tool.
The tool will recover the .xml version of the diagram and you can perform your edits.

Please be sure to include the .xml version when exporting the diagram into a picture. ;)

km4-article-extrator-class-diagram
km4-article-extrator-class-diagram

mysql-er-diagram
mysql-er-diagram

news-please-class-diagram
news-please-class-diagram

news-please-flowchart
news-please-flowchart

run news-please in a cluster

News-please has been very easy to setup & test. I've been getting excellent results during my testing, now I'm considering putting it in to production.

I'm trying to figure out the best way to run multiple instances in an AWS ECS cluster. If multiple News-please crawlers point to the same MySQL database, will this allow them to distribute the tasks across a cluster?

If not; do you have any thoughts on how to run news-please in a scalable cluster?

Can't run example

Hello, I am trying to work out with the examples.

I installed it, simply running "pip install newsplease.zip"

After that I am running python downloadfromurl.py and am getting the following error:

Traceback (most recent call last):
File "downloadfromurl.py", line 16, in
with open(basepath + article['filename'] + '.json', 'w') as outfile:
TypeError: 'NewsArticle' object is not subscriptable`

I change basepath and url.

What am I missing? `

"Unhandled Error in Deferred"

Hi,
i am working with Python 3.6.3 and pip 9.0.1 on Ubuntu 16.04.3 LTS.
news-please is not working right after the installation.

The installation process was absolved successfully.
After starting "news-please" in the cli an "Unhandled error in Deffered" pops up.
In python programms "newsplease" can be imported, but not used.
The call news-please had to be made with sudo, else there appears an Permission Error.

You can see the stacktrace below.

Cheers,
Raphael

`raphael@raphael-Latitude-E6330:~$ sudo news-please
[sudo] password for raphael:
[newsplease.config:165|INFO] Loading config-file (/home/raphael/news-please-repo/config/config.cfg)
[newsplease.config:165|INFO] Loading config-file (/home/raphael/news-please-repo/config/config.cfg)
[main:253|INFO] Removed /home/raphael/news-please-repo/.resume_jobdir/f03a98d15778ac99eeb8c578aa8c224b since '--resume' was not passed to initial.py or this crawler was daemonized.
[newsplease.config:165|INFO] Loading config-file (/home/raphael/news-please-repo/config/config.cfg)
[newsplease.config:165|INFO] Loading config-file (/home/raphael/news-please-repo/config/config.cfg)
Unhandled error in Deferred:

[main:253|INFO] Removed /home/raphael/news-please-repo/.resume_jobdir/861e0b7ca3034017282d27dce656d520 since '--resume' was not passed to initial.py or this crawler was daemonized.
[main:253|INFO] Removed /home/raphael/news-please-repo/.resume_jobdir/5011d55eaa1b745eefb709134271e173 since '--resume' was not passed to initial.py or this crawler was daemonized.
Unhandled error in Deferred:

Unhandled error in Deferred:

[newsplease.main:270|INFO] Graceful stop called manually. Shutting down.
`

Use this as a Python library

How can I use news-please in my code as a Python library? I would like to use this to crawl a list of sites continuously stored in a file.

Merge articles spread on multiple pages

Example: http://www.zeit.de/2016/18/ttip-barack-obama-hannover-usa-widerstand Under the given URL only the first part of the article is shown. A (human) reader can either click on a link that points to the second page or can click on "Auf einer Seite lesen" to read all on one page.

What will be the output of the current workflow? Ideally of course multiple pages should be identified and crawled as a single article. However, as this requires actual processing of the article, I expect the system to crawl this article as two articles?
If so, is there any way to easily identify (e.g., during the actual article extraction performed by the km4 team) that two (or more) articles actually belong to only one?

Answer:

It depends on the crawler:

The sitemap and RSS crawler only find pages that are listed in the corresponding files. Thus, those crawlers only find the listed article, which might be the first page, all pages, the entire article or a combination.

The recursive crawlers on the other hand will find all pages as well as the entire article and, if the heuristics work for those, will save all of them.

For latter one, a possible way to identity if articles belong together is to search for commen text parts since all pages should be part of the entire article.

For both, it would be possible to extract URLs with keywords like "continue reading" or "page x" etc.

Version conflict on python-dateutil

Hi there,
I'm currently running Python 2.7 on Ubuntu 14.04, installed news-please using pip.
Tried running news-please using cli and got this error:

Traceback (most recent call last):
File "/usr/local/bin/news-please", line 6, in
from pkg_resources import load_entry_point
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 3147, in
@_call_aside
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 3131, in _call_aside
f(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 3160, in _initialize_master_working_set
working_set = WorkingSet._build_master()
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 668, in _build_master
return cls._build_from_requirements(requires)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 681, in _build_from_requirements
dists = ws.resolve(reqs, Environment())
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 875, in resolve
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (python-dateutil 2.6.1 (/usr/local/lib/python2.7/dist-packages), Requirement.parse('python-dateutil==2.4.0'), set(['newspaper']))

Download article does not complete

When I try to download an article, my query

newsplease.NewsPlease.download_article('http://www.thehindu.com/todays-p ...: aper/tp-national/tp-otherstates/Elephant-destroys-three-houses-in-Meghal ...: aya-village/article17026523.ece')

does not complete or time-out on Python 3.5.2. On Python 2.7.3, however, it returns a result, but also returns the following TypeError

Traceback (most recent call last): File "/Users/mdmadhusudan/anaconda/envs/earthengine/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/Users/mdmadhusudan/anaconda/envs/earthengine/lib/python2.7/site-packages/newsplease/pipeline/pipelines.py", line 365, in process_item json.dump(ExtractedInformationStorage.extract_relevant_info(item), file_) TypeError: unbound method extract_relevant_info() must be called with ExtractedInformationStorage instance as first argument (got NewscrawlerItem instance instead)

I am on macOS Sierra. Any suggestions on how to fix either? Thanks.

ValueError: bad marshal data (unknown type code)

On Ubuntu 16.10, I got the above error.

File "/home/yuyuan/anaconda3/lib/python3.5/distutils/core.py", line 148, in setup
dist.run_commands()
File "/home/yuyuan/anaconda3/lib/python3.5/distutils/dist.py", line 955, in run_commands
self.run_command(cmd)
File "/home/yuyuan/anaconda3/lib/python3.5/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/home/yuyuan/anaconda3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg/setuptools/command/bdist_egg.py", line 209, in run
File "/home/yuyuan/anaconda3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg/setuptools/command/bdist_egg.py", line 245, in zip_safe
File "/home/yuyuan/anaconda3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg/setuptools/command/bdist_egg.py", line 355, in analyze_egg
File "/home/yuyuan/anaconda3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg/setuptools/command/bdist_egg.py", line 392, in scan_module
ValueError: bad marshal data (unknown type code

News please stopping in AWS EC2

In a AWS Ec2 instance with Ubuntu t2.micro instance, I have setup news-please to test if news-please works properly or not. But whenever, I am starting news-please, its failing giving error "Unhandled error in Deferred:".

I tried starting in DEBUG mode also. I am attaching the stacktrace over here:
_

/usr/local/lib/python3.5/dist-packages/requests/init.py:80: RequestsDependencyWarning: urllib3 (1.13.1) or chardet (2.3.0) doesn't match a supported version!
RequestsDependencyWarning)
[newsplease.config:165|INFO] Loading config-file (/home/ubuntu/news-please-repo/config/config.cfg)
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Crawler] default
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] url_input_file_name
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] working_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] local_data_directory
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [MySQL] host
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] host
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] ca_cert_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] client_cert_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] client_key_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_level
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_format
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_dateformat
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_encoding
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] jobdirname
[newsplease.config:267|DEBUG] Loading JSON-file (/home/ubuntu/news-please-repo/config/sitelist.hjson)
[newsplease.main:255|DEBUG] Calling Process: ['/usr/bin/python3', '/usr/local/lib/python3.5/dist-packages/newsplease/single_crawler.py', '/home/ubuntu/news-please-repo/config/config.cfg', '/home/ubuntu/news-please-repo/config/sitelist.hjson', '0', 'False', 'False']
[newsplease.main:255|DEBUG] Calling Process: ['/usr/bin/python3', '/usr/local/lib/python3.5/dist-packages/newsplease/single_crawler.py', '/home/ubuntu/news-please-repo/config/config.cfg', '/home/ubuntu/news-please-repo/config/sitelist.hjson', '1', 'False', 'False']
[newsplease.main:255|DEBUG] Calling Process: ['/usr/bin/python3', '/usr/local/lib/python3.5/dist-packages/newsplease/single_crawler.py', '/home/ubuntu/news-please-repo/config/config.cfg', '/home/ubuntu/news-please-repo/config/sitelist.hjson', '2', 'False', 'False']
/usr/local/lib/python3.5/dist-packages/requests/init.py:80: RequestsDependencyWarning: urllib3 (1.13.1) or chardet (2.3.0) doesn't match a supported version!
RequestsDependencyWarning)
/usr/local/lib/python3.5/dist-packages/requests/init.py:80: RequestsDependencyWarning: urllib3 (1.13.1) or chardet (2.3.0) doesn't match a supported version!
RequestsDependencyWarning)
/usr/local/lib/python3.5/dist-packages/requests/init.py:80: RequestsDependencyWarning: urllib3 (1.13.1) or chardet (2.3.0) doesn't match a supported version!
RequestsDependencyWarning)
[newsplease.config:165|INFO] Loading config-file (/home/ubuntu/news-please-repo/config/config.cfg)
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Crawler] default
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] url_input_file_name
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] working_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] local_data_directory
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [MySQL] host
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] host
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] ca_cert_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] client_cert_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] client_key_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_level
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_format
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_dateformat
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_encoding
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] jobdirname
[main:88|DEBUG] Config initialized - Further initialisation.
[newsplease.config:267|DEBUG] Loading JSON-file (/home/ubuntu/news-please-repo/config/sitelist.hjson)
[newsplease.config:165|INFO] Loading config-file (/home/ubuntu/news-please-repo/config/config.cfg)
[main:192|DEBUG] Using crawler RecursiveCrawler for https://www.dig-in.com/.
[main:253|INFO] Removed /home/ubuntu/news-please-repo/.resume_jobdir/55ae9ff89e530b083b20633a558d116b since '--resume' was not passed to initial.py or this crawler was daemonized.
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Crawler] default
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] url_input_file_name
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] working_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] local_data_directory
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [MySQL] host
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] host
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] ca_cert_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] client_cert_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] client_key_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_level
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_format
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_dateformat
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_encoding
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] jobdirname
[main:88|DEBUG] Config initialized - Further initialisation.
[newsplease.config:267|DEBUG] Loading JSON-file (/home/ubuntu/news-please-repo/config/sitelist.hjson)
[newsplease.config:165|INFO] Loading config-file (/home/ubuntu/news-please-repo/config/config.cfg)
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Crawler] default
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] url_input_file_name
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] working_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] local_data_directory
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [MySQL] host
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] host
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] ca_cert_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] client_cert_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] client_key_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_level
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_format
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_dateformat
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_encoding
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] jobdirname
[main:88|DEBUG] Config initialized - Further initialisation.
[newsplease.config:267|DEBUG] Loading JSON-file (/home/ubuntu/news-please-repo/config/sitelist.hjson)
[main:192|DEBUG] Using crawler RecursiveCrawler for http://www.insurancejournal.com/.
[main:253|INFO] Removed /home/ubuntu/news-please-repo/.resume_jobdir/8294b9c2cc3db1cadb6b4a98109c8590 since '--resume' was not passed to initial.py or this crawler was daemonized.
[main:192|DEBUG] Using crawler RecursiveCrawler for http://www.dnaindia.com/.
[main:253|INFO] Removed /home/ubuntu/news-please-repo/.resume_jobdir/6906ed0b1a6ca7bd359e919a4fd74596 since '--resume' was not passed to initial.py or this crawler was daemonized.
Unhandled error in Deferred:

Unhandled error in Deferred:

Unhandled error in Deferred:

[newsplease.main:270|INFO] Graceful stop called manually. Shutting down.

_

Please help.

Error during installation of news-please

Ians-MacBook-Air:~ ianmackerracher$ sudo pip install news-please
Password:
The directory '/Users/ianmackerracher/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/ianmackerracher/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting news-please
Downloading news-please-1.0.25.tar.gz (46kB)
100% |████████████████████████████████| 51kB 1.0MB/s
Collecting Scrapy>=1.1.0 (from news-please)
Downloading Scrapy-1.3.2-py2.py3-none-any.whl (239kB)
100% |████████████████████████████████| 245kB 650kB/s
Collecting PyMySQL>=0.7.9 (from news-please)
Downloading PyMySQL-0.7.10-py2.py3-none-any.whl (78kB)
100% |████████████████████████████████| 81kB 1.9MB/s
Collecting hjson>=1.5.8 (from news-please)
Downloading hjson-2.0.2.tar.gz
Collecting elasticsearch>=2.4 (from news-please)
Downloading elasticsearch-5.2.0-py2.py3-none-any.whl (57kB)
100% |████████████████████████████████| 61kB 1.6MB/s
Collecting beautifulsoup4>=4.5.1 (from news-please)
Downloading beautifulsoup4-4.5.3-py2-none-any.whl (85kB)
100% |████████████████████████████████| 92kB 2.5MB/s
Collecting readability-lxml>=0.6.2 (from news-please)
Downloading readability-lxml-0.6.2.tar.gz
Collecting langdetect>=1.0.7 (from news-please)
Downloading langdetect-1.0.7.zip (998kB)
100% |████████████████████████████████| 1.0MB 536kB/s
Collecting python-dateutil>=2.4.0 (from news-please)
Downloading python_dateutil-2.6.0-py2.py3-none-any.whl (194kB)
100% |████████████████████████████████| 194kB 894kB/s
Collecting plac>=0.9.6 (from news-please)
Downloading plac-0.9.6-py2.py3-none-any.whl
Collecting newspaper (from news-please)
Downloading newspaper-0.0.9.8.tar.gz (248kB)
100% |████████████████████████████████| 256kB 1.1MB/s
Requirement already satisfied: lxml in /Library/Python/2.7/site-packages (from Scrapy>=1.1.0->news-please)
Requirement already satisfied: PyDispatcher>=2.0.5 in /Library/Python/2.7/site-packages (from Scrapy>=1.1.0->news-please)
Requirement already satisfied: Twisted>=13.1.0 in /Library/Python/2.7/site-packages (from Scrapy>=1.1.0->news-please)
Requirement already satisfied: pyOpenSSL in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from Scrapy>=1.1.0->news-please)
Requirement already satisfied: queuelib in /Library/Python/2.7/site-packages (from Scrapy>=1.1.0->news-please)
Collecting w3lib>=1.15.0 (from Scrapy>=1.1.0->news-please)
Downloading w3lib-1.17.0-py2.py3-none-any.whl
Collecting cssselect>=0.9 (from Scrapy>=1.1.0->news-please)
Downloading cssselect-1.0.1-py2.py3-none-any.whl
Collecting parsel>=1.1 (from Scrapy>=1.1.0->news-please)
Downloading parsel-1.1.0-py2.py3-none-any.whl
Collecting service-identity (from Scrapy>=1.1.0->news-please)
Downloading service_identity-16.0.0-py2.py3-none-any.whl
Collecting six>=1.5.2 (from Scrapy>=1.1.0->news-please)
Downloading six-1.10.0-py2.py3-none-any.whl
Collecting urllib3<2.0,>=1.8 (from elasticsearch>=2.4->news-please)
Downloading urllib3-1.20-py2.py3-none-any.whl (111kB)
100% |████████████████████████████████| 112kB 1.5MB/s
Collecting chardet (from readability-lxml>=0.6.2->news-please)
Downloading chardet-2.3.0-py2.py3-none-any.whl (180kB)
100% |████████████████████████████████| 184kB 1.7MB/s
Collecting Pillow==2.5.1 (from newspaper->news-please)
Downloading Pillow-2.5.1-cp27-none-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.whl (3.0MB)
100% |████████████████████████████████| 3.0MB 336kB/s
Collecting PyYAML==3.11 (from newspaper->news-please)
Downloading PyYAML-3.11.zip (371kB)
100% |████████████████████████████████| 378kB 1.4MB/s
Collecting nltk==2.0.5 (from newspaper->news-please)
Downloading nltk-2.0.5.tar.gz (954kB)
100% |████████████████████████████████| 962kB 821kB/s
Collecting requests==2.3.0 (from newspaper->news-please)
Downloading requests-2.3.0-py2.py3-none-any.whl (452kB)
100% |████████████████████████████████| 460kB 911kB/s
Collecting jieba==0.35 (from newspaper->news-please)
Downloading jieba-0.35.zip (7.4MB)
100% |████████████████████████████████| 7.4MB 137kB/s
Collecting feedparser==5.1.3 (from newspaper->news-please)
Downloading feedparser-5.1.3.zip (1.2MB)
100% |████████████████████████████████| 1.2MB 933kB/s
Collecting tldextract==1.5.1 (from newspaper->news-please)
Downloading tldextract-1.5.1.tar.gz (57kB)
100% |████████████████████████████████| 61kB 1.6MB/s
Collecting feedfinder2==0.0.1 (from newspaper->news-please)
Downloading feedfinder2-0.0.1.tar.gz
Requirement already satisfied: zope.interface>=3.6.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from Twisted>=13.1.0->Scrapy>=1.1.0->news-please)
Requirement already satisfied: constantly>=15.1 in /Library/Python/2.7/site-packages (from Twisted>=13.1.0->Scrapy>=1.1.0->news-please)
Requirement already satisfied: incremental>=16.10.1 in /Library/Python/2.7/site-packages (from Twisted>=13.1.0->Scrapy>=1.1.0->news-please)
Collecting pyasn1 (from service-identity->Scrapy>=1.1.0->news-please)
Downloading pyasn1-0.2.2-py2.py3-none-any.whl (51kB)
100% |████████████████████████████████| 61kB 3.6MB/s
Collecting pyasn1-modules (from service-identity->Scrapy>=1.1.0->news-please)
Downloading pyasn1_modules-0.0.8-py2.py3-none-any.whl
Collecting attrs (from service-identity->Scrapy>=1.1.0->news-please)
Downloading attrs-16.3.0-py2.py3-none-any.whl
Requirement already satisfied: setuptools in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from tldextract==1.5.1->newspaper->news-please)
Installing collected packages: six, w3lib, cssselect, parsel, pyasn1, pyasn1-modules, attrs, service-identity, Scrapy, PyMySQL, hjson, urllib3, elasticsearch, beautifulsoup4, chardet, readability-lxml, langdetect, python-dateutil, plac, Pillow, PyYAML, nltk, requests, jieba, feedparser, tldextract, feedfinder2, newspaper, news-please
Found existing installation: six 1.4.1
DEPRECATION: Uninstalling a distutils installed project (six) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
Uninstalling six-1.4.1:
Exception:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/basecommand.py", line 215, in main
status = self.run(options, args)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/commands/install.py", line 342, in run
prefix=options.prefix_path,
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_set.py", line 778, in install
requirement.uninstall(auto_confirm=True)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_install.py", line 754, in uninstall
paths_to_remove.remove(auto_confirm)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_uninstall.py", line 115, in remove
renames(path, new_path)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/utils/init.py", line 267, in renames
shutil.move(old, new)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 302, in move
copy2(src, real_dst)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 131, in copy2
copystat(src, dst)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 103, in copystat
os.chflags(dst, st.st_flags)
OSError: [Errno 1] Operation not permitted: '/tmp/pip-YSDQTZ-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info'

%%%%%%%%%Try to install by ignoring already installed%%%%%%%%%

Ians-MacBook-Air:~ ianmackerracher$ pip install --ignore-installed news-please
Collecting news-please
Using cached news-please-1.0.25.tar.gz
Collecting Scrapy>=1.1.0 (from news-please)
Using cached Scrapy-1.3.2-py2.py3-none-any.whl
Collecting PyMySQL>=0.7.9 (from news-please)
Using cached PyMySQL-0.7.10-py2.py3-none-any.whl
Collecting hjson>=1.5.8 (from news-please)
Using cached hjson-2.0.2.tar.gz
Collecting elasticsearch>=2.4 (from news-please)
Using cached elasticsearch-5.2.0-py2.py3-none-any.whl
Collecting beautifulsoup4>=4.5.1 (from news-please)
Using cached beautifulsoup4-4.5.3-py2-none-any.whl
Collecting readability-lxml>=0.6.2 (from news-please)
Using cached readability-lxml-0.6.2.tar.gz
Collecting langdetect>=1.0.7 (from news-please)
Using cached langdetect-1.0.7.zip
Collecting python-dateutil>=2.4.0 (from news-please)
Using cached python_dateutil-2.6.0-py2.py3-none-any.whl
Collecting plac>=0.9.6 (from news-please)
Using cached plac-0.9.6-py2.py3-none-any.whl
Collecting newspaper (from news-please)
Using cached newspaper-0.0.9.8.tar.gz
Collecting lxml (from Scrapy>=1.1.0->news-please)
Using cached lxml-3.7.3.tar.gz
Collecting PyDispatcher>=2.0.5 (from Scrapy>=1.1.0->news-please)
Using cached PyDispatcher-2.0.5.tar.gz
Collecting Twisted>=13.1.0 (from Scrapy>=1.1.0->news-please)
Using cached Twisted-17.1.0.tar.bz2
Collecting pyOpenSSL (from Scrapy>=1.1.0->news-please)
Using cached pyOpenSSL-16.2.0-py2.py3-none-any.whl
Collecting queuelib (from Scrapy>=1.1.0->news-please)
Using cached queuelib-1.4.2-py2.py3-none-any.whl
Collecting w3lib>=1.15.0 (from Scrapy>=1.1.0->news-please)
Using cached w3lib-1.17.0-py2.py3-none-any.whl
Collecting cssselect>=0.9 (from Scrapy>=1.1.0->news-please)
Using cached cssselect-1.0.1-py2.py3-none-any.whl
Collecting parsel>=1.1 (from Scrapy>=1.1.0->news-please)
Using cached parsel-1.1.0-py2.py3-none-any.whl
Collecting service-identity (from Scrapy>=1.1.0->news-please)
Using cached service_identity-16.0.0-py2.py3-none-any.whl
Collecting six>=1.5.2 (from Scrapy>=1.1.0->news-please)
Using cached six-1.10.0-py2.py3-none-any.whl
Collecting urllib3<2.0,>=1.8 (from elasticsearch>=2.4->news-please)
Using cached urllib3-1.20-py2.py3-none-any.whl
Collecting chardet (from readability-lxml>=0.6.2->news-please)
Using cached chardet-2.3.0-py2.py3-none-any.whl
Collecting Pillow==2.5.1 (from newspaper->news-please)
Using cached Pillow-2.5.1-cp27-none-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.whl
Collecting PyYAML==3.11 (from newspaper->news-please)
Using cached PyYAML-3.11.zip
Collecting nltk==2.0.5 (from newspaper->news-please)
Using cached nltk-2.0.5.tar.gz
Collecting requests==2.3.0 (from newspaper->news-please)
Using cached requests-2.3.0-py2.py3-none-any.whl
Collecting jieba==0.35 (from newspaper->news-please)
Using cached jieba-0.35.zip
Collecting feedparser==5.1.3 (from newspaper->news-please)
Using cached feedparser-5.1.3.zip
Collecting tldextract==1.5.1 (from newspaper->news-please)
Using cached tldextract-1.5.1.tar.gz
Collecting feedfinder2==0.0.1 (from newspaper->news-please)
Using cached feedfinder2-0.0.1.tar.gz
Collecting zope.interface>=3.6.0 (from Twisted>=13.1.0->Scrapy>=1.1.0->news-please)
Using cached zope.interface-4.3.3-cp27-cp27m-macosx_10_11_x86_64.whl
Collecting constantly>=15.1 (from Twisted>=13.1.0->Scrapy>=1.1.0->news-please)
Using cached constantly-15.1.0-py2.py3-none-any.whl
Collecting incremental>=16.10.1 (from Twisted>=13.1.0->Scrapy>=1.1.0->news-please)
Using cached incremental-16.10.1-py2.py3-none-any.whl
Collecting Automat>=0.3.0 (from Twisted>=13.1.0->Scrapy>=1.1.0->news-please)
Using cached Automat-0.5.0-py2.py3-none-any.whl
Collecting cryptography>=1.3.4 (from pyOpenSSL->Scrapy>=1.1.0->news-please)
Using cached cryptography-1.7.2-cp27-cp27m-macosx_10_6_intel.whl
Collecting pyasn1 (from service-identity->Scrapy>=1.1.0->news-please)
Using cached pyasn1-0.2.2-py2.py3-none-any.whl
Collecting pyasn1-modules (from service-identity->Scrapy>=1.1.0->news-please)
Using cached pyasn1_modules-0.0.8-py2.py3-none-any.whl
Collecting attrs (from service-identity->Scrapy>=1.1.0->news-please)
Using cached attrs-16.3.0-py2.py3-none-any.whl
Collecting setuptools (from tldextract==1.5.1->newspaper->news-please)
Using cached setuptools-34.2.0-py2.py3-none-any.whl
Collecting idna>=2.0 (from cryptography>=1.3.4->pyOpenSSL->Scrapy>=1.1.0->news-please)
Using cached idna-2.2-py2.py3-none-any.whl
Collecting ipaddress (from cryptography>=1.3.4->pyOpenSSL->Scrapy>=1.1.0->news-please)
Using cached ipaddress-1.0.18-py2-none-any.whl
Collecting enum34 (from cryptography>=1.3.4->pyOpenSSL->Scrapy>=1.1.0->news-please)
Using cached enum34-1.1.6-py2-none-any.whl
Collecting cffi>=1.4.1 (from cryptography>=1.3.4->pyOpenSSL->Scrapy>=1.1.0->news-please)
Using cached cffi-1.9.1-cp27-cp27m-macosx_10_10_intel.whl
Collecting packaging>=16.8 (from setuptools->tldextract==1.5.1->newspaper->news-please)
Using cached packaging-16.8-py2.py3-none-any.whl
Collecting appdirs>=1.4.0 (from setuptools->tldextract==1.5.1->newspaper->news-please)
Using cached appdirs-1.4.0-py2.py3-none-any.whl
Collecting pycparser (from cffi>=1.4.1->cryptography>=1.3.4->pyOpenSSL->Scrapy>=1.1.0->news-please)
Using cached pycparser-2.17.tar.gz
Collecting pyparsing (from packaging>=16.8->setuptools->tldextract==1.5.1->newspaper->news-please)
Using cached pyparsing-2.1.10-py2.py3-none-any.whl
Installing collected packages: lxml, PyDispatcher, six, pyparsing, packaging, appdirs, setuptools, zope.interface, constantly, incremental, attrs, Automat, Twisted, idna, ipaddress, enum34, pyasn1, pycparser, cffi, cryptography, pyOpenSSL, queuelib, w3lib, cssselect, parsel, pyasn1-modules, service-identity, Scrapy, PyMySQL, hjson, urllib3, elasticsearch, beautifulsoup4, chardet, readability-lxml, langdetect, python-dateutil, plac, Pillow, PyYAML, nltk, requests, jieba, feedparser, tldextract, feedfinder2, newspaper, news-please
Running setup.py install for lxml ... error
Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;file='/private/var/folders/gy/5xt04_452z791v1qjs1yzxkh0000gn/T/pip-build-z_Uoxu/lxml/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /var/folders/gy/5xt04_452z791v1qjs1yzxkh0000gn/T/pip-nxGrQI-record/install-record.txt --single-version-externally-managed --compile:
Building lxml version 3.7.3.
Building without Cython.
Using build configuration of libxslt 1.1.29
Building against libxml2/libxslt in the following directory: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/lib
running install
running build
running build_py
creating build
creating build/lib.macosx-10.12-intel-2.7
creating build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/init.py -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/_elementpath.py -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/builder.py -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/cssselect.py -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/doctestcompare.py -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/ElementInclude.py -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/pyclasslookup.py -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/sax.py -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/usedoctest.py -> build/lib.macosx-10.12-intel-2.7/lxml
creating build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/init.py -> build/lib.macosx-10.12-intel-2.7/lxml/includes
creating build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/init.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/_diffcommand.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/_html5builder.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/_setmixin.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/builder.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/clean.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/defs.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/diff.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/ElementSoup.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/formfill.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/html5parser.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/soupparser.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/usedoctest.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
creating build/lib.macosx-10.12-intel-2.7/lxml/isoschematron
copying src/lxml/isoschematron/init.py -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron
copying src/lxml/lxml.etree.h -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/lxml.etree_api.h -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/includes/c14n.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/config.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/dtdvalid.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/etreepublic.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/htmlparser.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/relaxng.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/schematron.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/tree.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/uri.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/xinclude.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/xmlerror.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/xmlparser.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/xmlschema.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/xpath.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/xslt.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/etree_defs.h -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/lxml-version.h -> build/lib.macosx-10.12-intel-2.7/lxml/includes
creating build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources
creating build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/rng
copying src/lxml/isoschematron/resources/rng/iso-schematron.rng -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/rng
creating build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl
copying src/lxml/isoschematron/resources/xsl/RNG2Schtrn.xsl -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl
copying src/lxml/isoschematron/resources/xsl/XSD2Schtrn.xsl -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl
creating build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_abstract_expand.xsl -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_dsdl_include.xsl -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_schematron_message.xsl -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_schematron_skeleton_for_xslt1.xsl -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_svrl_for_xslt1.xsl -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/readme.txt -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
running build_ext
building 'lxml.etree' extension
creating build/temp.macosx-10.12-intel-2.7
creating build/temp.macosx-10.12-intel-2.7/src
creating build/temp.macosx-10.12-intel-2.7/src/lxml
cc -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch i386 -arch x86_64 -pipe -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include/libxml2 -Isrc/lxml/includes -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.macosx-10.12-intel-2.7/src/lxml/lxml.etree.o -w -flat_namespace
cc -bundle -undefined dynamic_lookup -arch i386 -arch x86_64 -Wl,-F. build/temp.macosx-10.12-intel-2.7/src/lxml/lxml.etree.o -L/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/lib -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.12-intel-2.7/lxml/etree.so
building 'lxml.objectify' extension
cc -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch i386 -arch x86_64 -pipe -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include/libxml2 -Isrc/lxml/includes -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/lxml/lxml.objectify.c -o build/temp.macosx-10.12-intel-2.7/src/lxml/lxml.objectify.o -w -flat_namespace
cc -bundle -undefined dynamic_lookup -arch i386 -arch x86_64 -Wl,-F. build/temp.macosx-10.12-intel-2.7/src/lxml/lxml.objectify.o -L/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/lib -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.12-intel-2.7/lxml/objectify.so
running install_lib
copying build/lib.macosx-10.12-intel-2.7/lxml/etree.so -> /Library/Python/2.7/site-packages/lxml
error: could not delete '/Library/Python/2.7/site-packages/lxml/etree.so': Permission denied


Command "/usr/bin/python -u -c "import setuptools, tokenize;file='/private/var/folders/gy/5xt04_452z791v1qjs1yzxkh0000gn/T/pip-build-z_Uoxu/lxml/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /var/folders/gy/5xt04_452z791v1qjs1yzxkh0000gn/T/pip-nxGrQI-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/gy/5xt04_452z791v1qjs1yzxkh0000gn/T/pip-build-z_Uoxu/lxml/

ImportError: No module named _thread

import newsplease
Traceback (most recent call last):
File "", line 1, in
File "/home/hi-161/home/nes/ajay/local/lib/python2.7/site-packages/newsplease/init.py", line 6, in
from newsplease.single_crawler import SingleCrawler
File "/home/hi-161/home/nes/ajay/local/lib/python2.7/site-packages/newsplease/single_crawler.py", line 26, in
from _thread import start_new_thread
ImportError: No module named _thread

Version Conflict on 3.5

Hi again,
I have tried running news-please on python 3.5 and Ubunutu 16.02 in the cli.
A version conflict was raised. See stack trace below.

Cheers, Raphael

raphael@raphael-Latitude-E6330:~$ sudo news-please
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 635, in _build_master
ws.require(requires)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 943, in require
needed = self.resolve(parse_requirements(requirements))
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 834, in resolve
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (lxml 3.5.0 (/usr/lib/python3/dist-packages), Requirement.parse('lxml>=3.6.0'), {'newspaper3k'})

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/news-please", line 5, in
from pkg_resources import load_entry_point
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2927, in
@_call_aside
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2913, in _call_aside
f(*args, **kwargs)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2940, in _initialize_master_working_set
working_set = WorkingSet._build_master()
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 637, in _build_master
return cls._build_from_requirements(requires)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 650, in _build_from_requirements
dists = ws.resolve(reqs, Environment())
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 834, in resolve
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (lxml 3.5.0 (/usr/lib/python3/dist-packages), Requirement.parse('lxml>=3.6.0'), {'newspaper3k'})

improve comparertext

The ComparerText compares a text with each other text and calculates a score of similarity. Depending on this score, one of the text that are most similar are returned. There can be several improvements:

What happens when there are 2 texts that are very similar and one text that is not similar to the others but the best extraction in this example. So one of the similar texts would be extracted. The problem is that there is no check for extractor biases. Say there is a tag in a html file which is not extracted correctly by most of the extractors, however this wrong extraction would be extracted. So a multiple extracted, wrong result would be extracted when you use just a similarity measure.

What happens when there are two similar, but wrongly extracted texts and three not so similar texts for which every text is better than one of the two very similar ones? One of the two similar ones would get extracted. This is why there should actually be a method that compares every partition with other partitions. The result would be more correct, however you loose speed.

The similarity score was created by a group of students. There is research about text comparison and maybe there are better ways to check for the similarity of texts.

lxml version requirements

I get an error installing news-please using pip:
No matching distribution found for lxml>=3.35 (from news-please)

It looks like the version number comparison should be 3.3.5 instead of 3.35.

Couldn't complete the installation

When I try to install this library I got following error.

error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

    ----------------------------------------
Command "c:\python36-32\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\Nishara\\AppData\\Local\\Temp\\pip-install-9axz4aui\\Twisted\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\Nishara\AppData\Local\Temp\pip-record-_xo3iyqt\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\Nishara\AppData\Local\Temp\pip-install-9axz4aui\Twisted\

My python version is python 3.6

fix wiki links

looks like many have been broken when transferring the repo

AttributeError: 'module' object has no attribute 'request'

Hello!

After a fresh install I ran the example code from the readme file and it gave me the following error:

>>> from newsplease import NewsPlease
>>> article = NewsPlease.from_url('https://www.nytimes.com/2017/02/23/us/politics/cpac-stephen-bannon-reince-priebus.html?hp')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/gregor/anaconda/lib/python2.7/site-packages/newsplease/__init__.py", line 79, in from_url
    articles = NewsPlease.from_urls([url])
  File "/Users/gregor/anaconda/lib/python2.7/site-packages/newsplease/__init__.py", line 98, in from_urls
    html = SimpleCrawler.fetch_url(url)
  File "/Users/gregor/anaconda/lib/python2.7/site-packages/newsplease/crawler/simple_crawler.py", line 16, in fetch_url
    return SimpleCrawler._fetch_url(url, False)
  File "/Users/gregor/anaconda/lib/python2.7/site-packages/newsplease/crawler/simple_crawler.py", line 27, in _fetch_url
    req = urllib.request.Request(url, None, headers)
AttributeError: 'module' object has no attribute 'request'

Python 2.7.13 (on macOS 10.12.5 Sierra)
urllib is installed

Unfortunately I had no time for debugging. Might come back to it soon.

Gregor

Cut down on what is published per article

Hi guys, love the tool, I was just looking for a way to cut down on the fields written to file and change the field names. Is there a file where I can edit these settings?

specifically I want to make it so the ouput files only contain:

'url': {'type': 'string', 'index': 'not_analyzed'},
'source': {'type': 'string', 'index': 'not_analyzed'},
'created_at#(renamed published_date)#': {'type': 'date', "format":"yyyy-MM-dd HH:mm:ss"},
'title': {'type': 'string'},
'text': {'type': 'string'},
'author': {'type': 'string'}

improve ComparerTitle

In order to compare the extracted titles, the comparer creates a cartesian product and compares the titles in a tupel. This could lead to a problem regarding the performance speed of the program, if more extractors will be added.

(adding) Pipeline / Filter - keyword(s) Filter

Hey there,

while almost all news sites structure their sites thematically (and therefor broad thematic crawling is possible) or using the elasticsearch (??) or databases indirectly for that matter later on is it planned to add said pipeline-filter (that we can drop non-keyword articles on the run) in the future?

Or did I miss something. Have crawled through the docs but can't find anything in that regard.

Best wishes

add wiki doc for direct url crawl and extract

Hi Felix,

du musst nur den Download-Crawler verwenden und die Heuristiken ausschalten.
Im Anhang findest du ein entsprechendes JSON.

Beste Grüße
Sören

# Furthermore this is first of all the actual config file, but as default just filled with examples.
{
  # Every URL has to be in an array-object in "base_urls".
  # The same URL in combination with the same crawler may only appear once in this array.
  "base_urls" : [
    {
      "crawler": "Download",
      "url": [
        # Cubs win Championship ~03.11.2016
        "http://www.dailymail.co.uk/news/article-3899956/Chicago-Cubs-win-World-Series-epic-Game-7-showdown-Cleveland.html",
        "http://www.mirror.co.uk/sport/other-sports/american-sports/chicago-cubs-win-world-series-9185077",
        "https://www.theguardian.com/sport/2016/nov/03/world-series-game-7-chicago-cubs-cleveland-indians-mlb",
        "http://www.telegraph.co.uk/baseball/2016/11/03/chicago-cubs-break-108-year-curse-of-the-billy-goat-winning-worl/",
        "https://www.thesun.co.uk/sport/othersports/2106710/chicago-cubs-win-world-series-hillary-clinton-bill-murray-and-barack-obama-lead-celebrations-as-cubs-end-108-year-curse/",
        "http://www.bbc.com/sport/baseball/37857919",
        "http://www.thetimes.co.uk/article/chicago-cubs-end-108-year-wait-for-world-series-win-g09t0kgfm",
        "http://www.independent.co.uk/sport/us-sport/major-league-baseball/world-series-chicago-cubs-cleveland-indians-108-year-title-drought-a7394706.html",
        "http://www.independent.co.uk/sport/us-sport/major-league-baseball/chicago-cubs-fans-celebrate-world-series-title-a7394736.html",
        "http://www.standard.co.uk/sport/other-sports/chicago-cubs-win-world-series-to-end-108year-curse-and-earn-invite-from-barack-obama-a3386411.html",
        "http://www.nytimes.com/2016/11/03/sports/baseball/chicago-cubs-beat-cleveland-indians-world-series-game-7.html?_r=0",
        "https://www.washingtonpost.com/sports/believe-it-chicago-cubs-win-classic-game-7-to-win-first-world-series-since-1908/2016/11/03/99cfc9c2-a0b3-11e6-a44d-cc2898cfab06_story.html",
        "https://www.washingtonpost.com/sports/nationals/you-knew-it-couldnt-come-easy-but-the-cubs-are-world-series-champions/2016/11/03/a4487ade-a0b3-11e6-a44d-cc2898cfab06_story.html",
        "http://www.usatoday.com/story/sports/ftw/2016/11/03/sports-world-reacts-to-the-chicago-cubs-winning-their-first-world-series-since-1908/93225730/",
        "http://www.wsj.com/articles/chicago-cubs-win-the-world-series-ending-108-year-drought-1478149589",
        "http://nypost.com/2016/11/03/cubs-end-drought-in-chaotic-epic-world-series-finale/",


        # FBI clears Clinton ~06.11-2016
        "http://www.dailymail.co.uk/wires/reuters/article-3910804/Trump-Clinton-focus-crucial-states-campaigns-final-days.html",
        "http://www.mirror.co.uk/news/world-news/hillary-clinton-cleared-fbi-over-9210739",
        "https://www.theguardian.com/us-news/2016/nov/06/fbi-director-hillary-clinton-email-investigation-criminal-james-comey",
        "http://www.bbc.com/news/election-us-2016-37892348",
        "https://www.thesun.co.uk/news/2130219/trumps-fury-after-fbi-says-hillary-clinton-has-committed-no-crime-in-email-scandal/",
        "http://www.thetimes.co.uk/article/clinton-off-the-hook-as-fbi-drops-investigation-into-emails-m9t6pnr0s",
        "http://www.express.co.uk/news/world/729545/hillary-clinton-chelsea-clinton-wedding-funds-wikileaks-rudy-giuliani",
        "http://www.telegraph.co.uk/news/2016/11/06/us-election-hillary-clinton-up-in-polls-as-hispanic-surge-threat/",
        "http://nypost.com/2016/11/06/fbi-stands-by-decision-not-to-charge-clinton-after-review-of-additional-emails/",
        "http://www.wsj.com/articles/fbis-comey-says-new-emails-dont-change-conclusions-about-hillary-clinton-1478464650",
        "http://www.usatoday.com/story/news/politics/elections/2016/2016/11/06/fbi-not-recommending-charges-over-new-clinton-emails/93395808/",
        "https://www.washingtonpost.com/blogs/post-partisan/wp/2016/11/06/comey-to-country-the-jury-will-disregard/?utm_term=.914ce12b2617",
        "http://www.nytimes.com/2016/11/06/us/politics/presidential-election.html",

        # China: rescue boy from well ~08.11.2016
        "http://www.bbc.com/news/world-asia-china-37906226",
        "http://www.bbc.com/news/world-asia-china-37946716",
        "http://www.dailymail.co.uk/news/article-3916560/Dramatic-footage-shows-rescuers-using-eighty-diggers-save-boy-fell-130ft-deep-picking-cabbages-Chinese-farm.html",
        "http://www.dailymail.co.uk/news/article-3923808/Mystery-Chinese-boy-fell-deep-massive-rescue-operation-involving-80-diggers-chute-empty.html",
        "http://www.thetimes.co.uk/article/millions-watch-as-rescue-effort-fails-to-save-boy-52c6dlpml",

        # Toblerone redesign outrage ~08.11.2016
        "http://www.standard.co.uk/news/uk/toblerone-bar-shape-change-sparks-anger-among-fans-a3389711.html",
        "http://www.independent.co.uk/news/uk/home-news/toblerone-new-shape-outrage-chocolate-scandal-a7404011.html",
        "https://www.thesun.co.uk/news/2138318/toblerone-fans-outraged-after-gap-between-triangles-is-increased-to-reduce-the-amount-of-chocolate-in-bars/",
        "http://www.thetimes.co.uk/article/toblerone-redesign-runs-into-a-mountain-of-trouble-82h785rfr",
        "http://www.telegraph.co.uk/news/2016/11/08/toblerone-faces-mountain-of-criticism-over-stupid-change-to-its/",
        "https://www.theguardian.com/business/2016/nov/08/toblerone-gets-more-gappy-but-its-fans-are-not-happy",
        "http://www.mirror.co.uk/news/uk-news/is-diet-version-outrage-toblerone-9217538",
        "http://www.dailymail.co.uk/news/article-3915960/Toblerone-increase-gaps-bar-s-iconic-peaks-make-lighter.html",
        "http://www.bbc.com/news/uk-37904703",
        "http://www.mirror.co.uk/news/weird-news/best-way-use-controversial-new-9233101",
        "http://www.express.co.uk/life-style/food/730108/Toblerone-shrinks-chocolate-bar-denies-Brexit-link",
        "http://nypost.com/2016/11/08/people-are-pissed-over-toblerones-new-candy-size/",
        "http://blogs.wsj.com/moneybeat/2016/11/08/mondelezs-toblerone-moves-mountain-to-hide-price-increase/",
        "http://www.usatoday.com/story/money/nation-now/2016/11/08/while-us-talks-election-uk-outraged-over-toblerone-chocolate/93465240/",
        "https://www.washingtonpost.com/news/worldviews/wp/2016/11/08/brits-blame-strange-new-toblerone-shape-on-brexit/",
        "http://www.nytimes.com/2016/11/09/world/europe/toblerone-triangle-change-uk.html",

        # Trump wins election
        # Anti-Trump protests ~09.11.2016
        "http://www.independent.co.uk/news/world/americas/us-elections/us-election-donald-trump-wins-protests-los-angeles-california-oregon-a7407521.html",
        "http://www.thetimes.co.uk/article/thousands-protest-against-election-result-in-us-cities-5v6ncl6pg",
        "http://www.telegraph.co.uk/news/2016/11/10/demonstrations-erupt-across-the-us-as-country-begins-to-imagine/",
        "http://www.express.co.uk/news/world/730363/protests-Donald-Trump-violence-US-election-Hillary-Clinton",
        "http://www.mirror.co.uk/news/world-news/trump-win-sparks-riots-across-9225317",
        "http://www.bbc.com/news/election-us-2016-37946231",
        "http://www.dailymail.co.uk/wires/ap/article-3920168/Trump-victory-sets-protests-California-Oregon.html",
        "https://www.theguardian.com/us-news/2016/nov/09/anti-donald-trump-protests-new-york-chicago-san-francisco",
        "http://www.nytimes.com/2016/11/10/us/trump-election-protest-berkeley-oakland.html",
        "https://www.washingtonpost.com/politics/trumps-white-house-win-promises-to-reshape-us-political-landscape/2016/11/09/62baa5e4-a66a-11e6-ba59-a7d93165c6d4_story.html"
        "http://www.usatoday.com/story/news/politics/2016/11/10/hundreds-protest-trump-downtown-milwaukee/93617960/",
        "http://www.wsj.com/articles/thousands-protest-outside-trump-tower-1478742884",
        "http://nypost.com/2016/11/09/protests-erupt-in-california-after-trump-victory/",
        "http://nypost.com/2016/11/11/anti-trump-protests-continue-in-wake-of-election/",

        # Lady Gaga protests ~09.11.2016
        "http://www.standard.co.uk/showbiz/celebrity-news/lady-gaga-stages-protest-outside-trump-towers-after-donald-trump-beats-hillary-clinton-a3391526.html",
        "http://www.mirror.co.uk/3am/celebrity-news/lady-gaga-protests-outside-trump-9228523",
        "http://www.independent.co.uk/news/people/president-donald-trump-lady-gaga-protest-tower-new-york-a7407081.html",
        "http://www.telegraph.co.uk/news/2016/11/09/lady-gaga-protests-outside-trump-tower/",
        "http://www.dailymail.co.uk/news/article-3918926/Hollywood-starts-panic-results-aren-t-going-Clinton-s-way.html",

        # Croydon tram accident
        "https://www.theguardian.com/uk-news/2016/nov/09/croydon-tram-crash-kills-at-least-seven-and-injures-more-than-50",
        "http://www.standard.co.uk/news/transport/croydon-tram-derailment-people-trapped-after-tram-overturns-in-at-sandilands-a3390796.html",
        "http://www.bbc.com/news/uk-england-london-37919658",
        "http://www.express.co.uk/news/uk/730639/Croydon-tram-crash-carnage-survivor-derailment-seven-dead",
        "http://www.mirror.co.uk/news/uk-news/huge-rescue-operation-sandilands-station-9226276",
        "http://www.dailymail.co.uk/wires/pa/article-3919284/Five-trapped-40-injured-tram-overturns-tunnel.html",
        "http://www.telegraph.co.uk/news/2016/11/10/croydon-tram-crash-police-check-drivers-mobile-phone-records/",
        "https://www.thesun.co.uk/news/2150294/croydon-tram-crash-derailment-cause/",
        "http://www.thetimes.co.uk/article/at-least-four-dead-and-dozens-injured-as-tram-derails-vqpsbrjb3",
        "http://www.independent.co.uk/news/uk/home-news/five-trapped-40-injured-after-tram-overturns-south-london-croydon-a7406496.html",
        "http://nypost.com/2016/11/09/several-dead-and-dozens-injured-after-tram-overturns-in-london/",
        "http://www.usatoday.com/story/news/2016/11/09/least-7-killed-tram-accident-south-london/93549248/",
        "http://www.nytimes.com/2016/11/10/world/europe/tram-derails-croydon-london.html",

        # Croydon tram accident follow-up
        "http://www.standard.co.uk/news/london/croydon-tram-crash-police-identify-all-seven-victims-killed-in-derailment-tragedy-a3394126.html",
        "http://www.independent.co.uk/news/uk/home-news/croydon-tram-crash-victims-named-last-london-derail-tributes-a7414006.html",
        "https://www.thesun.co.uk/news/2172708/croydon-tram-crash-victims-named/",
        "http://www.dailymail.co.uk/news/article-3929748/Croydon-tram-crash-carriages-carried-away-lorry-police-probe-claims-derailed-just-days-seven-people-died-tragedy.html",
        "http://www.thetimes.co.uk/article/young-father-killed-in-tram-crash-x3zh5dp0v",

        #shooting near protests ~10.11.2016
        "https://www.thesun.co.uk/news/2154878/shot-seattle-protests-donald-trump-election/",
        "http://www.express.co.uk/news/world/730721/Five-gunned-Seattle-shot-anti-Donald-Trump-US-President-Washington-victory-Republican",
        "http://www.dailymail.co.uk/news/article-3922446/Report-shooting-multiple-victims-near-Trump-protest-Seattle-PD.html",
        "http://www.wsj.com/articles/five-people-shot-in-seattle-unclear-if-connected-to-trump-protest-1478749638",

        # Trump meets Obama in the white house
        "http://www.bbc.com/news/election-us-2016-37932231",
        "http://www.dailymail.co.uk/news/article-3922932/Transition-Obama-Trump-meet-White-House.html",
        "http://www.standard.co.uk/news/world/barack-obama-describes-first-meeting-with-donald-trump-at-white-house-as-excellent-a3392866.html",
        "https://www.theguardian.com/us-news/live/2016/nov/10/donald-trump-barack-obama-white-house-us-election-live-updates",
        "http://www.independent.co.uk/news/world/americas/donald-trump-meets-barack-obama-body-language-president-president-elect-a7412186.html",
        "http://www.mirror.co.uk/news/world-news/donald-trump-barack-obama-hold-9234917",
        "http://www.thetimes.co.uk/article/rbtrump-3dc5hngts",
        "http://www.thetimes.co.uk/article/two-bitter-rivals-meet-at-the-white-house-br30bmkhn",
        "http://www.express.co.uk/news/world/730940/Donald-Trump-Barack-Obama-White-House-Washington-US-election-2016",
        "http://www.telegraph.co.uk/news/2016/11/10/donald-trump-and-barack-obamas-meeting-was-awkward-they-looked-l/",
        "https://www.thesun.co.uk/news/2159514/donald-trump-arrives-in-washington-ahead-of-power-transition-talks-with-president-barack-obama/",
        "https://www.washingtonpost.com/news/post-politics/wp/2016/11/10/obama-to-welcome-trump-to-white-house-for-first-meeting-since-election/",
        "http://www.usatoday.com/story/news/politics/elections/2016/11/10/obama-trump-white-house-transition/93581810/",
        "http://www.wsj.com/articles/trump-obama-set-to-begin-transition-1478787730",

        # german consulate attack ~10.11.2016
        "https://www.theguardian.com/world/2016/nov/10/taliban-attack-german-consulate-mazar-i-sharif-afghanistan-nato-airstrikes-kunduz",
        "http://www.bbc.com/news/world-asia-37944115",
        "http://www.independent.co.uk/news/world/middle-east/german-consulate-afghanistan-attacked-bomb-suicide-taliban-revenge-mazar-i-sharif-kunduz-attack-two-a7410746.html",
        "http://www.thetimes.co.uk/article/two-killed-in-bomb-attack-on-consulate-mttnh9pt9",
        "https://www.thesun.co.uk/news/2162467/taliban-suicide-bomber-truck-german-consulate-afghanistan-killing-two/amp/",
        "http://www.telegraph.co.uk/news/2016/11/10/taliban-attack-german-consulate-in-northern-afghan-city-of-mazar/",
        "http://www.express.co.uk/news/world/731052/German-consulate-explosion-gunfire-Afghanistan",
        "http://www.nytimes.com/2016/11/11/world/asia/taliban-strike-german-consulate-in-afghan-city-of-mazar-i-sharif.html?mtrref=query.nytimes.com&gwh=792F9F9ECEB17B00C71C4F8444293AD8&gwt=pay",
        "http://www.wsj.com/articles/german-consulate-in-afghanistan-attacked-1478817411"

        # Clinton blames FBI director Comey ~12.11.2016
        "http://www.independent.co.uk/news/people/hillary-clinton-blames-fbi-director-james-comeys-decision-to-to-reopen-email-probe-for-defeat-to-a7414021.html",
        "http://www.thetimes.co.uk/article/clinton-accuses-fbi-chief-of-costing-her-the-election-7czltxsxm",
        "http://www.bbc.com/news/election-us-2016-37963965",
        "https://www.thesun.co.uk/news/2168742/hillary-clinton-aide-blames-us-presidential-loss-to-donald-trump-on-fbi-chief-because-he-cleared-her-of-wrongdoing-over-weiner-emails/",
        "http://www.telegraph.co.uk/news/2016/11/12/hillary-clinton-blames-election-loss-on-fbis-james-comey-in-call/",
        "http://www.express.co.uk/news/world/731721/I-BLAME-COMEY-Bitter-Clinton-blames-FBI-chief-James-Comey-election-defeat-Donald-Trump",
        "http://www.mirror.co.uk/news/world-news/hillary-clinton-blames-fbi-director-9250867",
        "http://www.dailymail.co.uk/news/article-3930928/Hillary-Clinton-blames-FBI-Director-James-Comey-election-defeat.html",
        "https://www.theguardian.com/us-news/2016/nov/12/hillary-clinton-james-comey-letters-emails-election-defeat",
        "http://nypost.com/2016/11/12/clinton-blames-comeys-email-probe-for-her-defeat/",
        "http://www.wsj.com/articles/hillary-clinton-attributes-fbi-letters-as-factor-in-election-loss-1478994890",
        "https://www.washingtonpost.com/news/post-politics/wp/2016/11/12/hillary-clinton-blames-one-comey-letter-for-stopping-momentum-and-the-other-for-turning-out-trump-voters/",
        "http://www.nytimes.com/2016/11/13/us/politics/hillary-clinton-james-comey.html",

        # F1 Grand Prix Brazil ~13.11.2016
        # Hamilton wins
        "http://www.bbc.com/sport/formula1/37953887",
        "http://www.telegraph.co.uk/formula-1/2016/11/13/brazilian-grand-prix-live/",
        "http://www.thetimes.co.uk/article/champion-reigns-in-drenched-brazil-to-keep-title-hopes-alive-t922kw5hj",
        "https://www.thesun.co.uk/sport/2177105/lewis-hamilton-wins-the-brazilian-grand-prix-after-two-red-flags/",
        "https://www.theguardian.com/sport/live/2016/nov/13/f1-brazilian-grand-prix-live",

        # Verstappen avoids crash
        "https://www.theguardian.com/sport/blog/2016/nov/14/max-verstappen-brazilian-grand-prix-felipe-massa",
        "http://www.telegraph.co.uk/formula-1/2016/11/13/max-verstappen-even-stuns-his-dad-by-storming-home-into-third-pl/",
        "http://www.express.co.uk/sport/f1-autosport/731858/Max-Verstappen-avoids-crash-Kimi-Raikkonen-Brazilian-Grand-Prix-wet",
        "http://www.mirror.co.uk/sport/formula-1/red-bull-boss-christian-horner-9254708",
        "http://www.dailymail.co.uk/sport/formulaone/article-3932890/Max-Verstappen-amazes-Red-Bull-principal-Christian-Horner-performance-Brazil-witnessed-special.html",
        "http://www.dailymail.co.uk/sport/sportsnews/article-3934424/Formula-One-star-Max-Verstappen-shows-nerves-steel-avoid-accident.html",

        # Reikonnen / Massa crash
        "http://www.mirror.co.uk/sport/formula-1/brazilian-f1-grand-prix-riddled-9253267",
        "http://www.dailymail.co.uk/sport/formulaone/article-3932386/F1-legend-Felipe-Massa-makes-emotional-farewell-crashing-Brazil-Grand-Prix-Interlagos.html",
        "https://www.thesun.co.uk/sport/2177804/felipe-massa-retires-f1-legend-makes-a-very-emotional-farewell-after-crashing-in-his-last-home-race-in-brazil/",
        "http://www.dailymail.co.uk/sport/formulaone/article-3932252/Brazilian-Grand-Prix-thrown-chaos-Kimi-Raikkonen-accident-brings-red-flag-Sebastian-Vettel-fumes-stupid-conditions-mad.html",
        "http://www.standard.co.uk/sport/brazilian-grand-prix-redflagged-after-dramatic-kimi-raikkonen-crash-a3394411.html",
        "http://www.usatoday.com/story/sports/motor/formula1/2016/11/13/brazils-massa-crashes-but-gets-warm-farewell-at-home-gp/93771246/",
        "https://www.washingtonpost.com/sports/auto-racing/brazils-massa-crashes-but-gets-warm-farewell-at-home-gp/2016/11/13/007a15d4-a9e6-11e6-8f19-21a1c65d2043_story.html"

      ],

      "overwrite_heuristics": {
        "meta_contains_article_keyword": true,
        "og_type": false,
        "linked_headlines": false,
        "self_linked_headlines": false
      },
    }
  ]
}

Improve ComparerDescription, ComparerAuthor, ComparerDate, ComparerTopimage

These comparers return the result from newspaper if there is one. Since newspaper is working pretty well, this is effective. However, if you implement further extractors, these comparers can be improved:

ComparerDate: There can be a check if the extracted result is a valid date. Additionally, the extracted dates could be written in different ways but actually show the same date. A good comparer would see that.

ComparerAuthor: A comparer that checks for the similarity of extracted authors would be nice. This could be realized similarly to ComparerTitle. Sometimes the extractor extracts wrong authors, sometimes up two five authors. Maybe a limit of four authors would be great.

ComparerDescription: A measure of similarity would be great. Additionally, an interaction with the ComparerText would be great. What happens if the result from ComparerDescription is None? You could take the first paragraph from the text extracted by comparer text. Additionally, you can valid your result: Is the result from ComparerDescritption included in the main text or not? What happens if there is no Descritption in the article? Should the result be None? Or the first paragraph of the main text? What happens when the first paragraph is the Description? Would this create kind of unnecessary redundancy which would influence the further work flow?

ComparerTopimage: Sometimes there is no Topimage but a video. A method to deal with such a problem would be great.

RSS Crawler issues

Running a fresh install of python 3.5.4 on Win8.1 64bit with a fresh install of news-please. The example CLI runs without a problem, but when I try to modify the config file, rss crawlers throws and error that says no crawler found. See image below.
image

Error installing news-please

C:\Users\nithi>pip3 install news-please
Collecting news-please
Using cached news-please-1.2.35.tar.gz
Collecting Scrapy>=1.1.0 (from news-please)
Using cached Scrapy-1.4.0-py2.py3-none-any.whl
Collecting PyMySQL>=0.7.9 (from news-please)
Using cached PyMySQL-0.8.0-py2.py3-none-any.whl
Collecting hjson>=1.5.8 (from news-please)
Using cached hjson-3.0.1.tar.gz
Collecting elasticsearch>=2.4 (from news-please)
Using cached elasticsearch-6.0.0-py2.py3-none-any.whl
Collecting beautifulsoup4>=4.3.2 (from news-please)
Using cached beautifulsoup4-4.6.0-py3-none-any.whl
Collecting readability-lxml>=0.6.2 (from news-please)
Using cached readability-lxml-0.6.2.tar.gz
Collecting langdetect>=1.0.7 (from news-please)
Using cached langdetect-1.0.7.zip
Collecting python-dateutil>=2.4.0 (from news-please)
Using cached python_dateutil-2.6.1-py2.py3-none-any.whl
Collecting plac>=0.9.6 (from news-please)
Using cached plac-0.9.6-py2.py3-none-any.whl
Collecting dotmap>=1.2.17 (from news-please)
Using cached dotmap-1.2.20.tar.gz
Collecting PyDispatcher>=2.0.5 (from news-please)
Using cached PyDispatcher-2.0.5.tar.gz
Collecting warcio>=1.3.3 (from news-please)
Using cached warcio-1.5.1-py2.py3-none-any.whl
Collecting ago>=0.0.9 (from news-please)
Using cached ago-0.0.92.tar.gz
Collecting six>=1.10.0 (from news-please)
Using cached six-1.11.0-py2.py3-none-any.whl
Collecting lxml>=3.3.5 (from news-please)
Using cached lxml-4.1.1-cp36-cp36m-win32.whl
Collecting awscli>=1.11.117 (from news-please)
Using cached awscli-1.14.16-py2.py3-none-any.whl
Collecting hurry.filesize>=0.9 (from news-please)
Using cached hurry.filesize-0.9.tar.gz
Collecting newspaper3k (from news-please)
Using cached newspaper3k-0.2.5.tar.gz
Collecting pywin32>=220 (from news-please)
Could not find a version that satisfies the requirement pywin32>=220 (from news-please) (from versions: )
No matching distribution found for pywin32>=220 (from news-please)

C:\Users\nithi>pip install pypiwin32
Collecting pypiwin32
Downloading pypiwin32-220-cp36-none-win32.whl (8.3MB)
100% |████████████████████████████████| 8.3MB 69kB/s
Installing collected packages: pypiwin32
Successfully installed pypiwin32-220

C:\Users\nithi>pip3 install news-please
Collecting news-please
Using cached news-please-1.2.35.tar.gz
Collecting Scrapy>=1.1.0 (from news-please)
Using cached Scrapy-1.4.0-py2.py3-none-any.whl
Collecting PyMySQL>=0.7.9 (from news-please)
Using cached PyMySQL-0.8.0-py2.py3-none-any.whl
Collecting hjson>=1.5.8 (from news-please)
Using cached hjson-3.0.1.tar.gz
Collecting elasticsearch>=2.4 (from news-please)
Using cached elasticsearch-6.0.0-py2.py3-none-any.whl
Collecting beautifulsoup4>=4.3.2 (from news-please)
Using cached beautifulsoup4-4.6.0-py3-none-any.whl
Collecting readability-lxml>=0.6.2 (from news-please)
Using cached readability-lxml-0.6.2.tar.gz
Collecting langdetect>=1.0.7 (from news-please)
Using cached langdetect-1.0.7.zip
Collecting python-dateutil>=2.4.0 (from news-please)
Using cached python_dateutil-2.6.1-py2.py3-none-any.whl
Collecting plac>=0.9.6 (from news-please)
Using cached plac-0.9.6-py2.py3-none-any.whl
Collecting dotmap>=1.2.17 (from news-please)
Using cached dotmap-1.2.20.tar.gz
Collecting PyDispatcher>=2.0.5 (from news-please)
Using cached PyDispatcher-2.0.5.tar.gz
Collecting warcio>=1.3.3 (from news-please)
Using cached warcio-1.5.1-py2.py3-none-any.whl
Collecting ago>=0.0.9 (from news-please)
Using cached ago-0.0.92.tar.gz
Collecting six>=1.10.0 (from news-please)
Using cached six-1.11.0-py2.py3-none-any.whl
Collecting lxml>=3.3.5 (from news-please)
Using cached lxml-4.1.1-cp36-cp36m-win32.whl
Collecting awscli>=1.11.117 (from news-please)
Using cached awscli-1.14.16-py2.py3-none-any.whl
Collecting hurry.filesize>=0.9 (from news-please)
Using cached hurry.filesize-0.9.tar.gz
Collecting newspaper3k (from news-please)
Using cached newspaper3k-0.2.5.tar.gz
Collecting pywin32>=220 (from news-please)
Could not find a version that satisfies the requirement pywin32>=220 (from news-please) (from versions: )
No matching distribution found for pywin32>=220 (from news-please)

Error installing via pip under Windows

I experienced a problem with the installation via pip under Windows 10 (64-bit) and Python 3.5.

Collecting pywin32>=220 (from news-please)
  Could not find a version that satisfies the requirement pywin32>=220 (from news-please) (from versions: )
No matching distribution found for pywin32>=220 (from news-please)

Is there a quick solution to this problem?

Unable to use cli "news-please

Traceback (most recent call last):
File "/usr/local/bin/news-please", line 7, in
from newsplease.main import main
File "/usr/local/lib/python2.7/site-packages/newsplease/init.py", line 13, in
from dotmap import DotMap
ImportError: No module named dotmap

Timeout URL retrieval

Hi,

I wonder how to properly integrate a timeout for an URL retrieval? For example, below URL keeps on running on my machine (I'm on Windows in case useful):

url = 'http://www.nasdaq.com/article/forex-eurusd-keeps-pushing-higher-cm232206?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+nasdaq%2Fcategories+%28Articles+by+Category%29'
out = NewsPlease.from_url(url)

The problem is more general. I would like to say: try to retrieve this URL for at maximum N seconds, else quit.

I ran without success through the options outlined here: https://stackoverflow.com/questions/492519/timeout-on-a-function-call. The function just keeps running.

I couldn't find anything in the wiki, nor config file. Maybe I overlooked it. Hope somebody can help.

Many thanks,

Sam

news-please: CLI issue

I am a new user of python with a fresh install of Python 3.6.4 on a 64bit Windows 8.1 laptop. I have installed elasticsearch, newspaper3k, and news-please using pip3. I am able to use the library commands, but I need the functionality of the CLI interface to gather articles from news sites. I've run the CLI, but I continue to get the error shown in the attached image. I get this error with or without admin access for the command window. I've checked the config files as mentioned in the wiki. What am I missing?

Thank you very much for your help with this!

image

Error running commoncrawl.py

After running commoncrawl.py for like 15min it throws following error:

DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): ads.civitasmedia.com
DEBUG:urllib3.connectionpool:http://ads.civitasmedia.com:80 "GET /w/1.0/ai?auid=465268&cs=517002e209b24&cb=18879 HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:http://ads.civitasmedia.com:80 "GET /w/1.0/ai?cc=1&auid=465268&cs=517002e209b24&cb=18879 HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): u.openx.net
DEBUG:urllib3.connectionpool:http://u.openx.net:80 "GET /w/1.0/sc?r=http%3A%2F%2Fads.civitasmedia.com%2Fw%2F1.0%2Fai%3Fcc%3D1%26auid%3D465268%26cs%3D517002e209b24%26cb%3D18879 HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:http://u.openx.net:80 "GET /w/1.0/sc?cc=1&r=http%3A%2F%2Fads.civitasmedia.com%2Fw%2F1.0%2Fai%3Fcc%3D1%26auid%3D465268%26cs%3D517002e209b24%26cb%3D18879 HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:http://ads.civitasmedia.com:80 "GET /w/1.0/ai?mi=1bbd358a-0aa6-45e6-927b-7cc5cdbeab95&ma=1497001411&mr=1498211012&mn=1&mc=1&cc=1&auid=465268&cs=517002e209b24&cb=18879 HTTP/1.1" 200 43
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 11jo8z152kaa38lham19pzzv.wpengine.netdna-cdn.com
DEBUG:urllib3.connectionpool:http://11jo8z152kaa38lham19pzzv.wpengine.netdna-cdn.com:80 "GET /images/civitasreverse.png HTTP/1.1" 200 16957
INFO:__main__:article discard (sunburynews.com; None; Sunbury News)
INFO:__main__:statistics
INFO:__main__:pass = 0, discard = 160, total = 160
INFO:__main__:extraction from current WARC file started 10 minutes, 41 seconds ago; 4.012312 s/article
INFO:__main__:article discard (istoe.com.br; 2016-08-08 09:53:00; Olimp\xc3\xadada tem quebra de sete recordes mundiais)
INFO:__main__:article discard (brejo.com; 2013-12-22 00:00:00; FOTOS: Col\xc3\xa9gio da Luz realiza a 10\xc2\xaa edi\xc3\xa7\xc3\xa3o do Auto do Natal Luz)
Traceback (most recent call last):
  File "commoncrawl.py", line 271, in <module>
    common_crawl.run()
  File "commoncrawl.py", line 237, in run
    self.__process_warc_gz_file(local_path_name)
  File "commoncrawl.py", line 199, in __process_warc_gz_file
    article = NewsPlease.from_warc(record)
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newsplease/__init__.py", line 34, in from_warc
    article = NewsPlease.from_html(html, url)
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newsplease/__init__.py", line 68, in from_html
    item = extractor.extract(item)
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newsplease/pipeline/extractor/article_extractor.py", line 53, in extract
    article_candidates.append(extractor.extract(item))
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newsplease/pipeline/extractor/extractors/newspaper_extractor.py", line 30, in extract
    article.parse()
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newspaper/article.py", line 219, in parse
    meta_data = self.extractor.get_meta_data(self.clean_doc)
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newspaper/extractors.py", line 514, in get_meta_data
    ref[part] = value
TypeError: 'int' object does not support item assignment

Improve ComparerLanguage

At the moment, the comparer just counts how often any language was detected. But there are several problems:

  • the comparer does not consider that different language detectors perhaps use different short cuts for the same language.

  • is there no most frequently detected language, the comparer will take the language which is extracted by newspaper (because it is very accurate). But this is maybe not the best solution.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.