Git Product home page Git Product logo

signal-1m-tools's Introduction

Signal-1M-Tools

What is the Signal 1M Dataset?

The Signal Media One-Million News Articles Dataset dataset by Signal Media was released to facilitate conducting research on news articles. It can be used for submissions to the NewsIR'16 workshop, but it is intended to serve the community for research on news retrieval in general.

The articles of the dataset were originally collected by Moreover Technologies (one of Signal's content providers) from a variety of news sources for a period of 1 month (1-30 September 2015). It contains 1 million articles that are mainly English, but they also include non-English and multi-lingual articles. Sources of these articles include major ones, such as Reuters, in addition to local news sources and blogs.

Getting Started

Downloading the dataset

To obtain the dataset, follow the download link here.

Elasticsearch

Elasticsearch is a powerful distributed RESTful search engine that can be used to store and index large amounts of data. At Signal, we use Elasticsearch to handle most of our search requests.

Installation

  1. Download Elasticsearch and unzip.
  2. Run bin/elasticsearch on Unix or bin/elasticsearch.bat on Windows.
  3. Run curl -X GET http://localhost:9200/

At this point, Elasticsearch should be running locally on port 9200. More information about Elasticsearch can be found at their GitHub page.

We advise that you use a tool to interact with Elasticsearch. Here are a few good ones:

Creating an index

In order to store articles, you need to create an index. First, create an articles index:

curl -X PUT 'http://localhost:9200/articles'

or in Sense:

PUT articles

Indexing the million articles

To index the million articles into Elasticsearch using python, first install Requests:

pip install requests

Then run:

python index_articles.py http://localhost:9200 ./million.jsonl

Term frequencies

The term and document frequencies are also available using these links. These values were calculated after routine tokenisation and stop-word removal.

These files are in edn format.

TREC

Signal-1M-Convert-To-TREC

A script to convert the Signal Media One-Million News Articles Dataset to TREC format. The TREC format allows researchers to index the dataset using popular Information Retrieval platforms such as http://terrier.org

Running the script

After obtaining the dataset through this form http://goo.gl/forms/5i4KldoWIX, you can extract the JSONL file from the the downloaded Gzip file Then you run the script like this

python convert-to-trec.py -i <path to signalmedia-1m.jsonl> -o <path to your outputfile>

Indexing the dataset with Terrier

We recommend using the terrier.properties file included in this repository to index the dataset with Terrier. In your Terrier etc folder, add a text file "signal.spec" with one line containing the path to the file you created above (The TREC formatted dataset)

signal-1m-tools's People

Contributors

bonzanini avatar dcorney avatar dyaaalbakour avatar sdewan64 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

signal-1m-tools's Issues

Relevance judgments for text document retrieval

Is there any ground truth / relevance judgments (query-article pairs) available for this data set to be used in evaluating an information retrieval system ( vector space model with a modified TF-IDF term weighting function) ?

"TypeError: must be str, not bytes"

When I enter this command in cmd:

python convert-to-trec.py -i C:\Users\User\Desktop\signaldata\sample-1M.jsonl -o C:\Users\User\Desktop\signaldata\trec.jsonl

I get the following Error Message:

file "convert-to-trec.py", line 47, in main
fout.write(tecdoc.encode('utf-8'))
TypeError: must be str, not bytes

What causes this? Im also not sure if the file im writing into should already exist, in my case it doesnt

Also, another Error that appeared when trying to index the Data into Elasticsearch: At some random Number (E.g. 15000, 21000) the Indexing Script encounters an Error and Stops.

___CMD LOG OF INDEXING ERROR

Added document: 21567
Added document: 21568
Added document: 21569
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\urllib3\connection.py", line 159, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw)
File "C:\Python34\lib\site-packages\urllib3\util\connection.py", line 80, in create_connection
raise err
File "C:\Python34\lib\site-packages\urllib3\util\connection.py", line 70, in create_connection
sock.connect(sa)
OSError: [WinError 10048] Normalerweise darf jede Socketadresse (Protokoll, Netzwerkadresse oder Anschluss) nur jeweils einmal verwendet werden

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
chunked=chunked)
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 354, in _make_request
conn.request(method, url, **httplib_request_kw)
File "C:\Python34\Lib\http\client.py", line 1088, in request
self._send_request(method, url, body, headers)
File "C:\Python34\Lib\http\client.py", line 1126, in _send_request
self.endheaders(body)
File "C:\Python34\Lib\http\client.py", line 1084, in endheaders
self._send_output(message_body)
File "C:\Python34\Lib\http\client.py", line 922, in _send_output
self.send(msg)
File "C:\Python34\Lib\http\client.py", line 857, in send
self.connect()
File "C:\Python34\lib\site-packages\urllib3\connection.py", line 181, in connect
conn = self._new_conn()
File "C:\Python34\lib\site-packages\urllib3\connection.py", line 168, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x030F96D0>: Failed to establish a new connection: [WinError 10048] Normalerweise darf jede Socketadresse (Protokoll, Netzwerkadresse oder Anschluss) nur jeweils einmal verwendet werden

Indexing stops at 19170

Traceback (most recent call last):
File "C:\Users\knott\Documents\PythonKurs\uebung2\venv\lib\site-packages\urllib3\connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "C:\Users\knott\Documents\PythonKurs\uebung2\venv\lib\site-packages\urllib3\util\connection.py", line 95, in create_connection
raise err
File "C:\Users\knott\Documents\PythonKurs\uebung2\venv\lib\site-packages\urllib3\util\connection.py", line 85, in create_connection
sock.connect(sa)
OSError: [WinError 10048] Normalerweise darf jede Socketadresse (Protokoll, Netzwerkadresse oder Anschluss) nur jeweils einmal verwendet werden

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\knott\Documents\PythonKurs\uebung2\venv\lib\site-packages\urllib3\connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "C:\Users\knott\Documents\PythonKurs\uebung2\venv\lib\site-packages\urllib3\connectionpool.py", line 398, in _make_request
conn.request(method, url, **httplib_request_kw)
File "C:\Users\knott\Documents\PythonKurs\uebung2\venv\lib\site-packages\urllib3\connection.py", line 239, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "C:\Users\knott\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 1253, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Users\knott\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 1299, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "C:\Users\knott\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 1248, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "C:\Users\knott\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 1008, in _send_output
self.send(msg)
File "C:\Users\knott\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 948, in send
self.connect()
File "C:\Users\knott\Documents\PythonKurs\uebung2\venv\lib\site-packages\urllib3\connection.py", line 205, in connect
conn = self._new_conn()
File "C:\Users\knott\Documents\PythonKurs\uebung2\venv\lib\site-packages\urllib3\connection.py", line 186, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x000002614F64BD90>: Failed to establish a new connection: [WinError 10048] Normalerweise darf jede Socketadresse (Protokoll, Netzwerk
adresse oder Anschluss) nur jeweils einmal verwendet werden

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\knott\Documents\PythonKurs\uebung2\venv\lib\site-packages\requests\adapters.py", line 440, in send
resp = conn.urlopen(
File "C:\Users\knott\Documents\PythonKurs\uebung2\venv\lib\site-packages\urllib3\connectionpool.py", line 785, in urlopen
retries = retries.increment(
File "C:\Users\knott\Documents\PythonKurs\uebung2\venv\lib\site-packages\urllib3\util\retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=9200): Max retries exceeded with url: /articles/article (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000261
4F64BD90>: Failed to establish a new connection: [WinError 10048] Normalerweise darf jede Socketadresse (Protokoll, Netzwerkadresse oder Anschluss) nur jeweils einmal verwendet werden'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\knott\Documents\stuff\Signal-1M-Tools-master\index_articles.py", line 10, in
requests.post(url=es_url + '/articles/article', data=line)
File "C:\Users\knott\Documents\PythonKurs\uebung2\venv\lib\site-packages\requests\api.py", line 117, in post
return request('post', url, data=data, json=json, **kwargs)
File "C:\Users\knott\Documents\PythonKurs\uebung2\venv\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\knott\Documents\PythonKurs\uebung2\venv\lib\site-packages\requests\sessions.py", line 529, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\knott\Documents\PythonKurs\uebung2\venv\lib\site-packages\requests\sessions.py", line 645, in send
r = adapter.send(request, **kwargs)
File "C:\Users\knott\Documents\PythonKurs\uebung2\venv\lib\site-packages\requests\adapters.py", line 519, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=9200): Max retries exceeded with url: /articles/article (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000
2614F64BD90>: Failed to establish a new connection: [WinError 10048] Normalerweise darf jede Socketadresse (Protokoll, Netzwerkadresse oder Anschluss) nur jeweils einmal verwendet werden'))
PS C:\Users\knott\Documents\stuff\Signal-1M-Tools-master>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.