Git Product home page Git Product logo

linwoodc3 / gdeltpyr Goto Github PK

View Code? Open in Web Editor NEW
188.0 8.0 52.0 134.46 MB

Python based framework to retreive Global Database of Events, Language, and Tone (GDELT) version 1.0 and version 2.0 data.

Home Page: https://linwoodc3.github.io/gdeltPyR/

License: GNU General Public License v3.0

Python 12.96% Shell 0.09% Jupyter Notebook 86.88% HTML 0.07%
news geospatial-data gdelt python geolocation pandas data-frame global-database

gdeltpyr's Introduction

Windows OS Module Version Coverage Downloads
Build status PyPI version Coverage Status Downloads

gdeltPyR

gdeltPyR is a Python-based framework to access and analyze Global Database of Events, Language, and Tone (GDELT) 1.0 or 2.0 data in a Python Pandas or R dataframe. A user can enter a single date, date range (list of two strings), or individual dates (more than two in a list) and return a tidy data set ready for scientific or data-driven exploration.

  • Python 2 is retiring. Because gdeltPyR depends on several libraries that will end Python 2 support, it's only prudent that we do the same. gdeltPyR functionality in Python 2 will become buggy over the coming months. Move to Python 3 for the best experience.

gdeltPyR retrieves GDELT data, version 1.0 or version 2.0 via parallel HTTP GET requests and will provide a method to access GDELT data directly via Google BigQuery . Therefore, the more cores you have, the less time it takes to pull more data. Moreover, the more RAM you have, the more data you can pull. And finally, for RAM-limited workflows, create a pipeline that pulls data, writes to disk, and flushes.

The GDELT Project advertises as the largest, most comprehensive, and highest resolution open database of human society ever created. It monitors print, broadcast, and web news media in over 100 languages from across every country in the world to keep continually updated on breaking developments anywhere on the planet. Its historical archives stretch back to January 1, 1979 and accesses the world’s breaking events and reaction in near-realtime as both the GDELT Event and Global Knowledge Graph update every 15 minutes. Visit the GDELT website to learn more about the project.

GDELT Facts

  • GDELT 1.0 is a daily dataset
    • 1.0 only has 'events' and 'gkg' tables
    • 1.0 posts the previous day's data at 6AM EST of next day (i.e. Monday's data will be available 6AM Tuesday EST)
  • GDELT 2.0 is updated every 15 minutes
    • Some time intervals can have missing data; gdeltPyR provides a warning for missing data
    • 2.0 has 'events','gkg', and 'mentions' tables
    • 2.0 has a distinction between native english and translated-to-english news
    • 2.0 has more columns

Project Concept and Evolution Plan

This project will evolve in two phases. Moreover, if you want to contribute to the project, this section can help prioritize where to put efforts.

Phase 1 focuses on providing consistent, stable, and reliable access to GDELT data.

gdeltPyR will help data scientists, researchers, data enthusiasts, and curious Python coders in this phase. Therefore, most issues in this phase will build out the main Search method of the gdelt class to return GDELT data, version 1.0 or version 2.0, or equally important, give a relevant error message when no data is returned. This also means the project will focus on building documentation, a unit testing framework (shooting for 90% coverage), and creating a helper class that provides helpful information on column names/table descriptions.

Phase 2 brings analytics to gdeltPyR to expand the library beyond a simple data retrieval functionality

This phase is what will make gdeltPyR useful to a wider audience. The major addition will be an Analysis method of the gdelt class which will analyze outputs of the Search method. For data-literate users (data scientists, researchers, students, data journalists, etc), enhancements in this phase will save time by providing summary statistics and extraction methods of GDELT data, and as a result reduce the time a user would spend writing code to perform routine data cleanup/analysis. For the non-technical audience (students, journalists, business managers, etc.), enhancesments in this phase will provide outputs that summarize GDELT data, which can in turn be used in reports, articles, etc. Areas of focus include descriptive statistics (mean, split-apply-combine stats, etc), spatial analysis, and time series.

# Basic use and new schema method
import gdelt

gd= gdelt.gdelt()

events = gd.Search(['2017 May 23'],table='events',output='gpd',normcols=True,coverage=False)

# new schema method
print(gd.schema('events'))

Coming Soon (in version 0.2, as of Oct 2023)

Installation

gdeltPyR can be installed via pip

pip install gdelt

It can also be installed using conda

conda install gdelt

Basic Examples

GDELT 1.0 Queries

import gdelt

# Version 1 queries
gd1 = gdelt.gdelt(version=1)

# pull single day, gkg table
results= gd1.Search('2016 Nov 01',table='gkg')
print(len(results))

# pull events table, range, output to json format
results = gd1.Search(['2016 Oct 31','2016 Nov 2'],coverage=True,table='events')
print(len(results))

GDELT 2.0 Queries

# Version 2 queries
gd2 = gdelt.gdelt(version=2)

# Single 15 minute interval pull, output to json format with mentions table
results = gd2.Search('2016 Nov 1',table='mentions',output='json')
print(len(results))

# Full day pull, output to pandas dataframe, events table
results = gd2.Search(['2016 11 01'],table='events',coverage=True)
print(len(results))

Output Options

gdeltPyR can output results directly into several formats which include:

  • pandas dataframe
  • csv
  • json
  • geopandas dataframe (as of version 0.1.10)
  • GeoJSON (coming soon version 0.1.11)
  • Shapefile (coming soon version 0.1.11)

Performance on 4 core, MacOS Sierra 10.12 with 16GB of RAM:

  • 900,000 by 61 (rows x columns) pandas dataframe returned in 36 seconds
    • data is a merged pandas dataframe of GDELT 2.0 events database data

gdeltPyR Parameters

gdeltPyR provides access to 1.0 and 2.0 data. Six parameters guide the query syntax:

Name Description Input Possibilities/Examples
version (integer) - Selects the version of GDELT data to query; defaults to version 2. 1 or 2
date (string or list of strings) - Dates to query "2016 10 23" or "2016 Oct 23"
coverage (bool) - For GDELT 2.0, pulls every 15 minute interval in the dates passed in the 'date' parameter. Default coverage is False or None. gdeltPyR will pull the latest 15 minute interval for the current day or the last 15 minute interval for a historic day. True or False or None
translation (bool) - For GDELT 2.0, if the english or translated-to-english dataset should be downloaded True or False
tables (string) - The specific GDELT table to pull. The default table is the 'events' table. See the GDELT documentation page for more information 'events' or 'mentions' or 'gkg'
output (string) - The output type for the results 'json' or 'csv' or 'gpd'

These parameter values can be mixed and matched to return the data you want. the coverage parameter is used with GDELT version 2; when set to "True", the gdeltPyR will query all available 15 minute intervals for the dates passed. For the current day, the query will return the most recent 15 minute interval.

Known Issues

  • "Running out of memory; need to cover wider timeframe"
    • Fix 1: Use Google BigQuery Method in gdeltPyR (coming soon)
      • Why: Drastically reduces the memory footprint as all processing is pushed to server side; returns subset of GDELT data but requires SQL familiarity
    • Fix 2: Use Version 1 data
    • Fix 3: Get more memory or write to disk, flush RAM, then continue iterating until done.
      • Why: If you MUST use Version 2 and pull full days of data, you need more memory as the gdeltPyR return is held in memory. One day of GDELT Version 2 data can be 500 MB. Get more RAM, you have less problems. Or, pull a day, write to disk, flush, then continue.

Coming Soon

Contributing to gdeltPyR

All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.

A detailed overview on how to contribute is forthcoming.

Our main requirement (and advice) is to make sure you write a unit test for your enhancement or addition (or just make one to help get the project to 90% unit tests!!!). Moreover, we can't accept a commit until existing unittests are passing in Travis CI (OSX and Linux) and Appveyor (Windows).

If you are simply looking to start working with the gdeltPyR codebase, navigate to the gdeltPyR's Issues tab and start looking through interesting issues. There are a number of issues listed where you could start out.

Or maybe through using gdeltPyR you have an idea of your own or are looking for something in the documentation and thinking ‘this can be improved’...you can do something about it!

gdeltPyR Dev Environment

We follow the pandas instructions as a guide to build a gdeltPyR development environment. Windows users should look at the instructions below for environment set up.

An easy way to create a gdeltPyR development environment is as follows.

After completing all steps above, tell conda to create a new environment, named gdelt_dev, or any other name you would like for this environment, by running:

  • For Python 2.7
 conda create -n gdelt_dev python=2 -c conda-forge --file travis/requirements_all.txt
  • For Python 3.5
 conda create -n gdelt_dev python=3 -c conda-forge --file travis/requirements_all.txt
  • For Python 3.6
 conda create -n gdelt_dev python=3.6 -c conda-forge --file travis/requirements_all36.txt

Windows Dev Environment

For Windows, we will again follow the pandas documentation (let me know if this doesn't work for gdeltPyR). To build on Windows, you need to have compilers installed to build the extensions. You will need to install the appropriate Visual Studio compilers, VS 2008 for Python 2.7, VS 2010 for 3.4, and VS 2015 for Python 3.5 and 3.6.

For Python 2.7, you can install the mingw compiler which will work equivalently to VS 2008:

conda install -n gdelt_dev libpython

or use the Microsoft Visual Studio VC++ compiler for Python. Note that you have to check the x64 box to install the x64 extension building capability as this is not installed by default.

For Python 3.4, you can download and install the Windows 7.1 SDK. Read the references below as there may be various gotchas during the installation.

For Python 3.5 and 3.6, you can download and install the Visual Studio 2015 Community Edition.

Here are some references and blogs:

This will create the new environment, and not touch any of your existing environments, nor any existing Python installation. It will install all of the basic dependencies of gdeltPyR, as well as the development and testing tools. To enter this new environment:

  • On Windows
activate gdelt_dev
  • On Linux/Mac OS
source activate gdelt_dev

You will then see a confirmation message to indicate you are in the new development environment.

To view your environments:

conda info -e

To return to your home root environment in Windows:

deactivate

To return to your home root environment in OSX / Linux:

source deactivate

Building gdeltPyR

See the full conda docs here.

The last step is installing the gdelt development source into this new directory. First, make sure that you cd into the gdeltPyR source directory. You have two options to build the code:

  1. The best way to develop 'gdeltPyR' is to build the extensions in-place by running:
python setup.py build_ext --inplace

If you startup the Python interpreter in the pandas source directory you will call the built C extensions

  1. Another very common option is to do a develop install of pandas:
python setup.py develop

This makes a symbolic link that tells the Python interpreter to import pandas from your development directory. Thus, you can always be using the development version on your system without being inside the clone directory.

You should have a fully functional development environment!

Continuous Integration

pandas has a fantastic write up on Continuous Integration (CI). Because gdeltPyR embraces the same CI concepts, please read pandas introduction and explanation of CI if you have issues. All builds of your branch or Pull Request should pass with greens before it can be merged with the master branch.

CI Greens

Committing Your Code

There's no point in reinventing the wheel; read the pandas documentation on committing code for instructions on how to contribute to gdeltPyR.

If you completed everything above, you should be ready to contribute.

Styles for Submitting Issues/Pull Requests

We follow the pandas coding style for issues and pull requests. Use the following style:

  • ENH: Enhancement, new functionality
  • BUG: Bug fix
  • DOC: Additions/updates to documentation
  • TST: Additions/updates to tests
  • BLD: Updates to the build process/scripts
  • PERF: Performance improvement
  • CLN: Code cleanup

See this issue as an example.

gdeltpyr's People

Contributors

harman28 avatar iltc avatar linwoodc3 avatar pietermarsman avatar reed9999 avatar smritigambhir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gdeltpyr's Issues

Add json output format

Add simple ability to output the returned data in json format. In the end, we'll return csv, json, pandas dataframe, r dataframe, or hdf.

AssertionError: group argument must be None for now while Quering gkg in a time period

when i try to search in a time period i get Assertion error. it's interesting that it works when i run it only with one date. example gd.Search('2016 10 19',coverage=True,table='gkg')

what i Queried:
%time results = gd.Search(['2016 10 19','2016 10 22'],coverage=True,table='gkg')

the Error

File :1

File [c:\Users\\anaconda3\envs\myenv\Lib\site-packages\gdelt\base.py:634](file:///C:/Users//anaconda3/envs/myenv/Lib/site-packages/gdelt/base.py:634), in gdelt.Search(self, date, table, coverage, translation, output, queryTime, normcols)
    630     downloaded_dfs = list(pool.imap_unordered(eventWork,
    631                                               self.download_list))
    632 else:
--> 634     pool = NoDaemonProcessPool(processes=cpu_count())
    635     downloaded_dfs = list(pool.imap_unordered(_mp_worker,
    636                                               self.download_list,
    637                                               ))
    638 pool.close()

File [c:\Users\\anaconda3\envs\myenv\Lib\multiprocessing\pool.py:215](file:///C:/Users//anaconda3/envs/myenv/Lib/multiprocessing/pool.py:215), in Pool.__init__(self, processes, initializer, initargs, maxtasksperchild, context)
    213 self._processes = processes
    214 try:
--> 215     self._repopulate_pool()
    216 except Exception:
    217     for p in self._pool:

File [c:\Users\\anaconda3\envs\myenv\Lib\multiprocessing\pool.py:306](file:///C:/Users//anaconda3/envs/myenv/Lib/multiprocessing/pool.py:306), in Pool._repopulate_pool(self)
    305 def _repopulate_pool(self):
--> 306     return self._repopulate_pool_static(self._ctx, self.Process,
    307                                         self._processes,
    308                                         self._pool, self._inqueue,
    309                                         self._outqueue, self._initializer,
    310                                         self._initargs,
    311                                         self._maxtasksperchild,
    312                                         self._wrap_exception)

File [c:\Users\\anaconda3\envs\myenv\Lib\multiprocessing\pool.py:322](file:///C:/Users//anaconda3/envs/myenv/Lib/multiprocessing/pool.py:322), in Pool._repopulate_pool_static(ctx, Process, processes, pool, inqueue, outqueue, initializer, initargs, maxtasksperchild, wrap_exception)
    318 """Bring the number of pool processes up to the specified number,
    319 for use after reaping workers which have exited.
    320 """
    321 for i in range(processes - len(pool)):
--> 322     w = Process(ctx, target=worker,
    323                 args=(inqueue, outqueue,
    324                       initializer,
    325                       initargs, maxtasksperchild,
    326                       wrap_exception))
    327     w.name = w.name.replace('Process', 'PoolWorker')
    328     w.daemon = True

File [c:\Users\\anaconda3\envs\myenv\Lib\multiprocessing\process.py:82](file:///C:/Users//anaconda3/envs/myenv/Lib/multiprocessing/process.py:82), in BaseProcess.__init__(self, group, target, name, args, kwargs, daemon)
...
---> 82     assert group is None, 'group argument must be None for now'
     83     count = next(_process_counter)
     84     self._identity = _current_process._identity + (count,)

AssertionError: group argument must be None for now

Add "get.data" function to download master list

This will reduce the load time and the run time of the search function. Right now, for GDELT Version 2.0, a single day query takes 45-50 s. With this new functionality, we'll only make calls to the last 15 minute query or the historical get data master list.

How to store the news data into csv?

Hello, thank for this excellent package.

Could anyone let me know how to extract news data from GDELT using this package and store into .csv file?

Thank you very much!

DOC: Make documentation pages with sphinx

As this is my first module, need to learn how to use sphinx documentation. Make the page with concept description, section on how to contribute (asking for help from experienced folks), and information on CAMEO codes and how to use them.

Extract all locations from the gkg table

As a geospatial analyst,
I need to extract and classify all locations from the knowledge graph.
So that, I can easily extract locations on the country, state or city level.

coverage=True for gkg search error

Whenever I set coverage=True for gkg search I receive the error below. However with the events search I don't experience this error.

Code
gkg = gd.Search(['2017 May 23'],table='gkg',normcols=True,coverage=True)

Error

AssertionError Traceback (most recent call last)
in
----> 1 gkg = gd.Search(['2017 May 23'],table='gkg',normcols=True,coverage=True)
2 gkg.columns

/opt/miniconda3/envs/thesis/lib/python3.8/site-packages/gdelt/base.py in Search(self, date, table, coverage, translation, output, queryTime, normcols)
632 else:
633
--> 634 pool = NoDaemonProcessPool(processes=cpu_count())
635 downloaded_dfs = list(pool.imap_unordered(_mp_worker,
636 self.download_list,

/opt/miniconda3/envs/thesis/lib/python3.8/multiprocessing/pool.py in init(self, processes, initializer, initargs, maxtasksperchild, context)
210 self._processes = processes
211 try:
--> 212 self._repopulate_pool()
213 except Exception:
214 for p in self._pool:

/opt/miniconda3/envs/thesis/lib/python3.8/multiprocessing/pool.py in _repopulate_pool(self)
301
302 def _repopulate_pool(self):
--> 303 return self._repopulate_pool_static(self._ctx, self.Process,
304 self._processes,
305 self._pool, self._inqueue,

/opt/miniconda3/envs/thesis/lib/python3.8/multiprocessing/pool.py in _repopulate_pool_static(ctx, Process, processes, pool, inqueue, outqueue, initializer, initargs, maxtasksperchild, wrap_exception)
317 """
318 for i in range(processes - len(pool)):
--> 319 w = Process(ctx, target=worker,
320 args=(inqueue, outqueue,
321 initializer,

/opt/miniconda3/envs/thesis/lib/python3.8/multiprocessing/process.py in init(self, group, target, name, args, kwargs, daemon)
80 def init(self, group=None, target=None, name=None, args=(), kwargs={},
81 *, daemon=None):
---> 82 assert group is None, 'group argument must be None for now'
83 count = next(_process_counter)
84 self._identity = _current_process._identity + (count,)

AssertionError: group argument must be None for now

BUG: Add exception handling for no data returned

gdeltPyR returns a non-intuitive error when no data returns for a single 15 minute data pull. Need to add exception handling to make it clearer to the user that no data returned; right now it looks like gdeltPyR is broken.

Example to recreate

import gdelt
gd = gdelt.gdelt()
a=gd.Search('2017 July 27')

[Out]:
...
  File "/Users/linwood/projects/gdeltPyR/gdelt/base.py", line 597, in Search
    if len(results.columns) == 57:
AttributeError: 'NoneType' object has no attribute 'columns'

.idea Folder is Not Needed

The .idea folder is an artifact of the PyCharm editor. It should be removed and added to .gitignore to reduce clutter.

BUG: GDELT Version 2 not collecting the latest 15 minutes file

I've been using the same code to collect events every 15 minutes from the database for a few months now, but since yesterday I keep getting the error:

UserWarning: GDELT did not return data for date time 20200331120000 warnings.warn(message)

The code that I am using is:
gd2 = gdelt.gdelt(version=2)
results = gd2.Search('2020 03 31',table='events',output='json')

It works when collecting data for a date that it not the current date (31st March), so I think maybe it is because instead of collecting the latest 15 minutes, it is collecting whole day files only?

Is there a way to fix this?
I'm using python 3.5 64 bit on windows.

EDIT: the issue seems to be with the timezone as due to the timezone changing in the UK on the 29th, the URL being requested from the database is one hour ahead of the data available, which is the issue.

EDIT: I've temporarily fixed it by changing line 174 in dateFuncs.py to subtract an hour instead of using datetime.now() directly, would it be possible to add a feature to be able to set this from the Search function itself rather than changing dateFuncs manually?

Thank you.

BUG: Proxy issue when importing

I get a proxy error when trying to import the module. This is problematic since you can't pass parameters when importing things (IIRC). Seems like this is the problem bit.

~/gdelt/venv/lib/python3.7/site-packages/gdelt/base.py in <module>()
     80         '/utils/' \
     81         'schema_csvs/cameoCodes.json'
---> 82     codes = json.loads((requests.get(a).content.decode('utf-8')))
     83 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/gdelt/venv/lib/python3.7/site-packages/gdelt/base.py in <module>()
     74     codes = pd.read_json(os.path.join(BASE_DIR, 'data', 'cameoCodes.json'),
---> 75                          dtype=dict(cameoCode='str', GoldsteinScale=np.float64))
     76     codes.set_index('cameoCode', drop=False, inplace=True)

~/gdelt/venv/lib/python3.7/site-packages/pandas/io/json/json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression)
    421 
--> 422     result = json_reader.read()
    423     if should_close:

~/gdelt/venv/lib/python3.7/site-packages/pandas/io/json/json.py in read(self)
    528         else:
--> 529             obj = self._get_object_parser(self.data)
    530         self.close()

~/gdelt/venv/lib/python3.7/site-packages/pandas/io/json/json.py in _get_object_parser(self, json)
    545         if typ == 'frame':
--> 546             obj = FrameParser(json, **kwargs).parse()
    547 

~/gdelt/venv/lib/python3.7/site-packages/pandas/io/json/json.py in parse(self)
    637         else:
--> 638             self._parse_no_numpy()
    639 

~/gdelt/venv/lib/python3.7/site-packages/pandas/io/json/json.py in _parse_no_numpy(self)
    852             self.obj = DataFrame(
--> 853                 loads(json, precise_float=self.precise_float), dtype=None)
    854         elif orient == "split":

ValueError: Expected object or value

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
~/gdelt/venv/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    593             if is_new_proxy_conn:
--> 594                 self._prepare_proxy(conn)
    595 

~/gdelt/venv/lib/python3.7/site-packages/urllib3/connectionpool.py in _prepare_proxy(self, conn)
    814 
--> 815         conn.connect()
    816 

~/gdelt/venv/lib/python3.7/site-packages/urllib3/connection.py in connect(self)
    323             # self._tunnel_host below.
--> 324             self._tunnel()
    325             # Mark this connection as not reusable

/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py in _tunnel(self)
    910             raise OSError("Tunnel connection failed: %d %s" % (code,
--> 911                                                                message.strip()))
    912         while True:

OSError: Tunnel connection failed: 407 AuthorizedOnly

During handling of the above exception, another exception occurred:

MaxRetryError                             Traceback (most recent call last)
~/gdelt/venv/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    444                     retries=self.max_retries,
--> 445                     timeout=timeout
    446                 )

~/gdelt/venv/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    637             retries = retries.increment(method, url, error=e, _pool=self,
--> 638                                         _stacktrace=sys.exc_info()[2])
    639             retries.sleep()

~/gdelt/venv/lib/python3.7/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
    397         if new_retry.is_exhausted():
--> 398             raise MaxRetryError(_pool, url, error or ResponseError(cause))
    399 

MaxRetryError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /linwoodc3/gdeltPyR/master/utils/schema_csvs/cameoCodes.json (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 407 AuthorizedOnly')))

During handling of the above exception, another exception occurred:

ProxyError                                Traceback (most recent call last)
<ipython-input-1-b6a720b4b38d> in <module>()
----> 1 import gdelt

~/gdelt/venv/lib/python3.7/site-packages/gdelt/__init__.py in <module>()
      4 from __future__ import absolute_import
      5 
----> 6 from gdelt.base import gdelt
      7 
      8 __name__ = 'gdelt'

~/gdelt/venv/lib/python3.7/site-packages/gdelt/base.py in <module>()
     80         '/utils/' \
     81         'schema_csvs/cameoCodes.json'
---> 82     codes = json.loads((requests.get(a).content.decode('utf-8')))
     83 
     84 ##############################

~/gdelt/venv/lib/python3.7/site-packages/requests/api.py in get(url, params, **kwargs)
     70 
     71     kwargs.setdefault('allow_redirects', True)
---> 72     return request('get', url, params=params, **kwargs)
     73 
     74 

~/gdelt/venv/lib/python3.7/site-packages/requests/api.py in request(method, url, **kwargs)
     56     # cases, and look like a memory leak in others.
     57     with sessions.Session() as session:
---> 58         return session.request(method=method, url=url, **kwargs)
     59 
     60 

~/gdelt/venv/lib/python3.7/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    510         }
    511         send_kwargs.update(settings)
--> 512         resp = self.send(prep, **send_kwargs)
    513 
    514         return resp

~/gdelt/venv/lib/python3.7/site-packages/requests/sessions.py in send(self, request, **kwargs)
    620 
    621         # Send the request
--> 622         r = adapter.send(request, **kwargs)
    623 
    624         # Total elapsed time of the request (approximately)

~/gdelt/venv/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    505 
    506             if isinstance(e.reason, _ProxyError):
--> 507                 raise ProxyError(e, request=request)
    508 
    509             if isinstance(e.reason, _SSLError):

ProxyError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /linwoodc3/gdeltPyR/master/utils/schema_csvs/cameoCodes.json (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 407 AuthorizedOnly')))

Max retries

In AWS machine, keep getting max retries message. Never get this message on my personal computer so machines with really fast processors may send requests to the GDELT servers to fast. Need to add a synthetic delay.

Not all available data is downloaded!!!

I get a lot of outputs that GDELT does not return any outputs for certain dates. However, if I check this manually, data is available and I can download it. I have also checked it in the data and this data is missing

`/home/python3.10/site-packages/gdelt/parallel.py:111: UserWarning: GDELT did not return data for date time 20210201044500
warnings.warn(message)

/home/python3.10/site-packages/gdelt/parallel.py:111: UserWarning: GDELT did not return data for date time 20210201014500
warnings.warn(message)

/home/python3.10/site-packages/gdelt/parallel.py:111: UserWarning: GDELT did not return data for date time 20210201001500
warnings.warn(message)`

ENH: add google bigquery interface

Use [pandas.io.gbq](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_gbq.html#pandas.read_gbq)

First you will need to pip install: pip install --upgrade google-api-python-client
Here is a working query:

# load keys;  requires you to be registered
keys = json.load(open('/Users/linwood/Desktop/keysforapps/apikeys.txt'))

# setup google creds; not sure if this is required yet; but you need to do it once to authorize the api from your python ecosystem
from apiclient.discovery import build
service = build('bigquery', 'v2', developerKey=keys['google']['apikey']+"2")

# load query in proper SQL syntax as string
from pandas.io import gbq
q="""
SELECT MonthYear,count(*)c,count(IF(Actor1Code LIKE 'MUS',1,null)) c_up
FROM [gdelt-bq.full.events] WHERE EventRootCode = '19'
GROUP BY MonthYear ORDER BY MonthYear;"""


# run the query
df = gbq.read_gbq(q, project_id=<projectid>)
:[out]
Requesting query... ok.
Query running...
Query done.
Cache hit.

Retrieving results...
Got 461 rows.

Total time taken 0.75 s.
Finished at 2017-05-30 10:26:21

NameError: global name 'p' is not defined

Traceback (most recent call last):
File "D:\XXX\coding\gdelt\gdeltPyR.py", line 12, in
results = gd.Search(['2016 10 19','2016 10 22'],table='events',coverage=True)
File "C:\Users\XXX\Anaconda2\lib\site-packages\gdelt\base.py", line 290, in Search
p
NameError: global name 'p' is not defined

Pull GDelt V2 GKG data for the full day

Hi!

When I pull GKG data with the following code, I only get the first 15mins of data. Is it possible to get the full day's worth of GKG data?

gd = gdelt.gdelt(version=2)
date = extract_date.strftime('%Y%m%d')
df = gd.Search(date, table='gkg', coverage=True)

Many thanks!

Missing requirements? (pytest-cov, geopandas)

I'm running tests for the first time, and I went through the usual process of pip3 install -r requirements.txt from a virtualenv. It seems like there might be necessary packages that are missing:

pytest-cov

I got some issues with py.test: error: unrecognized arguments: --cov --cov-repo=term-missing which turned out to be a different root cause (system pytest being used). Nevertheless, in troubleshooting I got the impression pytest-cov probably should be installed explicitly. It might just be installed by upgrading to a current python-pytest, not sure.

geopandas

Now that I can run the tests, everything seems to pass except a couple with ModuleNotFoundError: No module named 'geopandas'. I thought maybe geopandas would be installed in requirements_geo.txt but apparently not. It's unclear to me which of the requirements files should install it, or both.

run the example from readme.md failed

GDELT 1.0 Queries

import gdelt

Version 1 queries

gd1 = gdelt.gdelt(version=1)

pull single day, gkg table

results= gd1.Search('2016 Nov 01',table='gkg')
print(len(results))

pull events table, range, output to json format

results = gd1.Search(['2016 Oct 31','2016 Nov 2'],coverage=True,table='events')
print(len(results))

++++++++++++++++++++++++++++++++++++++++++++++++++++

~/ub16_prj % python demogdelt.py

187291
187291
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users//ub16_prj/demogdelt.py", line 13, in
results = gd1.Search(['2016 Oct 31','2016 Nov 2'],coverage=True,table='events')
File "/usr/local/lib/python3.8/site-packages/gdelt/base.py", line 629, in Search
pool = Pool(processes=cpu_count())
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 212, in init
self._repopulate_pool()
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static
w.start()
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

187291

Add geopandas geodataframe output

Add ability to output returned data in geopandas output; this will make it easy for another output style (shapefile) and geojson. Also makes it easy to do a choropleth, mapping a statistical variable (count of a particular type of CAMEO Code) to a map. Should add the world geopandas data set to this as well (need to find a small one).

Error on pulling dates older than 2013, version 1

>>> import gdelt
>>> gd = gdelt.gdelt(version=1)
>>> results = gd.Search('2013 2 20',table='events')
Traceback (most recent call last):
  File "/Users/linwood/anaconda3/envs/pycharmDev/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-368ad372ac85>", line 1, in <module>
    results = gd.Search('2013 2 20',table='events',version=1)
TypeError: Search() got an unexpected keyword argument 'version'
gd = gdelt.gdelt(version=1)
results = gd.Search('2013 2 20',table='events')
Traceback (most recent call last):
  File "/Users/linwood/anaconda3/envs/pycharmDev/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-17-e0d15ebbd9c9>", line 1, in <module>
    results = gd.Search('2013 2 20',table='events')
  File "/Users/linwood/PycharmProjects/gdeltPyR/gdelt/base.py", line 426, in Search
    else:
  File "/Users/linwood/PycharmProjects/gdeltPyR/gdelt/vectorizingFuncs.py", line 100, in urlBuilder
    if parse(dateString) < parse('2013 Apr 01'):
  File "/Users/linwood/anaconda3/envs/pycharmDev/lib/python3.6/site-packages/dateutil/parser.py", line 1168, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/Users/linwood/anaconda3/envs/pycharmDev/lib/python3.6/site-packages/dateutil/parser.py", line 581, in parse
    ret = default.replace(**repl)
ValueError: month must be in 1..12

Unable to install using "pip install gdelt"

Hello.
When i tried to install i got the following:

Collecting gdelt
Using cached gdelt-0.1.10.6.1-py2.py3-none-any.whl (773 kB)
Discarding https://files.pythonhosted.org/packages/65/f9/a3d5111c8f17334b1752c32aedaab0d01ab4324bf26417bd41890d5b25d0/gdelt-0.1.10.6.1-py2.py3-none-any.whl (from https://pypi.org/simple/gdelt/): Requested gdelt from https://files.pythonhosted.org/packages/65/f9/a3d5111c8f17334b1752c32aedaab0d01ab4324bf26417bd41890d5b25d0/gdelt-0.1.10.6.1-py2.py3-none-any.whl has inconsistent version: expected '0.1.10.6.1', but metadata has '0.1.10.6'
Using cached gdelt-0.1.10.6.1.tar.gz (982 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [10 lines of output]
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/tmp/pip-install-dfkqd9oe/gdelt_fc3c3612c6dd4afbaff9146e7ebd3384/setup.py", line 39, in
read('CHANGES')),
File "/tmp/pip-install-dfkqd9oe/gdelt_fc3c3612c6dd4afbaff9146e7ebd3384/setup.py", line 15, in read
with codecs.open(os.path.join(cwd, filename), 'rb', 'utf-8') as h:
File "/usr/lib/python3.10/codecs.py", line 906, in open
file = builtins.open(filename, mode, buffering)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-install-dfkqd9oe/gdelt_fc3c3612c6dd4afbaff9146e7ebd3384/CHANGES'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

Travis builds are currently broken for >=3.6

0.36s$ pytest
ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --cov-repo=term-missing
  inifile: /home/travis/build/linwoodc3/gdeltPyR/setup.cfg
  rootdir: /home/travis/build/linwoodc3/gdeltPyR
The command "pytest" exited with 4.
before_cache
0.01s$ rm -f $HOME/.cache/pip/log/debug.log
cache.2
store build cache

ENH: Add ability to pull specific time interval on date

Right now, gdeltPyR can pull date ranges for historical dates and current day data. Add an ability for someone to specific specific date intervals to pull data. The historical 2.0 query pulls the last 15 minute interval of the day if coverage is set to False. Need to give more flexibility

Cannot run sample code for GDELT v2

Hi, When I run the sample code provided for v2, the following error is received. v1 works fine. Please help and let me know why this could be happening? Thank you

RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Traceback (most recent call last):
File "", line 1, in
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/opt/anaconda3/lib/python3.9/runpy.py", line 288, in run_path
return _run_module_code(code, init_globals, run_name,
File "/opt/anaconda3/lib/python3.9/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/opt/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "main.py", line 11, in
results = gd2.Search(['2016 11 01'],table='events',coverage=True)
File "/opt/anaconda3/lib/python3.9/site-packages/gdelt/base.py", line 635, in Search
pool = Pool(processes=cpu_count())
File "/opt/anaconda3/lib/python3.9/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/opt/anaconda3/lib/python3.9/multiprocessing/pool.py", line 212, in init
self._repopulate_pool()
File "/opt/anaconda3/lib/python3.9/multiprocessing/pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/opt/anaconda3/lib/python3.9/multiprocessing/pool.py", line 326, in _repopulate_pool_static
w.start()
File "/opt/anaconda3/lib/python3.9/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/opt/anaconda3/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/opt/anaconda3/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/opt/anaconda3/lib/python3.9/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/opt/anaconda3/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''

DOC: Add markdown file on contributing to `gdeltPyR`

  • Use the pandas contributing document as a guide.
  • Define versioning logic
  • Implement release plans with group of features to add before version number updated
  • Explain how to set up dev environment

Early contributing guidance:

  • I'm using http://semver.org/ and this stack overflow post as a model for versioning; I'm using four schemed number (0.0.0.0):

    • major version (changes when all planned features are added)
    • minor version (changes when new classes or modules are added that change the results or analysis on GDELT data returned)
    • minor-minor version (changes with smaller enhancements like classes or modules that just return unaltered data, new parameters to existing classes/modules, etc.)
    • Bug fixes - Last number is the bug fix count for current build; no changes to existing functionality but fixes a MAJOR bug that stops the entire module for working. Simple little bug fixes don't get counted. Resets to zero on minor-minor change. Only counts bugs so if no bugs...stays at zero. Eventually will drop this number off when the unit test suite has 80% coverage.
  • Small one or two-line entry in CHANGES (gdeltPyR --> CHANGES). Just a date line and description that says something like "added support to translated GDELT data" you can add your github username if you want too.

  • Add short bullet to README.md and README.rst. (gdeltPyR --> README.md (rst)). This just announces the new feature on the first page.

This looks good. Thanks for adding the unittests.

Administrative changes before any merge.

  • Small one or two-line entry in CHANGES explaining what you did (gdeltPyR --> CHANGES). Just a date line and description that says something like "Added support to translated GDELT data" you can add your github username if you want too.

  • Add short bullet to README.md and README.rst. (gdeltPyR --> README.md (rst)). This just announces the new feature on the first page.

  • Add issue that defines what bug or feature you plan to do. Reference the issue in commits and close it if you commit addresses the issue whether it's an enhancement, bug fix, or documentation update, etc.

  • (Optional) ** Add your name or github user name (or combo) to the AUTHORS.rst** (gdeltPyR --> AUTHORS.rst). Keep track of all contributors to show power of open source. Feel free to add your country as well.

Installing package doesn't work

Whenever I try installing the package I'm getting this error. This didn't happen earlier.

Collecting gdelt
Downloading gdelt-0.1.13.tar.gz (1.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 16.0 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... done
Installing backend dependencies ... done
error: subprocess-exited-with-error

× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
Preparing metadata (pyproject.toml) ... error
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.

BUG: Event search not working on windows 32 bit machine

import gdelt
import requests.packages.urllib3

requests.packages.urllib3.disable_warnings()
import platform
print(platform.architecture())
import gdelt

gd = gdelt.gdelt(version=2)

results = gd.Search(['2016 10 19','2016 10 22'],table='events',coverage=True)
print(results)


#output

D:\SUSHANT\pyt\python.exe C:/Users/sushant.s/PycharmProjects/testAGAIN/GDELT.py
('32bit', 'WindowsPE')
('32bit', 'WindowsPE')
('32bit', 'WindowsPE')
Traceback (most recent call last):
File "", line 1, in
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="mp_main")
File "D:\SUSHANT\pyt\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "D:\SUSHANT\pyt\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "D:\SUSHANT\pyt\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\sushant.s\PycharmProjects\testAGAIN\GDELT.py", line 11, in
results = gd.Search(['2016 10 19','2016 10 22'],table='events',coverage=True)
File "D:\SUSHANT\pyt\lib\site-packages\gdelt\base.py", line 568, in Search
pool = Pool(processes=cpu_count())
File "D:\SUSHANT\pyt\lib\multiprocessing\context.py", line 119, in Pool
context=self.get_context())
File "D:\SUSHANT\pyt\lib\multiprocessing\pool.py", line 168, in init
self._repopulate_pool()
File "D:\SUSHANT\pyt\lib\multiprocessing\pool.py", line 233, in _repopulate_pool
w.start()
File "D:\SUSHANT\pyt\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "D:\SUSHANT\pyt\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "D:\SUSHANT\pyt\lib\multiprocessing\popen_spawn_win32.py", line 33, in init
prep_data = spawn.get_preparation_data(process_obj._name)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Traceback (most recent call last):
File "", line 1, in
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="mp_main")
File "D:\SUSHANT\pyt\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "D:\SUSHANT\pyt\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "D:\SUSHANT\pyt\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\sushant.s\PycharmProjects\testAGAIN\GDELT.py", line 11, in
results = gd.Search(['2016 10 19','2016 10 22'],table='events',coverage=True)
File "D:\SUSHANT\pyt\lib\site-packages\gdelt\base.py", line 568, in Search
pool = Pool(processes=cpu_count())
File "D:\SUSHANT\pyt\lib\multiprocessing\context.py", line 119, in Pool
context=self.get_context())
File "D:\SUSHANT\pyt\lib\multiprocessing\pool.py", line 168, in init
self._repopulate_pool()
File "D:\SUSHANT\pyt\lib\multiprocessing\pool.py", line 233, in _repopulate_pool
w.start()
File "D:\SUSHANT\pyt\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "D:\SUSHANT\pyt\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "D:\SUSHANT\pyt\lib\multiprocessing\popen_spawn_win32.py", line 33, in init
prep_data = spawn.get_preparation_data(process_obj._name)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

('32bit', 'WindowsPE')
('32bit', 'WindowsPE')

GDELT did not return data for any date time?

We are seeing a bunch of failing http requests, but the url seems to be valid and the file can be downloaded by using another httpclient.

How can we narrow it down? Some console outputs are written like running into timeout ...

ENH: Add a new class that provides information on each table and column

This is part of Phase 1.

GDELT is a very complex data set and beginners will need to understand what is available. This is a multi-pronged issue as it is tied to #30 .

The implementation is up to the coder who takes this on, but for consideration:

  • Create a class that is an "information" or "whatIs" class. The name of the class should be easy to understand and let the user know to use this specific class to learn more about tables and column names in tables.

  • Each table for GDELT (events, gkg,iatv,mentions,literature) should have a method that returns a description of the table. GDELT Codebook descriptions may help give a generic overview of tables .

  • Each table will need to include different descriptions for the different GDELT version (version 1 and version 2). The main difference is that new columns or improvements should be highlighted in the description. For example, Events 1 table has less columns than Events 2 table. The description will explain why briefly (maybe 1 sentence at the beginning of the description of Events 2).

  • Each table will have a dataframe that provides a description of the columns. Each column will have a name, data type (integer, string, etc.), and a description.

  • Write a unit test to test each table; start by writing failing unit tests firsts (to load the table), then go back and make the tables load with the descriptions. We must have a unit test for each table.

A potential tree is:
gdelt.info -> events -> columndescription

OR

gdelt.info(version=2) -> events -> tabledescription

The version should be set in gdelt.

ENH: Add shapefile output

Add a method to convert geopandas output into a shapefile OR include an option that allows the user to write the gdeltPyR results directly to a shapefile.

FEATURE: Calculate day event was added

The events 2.0 codebook describes the fraction date. Here is the code to convert the fraction date to the approximate date when the event happened. I'm assuming I had a fraction date of 2020.2438.

import datetime

datetime.datetime(day=1,month=1,year=2020) + datetime.timedelta(days=int(2438/9999 * 365))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.