Git Product home page Git Product logo

politwoops-tweet-collector's Introduction

Install Beanstalkd

http://kr.github.com/beanstalkd/download.html

Requires installing the libevent-dev package on apt-based systems.

Install Python dependencies

Install pip if you don't already have it then run:

pip install -r requirements.txt

Edit config file

First:

cp conf/tweets-client.ini.example conf/tweets-client.ini

In the [tweets-client] section, add your Twitter account's username and password. This account will be authenticated against to make all API requests.

In the [beanstalk] section, change "tweets_tube" and "screenshot_tube". The values don't matter much, they just need to be unique.

In the [database] section, update the "host", "port", "username", "password", and "database" sections with your own details, if the defaults are not appropriate.

In the [aws] section, add your access key, secret access key, bucket name, and any path prefix inside the bucket you want to use. This is for archiving images and screenshots of tweeted links.

Running

Run tweets-client.py to start streaming items from Twitter into the beanstalk queue. Append the lib directory to the PYTHONPATH, either persistently or as part of the command:

PYTHONPATH=$PYTHONPATH:`pwd`/lib ./bin/tweets-client.py

Then run politwoops-worker.py to start pulling the tweets out of beanstalk and loading them into MySQL:

PYTHONPATH=$PYTHONPATH:`pwd`/lib ./bin/politwoops-worker.py --images

Finally, if you ran politwoops-worker.py with the images option turned on, run screenshot-worker.py to grab screenshots of webpages and mirror images linked in tweets.

PYTHONPATH=$PYTHONPATH:`pwd`/lib ./bin/screenshot-worker.py

These three scripts all accept the following options:

  • --loglevel - Sets the verbosity of logging.
  • --output - Destination for log files.
  • --restart - Restart if the script encounters an error that cannot be handled.

politwoops-tweet-collector's People

Contributors

breyten avatar dwillis avatar kaitlin avatar konklone avatar plantfansam avatar timball avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

politwoops-tweet-collector's Issues

http error when running tweets-client.py

When running command ./bin/tweets-client.py the following error is returned.

send: 'GET /1/account/verify_credentials.json HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: api.twitter.com\r\nAuthorization: OAuth realm="", oauth_nonce="85857454", oauth_timestamp="1411050412", oauth_consumer_key="heLP4n2JMqICTTT7fWChOosK6", oauth_signature_method="HMAC-SHA1", oauth_version="1.0", oauth_token="2814994075-XEieBJFHmFwfZxKMHuJ14ODCC9QXcMIh7ibTSyg", oauth_signature="7VQa5SGFUnaGWD5j5mFc0B1gcGI%3D"\r\n\r\n'
reply: 'HTTP/1.1 403 Forbidden\r\n'
header: content-length: 52
header: content-type: application/json;charset=utf-8
header: date: Thu, 18 Sep 2014 14:26:53 UTC
header: server: tsa_a
header: set-cookie: guest_id=v1%3A141105041308460869; Domain=.twitter.com; Path=/; Expires=Sat, 17-Sep-2016 14:26:53 UTC
header: x-connection-hash: ac7127a6a042322fc951f53834b990cb
[2014-09-18 14:26] ERROR: tweets-client.py: [{u'message': u'SSL is required', u'code': 92}]
Traceback (most recent call last):
File "./bin/tweets-client.py", line 201, in
sys.exit(main(args))
File "./bin/tweets-client.py", line 182, in main
return app.run()
File "./bin/tweets-client.py", line 159, in run
self.init_beanstalk()
File "./bin/tweets-client.py", line 133, in init_beanstalk
use=tweets_tube)
File "/home1/geektrek/git/politwoops-tweet-collector/lib/politwoops/utils.py", line 41, in beanstalk
beanstalk = beanstalkc.Connection(host=host, port=port)
File "/home1/geektrek/python272/lib/python2.7/site-packages/beanstalkc.py", line 57, in init
self.connect()
File "/home1/geektrek/python272/lib/python2.7/site-packages/beanstalkc.py", line 61, in connect
SocketError.wrap(self._socket.connect, (self.host, self.port))
File "/home1/geektrek/python272/lib/python2.7/site-packages/beanstalkc.py", line 43, in wrap
raise SocketError(e)
beanstalkc.SocketError: [Errno 111] Connection refused

twitter api changed

Running

PYTHONPATH=$PYTHONPATH:`pwd`/lib ./bin/tweets-client.py

returns

[2021-01-07 17:43:43.028120] ERROR: tweets-client.py: Failed to send request: Failed to parse: https://api.twitter.com/1.1/account/verify_credentials.json
Traceback (most recent call last):
  File "/home/skela/.local/lib/python3.8/site-packages/requests/models.py", line 382, in prepare_url
    scheme, auth, host, port, path, query, fragment = parse_url(url)
  File "/usr/lib/python3/dist-packages/urllib3/util/url.py", line 392, in parse_url
    return six.raise_from(LocationParseError(source_url), None)
  File "<string>", line 2, in raise_from
urllib3.exceptions.LocationParseError: Failed to parse: https://stream.twitter.com/1.1/statuses/filter.json

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./bin/tweets-client.py", line 232, in <module>
    sys.exit(main(args))
  File "./bin/tweets-client.py", line 213, in main
    return app.run()
  File "./bin/tweets-client.py", line 194, in run
    self.stream_forever()
  File "./bin/tweets-client.py", line 183, in stream_forever
    stream.filter(follow=track_items)
  File "/home/skela/.local/lib/python3.8/site-packages/tweepy/streaming.py", line 474, in filter
    self._start(is_async)
  File "/home/skela/.local/lib/python3.8/site-packages/tweepy/streaming.py", line 389, in _start
    self._run()
  File "/home/skela/.local/lib/python3.8/site-packages/tweepy/streaming.py", line 320, in _run
    six.reraise(*exc_info)
  File "/home/skela/.local/lib/python3.8/site-packages/six.py", line 686, in reraise
    raise value
  File "/home/skela/.local/lib/python3.8/site-packages/tweepy/streaming.py", line 266, in _run
    resp = self.session.request('POST',
  File "/home/skela/.local/lib/python3.8/site-packages/requests/sessions.py", line 528, in request
    prep = self.prepare_request(req)
  File "/home/skela/.local/lib/python3.8/site-packages/requests/sessions.py", line 456, in prepare_request
    p.prepare(
  File "/home/skela/.local/lib/python3.8/site-packages/requests/models.py", line 316, in prepare
    self.prepare_url(url, params)
  File "/home/skela/.local/lib/python3.8/site-packages/requests/models.py", line 384, in prepare_url
    raise InvalidURL(*e.args)
requests.exceptions.InvalidURL: Failed to parse: https://stream.twitter.com/1.1/statuses/filter.json

this should be caused by the fact that the twitter api changed (https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/migrate/standard-to-twitter-api-v2)

Http error when running tweets-client.py

Hello

first of all thanks for open sourcing the code of a great project like politwoops. In my problem now, I have edited the config file with the appropriate parameters, I run PYTHONPATH=$PYTHONPATH:pwd/lib ./bin/tweets-client.py and I get this error

Traceback (most recent call last):
File "./bin/tweets-client.py", line 150, in
sys.exit(main(args))
File "./bin/tweets-client.py", line 133, in main
return app.run()
File "./bin/tweets-client.py", line 102, in run
for tweet in stream:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tweetstream/streamclasses.py", line 165, in iter
self._init_conn()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tweetstream/streamclasses.py", line 99, in _init_conn
self._conn = opener.open(req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
result = self._call_chain(_args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(_args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 890, in http_error_401
url, req, headers)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 865, in http_error_auth_reqed
response = self.retry_http_basic_auth(host, req, realm)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 878, in retry_http_basic_auth
return self.parent.open(req, timeout=req.timeout)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(_args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(_args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 406: Not Acceptable

I tried change the field username in tweets-client.ini putting the @ symbol before the username and then I get an authentication error 401

Traceback (most recent call last):
File "./bin/tweets-client.py", line 150, in
sys.exit(main(args))
File "./bin/tweets-client.py", line 133, in main
return app.run()
File "./bin/tweets-client.py", line 102, in run
for tweet in stream:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tweetstream/streamclasses.py", line 165, in iter
self._init_conn()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tweetstream/streamclasses.py", line 103, in _init_conn
raise AuthenticationError("Access denied")
tweetstream.AuthenticationError: Access denied

Have you got any idea what's going wrong? Thanks in advance and keep up the good work.

TweetListener got bad status code: 406

>

Thanks for open sourcing this. I am excited to use it. I am having a similar issue as above, but I have a record in the database. Here is the output I get. Any help very much appreciated.

send: 'POST /1.1/statuses/filter.json?follow=barackobama&delimited=length HTTP/1.1\r\nHost: stream.twitter.com\r\nAccept-Encoding: identity\r\nContent-Length: 18\r\nContent-Type: application/x-www-form-urlencoded\r\nAuthorization: OAuth oauth_nonce="xxx", oauth_timestamp="1408564278", oauth_version="1.0", oauth_signature_method="HMAC-SHA1", oauth_consumer_key="xxx", oauth_token="xxx", oauth_signature="xxx"\r\n\r\nfollow=barackobama'
reply: 'HTTP/1.1 406 Not Acceptable\r\n'
header: connection: close
header: content-length: 105
header: Content-Type: text/html
header: date: Wed, 20 Aug 2014 19:51:15 UTC
header: server: tsa
header: x-connection-hash: 3d37c40128c8f55f2e4aacb232c424da
[2014-08-20 19:51] ERROR: tweets-client.py: TweetListener got bad status code: 406

TweetListener got bad status code: 406

I'm trying to run the collector but I get an error when trying to get the twitter stream into beanstalk.

I get this error:

...
[2014-02-02 14:09] NOTICE: tweets-client.py: Authenticated as retirolodichouy
send: 'POST /1.1/statuses/filter.json?delimited=length HTTP/1.1\r\nHost: stream.twitter.com\r\nAccept-Encoding: identity\r\nContent-Length: 0\r\nContent-type: application/x-www-form-urlencoded\r\nAuthorization: OAuth realm="", oauth_nonce="XXXX", oauth_timestamp="XXX", oauth_consumer_key="XXX", oauth_signature_method="HMAC-SHA1", oauth_version="1.0", oauth_token="XX", oauth_signature="XXX"\r\n\r\n'
reply: 'HTTP/1.1 406 Not Acceptable\r\n'
header: Content-Type: text/html
header: Transfer-Encoding: chunked
[2014-02-02 14:09] ERROR: tweets-client.py: TweetListener got bad status code: 406

Any idea what could it be or how to debug it?

Add a license

We at the OpenData group from Cordoba, Argentina, (http://opendatacordoba.org/) are very interested in this library to implement our own "politwoops" service, but it's no clear if it is Libre/Opensource software because it hasn't a clearly defined license.

utils.Heart __init__ buggy...

hi there,
problem is here: utils.py

        try:
            directory = config.get('tweets-client', 'heartbeats_directory')
        except:
            logbook.warning("No heartbeats_directory configuration parameter, skipping heartbeat.")
            raise StopIteration
        if not os.path.isdir(directory):
            logbook.warning("The heartbeats_directory parameter ({0}) is not a directory.",
                             directory)
            raise StopIteration

it seems nothing special, BUT....
in the tweets-client.ini.example , the heartbeats_directory is empty

# The directory should otherwise be empty.
heartbeats_directory=

so, there(L179) will always be True

if not os.path.isdir(directory):

That maybe should be OK, because if directory is "./" , the self.filepath will be ./script.py...
and the /script.py will be covered :

        scriptname = os.path.basename(sys.argv[0])
        self.filepath = os.path.join(directory, scriptname)

        start_time = datetime.datetime.now().isoformat()
        self.pid = os.getpid()
        with file(self.filepath, 'w') as fil:
            fil.write(anyjson.serialize({
                'pid': self.pid,
                'started': start_time
            }))

I think there is buggy and insecurity, maybe the with condition should split file extension out?
(to prevent write existed file...

fix screenshot-worker bug

because ssl-poodle ,screenshot-worker.py can't work.

But it can be solve, for Line:255

cmd = ["phantomjs", "js/rasterize.js", url, fil.name]

change to

cmd = ["phantomjs", "--ssl-protocol=any", "js/rasterize.js", url, fil.name]

( phantomjs' ssl-protocol default is SSLv3 )

sorry for create issue, not PR...
I was fork this project to change it to collect facebook feeds...
so I afraid If click " edit " on github(auto fork, then send PR)....will any thing happen.

table politicians doesn't exists

The first run of the command

PYTHONPATH=$PYTHONPATH:`pwd`/lib ./bin/tweets-client.py

returns

[2021-01-07 17:06:42.824128] ERROR: tweets-client.py: Failed to send request: Failed to parse: https://api.twitter.com/1.1/account/verify_credentials.json
Traceback (most recent call last):
  File "./bin/tweets-client.py", line 232, in <module>
    sys.exit(main(args))
  File "./bin/tweets-client.py", line 213, in main
    return app.run()
  File "./bin/tweets-client.py", line 194, in run
    self.stream_forever()
  File "./bin/tweets-client.py", line 176, in stream_forever
    track_items = self.track.get_items()
  File "/home/skela/WORKSPACE/politwoops-tweet-collector/lib/tweetsclient/mysql_track.py", line 60, in get_items
    return self._get_trackings()
  File "/home/skela/WORKSPACE/politwoops-tweet-collector/lib/tweetsclient/mysql_track.py", line 52, in _get_trackings
    return self._query(conn, tbl, fld, cnd)
  File "/home/skela/WORKSPACE/politwoops-tweet-collector/lib/tweetsclient/mysql_track.py", line 44, in _query
    cursor.execute(q)
  File "/home/skela/.local/lib/python3.8/site-packages/MySQLdb/cursors.py", line 206, in execute
    res = self._query(query)
  File "/home/skela/.local/lib/python3.8/site-packages/MySQLdb/cursors.py", line 319, in _query
    db.query(q)
  File "/home/skela/.local/lib/python3.8/site-packages/MySQLdb/connections.py", line 259, in query
    _mysql.connection.query(self, query)
MySQLdb._exceptions.ProgrammingError: (1146, "Table 'tweets.politicians' doesn't exist")

this should be caused by the query at the link

q = "SELECT `twitter_id`, `user_name`, `id` FROM `politicians` where status IN (1,2)"

a fix should add that table in database/schema.sql

DROP TABLE IF EXISTS `politicians`;

CREATE TABLE `politicians` (
	`id` BIGINT UNSIGNED AUTO_INCREMENT,
	`twitter_id` varchar(255),
	`user_name` VARCHAR(64),
    `status` tinyint(1),
    PRIMARY KEY(`id`)
) DEFAULT CHARSET=UTF8;

Error during initial setup

I am trying to get politwoops

I assume in this politwoops-tweet-collector we should be pointing to the same data base of the rails app. right? (cause this is not clear from the readme)

I've installed and started beanstalkd and created the initial configuration as in the read me file

but i'm stuck with this error whenever i try starting the collector

$ PYTHONPATH=$PYTHONPATH:`pwd`/lib ./bin/tweets-client.py
[2013-03-21 15:20] WARNING: Generic: The heartbeats_directory parameter () is not a directory.
Traceback (most recent call last):
  File "./bin/tweets-client.py", line 150, in <module>
    sys.exit(main(args))
  File "./bin/tweets-client.py", line 133, in main
    return app.run()
  File "./bin/tweets-client.py", line 99, in run
    with politwoops.utils.Heart() as heart:
  File "/home/mahmoud/code/politwoops-tweet-collector/lib/politwoops/utils.py", line 160, in __init__
    raise StopIteration
StopIteration

Error on handling deletion

Hi again

When I try to delete an inserted tweet from twitter, I get back this error from politwoops-worker

[2012-10-15 15:19] NOTICE: politwoops-worker.py: Deleted tweet 257863095957999616
./bin/politwoops-worker.py:165: Warning: Out of range value for column 'id' at row 1
cursor.execute("""REPLACE INTO tweets (id, deleted, modified, created) VALUES(%s, 1, NOW(), NOW())""", (tweet['delete']['status']['id']))
[2012-10-15 15:19] ERROR: Generic: Unhandled exception of type <type 'exceptions.TypeError'>: 'NoneType' object has no attribute 'getitem'
Traceback (most recent call last):
File "./bin/politwoops-worker.py", line 308, in
sys.exit(main(args))
File "./bin/politwoops-worker.py", line 280, in main
return app.run()
File "./bin/politwoops-worker.py", line 120, in run
self.handle_tweet(job.body)
File "./bin/politwoops-worker.py", line 145, in handle_tweet
self.handle_deletion(tweet)
File "./bin/politwoops-worker.py", line 169, in handle_deletion
self.send_alert(ref_tweet[1], ref_tweet[4], ref_tweet[2])
TypeError: 'NoneType' object has no attribute 'getitem'

Have you got any idea what's the problem. Thanks in advance.

duplicate tweets insertion in the database

when running the politwoops-worker.py to read the tweets from beanstalkd and insert them into mysql
PYTHONPATH=$PYTHONPATH:pwd/lib ./bin/politwoops-worker.py

the first tweet gets inserted, but it seems no deleted from beanstalkd. so shortly the scripts tries to insert it again leading to duplicate errors

$ PYTHONPATH=$PYTHONPATH:`pwd`/lib ./bin/politwoops-worker.py 

[2013-03-28 15:25] NOTICE: politwoops-worker.py: Log level NOTICE
[2013-03-28 15:25] NOTICE: politwoops-worker.py: New tweet 317294908039901186 from user 14830552/modsaid
./bin/politwoops-worker.py:197: Warning: Out of range value for column 'id' at row 1
  cursor.execute("""INSERT INTO `tweets` (`id`, `user_name`, `politician_id`, `content`, `created`, `modified`, `tweet`) VALUES(%s, %s, %s, %s, NOW(), NOW(), %s)""", (tweet['id'], tweet['user']['screen_name'], self.users[tweet['user']['id']], tweet['text'], anyjson.serialize(tweet)))

[2013-03-28 15:27] NOTICE: politwoops-worker.py: New tweet 317296871301324800 from user 14830552/modsaid
[2013-03-28 15:27] ERROR: Generic: Unhandled exception of type <class '_mysql_exceptions.IntegrityError'>: (1062, "Duplicate entry '2147483647' for key 'PRIMARY'")
Traceback (most recent call last):
  File "./bin/politwoops-worker.py", line 308, in <module>
    sys.exit(main(args))
  File "./bin/politwoops-worker.py", line 280, in main
    return app.run()
  File "./bin/politwoops-worker.py", line 120, in run
    self.handle_tweet(job.body)
  File "./bin/politwoops-worker.py", line 148, in handle_tweet
    self.handle_new(tweet)
  File "./bin/politwoops-worker.py", line 197, in handle_new
    cursor.execute("""INSERT INTO `tweets` (`id`, `user_name`, `politician_id`, `content`, `created`, `modified`, `tweet`) VALUES(%s, %s, %s, %s, NOW(), NOW(), %s)""", (tweet['id'], tweet['user']['screen_name'], self.users[tweet['user']['id']], tweet['text'], anyjson.serialize(tweet)))
  File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 174, in execute
    self.errorhandler(self, exc, value)
  File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
_mysql_exceptions.IntegrityError: (1062, "Duplicate entry '2147483647' for key 'PRIMARY'")

Open to enhancements?

Hey, this looks like a neat tool!

Is this something you are actively using? Let me know, would love to do some work on this, like testing and/or dockerization

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.