Git Product home page Git Product logo

search-tweets-python's People

Contributors

42b avatar andypiper avatar aureliaspecker avatar fionapigott avatar jeffakolb avatar jimmoffitt avatar jrmontag avatar lfsando avatar oihamza-zz avatar v11ncent avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

search-tweets-python's Issues

tweet_mode

How do I get the full text with tweet_mode=extended ?

The tweets I get are truncated.

[QUESTION] How to make a rule to avoid retweets?

Hello, I'm getting every tweets this way:

rule = gen_rule_payload("rihana", results_per_call=100) tweets = collect_results(rule, max_results=100, result_stream_args=cred)
But I'm not interesting in retweets, so I'm wasting requests filtering tweets after I got them.

I sincerely tried to understand the docs before asking, but I can't understand how PowerTrack Rules really work, and if they have any to do with it.

Consolidate API setup boilerplate into helper function

I'm noticing a pattern in the examples (and my own code) where I create the requisite YAML file -
containing my endpoint, account, and user creds - and then I write a line or two of JSON/dict parsing to create a *_args object which goes into the collect_results() function. Here's an example from some code I'm working on right now:

with open(".twitter_keys.yaml") as f:
    creds = yaml.load(f)

search_endpoint = creds["search_api"]["endpoint"]
count_endpoint = change_to_count_endpoint(search_endpoint)

search_args = {"username": creds["search_api"]["username"],
               "password": creds["search_api"]["password"],
               #"bearer_token": creds["search_api"]["bearer_token"],
               "endpoint": search_endpoint,
               }
count_args = {"username": creds["search_api"]["username"],
              "password": creds["search_api"]["password"],
              # "bearer_token": creds["search_api"]["bearer_token"],
              "endpoint": count_endpoint,
             }    

rule = gen_rule_payload('cannes', from_date='2017-05-17', to_date='2017-05-29')

tweets = collect_results(rule, max_results=1000, result_stream_args=search_args)

It might be more clear to the user if we make strict expectations about the YAML contents and then used a helper function to hide the requisite YAML parsing and manipulation. I imagine replacing the above with the following:

with open(".twitter_keys.yaml") as f:
    creds = yaml.load(f)
    # this new dict has all the relevant keys for collect_results()
    api_args = get_api_args(creds)

rule = gen_rule_payload('cannes', from_date='2017-05-17', to_date='2017-05-29')

# and collect_results() could use a simple endpoint switch
tweets = collect_results(rule, max_results=1000, result_stream_args=api_args, endpoint='tweets')

where the get_api_args() handles the YAML parsing, the string manipulations for the different endpoints, and conditional logic for the existence of optional yaml keys.

This streamlines the user experience (fewer lines of code), but doesn't add much abstraction. Thoughts? If it sounds useful, I'm happy to take a first stab at it.

doesn't work for python 2.7

I was trying to use searchtweets in python 2.7, just tried to import searchtweets would return something like:

return {**dict1, **dict2}
^
SyntaxError: invalid syntax

I changed to python 3.6 and it works fine.

PyYAML 5.1 warning

Currently, running searchtweets against the latest PyYAML release results in a warning on loading the configuration file:

credentials.py:34: YAMLLoadWarning: calling yaml.load() without Loader=… is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.

More information in this issue.

This is a minor / low priority fix, as the syntax is currently only deprecated, and the code will still work despite the warning being issued.

python version compatibility

The merge_dicts function is only compatible with Python 3.5+ due to the {**dict1, **dict2} syntax. It's a trivial fix to make it compatible with older versions of python 3.

Thanks @jimmoffitt for the note.

n_requests not consistent

I'm trying to retrieve data from the full-archive API. The following query returns 8 tweets correctly but 4 requests (rs.n_requests) are executed.

rs = ResultStream(rule_payload=gen_rule_payload("china from:realdonaldtrump", from_date='2015-06-01', to_date='2015-09-01', results_per_call=100), max_results=100, **premium_search_args)

data = list(rs.stream())

Since one request should return up to 100 tweets for sandbox accounts, why does this query consume 4 requests.

UPDATE: Sorry, this is no bug of the lib. This issue is discussed here https://twittercommunity.com/t/using-more-requests-than-expected-with-searchtweets-module-and-fullarchive-endpoint/106722/2

[KeyError: 'maxResults'] Error when creating count_rule

I tried connecting the the Counts API endpoint via the documentation here:

https://github.com/twitterdev/search-tweets-python#counts-endpoint

count_rule = gen_rule_payload("beyonce", count_bucket="day")

But this throws me the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-33-62bbdf3105fa> in <module>
----> 1 count_rule = gen_rule_payload('beyonce', count_bucket="day") # count_bucked may be "day", "hour" or "minute"

~\Anaconda3\lib\site-packages\searchtweets\api_utils.py in gen_rule_payload(pt_rule, results_per_call, from_date, to_date, count_bucket, tag, stringify)
    128         if set(["day", "hour", "minute"]) & set([count_bucket]):
    129             payload["bucket"] = count_bucket
--> 130             del payload["maxResults"]
    131         else:
    132             logger.error("invalid count bucket: provided {}"

KeyError: 'maxResults'

I ended up specifying results_per_call=500 to the gen_rule_payload arguments, which solved my problem. However this seemed kind of scary when working with a paid API. I dont' know if this is working as intended, but please either fix documentation or code :)

ResultStream stream() function does not paginate

I'm using a ResultStream object with search args from a ./.twitter_keys.yaml config file that point at a premium FAS endpoint. Our subscription is for the maximum premium monthly rate limit of 2,500 requests.

My YAML file:

search_tweets_premium:
  account_type: premium
  endpoint: https://api.twitter.com/1.1/tweets/search/fullarchive/:my_dev_env.json
  consumer_key: <my_key>
  consumer_secret: <my_secret>

My code (which I'm running from a Jupyter notebook):

from searchtweets import ResultStream, gen_rule_payload, load_credentials, collect_results

premium_search_args = load_credentials(filename="./.twitter_keys.yaml",
                 yaml_key="search_tweets_premium",
                 env_overwrite=False)

rule = gen_rule_payload("from:karenm_2000", results_per_call=500)
print(rule)

rs = ResultStream(rule_payload=rule,
                  max_results=500,
                  max_pages=100,
                  **premium_search_args)

tweets = list(rs.stream())

outputs...

{"query": "from:karenm_2000", "maxResults": 500}
searchtweets.result_stream - INFO - using bearer token for authentication
searchtweets.result_stream - INFO - ending stream at 38 tweets

After carefully doing a manual review of @karenm_2000's timeline, we see that 38 is the number of tweets this user has published in the past 30 days (at the time of this writing). For whatever reason, ResultStream is stopping after 30 days instead of paginating through until it runs into one of the limiting args.

Query total tweet count

I think it would be really useful to add an option to get the total amount of tweets for a defined query. This can be done by pulling counts for that query and adding them up. It would allow checking if you have enough requests with the current tier of premium/enterprise (500 tweets/request) and calculate the total cost to run the query. Especially useful if you need to use the premium API only for specific queries (ex: analyzing tweets from users during a specific event).

This would be a similar feature to what I think GNIP PowerTrack used to have, to check the cost of the query before starting it.

I can work on a PR myself if this is considered "non-prioritary" internally, as I'm currently doing this manually and it's quite tiresome.

ResultStream raises a generic HTTPError

After reviewing result_stream.py I noticed that the retry decorator raises an HTTPError with no context surrounding the actual HTTP code encountered. See below:
`
def retry(func):
"""
Decorator to handle API retries and exceptions. Defaults to three retries.

Args:
    func (function): function for decoration

Returns:
    decorated function

"""
def retried_func(*args, **kwargs):
    max_tries = 3
    tries = 0
    while True:
        try:
            resp = func(*args, **kwargs)

        except requests.exceptions.ConnectionError as exc:
            exc.msg = "Connection error for session; exiting"
            raise exc

        except requests.exceptions.HTTPError as exc:
            exc.msg = "HTTP error for session; exiting"
            raise exc

        if resp.status_code != 200 and tries < max_tries:
            logger.warning("retrying request; current status code: {}"
                           .format(resp.status_code))
            tries += 1
            # mini exponential backoff here.
            time.sleep(tries ** 2)
            continue

        break

    if resp.status_code != 200:
        error_message = resp.json()["error"]["message"]
        logger.error("HTTP Error code: {}: {}".format(resp.status_code, error_message))
        logger.error("Rule payload: {}".format(kwargs["rule_payload"]))
        raise requests.exceptions.HTTPError

    return resp

return retried_func

`
Retry section of this page https://developer.twitter.com/en/docs/tweets/search/overview/enterprise.html indicates a throttling of requests when encountering a 503. Also I'd like to be able to respond accordingly / log other HTTP error codes as well. Are there any plans to address this or can I open a PR to do so?

Rate limit handling

From the community forums - it's worth a potential addition to handle rate limits for long-running requests or for other use cases.

Consider this more of a note for future discussion than for a specific implementation strategy, which could range from a time.sleep() call to auto-adjusting calls based on the type of environment (e.g., sandbox vs prod).

Simplify parsing version in setup.py?

If importing VERSION directly is out of the question, there seem to be a few other ways to simplify parsing the version code.

>>> import re
... 
... 
... def parse_version(str_):
...     """
...     Parses the program's version from a python variable declaration.
...     """
...     v = re.findall(r"\d+.\d+.\d+", str_)
...     if v:
...         return v[0]
...     else:
...         print("cannot parse string {}".format(str_))
...         raise KeyError
... 
>>> # original
... with open("./searchtweets/_version.py") as f:
...     _version_line = [line for line in f.readlines()
...                      if line.startswith("VERSION")][0].strip()
...     VERSION = parse_version(_version_line)
... 
>>> VERSION
'1.7.4'

>>> # no readlines
... with open("./searchtweets/_version.py") as f:
...     _version_line = [line for line in f
...                      if line.startswith("VERSION")][0].strip()
...     VERSION = parse_version(_version_line)
... 
>>> VERSION
'1.7.4'

>>> # use next instead of a list comp + zero index
... with open("./searchtweets/_version.py") as f:
...     _version_line = next(line for line in f
...                          if line.startswith("VERSION")).strip()
...     VERSION = parse_version(_version_line)
... 
>>> VERSION
'1.7.4'

>>> # remove unnecessary str.strip
... with open("./searchtweets/_version.py") as f:
...     _version_line = next(line for line in f
...                          if line.startswith("VERSION"))
...     VERSION = parse_version(_version_line)
... 
>>> VERSION
'1.7.4'

>>> # pass f.read() directly
... with open("./searchtweets/_version.py") as f:
...     VERSION = parse_version(f.read())
... 
>>> VERSION
'1.7.4'

>>> # get rid of parse_version function altogether
... with open("./searchtweets/_version.py") as f:
...     VERSION = re.search(r'\d+.\d+.\d+', f.read()).group()
... 
>>> VERSION
'1.7.4'

renaming package for pypi

I am renaming the package so we can distribute this on pypi. The new name is searchtweets and we should rename the repo as well.

I propose twitterdev/search-tweets-python for a new repo name, after @jimmoffitt's Ruby search API wrapper. Thoughts from @twitterdev/des-science ?

Is it possible to search for more than one user in a query?

Is it possible to search for created by more than one user in a query?

Going off the following provided example:

rule = gen_rule_payload("from:jack",
                        from_date="2017-09-01", #UTC 2017-09-01 00:00
                        to_date="2017-10-30",#UTC 2017-10-30 00:00
                        results_per_call=500)

I tried the following variations with no success:

rule = gen_rule_payload("from:jack", "from:bill",
rule = gen_rule_payload('"from:jack" OR "from:bill"', 
rule = gen_rule_payload("('from:jack', 'from:bill')",
rule = gen_rule_payload("('from:jack' OR 'from:bill',

warning ... YAMLLoadWarning ... deprecated

first time user here. I get this RED warning message:

~/env/.../python3.5/site-packages/searchtweets/credentials.py:34:
YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, 
as the default Loader is unsafe. 
Please read https://msg.pyyaml.org/load for full details.
  search_creds = yaml.load(f)[yaml_key]

when executing this code:

    premium_search_args = searchtweets.load_credentials("twitter_credentials.yaml",
                                       yaml_key="tweets_search_fullarchive",
                                       env_overwrite=False)

Full Archive Search - handles Next token - larger than 500 results.

Hi,

Thanks for this wonderful package. I'm using the enterprise full archive search feature for our company and was wondering how I grab tweets that are more than 500 tweets. With requests of more than 500 tweets, there is a next token provided that will have the next 500 tweets right?

So how do I get the results of the full archive search, not just the first 500 tweets. Does this gen_rule_payload() and collect_results() functions give back all of the tweets from the full archive search, not just the first 500?

Thanks for the help!

Below is the full archive example script provided in the docs:

rule = gen_rule_payload("from:jack", 
    from_date="2017-09-01", #UTC 2017-09-01 00:0 
    to_date="2017-10-30",#UTC 2017-10-30 00:00
    results_per_call=500)

print(rule)

{"query":"from:jack","maxResults":500,"toDate":"201710300000","fromDate":"201709010000"}

tweets = collect_results(rule, max_results=500, result_stream_args=enterprise_search_args)

Unclear on use patterns: PATH script or local file in repo?

If I understand correctly, when the library is pip-installed, the current setup.py file copies the executable search_tweets.py into the bin/ dir of the relevant python environment. In my experience, this is so the user can run e.g. (env)$ search_tweets.py from any location and the stand-alone script will still be on the $PATH. However the current search_tweets.py doesn't have the #! line so it leads to some unexpected errors - in both my case and this SO post, the resulting output comes from ImageMagik (of all places...).

In the README, the user is instructed to run the local, repo file as (env)$ python search_tweets.py from the tools/ directory. This requires the user to have the repo cloned locally.

I think it would be helpful to be more clear about which is the recommended way for the user to run the main search_tweets.py script (not when imported as a library).

My preference would be to enable the file to run by keeping it in the setup.py script and adding the appropriate shebang. Then, we could remove the language about using the repo version, thus removing the expectation that the user has downloaded or cloned the repo. But I'm happy to hear more about the relative trade-offs of the approaches here.

Add a log message to indicate endpoint switch to the user

The ResultStream constructor selects the /counts or /search endpoint by reading the json rule payload and checking to see if there is a bucket key, (here)

While convenient, this switch can also be confusing to the user. I'd like to propose a logging message if the endpoint is swapped. Perhaps a logging.warning so that it shows in a Jupyter notebook.

Thoughts?

outputs 500 results despite changing config

Hi! I have set the max-results inside my config file to be something other than 500, but the output always has 500 results no matter how I change the config file, even if max-results is smaller than 500. I wonder if there is anything I have done wrong in this case.

truncated tweet does not contain `exteded_tweet`

I am using Premium API and according to #62 (comment) I should be able to access extended_tweet, but it is not the case. Any ideas?

Here is one example of retrieved tweet via Premium Search:

{'created_at': 'Tue Nov 26 14:44:42 +0000 2019', 'id': 1199338194476490755, 'id_str': '1199338194476490755', 'text': '@Twins Unfortunate that such a great looking uniform has been defaced with the mark of 🇺🇸 Cop hating, Communist lov… https://t.co/E39DoIaQUP', 'truncated': True, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'Twins', 'name': 'Minnesota Twins', 'id': 39397148, 'id_str': '39397148', 'indices': [0, 6]}], 'urls': [{'url': 'https://t.co/E39DoIaQUP', 'expanded_url': 'https://twitter.com/i/web/status/1199338194476490755', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [117, 140]}]}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'in_reply_to_status_id': 1198979679136473088, 'in_reply_to_status_id_str': '1198979679136473088', 'in_reply_to_user_id': 39397148, 'in_reply_to_user_id_str': '39397148', 'in_reply_to_screen_name': 'Twins', 'user': {'id': 1358231077, 'id_str': '1358231077', 'name': 'Fishin🎣', 'screen_name': 'InDa906Eh', 'location': '', 'description': '', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 33, 'friends_count': 306, 'listed_count': 0, 'created_at': 'Wed Apr 17 00:51:18 +0000 2013', 'favourites_count': 1714, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': False, 'statuses_count': 1111, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': '131516', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme14/bg.gif', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme14/bg.gif', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1181605743608381442/lGiYEUrc_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1181605743608381442/lGiYEUrc_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/1358231077/1573519171', 'profile_link_color': '009999', 'profile_sidebar_border_color': 'EEEEEE', 'profile_sidebar_fill_color': 'EFEFEF', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'none'}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'retweet_count': 0, 'favorite_count': 0, 'favorited': False, 'retweeted': False, 'lang': 'en'}

Command-line app refinement and testing

We need to ensure that the command line app works seamlessly for enterprise and premium users. I am also very open to redesigning the specification of arguments via either the command line or a file. Currently, the project is using configparser to pass args to the program, with overrides for redundant args passed via the command line.

Also, are there other use cases to support via the command line? Do we need to examine how the file saving or stdout usage occurs?

Add more specific metadata to request headers

For the purposes of tracking usage adoption, it would be nice to include some library and version information in the HTTP requests.

I believe the requests library under the hood of searchtweets sends something similar to 'User-Agent': 'python-requests/1.2.0' in its requests. Perhaps we could modify the headers like (roughly):

version = searchtweets.__version__
useragent = 'search-tweets-python/{}'.format(version)
headers = {'User-Agent': useragent}

I'm not sure if it would be best set in the request() method or in the make_session() method. The requests docs mention doing it in both places (former, latter).

@binaryaaron @fionapigott @jeffakolb are there other metadata things that would be useful in the headers?

Error with load_credentials

hello
after installing the packages I created a YAML file as shown here

search_tweets_premium:
  account_type: premium
  endpoint: https://api.twitter.com/1.1/tweets/search/fullarchive/dev.json
  # if you have a bearer token, you can use it below. otherwise, swap the comment marks and use 
  # your app's consumer key/secret - the library will generate and use a bearer token for you. 
#bearer_token: <A_VERY_LONG_MAGIC_STRING> 
  consumer_key: <Sznpkb******************************>
  consumer_secret: <lP8Xr*****************************************************>

but when I run the code as shown here

premium_search_args = load_credentials("C:/Users/DRC/Desktop/.twitter_keys.yaml",
                                       yaml_key="search_tweets_premium",
                                       env_overwrite=False)

I got this error

cannot read file C:/Users/DRC/Desktop/.twitter_keys.yaml
Error parsing YAML file; searching for valid environment variables
Account type is not specified and cannot be inferred.
        Please check your credential file, arguments, or environment variables
        for issues. The account type must be 'premium' or 'enterprise'.
        

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-55-953b455c2646> in <module>
      1 premium_search_args = load_credentials("C:/Users/DRC/Desktop/.twitter_keys.yaml",
      2                                        yaml_key="search_tweets_premium",
----> 3                                        env_overwrite=False)

~\Anaconda3\lib\site-packages\searchtweets\credentials.py in load_credentials(filename, account_type, yaml_key, env_overwrite)
    187                    if env_overwrite
    188                    else merge_dicts(env_vars, yaml_vars))
--> 189     parsed_vars = _parse_credentials(merged_vars, account_type=account_type)
    190     return parsed_vars
    191 

~\Anaconda3\lib\site-packages\searchtweets\credentials.py in _parse_credentials(search_creds, account_type)
     80         """
     81         logger.error(msg)
---> 82         raise KeyError
     83 
     84     try:

KeyError:

if I modified the path in load_credentials and remove the dot before twitter_keys the error message change to be as shown here

premium_search_args = load_credentials("C:/Users/DRC/Desktop/twitter_keys.yaml",
                                       yaml_key="search_tweets_premium",
                                       env_overwrite=False)
---------------------------------------------------------------------------
ScannerError                              Traceback (most recent call last)
<ipython-input-3-e34d8bf3703d> in <module>
      1 premium_search_args = load_credentials("C:/Users/DRC/Desktop/twitter_keys.yaml",
      2                                        yaml_key="search_tweets_premium",
----> 3                                        env_overwrite=False)

~\Anaconda3\lib\site-packages\searchtweets\credentials.py in load_credentials(filename, account_type, yaml_key, env_overwrite)
    179     filename = "~/.twitter_keys.yaml" if filename is None else filename
    180 
--> 181     yaml_vars = _load_yaml_credentials(filename=filename, yaml_key=yaml_key)
    182     if not yaml_vars:
    183         logger.warning("Error parsing YAML file; searching for "

~\Anaconda3\lib\site-packages\searchtweets\credentials.py in _load_yaml_credentials(filename, yaml_key)
     32     try:
     33         with open(os.path.expanduser(filename)) as f:
---> 34             search_creds = yaml.load(f)[yaml_key]
     35     except FileNotFoundError:
     36         logger.error("cannot read file {}".format(filename))

~\Anaconda3\lib\site-packages\yaml\__init__.py in load(stream, Loader)
     70     loader = Loader(stream)
     71     try:
---> 72         return loader.get_single_data()
     73     finally:
     74         loader.dispose()

~\Anaconda3\lib\site-packages\yaml\constructor.py in get_single_data(self)
     33     def get_single_data(self):
     34         # Ensure that the stream contains a single document and construct it.
---> 35         node = self.get_single_node()
     36         if node is not None:
     37             return self.construct_document(node)

~\Anaconda3\lib\site-packages\yaml\composer.py in get_single_node(self)
     34         document = None
     35         if not self.check_event(StreamEndEvent):
---> 36             document = self.compose_document()
     37 
     38         # Ensure that the stream contains no more documents.

~\Anaconda3\lib\site-packages\yaml\composer.py in compose_document(self)
     53 
     54         # Compose the root node.
---> 55         node = self.compose_node(None, None)
     56 
     57         # Drop the DOCUMENT-END event.

~\Anaconda3\lib\site-packages\yaml\composer.py in compose_node(self, parent, index)
     82             node = self.compose_sequence_node(anchor)
     83         elif self.check_event(MappingStartEvent):
---> 84             node = self.compose_mapping_node(anchor)
     85         self.ascend_resolver()
     86         return node

~\Anaconda3\lib\site-packages\yaml\composer.py in compose_mapping_node(self, anchor)
    131             #    raise ComposerError("while composing a mapping", start_event.start_mark,
    132             #            "found duplicate key", key_event.start_mark)
--> 133             item_value = self.compose_node(node, item_key)
    134             #node.value[item_key] = item_value
    135             node.value.append((item_key, item_value))

~\Anaconda3\lib\site-packages\yaml\composer.py in compose_node(self, parent, index)
     82             node = self.compose_sequence_node(anchor)
     83         elif self.check_event(MappingStartEvent):
---> 84             node = self.compose_mapping_node(anchor)
     85         self.ascend_resolver()
     86         return node

~\Anaconda3\lib\site-packages\yaml\composer.py in compose_mapping_node(self, anchor)
    125         if anchor is not None:
    126             self.anchors[anchor] = node
--> 127         while not self.check_event(MappingEndEvent):
    128             #key_event = self.peek_event()
    129             item_key = self.compose_node(node, None)

~\Anaconda3\lib\site-packages\yaml\parser.py in check_event(self, *choices)
     96         if self.current_event is None:
     97             if self.state:
---> 98                 self.current_event = self.state()
     99         if self.current_event is not None:
    100             if not choices:

~\Anaconda3\lib\site-packages\yaml\parser.py in parse_block_mapping_key(self)
    426 
    427     def parse_block_mapping_key(self):
--> 428         if self.check_token(KeyToken):
    429             token = self.get_token()
    430             if not self.check_token(KeyToken, ValueToken, BlockEndToken):

~\Anaconda3\lib\site-packages\yaml\scanner.py in check_token(self, *choices)
    113     def check_token(self, *choices):
    114         # Check if the next token is one of the given types.
--> 115         while self.need_more_tokens():
    116             self.fetch_more_tokens()
    117         if self.tokens:

~\Anaconda3\lib\site-packages\yaml\scanner.py in need_more_tokens(self)
    147         # The current token may be a potential simple key, so we
    148         # need to look further.
--> 149         self.stale_possible_simple_keys()
    150         if self.next_possible_simple_key() == self.tokens_taken:
    151             return True

~\Anaconda3\lib\site-packages\yaml\scanner.py in stale_possible_simple_keys(self)
    287                 if key.required:
    288                     raise ScannerError("while scanning a simple key", key.mark,
--> 289                             "could not find expected ':'", self.get_mark())
    290                 del self.possible_simple_keys[level]
    291 

ScannerError: while scanning a simple key
  in "C:/Users/DRC/Desktop/twitter_keys.yaml", line 5, column 3
could not find expected ':'
  in "C:/Users/DRC/Desktop/twitter_keys.yaml", line 7, column 3

is that because of bearer token?
or there is any mistake in my YAML file?

Scraped tweets are the same for every request

I'm using the premium sandbox API for scraping tweets and I can only scrape 100 tweets for each request, so if I need to scrape 500 tweets on one day I will need to do 5 requests. The problem is that when I scrape 100 tweets for each request it scrapes the same 100 tweets each time, so how can I make sure that the 100 tweets I scrape for a request is different from the 100 tweets for each other requests

URL-encoded special character ($) returns 422 no viable character

Running:
search_tweets.py --credential-file .twitter_keys.yaml --max-results 100 --results-per-call 100 --filter-rule "%24tsla" --filename-prefix tsla_test --print-stream`

Returns:
ERROR:searchtweets.result_stream:HTTP Error code: 422: There were errors processing your request: no viable alternative at character '%' (at position 1) ERROR:searchtweets.result_stream:Rule payload: {'query': '%24tsla', 'maxResults': 100}

Per Standard Operator docs, I would expect this call to succeed. FWIW, using tweepy with the same query string, the call succeeds.

Incorrectly getting 401 status code despite correct authentication parameters

Followed the readme steps to install the library via pip, added the relevant code in a file (see below), ran as python3 script, and keep getting 401 status code.

Relevant code invoking the library:

from searchtweets import ResultStream, gen_rule_payload, load_credentials, collect_results
premium_search_args = load_credentials("~/.twitter_keys.yaml",
                                   yaml_key="search_tweets_api",
                                   env_overwrite=False)
rule = gen_rule_payload("beyonce", results_per_call=100) # testing with a sandbox account
print(rule)
print("TOKEN is: '{}'".format(premium_search_args['bearer_token']))
tweets = collect_results(rule,
                     max_results=100,
                     result_stream_args=premium_search_args) # change this if you need to
[print(tweet.all_text, end='\n\n') for tweet in tweets[0:10]];

Confirmed that ~/.twitter_keys.yaml has good creds and that the relevant application setup is good - was able to use the bearer token printed from the library in a curl request as follows:

curl -X POST "https://api.twitter.com/1.1/tweets/search/30day/testing.json" -d '{"query": "beyonce", "maxResults": 100}' -H "Authorization: Bearer PASTED_TOKEN_HERE

Using more requests than expected

Hi,

I'm not sure if this is a search-tweets module specific bug or if it is upstream - sorry if the latter. I posted on the twitter developer forum but no response yet (link to forum post at bottom).

I'm using the fullarchive endpoint with premium access and I’ve found I'm using more requests than intended (I’d like to use 1) when grabbing tweets for a handle over the last 90 days. Passing in max_requests to ResultsStream() also doesn’t help. Finally, in this case only 88 tweets are matched by that search query. From my understanding of the docs I should only ever be using 1 request (as max results in 500 and each request returns at most 500 tweets).

raw_data_test = {}

print("Starting: {} API calls used".format(ResultStream.session_request_counter))

def make_rule(handle, to_date, from_date):
    _rule_a = "from:"+handle
    rule = gen_rule_payload(_rule_a,
                        from_date=from_date,
                        to_date=to_date,
                        results_per_call=500)
    return rule

days_to_scrape = 90 
for indx_,handle_list, date in to_scrape[32:33]:
    
    to_datetime = pd.to_datetime(date)
    from_dateime = to_datetime - pd.Timedelta(days_to_scrape, unit='D')
    from_datestring = str(from_dateime.date())
    to_datestring = str(to_datetime.date())

    for handle in handle_list:
        #print(handle)
        print('collecting results...')
        search_query = make_rule(handle, to_datestring,from_datestring)

        print(search_query)
        rs = ResultStream(rule_payload=search_query,
                                      max_results=500,
                                      **search_args)

        results_list = list(rs.stream())
        print("You have used {} API calls".format(ResultStream.session_request_counter))
        raw_data_test[search_query] = results_list
        #time.sleep(2)

Output:
Starting: 0 API calls used
collecting results…
{“query”: “from:xxxxxx”, “maxResults”: 500, “toDate”: “201502050000”, “fromDate”: “201411070000”}
You have used 3 API calls

Please let me know if there is anything further info that would help.

Thanks!

(https://twittercommunity.com/t/using-more-requests-than-expected-with-searchtweets-module-and-fullarchive-endpoint/106722)

Update credential doc for clarity and error handling

In working with @jimmoffitt, we noticed that credential handling setup is not 100% clear.

The flexibility introduced by #14 is perhaps not explicit about the defaults values of both the key file (.twitter_keys.yaml )and the keys in the file ( search_tweets_api ).

Also, there is no graceful error when a KeyError occurs when parsing the credential file.

Adding all premium operators in API

Is there a way we can extend API params to support all operators provided by Twitter Premium API(fullarchive or 30days). I tried adding from: somename in config.yaml file but didn't worked as expected.

results_per_call defaults to 100, not 500 as stated in the docs

I ran a collect_results query without setting the parameter results_per_call using a premium (and non-sandbox) endpoint, which supports 500 tweets per request. The docs say results_per_call defaults to the maximum allowed, which is 500.

max_results was set to 2500, thus I expected 5 (or at most 6) requests. But according to my twitter dashboard, 26 requests where used. I guess this corresponds to 2500/100=25 approx.

ResultStream only returns first page ?

I use a Full Archive Premium account to search tweets from 2016. However, instead of receiving a list of tweets that are all different, I get a list that contains n times the same tweets. It may be due to a problem of pagination, however the number of unique tweets is 942, so higher than 500. Here is my code :

from searchtweets import ResultStream, gen_rule_payload, load_credentials

premium_search_args = load_credentials(filename="./search_tweets_creds.yaml",
                 yaml_key="search_tweets_api",
                 env_overwrite=False)
rule = gen_rule_payload("(orlando shooting OR pulse) profile_country:US",
                        from_date="2016-06-14 15:00",
                        to_date="2016-06-14 16:00",
                        results_per_call=500)
rs = ResultStream(rule_payload=rule,
                  max_results=11800,
                  max_pages=24,
                  **premium_search_args)
tweets = list(rs.stream())

As a result, I get a list of 11800 tweets that looks like this (with example ids) :

id | created_at
1 | Tue Jun 14 15:59:59 +0000 2016
2 | Tue Jun 14 15:59:58 +0000 2016
...
473 | Tue Jun 14 15:57:31 +0000 2016
1 | Tue Jun 14 15:59:59 +0000 2016
2 | Tue Jun 14 15:59:58 +0000 2016
...
844 | Tue Jun 14 15:54:53 +0000 2016
1 | Tue Jun 14 15:59:59 +0000 2016
...

I used the same code last month without problem, was there any change in the API recently that could cause this ?

Search by coordinate

Hello! Is it possible to include coordinates in the gen_rule_payload argument? Maybe this already exists but I can't find a list of optional arguments. It also seems that you have to have something in the query field right? (Ideally I'd like to collect any tweets that occurred in a specific location over a specific time period, but is that possible with the premium API?) Thanks!

HTTP Rate Limit Error on Premium Full Archive Search?

I have 13 similar queries, 9 of them succeed, however 4 of them error due to HTTP Rate Limiting.

Full Error Output:

ERROR:searchtweets.result_stream:HTTP Error code: 429: Exceeded rate limit
ERROR:searchtweets.result_stream:Rule payload: {
      'fromDate': '201708220000', 
      'maxResults': 500, 
      'toDate': '201709030000', 
      'query': 'from:1307298884 OR from:355574901 OR from:11996222 OR from:588649189 OR from:117808071 OR from:19401084 OR from:19401550 OR from:735608306585198592 OR from:241642996 OR from:27888518 OR from:56470183 OR from:14861004 OR from:106728683 OR from:19683725 OR from:229379349 OR from:16425419 OR from:20562924 OR from:76138084 OR from:76303 OR from:399111773 OR from:26774590 OR from:795296683533942784 OR from:159669473 OR from:345161879 OR from:4784677524 OR from:22654208 OR from:2227415623 OR from:139114806 OR from:34030321 OR from:140660078 OR from:247503175 OR from:14791386 OR from:3014668581 OR from:279136688 OR from:64798315 OR from:2247970886 OR from:704407090060730368 OR from:82960432 OR from:64523032 OR from:396045731 OR from:45564482 OR from:28645139 OR from:18205191 OR from:734824555051646976 OR from:357026180 OR from:44900997 OR from:14192680 OR from:4813084988 OR from:600958397 OR from:605603344 OR from:577537350 OR from:589419748 OR from:1429761 OR from:401527825 OR from:15937025 OR from:2788847458', 
'next': 'eyJtYXhJZCI6OTAwMjkyNzgzMzk0NjM1Nzc2fQ=='}

Relevant YAML configuration details:

search_params:
    results-per-call: 500
    max-results: 1000000

output_params:
    save_file: True
    results_per_file: 10000000

search_tweets_api:
  account_type: premium
  endpoint: https://api.twitter.com/1.1/tweets/search/fullarchive/history.json

It also appears that these failed queries also count against our paid requests, even though nothing gets written to the output file.

Failed to install by running `pip install searchtweets` on Mac

Hi there!

I got some problems when trying to install the searchtweets package. I run a pip install searchtweets but received the error

....
long_description=open('README.rst', 'r', encoding="utf-8").read(),
    TypeError: 'encoding' is an invalid keyword argument for this function
------------------
Command "python setup.py egg_info" failed with error code 1 in .....

I'm using Mac and I'm trying both on python 2 and python 3 while got the same error. Does anyone have some hint about this? Or should I need to do anything else before installing?
Thank you!

problem with filter_rules

search_tweets.py --credential-file twitter_keys.yaml --max-results 100 --results-per-call 100 --filter-rule "HayatımınSırrı -is:retweet" --start-datetime "2017-08-21" --end-datetime "2017-08-23" --filename-prefix efsane5 --no-print-stream

I wrote this query that ı want to tweets with specific date and ı just want to not retweets when ı add to "-is:retweet" parameter it give this error:
my account is sandbox premium. How can ı solve this problem.
:HTTP Error code: 422: There were errors processing your request: Reference to invalid operator 'is:retweet'. Operator is not available in current product or product packaging

'Tweet' object has no attribute 'reply_count'

tweets = collect_results(rule, max_results=500, result_stream_args=premium_search_args)
for tweet in tweets: print(tweet.reply_count)

returns
Exception has occurred: AttributeError
'Tweet' object has no attribute 'reply_count'
but while debugging, I can clearly see the attribute reply_count in tweet. What gives?

Dates Are Wrong

Using the premium full archive and search-tweets, I have found a tweet (https://twitter.com/TheFatApple/status/29118205491) that has a date of Oct 29, 2010. However, the API says it's 'created_at' attribute is November 4, 2010, which is wrong. This may be an API related bug, but I'm bringing it up here just to see if anyone else has experienced something similar

save as json

Hi, is there a way to save all queried tweets as json files first, before parsing them into Twitter Parser?

result_type parameter premium/enterprise search

Hi,

For standard search, I see a search parameter "result_type" that has the options: mixed, recent, popular. For developer/enterprise search do you guys know if there is a similar parameter to sort the tweet results?

Thanks!

ResultStream produces max_requests+1 requests

search_args = load_credentials()
rule = gen_rule_payload("rule", results_per_call=100)
rs = ResultStream(rule_payload=rule,
                  max_requests=1,
                  **search_args)
tweets = list(rs.stream())

This ResultStream produces 2 requests (and len(tweets) will be 200).

Specifying objects to return within json results

I can return reliably the full json with all the metadata with resultstream and:

[print(tweet) for tweet in tweets]

However I'm having trouble just requesting specific objects of interest (e.g. just the text), using the same methods you can use with the search api:

[print(tweet.full_text) for tweet in tweets]

Or trying to specify using the json structure:

[print(tweet.user.extended_tweet.full_text) for tweet in tweets]

Is there syntax specific to the premium endpoints that's required to trim some of the objects out of the json (ideally keeping the structure of those remaining)?

Python 2 / 3 compatibility

This is somewhat self explanatory, but this repo has only been built for python 3. Should we extend support to python 2.7? If so, this needs to be done.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.