Comments (8)
Yes, I still haven't managed to find time to rehydrate the dataset. I will get to it this weekend.
from data_tooling.
@albertvillanova I've rehydrated the dataset but there are two problems:
- Half the tweets (50.74% out of 2,841,125 Tweets) can no longer be retrieved as they were either deleted or their authors went private. (Common problem in rehydration I've heard)
- A lot of the tweets are actually retweets (32.71% out of 1,399,387 Tweets) and are therefore truncated like so:
'RT @COFMadrid: Con motivo del 30 aniversario, @FarmaSinFronter acerca el arte solidario a favor del área materno-infantil en la ciudad de T…'
(Not sure about the language distribution of the data either)
We can't really do anything about Problem 1.
Problem 2 can be solved using a second run through the data. Let me know if I should retrieve the original tweets (On second thought, I suspect that a lot of them are already part of the corpus and therefore duplicates that are of no interest to a language modeling task.)
For reference in case someone wants to rehydrate a tweets dataset later in the project, this is how I used Twitter API v2
to do it:
Keep in mind that this ran almost a full day (Most of it spent in sleep
as it hit the rate limit 94 times, each time waiting for around 780 seconds), so it might not the best code.
import tweepy
import pandas as pd
from itertools import zip_longest, chain
# https://docs.python.org/3/library/itertools.html#itertools-recipes
def batcher(iterable, n, fillvalue=19*'1'):
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
client = tweepy.Client(bearer_token='XXXXXXXXXXXX', wait_on_rate_limit=True)
tweets_df = pd.read_csv('./coronatweetids.csv.gz', compression='gzip')
tweet_list = []
for batch in batcher(tweets_df['tweet_id'].tolist(), 100):
ids = list(batch)
tweet_list.append(client.get_tweets(ids)[0])
tweets = list(chain.from_iterable(tweet_list))
pickle.dump(tweets, open("tweets.pkl", "wb"))
df = pd.DataFrame([dict(t) for t in tweets]).rename(columns={'id': 'tweet_id'})
tweets_df = tweets_df.merge(df, on='tweet_id', how='left')
tweets_df.to_csv('./coronatweets.csv')
CPU times: user 8min 4s, sys: 8.06 s, total: 8min 12s
Wall time: 23h 30min 47s
from data_tooling.
#self-assign
from data_tooling.
This dataset consists of three .xlsx files of Tweet IDs. Use of this dataset to rehydrate tweets is solely for non-commercial research purposes and subject to Twitter's terms, including: Twitter Terms of Service, Privacy Policy, Developer Agreement and Policy.
It is also a condition of use of the dataset that you provide attribution of the dataset to the Digital Observatory.
source: https://researchdatafinder.qut.edu.au/display/n10613
from data_tooling.
I think it is OK with these requirements.
Also see in the catalogue entry:
primary_license: Yes - the dataset curators have obtained consent from the source material owners
from data_tooling.
Consolidated all three excel sheets into one .csv using:
t_0 = pd.read_excel('./coronatweetids0.xlsx', sheet_name="Sheet1")
t_1 = pd.read_excel('./coronatweetids1.xlsx', sheet_name="Sheet1")
t_2 = pd.read_excel('./coronatweetids2.xlsx', sheet_name="Sheet1")
t_0["file"] = "file_0"
t_1["file"] = "file_1"
t_2["file"] = "file_2"
df = pd.concat([t_0, t_1, t_2]).reset_index(drop=False).rename(columns={'index':'file_index'})
df.to_csv('./dataset.csv')
Will rehydrate Tweets next.
from data_tooling.
I guess this dataset needs text content for each tweet from dataset:
I have compressed the data file and checked it loads OK:
{'tweet_id': 1219627299085012992}
from data_tooling.
Thanks @cakiki.
Let's keet this dataset for the moment out of the final LM scripts...
from data_tooling.
Related Issues (20)
- Create dataset xnli
- Create dataset indonesian_news_articles_2017 HOT 4
- Create dataset tsac
- Create dataset science_magazing_aaas_academic_journal_ HOT 1
- Create dataset ekantipur_com
- Create dataset nurition_fact
- Create dataset information_week_digital_magazine
- Create dataset du_reader HOT 4
- Create dataset wikihow_vietnamese_human_instructions HOT 2
- Create dataset MT_Vi_Mono_VLSP2020 HOT 4
- Create dataset malindomorph__morphological_dictionary_and_analyser_for_malay_indonesian
- Create dataset human_instructions_in_indonesian_extracted_from_wikihow
- Create dataset mind_body_green
- Create dataset vanguard_daily_media
- Create dataset opus_100 HOT 2
- Create dataset odiencorp2_0 HOT 4
- Create license-compliant version of the Pile: Stack Exchange HOT 1
- Create license-compliant version of the Pile: EuroParl HOT 1
- Citing this resource HOT 4
- Reason for not applying remove_non_prining_characters normalization HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from data_tooling.