uid: 100_days_of_covid_19_in_the_australian_twittersphere type: process

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Consolidated all three excel sheets into one .csv using: <div class="highlight hig

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Create dataset 100_days_of_covid_19_in_the_australian_twittersphere,about bigscience-workshop/data_tooling

Comments (8)

cakiki commented on June 2, 2024 1

Yes, I still haven't managed to find time to rehydrate the dataset. I will get to it this weekend.

from data_tooling.

cakiki commented on June 2, 2024 1

@albertvillanova I've rehydrated the dataset but there are two problems:

Half the tweets (50.74% out of 2,841,125 Tweets) can no longer be retrieved as they were either deleted or their authors went private. (Common problem in rehydration I've heard)
A lot of the tweets are actually retweets (32.71% out of 1,399,387 Tweets) and are therefore truncated like so:

'RT @COFMadrid: Con motivo del 30 aniversario, @FarmaSinFronter acerca el arte solidario a favor del área materno-infantil en la ciudad de T…'

(Not sure about the language distribution of the data either)

We can't really do anything about Problem 1.

Problem 2 can be solved using a second run through the data. Let me know if I should retrieve the original tweets (On second thought, I suspect that a lot of them are already part of the corpus and therefore duplicates that are of no interest to a language modeling task.)

For reference in case someone wants to rehydrate a tweets dataset later in the project, this is how I used Twitter API v2 to do it:
Keep in mind that this ran almost a full day (Most of it spent in sleep as it hit the rate limit 94 times, each time waiting for around 780 seconds), so it might not the best code.

import tweepy
import pandas as pd
from itertools import zip_longest, chain

# https://docs.python.org/3/library/itertools.html#itertools-recipes
def batcher(iterable, n, fillvalue=19*'1'):
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

client = tweepy.Client(bearer_token='XXXXXXXXXXXX', wait_on_rate_limit=True)
tweets_df = pd.read_csv('./coronatweetids.csv.gz', compression='gzip')

tweet_list = []
for batch in batcher(tweets_df['tweet_id'].tolist(), 100):
    ids = list(batch) 
    tweet_list.append(client.get_tweets(ids)[0])
tweets = list(chain.from_iterable(tweet_list))
pickle.dump(tweets, open("tweets.pkl", "wb"))
df = pd.DataFrame([dict(t) for t in tweets]).rename(columns={'id': 'tweet_id'})
tweets_df = tweets_df.merge(df, on='tweet_id', how='left')
tweets_df.to_csv('./coronatweets.csv')

CPU times: user 8min 4s, sys: 8.06 s, total: 8min 12s
Wall time: 23h 30min 47s

from data_tooling.

cakiki commented on June 2, 2024

#self-assign

from data_tooling.

cakiki commented on June 2, 2024

This dataset consists of three .xlsx files of Tweet IDs. Use of this dataset to rehydrate tweets is solely for non-commercial research purposes and subject to Twitter's terms, including: Twitter Terms of Service, Privacy Policy, Developer Agreement and Policy.
It is also a condition of use of the dataset that you provide attribution of the dataset to the Digital Observatory.

source: https://researchdatafinder.qut.edu.au/display/n10613

from data_tooling.

albertvillanova commented on June 2, 2024

I think it is OK with these requirements.

Also see in the catalogue entry:

primary_license: Yes - the dataset curators have obtained consent from the source material owners

from data_tooling.

cakiki commented on June 2, 2024

Consolidated all three excel sheets into one .csv using:

t_0 = pd.read_excel('./coronatweetids0.xlsx', sheet_name="Sheet1")
t_1 = pd.read_excel('./coronatweetids1.xlsx', sheet_name="Sheet1")
t_2 = pd.read_excel('./coronatweetids2.xlsx', sheet_name="Sheet1")
t_0["file"] = "file_0"
t_1["file"] = "file_1"
t_2["file"] = "file_2"
df = pd.concat([t_0, t_1, t_2]).reset_index(drop=False).rename(columns={'index':'file_index'})
df.to_csv('./dataset.csv')

https://huggingface.co/datasets/bigscience-catalogue-data/100_days_of_covid_19_in_the_australian_twittersphere

Will rehydrate Tweets next.

from data_tooling.

albertvillanova commented on June 2, 2024

I guess this dataset needs text content for each tweet from dataset:

#133

I have compressed the data file and checked it loads OK:

{'tweet_id': 1219627299085012992}

from data_tooling.

albertvillanova commented on June 2, 2024

Thanks @cakiki.

Let's keet this dataset for the moment out of the final LM scripts...

from data_tooling.

Create dataset 100_days_of_covid_19_in_the_australian_twittersphere about data_tooling HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent