Git Product home page Git Product logo

Comments (8)

cakiki avatar cakiki commented on June 2, 2024 1

Yes, I still haven't managed to find time to rehydrate the dataset. I will get to it this weekend.

from data_tooling.

cakiki avatar cakiki commented on June 2, 2024 1

@albertvillanova I've rehydrated the dataset but there are two problems:

  1. Half the tweets (50.74% out of 2,841,125 Tweets) can no longer be retrieved as they were either deleted or their authors went private. (Common problem in rehydration I've heard)
  2. A lot of the tweets are actually retweets (32.71% out of 1,399,387 Tweets) and are therefore truncated like so:
'RT @COFMadrid: Con motivo del 30 aniversario, @FarmaSinFronter acerca el arte solidario a favor del área materno-infantil en la ciudad de T…'

(Not sure about the language distribution of the data either)

We can't really do anything about Problem 1.

Problem 2 can be solved using a second run through the data. Let me know if I should retrieve the original tweets (On second thought, I suspect that a lot of them are already part of the corpus and therefore duplicates that are of no interest to a language modeling task.)

For reference in case someone wants to rehydrate a tweets dataset later in the project, this is how I used Twitter API v2 to do it:
Keep in mind that this ran almost a full day (Most of it spent in sleep as it hit the rate limit 94 times, each time waiting for around 780 seconds), so it might not the best code.

import tweepy
import pandas as pd
from itertools import zip_longest, chain

# https://docs.python.org/3/library/itertools.html#itertools-recipes
def batcher(iterable, n, fillvalue=19*'1'):
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

client = tweepy.Client(bearer_token='XXXXXXXXXXXX', wait_on_rate_limit=True)
tweets_df = pd.read_csv('./coronatweetids.csv.gz', compression='gzip')

tweet_list = []
for batch in batcher(tweets_df['tweet_id'].tolist(), 100):
    ids = list(batch) 
    tweet_list.append(client.get_tweets(ids)[0])
tweets = list(chain.from_iterable(tweet_list))
pickle.dump(tweets, open("tweets.pkl", "wb"))
df = pd.DataFrame([dict(t) for t in tweets]).rename(columns={'id': 'tweet_id'})
tweets_df = tweets_df.merge(df, on='tweet_id', how='left')
tweets_df.to_csv('./coronatweets.csv')
CPU times: user 8min 4s, sys: 8.06 s, total: 8min 12s
Wall time: 23h 30min 47s

from data_tooling.

cakiki avatar cakiki commented on June 2, 2024

#self-assign

from data_tooling.

cakiki avatar cakiki commented on June 2, 2024

This dataset consists of three .xlsx files of Tweet IDs. Use of this dataset to rehydrate tweets is solely for non-commercial research purposes and subject to Twitter's terms, including: Twitter Terms of Service, Privacy Policy, Developer Agreement and Policy.
It is also a condition of use of the dataset that you provide attribution of the dataset to the Digital Observatory.

source: https://researchdatafinder.qut.edu.au/display/n10613

from data_tooling.

albertvillanova avatar albertvillanova commented on June 2, 2024

I think it is OK with these requirements.

Also see in the catalogue entry:

primary_license: Yes - the dataset curators have obtained consent from the source material owners

from data_tooling.

cakiki avatar cakiki commented on June 2, 2024

Consolidated all three excel sheets into one .csv using:

t_0 = pd.read_excel('./coronatweetids0.xlsx', sheet_name="Sheet1")
t_1 = pd.read_excel('./coronatweetids1.xlsx', sheet_name="Sheet1")
t_2 = pd.read_excel('./coronatweetids2.xlsx', sheet_name="Sheet1")
t_0["file"] = "file_0"
t_1["file"] = "file_1"
t_2["file"] = "file_2"
df = pd.concat([t_0, t_1, t_2]).reset_index(drop=False).rename(columns={'index':'file_index'})
df.to_csv('./dataset.csv')

https://huggingface.co/datasets/bigscience-catalogue-data/100_days_of_covid_19_in_the_australian_twittersphere

Will rehydrate Tweets next.

from data_tooling.

albertvillanova avatar albertvillanova commented on June 2, 2024

I guess this dataset needs text content for each tweet from dataset:

I have compressed the data file and checked it loads OK:

{'tweet_id': 1219627299085012992}

from data_tooling.

albertvillanova avatar albertvillanova commented on June 2, 2024

Thanks @cakiki.

Let's keet this dataset for the moment out of the final LM scripts...

from data_tooling.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.