Git Product home page Git Product logo

hammou2020 / covid-19-arabic-tweets-dataset Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sarahalqurashi/covid-19-arabic-tweets-dataset

0.0 0.0 0.0 74.03 MB

The repository contains a collection of Arabic tweets IDs associated with the novel coronavirus COVID-19. The dataset contains Tweets' ids from 2020-01-01 to 2020-04-15. The Twitter search API was used to gather real-time tweets that contained specific keywords in the Arabic language. The dataset contains almost eight millions and half Arabic tweets.

License: Other

Jupyter Notebook 100.00%

covid-19-arabic-tweets-dataset's Introduction

COVID-19-Arabic-Tweets-Dataset

The repository contains a collection of Arabic tweets IDs related to novel coronavirus COVID-19. The dataset contains Tweets ids starting from January ,2020 . The Twitter search API was used to gather real-time tweets that contained specific keywords in the Arabic language. To comply with Twitter’s Terms of Service, only the ids of the tweets are released. This dataset is for non-commercial research use only.

Data Organization

  • As of April 19, 2020 we have tweets from January,2020 unitl April 15, 2020 tweets. We plan to add more months in upcoming days and continuosly update this page.
  • Tweet-ID files are stored in folders that indicate the year and month of the collection
  • The Tweet-ID files contain the tweets ids, all files name have the same structure, with a prefix “COVID19-tweetID-year-month-day"

Dataset collection

  • Only tweets in Arabic language were collected from February 1,2020 to April 15, 2020.
  • The keywords.txt file contains the updated keywords along with the date we began tracing them. The Hashtags.txt files contain the hashtags that we followed in our Twitter data-set the number of tweets collected for each hashtag along with the date we began tracing them.
  • Since Twitter’s search API have a restriction on the amount of the retrieved data there are missing hours of data.
  • We provided preliminary statistics of the data-set in the associated paper to this repository. The preliminary statistics will be automatically updated with every update of the dataset.
  • For retrieving, the full object of the tweet consider the following tools Hydrator and twarc .

Dataset Statistics

The following statistics is from Tweets colected until April,15,2020.
The Number of Tweets: 3,934,610
The Number of Original Tweets : 3,934,235
The Number of Retweets: 375
The Average of Tweets Collected Daily : 77471

Guideline to Hydrate

Using TWARC Notebook

To hydrate the tweets-ID from our COVID-19-Arabic-Tweets-Dataset GitHub repository you can use our Hydrate_TweetIDs_Arabic_COVID19 notebook.

  • The notebook runs on google collab
  • You are required to have a Twitter developer account

For those who prefer to use a Graphical User Interface (GUI) , We suggest using Hydrator.

Using Hydrator

To use Hydrator follow the instructions in the Hydrator GitHub repository.

For Arabic guideline on both Hydrator and our Twarc notebook check our دليل استعادة قاعدة بيانات التغريدات.

Licensing

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0).By using this dataset , you agree to the terms of the LICENSE, and to all Twitter’s Terms of Service, and cite our paper: https://arxiv.org/abs/2004.04315

Contact

If you have any suggestions or questions, please reach out to saraa.alqurashi on Gmail or eaanazi(AT)uqu(dot)edu(dot)sa

covid-19-arabic-tweets-dataset's People

Contributors

sarahalqurashi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.