Git Product home page Git Product logo

ita-twitter_analytics's Introduction

Quick Note: The backend needs some time to start, so no worries if our website is not working right from the start.

Requirements

Make sure to download all required packages first, i.e. run:

pip install -r requirements.txt

and please also run:

python -m spacy download de

Apart from that you need access to the Twitter API to run the scraper.

Consumer-Based Decision Aid Of The Top 50 German Twitter Trends

Maximilian Schöneberger, [email protected]

Jan Straub, [email protected]

Paavo Streibich, [email protected]

Robin Viellieber, [email protected]

Milestone one

Project State

We implemented our pipeline and scraper for the german twitter trends. For the next milestone we are going to write the machine learning part and the website where we will present our results.

First, we created a scraper that scrapes for each of the top 50 german Twitter trends n tweets, where n is a hyperparameter. Our pipeline then takes the scraped tweets and first cleanes the data from all non-alphabetical characters. Secondly, we used the CountVectorizer from sklearn to remove the stop words and set all letters to their lower-cased form. Thirdly we tokenize the data and add it to a dictionary. Then we lemmatize the words and count them again in a new dictionary, and achieve the final preprocessed form of our data. Furthermore, we followed the advice from our tutor and only used lemmatization and skipped stemming. We found out that german lemmatization is possible via a package from the spacey module. Also, we pursued his guidance and focused on the text analytics part, and reserved the sentiment analysis for later in case we choose we want to invest more in that direction. We also tested the pipeline with our data and everything seems to work as intended. In addition, the lemmatizer also seems to work properly, since most morphological variations were correctly lemmatized to an identical basic word.

Data Analysis

At first, we planed on using getdaytrends.com to access the top 50 Twitter trends, but after we did more research about the Twitter API, we realized that we can directly extract the trends from their API. Therefore we discarded our original plan to use getdaytrends.com and instead directly use the Twitter API, which provides a specific number of german tweets for a given topic in a .json file. In this file we get the creation timestamp, the twitter id, the language abbreviation, and the tweet itself. As an example:

[
  {
    "created_at": "2020-12-06T19:15:04.000Z",
    "id": "1335664071623004167",
    "lang": "de",
    "text": "Zwei Rennen in Bahrain, zweimal spektakul\u00e4r bis chaotisch. #F1 #Formel1"
  },
  ...
]

relationship model

relationship model

Contributions

  • Maximilian wrote the scraper and got in contact with Twitter to get access to their API.
  • Maximilian, Robin, Paavo, and Jan worked together to build the pipeline.
  • Robin and Paavo worked on the relationship model.

ita-twitter_analytics's People

Contributors

mschoeneberger avatar janmstraub avatar rv3dcv avatar paavo0 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

ita-twitter_analytics's Issues

BUG: links filter does not work properly

produces 2 links in 1 item or faulty(uncomplete) links

EXAMPLE:

"links": {
"https://t.co/vTm0Mm": 19,
"https://t.co/tzx9bHvNLl": 1,
"https://t.co/af0xCB6KDC": 1,
"https://t.co/KbEOTWU3K6": 16,
"https://t.co/Xz7jGq5J4N": 1,
"https://t.co/zjCo4yW020": 1,
"https://t.co/jyWGuTTMIk": 1,
"https://t.co/kXyhG7Zy1J": 22,
"https://t.co/L5Mz2xLRqt": 202,
"https://t.co/ohGjRCZZjo https://t.c": 11,
"https://t.co/LzbW41lpMO": 1,
"https://t.co/G7YyYgIUfQ": 1,
"https://t.co/JGZDbzh1om": 1,
"https://t.co/QCwOzTg5QH": 1,
"https://t.co/xhZY2kqYD3": 1,
"https://t.co/uSNzvsIb9a": 1,
"https://t.co/3nZtCNcyeH": 1,
"https://t.co/94nsgNwXCC": 1,
"https://t.co/3vABucQxVp": 9,
"https://t.co/rWUVJpaltK": 2,
"https://t.c": 5,
"https://t.co/0UGtNXCdrs": 1,
"https://t.co/18wAwGSHgT": 1,
"https://t.co/6Bxf4ITai2": 1,
"https://t.co/BG22wVxI4f": 1,
"https://t.co/adaRUd78pt": 1,
"https://t.co/0lFct59FCv": 36,
"https://t.co/9npdVoY9dC": 1,
"https://t.co/qSwdW2PMVn https://t.co/ryfcgolChE": 1,
"https://t.co/qSwdW2PMVn https://t.co/TjL9j4V4y4": 1,
"https://t.co/GiCYdYHxjO": 1,
"https://t.co/qSwdW2PMVn": 1,
"https://t.co/jxiR16h7z8": 1,
"https://t.co/Sbdb6GkNiJ": 1,
"https://t.co/2QYV3hgf5z": 1,
"https://t.co/Sbdb6GCoah": 1,
"https://t.co/JpL7Nu6plr": 1,
"https://t.co/WuUZq6ee4w": 1,
"https://t.co/FDZFBeNIpw": 1,
"https://t.co/h6XKxCXRrP https://t.co/Igg4WUnXSN": 1,
"https://t.co/M1XPJ86Fma": 1,
"https://t.co/N43Y7QWqtl": 1,
"https://t.co/SNQw2k00fq": 1,
"https://t.co/t4D8dD2rMO": 1,
"https://t.co/nPYIda5dYW": 1,
"https://t.co/LOzRNZM1Ar": 1,
"https://t.co/XvrjF85YgW": 1,
"https://t.co/IAYn7bYPnF": 1,
"https://t.co/sCjroLIzOt": 1,
"https://t.co/vpzqhEfXyi": 2,
"https://t.co/mkOXC0OcJA": 1,
"https://t.co/yXf9b3Dfgl": 1,
"https://t.co/5kH6SAmaTf": 1,
"https://t.co/kmeqoEHnF3": 1,
"https://t.co/xv1CRGEuVw": 1,
"https://t.co/lmTwPic5Xv": 1,
"https://t.co/zi9qBZKzpd": 1,
"https://t.co/pQBZnjlTRc": 1,
"https://t.co/quKkt7wWVD": 1,
"https://t.co/Aqn9CATHu2": 1,
"https://t.co/MMUQsN7HOW": 1,
"https://t.co/UXDNZwQjTY": 5,
"https://t.co/IkGQm7JTvw https://t.co/vDQrdpcGSJ": 1,
"https://t.co/Te1lTWIKHI": 1,
"https://t.co/3CXs6xeZWG": 1,
"https://t.co/lHPS8KPO3u": 1,
"https://t.co/yPMe2rWDGr": 1,
"https://t.co/ltPb3Qjbti": 3,
"https://t.co/s1eGEHvOJk": 5,
"https://t.co/8Bxe5KoidB": 1,
"https://t.co/pK338jjm51": 1,
"https://t.co/V6vHxtP2Vy": 1,
"https://t.co/TGp0Bzm6FD https://t.co/zU0heH5kHA": 1,
"https://t.co/Bch0acy2CW": 2,
"https://t.co/fk4Vq1DjYW": 1,
"https://t.co/SjblD5Wtw7": 4,
"https://t.co/bIkNBtBy3T": 1,
"https://t.co/6c6WbqdteF": 1,
"https://t.co/4Gj3yLQYkP": 2,
"https://t.co/7OJW22pBoo": 1,
"https://t.co/lBxiCi6yNA": 13,
"https://t.co/iEBZO0QQdc": 1,
"https://t.co/RM4y2jjcmf": 1,
"https://t.co/4FsBrLgnTF": 1,
"https://t.co/ohGjRCZZjo https://t.co/xlBclP3fcw": 1,
"https://t.co/wZ4popHLVk": 2,
"https://t.co/krD7Ba0Mbg": 1,
"https://t.co/Q6kav3MQeU": 1,
"https://t.co/Rvw0mDvjEw": 1,
"https://t.co/aiczlEetct": 1,
"https://t.co/pSRWskyRXZ": 1,
"https://t.co/ZwOSawBkHy": 1,
"https://t.co/ivVMApZXja": 1,
"https://t.co/mdlmXRn2pS": 1,
"https://t.co/8Ym1wuIXkx": 1,
"https://t.co/apYUEaUwZu": 1,
"https://t.co/XYimLXRDAq": 1,
"https://t.co/dIfC2n15hM": 1,
"https://t.co/PCgJpb98va": 1,
"https://t.co/O1LsbTxxHN": 1,
"https://t.co/U7kpGvwZ9f": 1,
"https://t.co/itHzznNvYB": 1,
"https://t.co/iFEK0xox4i": 1,
"https://t.co/bx4vxk6Pe4": 1,
"https://t.co/KjpYf84G4t": 1,
"https://t.co/WYznFnzf6T": 1,
"https://t.co/ZnVu7zX4N5 https://t.co/zHs6odW446": 1,
"https://t.co/": 1,
"https://t.co/e7IGQj52Zo https://t.co/xSKZBrlGI6": 1,
"https://t.co/EE3dejrmaO": 1,
"https://t.co/fywky3UCBQ": 1,
"https://t.co/kUejkHGFNl": 1,
"https://t.co/WwDMquiELF": 1,
"https://t.co/SldEYCPINl": 1,
"https://t.co/TTdxv7QfZm": 1,
"https://t.co/jz0xFHpSdC": 1,
"https://t.co/qvIHREOsoR": 1,
"https://t.co/yeSxZMWjuB": 1,
"https://t.co/3i8yz4p8XA": 1,
"https://t.co/F4kfadSTZf": 1,
"https://t.co/sLy8P360Zi": 1,
"https://t.co/KIADgQdSFn": 1,
"https://t.co/lTLlrrn50d": 1,
"https://t.co/fYhTKhjR6b": 1,
"https://t.co/uRGzhtCajw": 1,
"https://t.co/2woUWsKhkN": 1,
"https://t.co/vTm0MmuI0a": 1,
"https://t.co/4W0y3H3AvQ": 1,
"https://t.co/2rWqjkhQ8g": 1,
"https://t.co/V65q0S5Y0t": 1,
"https://t.co/vP26bVD3eT": 1,
"https://t.co/YKvLXE10BR": 1,
"https://t.co/IznEsxafuU": 1,
"https://t.co/O9lCcHqjqC": 1,
"https://t.co/3lzejzXlSA": 1,
"https://t.co/2bcadlGDF5": 1,
"https://t.co/hsHTFQNwXd": 1,
"https://t.co/AUgOY9C0s2": 1,
"https://t.co/8SGnVwC6pr https://t.co/zlhcnFmi7B": 1,
"https://t.co/a7ujnQhpVz": 1,
"https://t.co/lduxKBApa4": 1,
"https://t.co/WziL4A5Bzr": 1,
"https://www..welt.de/politik/plus224367844/Quarantaenebrecher-Laender-schaffen-Zentralstellen-zur-Zwangseinweisung.html": 1,
"https://t.co/Cu1nGhWc2R": 1,
"https://t.co/IAwp2fJplY": 1,
"https://t.co/zRhKCvqQg5": 1,
"https://t.co/F2PD7V46vB": 1,
"https://t.co/fpu0caEpLY": 1,
"https://t.co/IkrnZBF3jj": 1,
"https://t.co/vtUvOKsnMe": 1,
"https://t.co/b3QCUoljQS": 1,
"https://t.co/OSholXt6n1": 1,

rewrite pipeline

CountVectorizer in pipeline is not well suited for our case, remove it and perform manual cleaning

BUG: Pipeline filter does not filter out "rt"

In nearly all results the token "rt" is not filtered out and one of the most common tokens in each trend.
Additionally check if the pipeline filters out all (sequences of) special characters like "_ "

empty trend issue

there can be trends, which do not contain any german tweets (or fewer than first round downloads ?), resulting in a crash at wordcloud creation.
TODO:
add handling for such cases

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.