A mongoDB stores thousands of reviews for over than 250 movies and all the information related to these movies.\
In this repo, I implemented an ETL pipeline to achieve the following:
- Collect reviews and stars from the database.
- Filter the text and do the preprocessing task from: -Text Normalization-Noise Filteration-Stemming-.
- Store the data in Json files or Csv files.
Before that, There is an implementation of IMBD spider to crawl reviews from the website for 250 different movies from different genres.
The crawled reviews are stored in mongoDB.
- Scrapy
- Pandas
- shutil
- re
- Pymongo
- Json
- NLTK