In this project, we are building the system using the Amazon Electronics dataset that contains information of more than 60,000 electronic products. The dataset includes productID, Review Time, and Review of the products.
This system is programed using PySpark and Hive environment. And user interface is build on the jupyter notebook connected with the Pyspark. After typing in the product ID, this system will output several positive/negative features and the relativity of these features. At the same time, we can also obtain the graph that show how the scores of this product is changing with time and number of reviews. In the end, we also showed the LDA result for the positive and negative reviews. .
This folder is orgarnized as follows.
proj/
├── algorithm analysis/
├── data/
├── scraper/
└── final script/
If you want to run the final version of script locally, please download all scripts and data to the bin folder under pyspark. Run the productanalysis.ipynb on notebook connected to the pyspark
algorithm analysis
- feature_extraction_comparison.ipynb Compare different features
- feature_selection_model_analysis.ipynb Compare different features
- plot.ipynb Basic statistical analysis
data
- whitelist
- productinfo ( generated by Hive)
- productname
final script
- productanalysis.ipynb (Final version)
- plot.py
scraper
- Extractinfo.ipynb (Final version)