getttttttt / hive-ngram-analysis Goto Github PK
View Code? Open in Web Editor NEWThis repository contains a Hive project to analyze Google Books n-grams dataset. This project demonstrates how to install and deploy Hive on a local Hadoop pseudo-distributed cluster, merge data files on HDFS, compute average occurrences of each bigram per year, and identify the top 20 bigrams by their average yearly occurrences.