daniefer / mapreduce-wordandarticlecount Goto Github PK
View Code? Open in Web Editor NEWMapReduce program for processing the Wikipedia data set. Each Article on Wikipedia is line separated in a 32GB text file. Each line is tab separated into title, last update date and time, content, and external links. This program has two options, Count the five most common words in articles who's title contains a supplied keyword or count the number of article that contains the supplied keyword