- TF-IDF is the product of two statistics : Term Frequency (TF) and Inverse Document Frequency (IDF).
- TF is the number of times a term (word) occurs in a document.
- IDF is a numerical statistic that is intended to reflect how important a word is to a document.
- Stop Words are the words which donot contain important significance to the search queries.
- MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
- Apache Hadoop is an open-source software framework used for distributed storage and processing of big data sets using the MapReduce programming model.
TFIDF = n/N * log(D/m) n is the number of times a word is in a document N is the sum of all n's of a document