Git Product home page Git Product logo

text-retrieval-python's Introduction

#Text Retrieval Python# A collection of Python scripts with functions for Text Retrieval.

Search process:

  • Users perform a task that has an information need, from which a query is formulated.
  • A search engine processes textual queries and retrieves a ranked list of documents.
  • Users obtain a result and may reformulate (refine) their query, based on suggestions.

##Ranking## Search engines (used to) rank web documents based on features, such as:

  • Term Frequency (In URL, pagetitle, body and metatags)
  • TF-IDF
  • BM25
  • LSI
  • Keyword Density
  • Boolean models
  • Vector space models
  • PageRank
  • Number of inbound links
  • Number of outbound links
  • Link quality
  • Page quality
  • Site quality
  • Author quality
  • Document length
  • Number of slashes in URL
  • Length of URL
  • URL click count from user data
  • URL click count on result pages
  • URL dwell time
  • Reading level

###Term Frequency - Inverted Document Frequency (TF-IDF)### TF-IDF is the product of TF and IDF. A high TF-IDF means the term is highly related to a document.

TF or tf(t,d), can prevent bias to longer documents:

  • Raw Frequency: Simply count the number of times the term/query appears in the document.
  • Boolean Frequency: tf(t,d) = 1 if term in document, else tf(t,d) = 0
  • Logarithmically Scaled Frequency: tf(t,d) = 1 + log f(t,d) if term in document, else tf(t,d) = 0
  • Proportional Frequency: tf(t,d) = tf(t,d) / number of all terms in document
  • Normalized Frequency: tf(t,d) = tf(t,d) / maximum raw frequency of any term in the document

To illustrate TF. We have 2 documents A and B:

  • Document A has 10 words and Document B has 100 words.
  • The term "Alice" appears 1 time in document A. Proportional: tf(t,d) for document A = 1 / 10 = 0.1
  • The term "Alice" appears 5 times in document B. Proportional: tf(t,d) for document B = 5 / 100 = 0.05

IDF or idf(t, docs), prevents bias to common words and values unique words:

  • idf(t, docs): count all documents / Boolean frequency of the term in all documents.

To illustrate IDF. We have 20 documents. We have three words:

  • a very common word, appears in all documents: IDF = 20 / 20 = 1
  • a fairly common word, appears in half of the documents: IDF = 20 / 10 = 2
  • a very rare word, appears in only 10% of documents: IDF = 20 / 2 = 10

####tfidf.py#### Functions:

  • TF Raw
  • TF Proportional
  • IDF
  • Very simple tokenizer

Sample output:

Term: alice
First 5 document tokens: ['alice', 'wunderland', '1009', 'tree', 'lane']
Term count in document 1
Token count in document: 6
Number of documents: 3
Number of documents with term: 2
TF:         0.1666
IDF:        1.5
TFIDF:      0.25

##Evaluating Search Engine Results## What makes for a good search engine?

Mainly:

  • Speed
  • Size
  • Quality

Optionally:

  • Personalization
  • Localization
  • Customization
  • Simplicity / UX
  • Social
  • Privacy

###Measuring Quality### How de we assess the quality of a search engine result page (SERP)?

  • Ask users and volunteers directly for feedback
  • Hire SERP reviewers
  • Measure user engagement (Number of clicks on result, Bounce ratio)
  • A/B testing
  • Multi-variate testing

Quality feedback can be binary (0 for not relevant, 1 for relevant) or non-binary (0-3 scale).

Example binary feedback for 15 results:

[(1,1),(2,1),(3,1),(4,0),(5,1),(6,1),(7,1),(8,1),(9,1),(10,1),(11,0),(12,1),(13,0),(14,1),(15,0)]

(Rank 1, 2, 3 are relevant, rank 4 is irrelevant, rank 5 to 10 are relevant etc.)

####rankedresultseval.py#### Functions for:

  • Precision (P)
  • Precision at K (PaK)
  • Interpolated Precision (IP)
  • Average Precision (AP)
  • Recall (R)
  • Cumulative Gain (CG)
  • Discounted Cumulative Gain (DCG)
  • Ideal Discounted Cumulative Gain (IDCG)
  • Normalized Discounted Cumulative Gain (NDCG)

Sample output:

INPUT:	[(1,1),(2,1),(3,1),(4,0),(5,1),(6,1),(7,1),(8,1),(9,1),(10,1),(11,0),(12,1),(13,0),(14,1),(15,0)]

P: 		[1.0, 1.0, 1.0, 0.75, 0.8, 0.83, 0.85, 0.87, 0.88, 0.9, 0.81, 0.83, 0.76, 0.78, 0.73]
R: 		[0.09, 0.18, 0.27, 0.27, 0.36, 0.45, 0.54, 0.63, 0.72, 0.81, 0.81, 0.90, 0.90, 1.0, 1.0]
IP: 	[1.0, 1.0, 1.0, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.83, 0.83, 0.78, 0.78, 0.73]
AP: 	[1.0, 1.0, 1.0, 0.93, 0.90, 0.89, 0.89, 0.88, 0.88, 0.89, 0.88, 0.87, 0.87, 0.86, 0.85]
PaK: 	0.8 // precision at 5
CG: 	11
DCG: 	6.69277
IDCG: 	6.95740
NDCG: 	0.96196

RANK	RELEVANT?	PRECISION	RECALL		INTERPOLATED PRECISION
1		1			1.00000		0.09091		1.00000
2		1			1.00000		0.18182		1.00000
3		1			1.00000		0.27273		1.00000
4		0			0.75000		0.27273		0.90000
5		1			0.80000		0.36364		0.90000
6		1			0.83333		0.45455		0.90000
7		1			0.85714		0.54545		0.90000
8		1			0.87500		0.63636		0.90000
9		1			0.88889		0.72727		0.90000
10		1			0.90000		0.81818		0.90000
11		0			0.81818		0.81818		0.83333
12		1			0.83333		0.90909		0.83333
13		0			0.76923		0.90909		0.78571
14		1			0.78571		1.00000		0.78571
15		0			0.73333		1.00000		0.73333

text-retrieval-python's People

Watchers

James Cloos avatar Jolin Xia avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.