Information extraction and retrieval from textual data : Search-Engine-Project-and-Homeworks
Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. The collection is generally a set of documents from a single or several databases, described by their content or the metadata associated. The first part will describe the search engine that we implemented which is capable of quering strings or words provided by the user in a collection of text files. The explosion of amount of data through the 21st century paved the way to new technics and models to extract relevant information from them. It applies to sentiment analysis field which refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. Sentiment analysis is widely applied to reviews and social media for a variety of applications, ranging from marketing to customer services. Autoencoders have been introduced to solve the aformentionned tasks. They learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction with the aim to facilitate the learning task. We will present them more in detail in the second part and present how they are used to learn sentiments such as positive or negative reviews on movies. Nevertheless, works by the scientific community for autoencoders improvement are still running to get better performances, this is how stacked denoising autoencoders appear by introducing some noise to the data in order to make the learning task stronger. The Goal of this report is to present some tools and state of the art methods about information retrieval and computational linguistic for new students in machine learning.