Git Product home page Git Product logo

master_ds_tfm's Introduction

Automatic generation of sport news

Abstract

Natural Language Processing techniques are currently one of the most active investigation fields in the area of Computer Science and Machine Learning. These techniques can be applied to multiple applications, such as text summarization. The goal of text summarization is to provide a new and shorter text from a reference, keeping the main information.

This work uses these automatic text summarization techniques to create football news. Almost every online sport site provides both real time information and text summaries about football matches. Therefore, the main goal is the design and implementation of a solution that builds a summary of a match using these real time events.

Regarding the data, as there is no open database that provides the required information, Web scraping techniques are used to extract text from different online newspapers and sites. Each match is represented by a list of real-time events that describe match information and players, together with a summary that collects the most important events. Both text sets present different characteristics in terms of vocabulary and structure, and must be processed in different ways.

Three different strategies that use the real-time events as summaries are proposed. The first one leverages the most important events (events that describe goals or red cards) and uses them to perform a summary. The second strategy builds a conceptual graph, and uses the relation between the different events and concepts to build a summary. Finally, the last solution treats the problem as a ranking problem, where each event has a score that represents the likelihood of that score being included in the real summary.

These events are evaluated in two different ways. The first one is one of the most used metrics to evaluate summaries (ROUGE), and it is based on counting the common words or word sequences in both texts. Despite being widely used, this metric comes with a drawback: it can’t capture semantic relationships. To address this issue, the use of a similarity based on word embeddings is proposed (SMS).

Code structure

  • analysis: notebooks with EDAs, model metrics, experiments...
  • scrapers: package with web scrapers
  • scripts: main package with most processes
    • data: scrapers information unification
    • experiments: abstract interface for experiments
    • extractive_summary: extractive summary scripts
    • metrics: evaluation metrics
    • models: model training and metrics definition
    • text: basic NLP processing

master_ds_tfm's People

Contributors

gscharly avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.