Git Product home page Git Product logo

phd-track-event-detection's Introduction

Event detection from news articles

We have a dataset of approximately 100 000 newspaper articles. The goal is to detect events from these articles.

Data cleaning

Data is first loaded and cleaned in data-exploration.ipynb.

It is then exported to a new jsonl file named newspapers_filtered_{date_of_exportation}.jsonl

As a second step, we remove from the dataset rows having a too long common prefix. This is done in remove_articles_with_common_prefixes.ipynb.

This allows us to get rid of article titles where the structure is the same but the content is different such as :

Météo, prévisions en Normandie pour le lundi 7 mars
Météo, prévisions en Normandie pour le dimanche 3 avril
Météo, prévisions en Normandie pour le dimanche 10 avril
Météo, prévisions pour le mardi 19 avril en Normandie

TF-IDF

We use a classic TF-IDF method to extract the most important words from the articles. The code is in tf-idf.ipynb.

word_importance_animation.gif

Embedding, Clustering, and generation of a unified title for each cluster

I wrote details about the results in this notion page.

Embedding

We generate an embedding of each title using Mistral's embedding API, which gives us a 1024-dimensional vector representation of each title.

The embedding is done in embedding.ipynb. The resulting file is an exported numpy array.

Clustering

We use the HDBSCAN clustering algorithm to cluster the articles. The code is in hdbscan.ipynb.

The algorithm uses the embeddings and the date to group articles with similar titles and dates together.

Title generation

For each cluster, we generate a unified title using an LLM.

API Keys

To reproduce the results, you need to have access to the Mistral API for embeddings, and Anthropic API for title generation.

Create a .env file with the following format and the API keys

MISTRAL_API_KEY=<your Mistral API key>
ANTHROPIC_API_KEY=<your Anthropic API key>

phd-track-event-detection's People

Contributors

supermuel avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.