Git Product home page Git Product logo

arasgungore / job-posting-duplicate-detection Goto Github PK

View Code? Open in Web Editor NEW
4.0 3.0 0.0 296 KB

A project aiming to leverage text embeddings and Milvus, a high-performance vector search engine, to detect duplicate job postings.

License: MIT License

Dockerfile 21.37% Python 78.63%
duplicate-detection embeddings job-posting job-postings milvus text-embedding docker-compose dockerfile duplicates exploratory-data-analysis

job-posting-duplicate-detection's Introduction

job-posting-duplicate-detection

A project aiming to leverage text embeddings and Milvus, a high-performance vector search engine, to detect duplicate job postings. The process involves generating embeddings from job descriptions and utilizing Milvus for efficient duplicate detection.

Table of Contents

Introduction

The project focuses on the following key tasks:

  1. Data Preprocessing: Explore and clean job postings data, handling missing values and anomalies.
  2. Generating Embeddings: Utilize a pre-trained model (Sentence Transformers) to generate embeddings for job descriptions.
  3. Milvus for Duplicate Detection: Set up a Milvus instance, insert embeddings, and implement a method to search for potential duplicates.
  4. Docker/Docker Compose Integration: Containerize the project for easy reproducibility.

Project Structure

/job-posting-duplicate-detection
|-- data/
|   |-- job_postings.csv
|-- embeddings/
|   |-- generate_embeddings.py
|-- milvus/
|   |-- milvus_setup.py
|   |-- duplicate_detection.py
|-- Dockerfile
|-- docker-compose.yml
|-- video_demo/
|   |-- demo_video.mp4
|-- README.md

Requirements

  • Python 3.x
  • PyTorch
  • Sentence Transformers
  • pymilvus

Install dependencies using:

pip install -r requirements.txt

Installation

  1. Clone the repository:

    git clone https://github.com/arasgungore/job-posting-duplicate-detection.git
  2. Navigate to the project directory:

    cd job-posting-duplicate-detection
  3. Install dependencies:

    pip install -r requirements.txt

Usage

  1. Data Preprocessing:

    Explore and clean the data in the data/job_postings.csv file.

  2. Generating Embeddings:

    Run the following command to generate embeddings:

    python embeddings/generate_embeddings.py
  3. Milvus for Duplicate Detection:

    • Set up Milvus instance:

      python milvus/milvus_setup.py
    • Run duplicate detection:

      python milvus/duplicate_detection.py
  4. Docker/Docker Compose Integration:

    • Build and run the Docker image:

      docker build -t job-posting-duplicate-detection .
      docker-compose up

Results and Evaluation

Results and evaluation metrics are provided in the code comments of milvus/duplicate_detection.py. The effectiveness of the duplicate detection method can be assessed based on precision, recall, and similarity threshold.

Docker Integration

The project includes Docker and Docker Compose files (Dockerfile and docker-compose.yml) for containerization. This ensures a reproducible and isolated environment.

To build and run the Docker image, follow the instructions in the Usage section.

Video Demo

Watch the demo video for a quick overview of the project.

Contributing

Contributions are welcome! Feel free to open issues or pull requests for any improvements or new features.

License

This project is licensed under the MIT License.

Author

👤 Aras Güngöre

job-posting-duplicate-detection's People

Contributors

arasgungore avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.