Git Product home page Git Product logo

epitar.gz's Introduction

epitar.gz

Highly customizable archive and index framework for EPITA.

Get started

  • Create a new config.yml (see config.sample.yml) to configure the EPITA services you wish to archive by specifying the associated archive module.
  • Configure your sonic instance in sonic.cfg.
  • Run the given docker-compose.yml file in order to start your sonic instance and a docconv container (word extractor for PDF files).
  • Run ./epitar start to start archiving and indexing.

How does it work

Archive modules

An archive module scrapes, downloads, or archives websites and services. These modules are highly customizable as they run in Docker containers.

Index

Archived files may be scanned to build a search index. PDF files words are extracted using regular methods or using an OCR for scanned documents.
Words are then processed by a sonic instance in order to build a fast search index.

UI & API

A UI is exposed along with an API to quickly search for files.

Contributing

Add an archive module

An archive module is highly customizable as it can be written in programming language as long as a valid Dockerfile is provided.
Your archive module must have a Dockerfile, a module.json and a README.

Dockerfile

Your Dockerfile can use any base image but try to keep the image size small.

The output directory for archived files must be /output.

module.json

Your module.json must provide informations about the website or service that is being archived.
Here is an example:

{
    "name": "Past-Exams",
    "slug": "past-exams",
    "url": "https://github.com/Epidocs/Past-Exams",
    "description": "Past subjects and other files, for the benefit of EPITA students. ",
    "logo": "https://github.com/fluidicon.png", // optional
    "authors": [
        {
            "name": "Aurele Oules",
            "email": "[email protected]"
        }
    ]
}

README.md

You must provide a simple README.md that explains how to use this module.
An archive module may take environment variables as options so you may explain them here.

Other files

You may add any other files in the module directory but try to keep it organized and only commit necessary files.

You must edit the config.sample.yml file to provide an example on how to use your archive module.

License

MIT - Aurèle Oulès

epitar.gz's People

Contributors

aureleoules avatar go-compile avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

epitar.gz's Issues

CVE-2022-4643: RCE in pdf convert

CVE-2022-4643

Vulnerability: RCE
Impact: Critical
CVE: CVE-2022-4643
Dependency: docconv
Vector: The manipulation of the argument path to docconv.{ConvertPDF,PDFHasImage} leads to os command injection
Component: pdf_ocr.go,ConvertPDFImages,PDFHasImage (docconv)

Epitar.gz's vector

res, err := docconvClient.Convert(bytes.NewReader(data), filename)

Convert() is a vulnerable artifact of CVE-2022-4643

Patch

Update code.sajari.com/docconv v1.2.0 to v1.2.1 or latest v1.3.5

code.sajari.com/docconv v1.2.0

Recommended Follow ups

Setup periodic dependency scanning.
OSV Scanner.

References:

https://github.com/sajari/docconv/pull/110/files#diff-6b4d1219c6eeac22eb3c89b2842dfa0fa6d774fcf29a57dd4b829ca76cdc69f4L114
https://pkg.go.dev/vuln/GO-2022-1184
https://nvd.nist.gov/vuln/detail/CVE-2022-4643

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.