axa-group / parsr Goto Github PK

View Code? Open in Web Editor NEW

5.7K 81.0 305.0 53.86 MB

Transforms PDF, Documents and Images into Enriched Structured Data

License: Apache License 2.0

JavaScript 93.28% Shell 0.03% TypeScript 6.20% Dockerfile 0.03% Python 0.46%

parsr pdf images extraction data typescript python ocr document nlp

parsr's People

Contributors

Stargazers

Watchers

Forkers

trumancfy oguzhankarahan marcel-bihr streifehoernli binarybrain chrisfinnet cipri-tom suncreat idealley mino-park7 cheikhsambam d34d10ck royjohal hexapode nettishq ib23 tailgunnerbeavis einsky tsupe project-renard-survey wlmgithub jaksz c0debrain backwardn cih-y2k apcelent dacer250 ggsonic rdpli jingmouren longjohncoder mdheller socioprophet sbrichardson fulmicoton lyrl cuimengkai alessandrostone aljaziri shivlondon mthodi hadryan dumasss163 0x01001011 emadshihab emreozguruoglu thaneacheron doytsujin nawar29 montantes srravula1 itsround mariembouhaha artsmorgan parampavar tjbay jacke eternalerrors carlosf rafaelmri frankpolte powerxyz aral zmilan fluidcode krobbn whjvenyl trafalcon jasperosy graphtylove duoplay90 jasonzhangzy1757 bjrbhre fksi 0xflotus madebyae stefanondisponibile pdfpeople hlzm spotter phymucs mahjiong yaxche-io salbinus devcsrj s33kh crumbgit israaar codeaudit bpowers4 wanbiguizhao basilcm jaredscheib dwtcourses cristina-ilie gbellard gindrinkersline licshire kungfuai ben-lavelle

parsr's Issues

Automatic LinesToParagraph parameters calculation

The current implementation of paragraph formation from words (in the LinesToParagraph.ts module) highly depends on the values of its parameters (the most importantly maxInterline) to determine if the lines can be merged.

This can be automatised using the following steps:

Calculate if the data isn't rotated - take the smallest text elements - chars or words, and calculate their general direction (with respect to the neighbors of the same type). If there is a common pattern (an angle alpha), the page is rotated.
Taking alpha into account, calculate the most common, the maximum and the minimum inter-line vertical distance (vertical), and use that to calculate the maxInterline value that the module vitally needs to function correctly.

pdfminer does not always produce clean textual output

Summary
pdfminer sometimes omits characters in the textual output.
There are some characters missing.

Environment

Reference commit/version: a7b4b0c

An error on table export to CSV format

Summary
The table export procedure fails for a particular table.
There seems to be an error somewhere in the pipeline where the table is conveted to an array.

Steps To Reproduce
Steps to reproduce the behavior:

Make sure table-detection is turned on.
Pass the attached file through the pipeline.
Observe a TypeError in Table.ts.

Expected behavior
A seemless export of the table to CSV.

Actual behavior
The following error is returned:

[2019-09-09T04:43:06] ERROR (parsr): Cannot set property '0' of undefined
    TypeError: Cannot set property '0' of undefined
        at Table.toArray (/Users/me/Code/parsr/dist/src/types/DocumentRepresentation/Table.js:337:47)
        at /Users/me/Code/parsr/dist/src/exporters/MarkdownExporter.js:89:47
        at Array.forEach (<anonymous>)
        at /Users/me/Code/parsr/dist/src/exporters/MarkdownExporter.js:50:27
        at Array.forEach (<anonymous>)
        at MarkdownExporter.getMarkdown (/Users/me/Code/parsr/dist/src/exporters/MarkdownExporter.js:49:24)
        at MarkdownExporter.export (/Users/me/Code/parsr/dist/src/exporters/MarkdownExporter.js:44:48)
        at /Users/me/Code/parsr/dist/bin/index.js:130:107
        at process._tickCallback (internal/process/next_tick.js:68:7)

Environment

Reference commit/version: 16a4e23
Other platform details: npm 6.9.0
OS: MacOS 10.14.5

Attached file*
t2 (dragged).pdf

Vue - Allow next & prev selection of current element selected

We need a UI way to select the next/prev element with same type (Word, Paragraph, Line, Heading, Table...) of current selected element.

I would like to have inside "Element Inspector" a component with just a left arrow & right arrow that will select previous or next element with same type of the current selected

Certain headings being detected as body text

Certain headings are not being detected as such: perhaps due to the small difference in font size compared to the body text?

It could be interesting to use different parameters as weights (bold, etc) for picking heading candidates in the HeadingDetectionModule.

Source

Resulting Markdown

pdfjson's installation procedure to be changed for Arch Linux and Ubuntu

Summary
The installation procedure for pdf2json mentioned under Arch Linux assumes the package's presence in the official repositories, but it has recently been removed.
UPDATE: The package pdf2json has also been removed from the latest versions of ubuntu.

Steps To Reproduce

Boot into a machine where Arch linux is the operating system.
In a terminal, execute as superuser: pacman -Ss pdf2json. No packages will be returned.

Expected behavior
An installation of pdf2json.

Actual behavior
The package is not found in the repositories.

Environment

OS: Arch Linux

Additional context

A new AUR package should be made using builing pdf2json from source and submitted to https://aur.archlinux.org/
The README files should be modified to reflect the new commands to install this package.

Add unit tests to the API

Add unit tests to the API and run them when we run npm test at the root of the project.

Automatic high performance Header/Footer detection

The current header/footer detection module HeaderFooterDetectionModule requires an estimate in percentage of the maximal distance from the page limit, where the header and footers lie.
It would be great to have this module automatically detect headers and footers (using techniques like NLP, Vision, etc) without the need of such a parameter.

Keep the original order of elements in their properties

Currently, the TextOrderDetection module detects an order to the extracted text depending on the physical layout.

It could be interesting to keep the original text node order from PDF in the properties of the element, so that it can still be referred to, and this information is not lost.

Evaluate PDFMiner, pdf-extractor, pdfreader, and ghostscript with a possibility of including them as possible extractors

Recent tests have shown that PDFMiner outperforms pdf2json as a pdf extractor. It's pdf2txt.py -A -t xml -o <output.xml> <input.pdf> provides an output xml very similar in content to the output of pdf2json.
It could be nice to provide PDFMiner as an extaction option.

Note: PDFMiner's installation depends on python2 and not python3.

Other extractors to be evaluated:

https://www.npmjs.com/package/pdfreader
https://www.npmjs.com/package/pdf-extractor
Ghostscript: see https://stackoverflow.com/a/6189489

Add an architecture diagram to the project documentation

It could be nice to explain the whole pipeline of Parsr diagramatically for easier understanding and adaptation.
It would be great to store the diagram source files in the repository as well.

doc/architecture.md
doc/assets/architecture-drawing.xxx

Sample files

Provide some representative samples in the repository under samples/.

Accept .docx files on input

Currently, the system accepts either images, or PDF files.
It would be a nice feature to have if the system could accept DOCX files upon input, keeping the heirarchical structure and contents intact and as a valid types/DocumentRepresentation instance.

Add an API endpoint to serve the list of modules, their documentation and default parameter values (defaultConfig.json)

Description
Currently, there is no way for the client (GUI, etc) to know which parameters are to be supplied for each module, what the default values are, what their max/min values are, etc.
In the deployed environment, the client needs to be aware of this information which inherently belongs to the server.
An API endpoint /modules with an object providing all this information would even render future server-side modifications replicable on the client automatically.

Word level confidence export

Export confidence level corresponding to each word.

Given the hierarchical arrangement of element types, the order detected needs context

Paragraphs, lines and words follow an order value unique to their hierarchy level.
I.E. Paragraphs are ordered 0, 1, 2, 3, which their respective lines are ordered 0, 1, 2, 3... in the first paragraph, 0, 1, 2, 3.. in the second, and so on.
A consistent scheme where all elements follow the same continuous sequence would be ideal.

Add support for exporting higher levels of granularity (line, paragraph)

Currently, the output JSON supports exporting the text elements to the granularity levels of word and character, which make up for the two most fine-grained levels of detail.

It could be nice to make a more compact JSON export possible - keeping the finest granularity to either Line or Paragraph.

Calculate text order (L->R or R->L) using character types from the document, then adapt the Reading Order algorithm to it

In an endeavor to support right to left languages, it could be useful to calculate text order (L->R or R->L) using character types from the document, then adapt the Reading Order algorithm to it.
For example: if the majority of the text is in the arabic script, adapt the reading order XY cut alogrithm to inverse the X.
Same goes for top-down languages.

Propose a new demo Web Interface with a cleaner and more responsive design

The current basic web interface demo module (in demo/web-viewer) lacks several features (like file type export, source file preview, non-overlapping text elements) which could serve as a more effective demo for end-users.

Also to be included - a legend describing which colours mean what.

Handle PDFs composed of text and images that contain text

A possible solution could be to handle each file page by page: if the page has no text from the text extractor, we forward it to the OCR extractor.
The cleanest solution would be to extract images with mupdf, run an OCR on them and put it back into the Document.

Error on docker-compose build under Windows

Summary
An error shows up when docker-compose build is launched in the root directory, causing the installation to fail.

Steps To Reproduce

Clone the repository.
In the root folder, type docker-compose build

Expected behavior
A working installation of Parsr.

Actual behavior
https://gist.github.com/aarohijohal/804196e1f308fd6054eac22201e7892f

Environment

OS: Windows
Other details can be extracted if need be

Handling formatting (bold, italics...) in the markdown export

Currently, all text content is outputted without any formatting (bold, italics) information.
If any formatting information is available, it should be used to enrich the markdown output.

It is a feature asked for NLP based use-cases, where bold, italics and other formatting information can be exploited for a better identification of a definition or a term, for example.

Handle Password Protected PDFs

Password protected PDF files should be treated with something like QPDF (to get rid of the password requirement) before letting it go through the extractor.

python3 command is just python sometimes

Summary
On Archlinux and some other OS (maybe Windows as well?), python 3.x is ran using the python command and python 2.x using the python2 command. Here, we're assuming it's always python3:

Parsr/server/src/modules/TableDetectionModule/TableDetectionModule.ts

Line 35 in dc3bec7

const tableExtractor = child_process.spawnSync('python3', [

Steps To Reproduce
Steps to reproduce the behavior:

Have your path with python pointing on Python 3.x
Remove python3 from your path.

Expected behavior
Table extraction should run.

Actual behavior
A node error: Error: spawnSync python3 ENOENT.

Environment

Reference commit/version: dc3bec7ae62b0db12a075f31c26ba8951a63c2f7
Other platform details: Python 3.x is just python, not python3
OS: Arch Linux, maybe others?

Restructuring modules into respective subfolders

One folder per module would allow us to include the following 3 files for each functionality:

The source code
The default configuration JSON for the module.
The documentation reference

Text rendering mode 3 handling

Some document are filled with text elements having their rendering mode set to 3.

These blocks of text are supposed to be invisible and thus can spoil paragraph detection and other similar algorithms. They should not be removed though because they are sometimes used for OCR-enhanced documents, where the detected invisible text overlap with the image.

Amazon Texteract extractor

Make an extractor (or input module) based on the Amazon Texteract API.

Error on docker-compose build

Summary
An error shows up when docker-compose build is launched in the root directory, causing the installation to fail.

Steps To Reproduce
Steps to reproduce the behavior:

Clone the repository.
In the root folder, type docker-compose build
Observe an output similar to: https://gist.github.com/aarohijohal/380f197cf990815ca49dc9919fbed211

Expected behavior
A working installation of Parsr.

Actual behavior
https://gist.github.com/aarohijohal/380f197cf990815ca49dc9919fbed211

Environment

Reference commit/version: [455799ad19e06e4ed4763033c985e279d2f00594](https://github.com/axa-group/Parsr/commit/455799ad19e06e4ed4763033c985e279d2f00594)
Other platform details: docker-compose version 1.24.1, build 4667896b
OS: MacOS Mojave 10.14.5

Documentation of the optional parameters for each module

The documentation for each module (residing in /docs/modules) don't yet contain reference to the optional parameters for each one of the modules. The only point to check is in the code itself.

It would increase usability if these values could be referred to in the documentation along with the descriptions.

Improper WordsToLine and LineToParagraph merge on a certain document

There's a problem with the formation of lines and paragraphs on this certain document.
It might be linked to the slanted nature of the text (slight rotation of the content on the input document).

Source File:

Output:

Add unit tests to getElementOfType()

This function is widely used throughout the code and is not straightforward.
It should be more thoroughly tested including all the various edge cases one can think off.

Fails on openshift Kubernetes `EACCES: permission denied, mkdir '/opt/app-root/src/api/server/dist/output'`

Summary
Trying to run the container on Openshift 3 Kubernetes fails. By default openshift runs in non-root mode. The program tries to create a directory (dist/output) and this fails.

The code that writes this output directory is here:
https://github.com/axa-group/Parsr/blob/develop/api/server/src/api.ts#L55-L59

Steps To Reproduce

$ oc run parsr --image=axarev/parsr --expose=true --port=3001
...
$ oc logs parsr-1-6wchv
Starting par.sr API : node api/server/dist/index.js
fs.js:115
    throw err;
    ^

Error: EACCES: permission denied, mkdir '/opt/app-root/src/api/server/dist/output'
    at Object.mkdirSync (fs.js:753:3)
    at new ApiServer (/opt/app-root/src/api/server/dist/api.js:63:16)
    at Object.<anonymous> (/opt/app-root/src/api/server/dist/index.js:19:11)
    at Module._compile (internal/modules/cjs/loader.js:689:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
    at Module.load (internal/modules/cjs/loader.js:599:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
    at Function.Module._load (internal/modules/cjs/loader.js:530:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:742:12)
    at startup (internal/bootstrap/node.js:283:19)

Expected behavior
The pod should be running. It does run properly in a default GKE cluster.

Actual behavior
The container fails to run, because it cannot create the directory.
It goes in a restart loop.

Environment

Docker hub version https://hub.docker.com/layers/axarev/parsr/latest/images/sha256-b4adf5d95317f4a9c3ce79bd708bd6bdf6abde7b4bc553880af711b3b69e410f
Other platform details: Openshift version 3.9

Additional context
The fact that openshift by default only allows non-root is documented e.g. here:
https://stackoverflow.com/questions/42363105/permission-denied-mkdir-in-container-on-openshift

I think it would be better if we can change the set-up of the container that it can run as non-root.

Include a module's description and name in each module's defaultConfig.json

Module documentation and name should also be present in the each module's defaultConfig.json file.

Table detection with Tesseract

Find a way to have table detection with Tesseract.
Maybe Tesseract has some options to do it.
Maybe we can find a way to pass the bounding boxes and content to Camelot.

Add an API endpoint to serve the default server configuration

Description
Currently, the default server configuration is supplied along with the project source.
In a deployed environment, the client cannot be automatically notified if the server's defaults are changed.
An API endpoint /default-config could provide this information, which would remove the need to hardcode anything on the client/GUI side.

EDIT: changed the proposition from /defaultConfig to /default-config to follow api endpoint naming convention.

Input and output modules (to be named so after Issue #48 is treated) should each have a seperate subfolder with a documentation and a configuration file

Input and output modules should also have configurability from an external config.json file. For example: various details about the output file can be detailed in this file. Like the granularity of the output json, etc.

Google Cloud Vision API extractor

Make an extractor (or input module) based on the Google Cloud Vision API.

Make the API documentation accessible via a route to `GET /`

Add a route to GET / on the API, that displays or links to the documentation of the API.

Accept .docx format as an output type

Several use-cases involve direct interpretation/manipulation of .docx files for translation, ingestion into databases, etc.
Along with markdown, raw text, JSON, it could be a nice feature to have DOCX as an output format.

Duplicates not removed because of slight translation in coordinates

If there are duplicates of certain textual elements with not the exact coordinates but slightly translated from one another, the duplicate element removal module does not treat them properly.

UPDATE:
After e5cba27 ,

Make Parsr available as a single PyPi package

Additional Ideas:

Split the current project into libraries before publishing them as individual libraries.
Add our types to @types/... http://definitelytyped.org/

UPDATE: Priority shifted to a python/pip packaging

Use ghostscript instead of convert to generate intermediate TIFF files (from PDF) before extracting using tesseract

Currently with some specific documents, the PDF pre processing step (a call to convert that generates tiff files) can generate low resolution images. Theses images are then passed to tesseract and the results are quite bad.

We should make sure that:

the target images are big enough to get good results from tesseract.
the processing time is fast enough.

We had a look at calling ghostscript directly.
300 dpi seems well enough and Ghostscript is faster than Convert.

$ gs -dNOPAUSE -q -sDEVICE=tiff48nc -dBATCH -sOutputFile=a.tiff -r300 a.pdf -c quit

Note that this create some pretty huge files.

Missing Camelot and PDF Miner in README's Dependencies Section

The instructions to install camelot and pdf miner are missing in the README. Leading to error like these:

Refactor naming of the different stages of the pipeline

It is interesting for an easier understandability that the following renaming be made:

extraction becomes 'input',
export becomes 'output', and
cleaning becomes 'processing'

[axarev/parsr:latest] @grpc/grpc-js only works on Node ^8.13.0 || >=10.10.0

Summary

The latest version of Parsr published on Docker Hub throws an error when trying to get the queue status for a document previously submitted to the /document endpoint.

Steps To Reproduce

Use the following docker-compose.yml:

version: '3.3'

services:
  duckling:
    image: axarev/duckling:latest
    ports:
      - 8000:8000

  parsr:
    image: axarev/parsr:latest
    ports:
      - 8080:3000
      - 3001:3001
    environment:
      DUCKLING_HOST: http://duckling:8000
      ABBYY_SERVER_URL:
    volumes:
      - ./pipeline/:/opt/app-root/src/demo/web-viewer/pipeline/

volumes:
  pipeline:
    driver: local

Run docker-compose up.
Call POST http://localhost:3001/api/document with a sample PDF and the "exempli gratia" config.json file.
Call GET http://localhost:3001/api/queue/{id} with the ID produced at step 3.

Expected behavior
Get an appropriate response.

Actual behavior
An HTTP 500 Internal Server Error is received with the following payload:

/opt/app-root/src/node_modules/@grpc/grpc-js/build/src/index.js:47
throw new Error(`@grpc/grpc-js only works on Node ${supportedNodeVersions}`);
^

Error: @grpc/grpc-js only works on Node ^8.13.0 || >=10.10.0
at Object.<anonymous> (/opt/app-root/src/node_modules/@grpc/grpc-js/build/src/index.js:47:11)
	at Module._compile (module.js:652:30)
	at Object.Module._extensions..js (module.js:663:10)
	at Module.load (module.js:565:32)
	at tryModuleLoad (module.js:505:12)
	at Function.Module._load (module.js:497:3)
	at Module.require (module.js:596:17)
	at require (internal/module.js:11:18)
	at Object.<anonymous> (/opt/app-root/src/node_modules/google-gax/build/src/grpc.js:37:14)
		at Module._compile (module.js:652:30)

Environment

Reference commit/version: axarev/parsr:latest
OS: Windows 10
Docker Desktop Community for Windows:
Version: 2.1.0.3 (38240)
Engine: 19.03.2
Compose: 1.24.1

Missing fields in the SVG-Line export into the output JSON format

Inside the document representation, the SvgLine object reads:

{
  _id: 27627,
  _box: BoundingBox {
    _left: 211,
    _top: 6711,
    _width: 4538,
    _height: 11
  },
  _metadata: [],
  _properties: {},
  _children: [],
  content: null,
  _thickness: 2,
  _fromX: 212,
  _fromY: 6725,
  _toX: 4750,
  _toY: 6725
}

Upon export, the same object reads:

{
  id: 27627,
  type: 'svg-line',
  properties: {},
  metadata: [],
  box: {
    l: 211,
    t: 6711,
    w: 4538,
    h: 11
  },
  fromX: undefined,
  fromY: undefined,
  toX: undefined,
  toY: 6725,
  thickness: 2
}

NOTE: the 'undefined' fields.

[Vue] Add persistence to configuration form

Each time a file is uploaded custom configuration is lost, in the development time a custom configuration is required to test Parsr behaviour and is a nightmare the fact to lose that each upload.

I would like to reuse current custom configuration for multiple uploads and I would like to have a "reset" button to set again the default configuration.

This feature can be quickly done by moving custom configuration state from component "/views/Upload" to sore "/vuex/Store".

In raw text extraction outputs, handle bulleted and numbered lists

Currently, the output of the pdf2json and tesseract extractors do not give us any information about the numbered or bulleted lists.
The current bullet/numbered list detection module ListDetectionModule needs to be improved.

Current state
Source

Resulting Markdown

Refactor Exporter and API's FileManager to allow files to be on a remote location (S3 for example)

Currently, the storage used for processed files is only local which can cause problems when using Docker.
The idea is to be able to use different storage such Amazon S3, Azure Storage, local files...

Replace the usage of the `where` command for when Parsr is used under Windows

Summary
The file server/src/extractors/extract-fonts.ts:30 features the use of the which command, which is nix-only.
For Windows based clients, this needs to be replaced with where for clients WinXP 32bit and above.
A solution for older Windows clients needs to be integrated too.

Steps To Reproduce

Run the tool, and process any file.
Observe an error reporting that the join of null cannot be made. This comes from the fact that the spawn of the command which produces a null output.

Expected behavior
The running of an input/extractor module.

Actual behavior
The following error is observed:

[2019-09-09T05:10:55] ERROR (parsr): Cannot read property 'join' of null
    TypeError: Cannot read property 'join' of null
        at /Users/me/Code/parsr/dist/src/extractors/extract-fonts.js:30:81
        at new Promise (<anonymous>)
        at Object.extractFonts (/Users/me/Code/parsr/dist/src/extractors/extract-fonts.js:29:12)
        at PdfJsonExtractor.run (/Users/me/Code/parsr/dist/src/extractors/pdf2json/PdfJsonExtractor.js:63:43)
        at Orchestrator.run (/Users/me/Code/parsr/dist/src/Orchestrator.js:47:31)
        at runOrchestrator (/Users/me/Code/parsr/dist/bin/index.js:97:14)
        at main (/Users/me/Code/parsr/dist/bin/index.js:88:5)
        at Object.<anonymous> (/Users/me/Code/parsr/dist/bin/index.js:246:1)
        at Module._compile (internal/modules/cjs/loader.js:775:14)
        at Object.Module._extensions..js (internal/modules/cjs/loader.js:789:10)

Screenshots

Environment

OS: Windows (version to be confirmed)

Additional context
The following discussions on an alternative can be useful:

The following function can be used to detect the current OS:

https://nodejs.org/api/os.html#os_os_platform

axa-group / parsr Goto Github PK

parsr's People

Contributors

Stargazers

Watchers

Forkers

parsr's Issues

Recommend Projects

Recommend Topics

Recommend Org