axa-group / parsr Goto Github PK
View Code? Open in Web Editor NEWTransforms PDF, Documents and Images into Enriched Structured Data
License: Apache License 2.0
Transforms PDF, Documents and Images into Enriched Structured Data
License: Apache License 2.0
The current implementation of paragraph formation from words (in the LinesToParagraph.ts
module) highly depends on the values of its parameters (the most importantly maxInterline
) to determine if the lines can be merged.
This can be automatised using the following steps:
maxInterline
value that the module vitally needs to function correctly.Summary
pdfminer sometimes omits characters in the textual output.
There are some characters missing.
Environment
Summary
The table export procedure fails for a particular table.
There seems to be an error somewhere in the pipeline where the table is conveted to an array.
Steps To Reproduce
Steps to reproduce the behavior:
Table.ts
.Expected behavior
A seemless export of the table to CSV.
Actual behavior
The following error is returned:
[2019-09-09T04:43:06] ERROR (parsr): Cannot set property '0' of undefined
TypeError: Cannot set property '0' of undefined
at Table.toArray (/Users/me/Code/parsr/dist/src/types/DocumentRepresentation/Table.js:337:47)
at /Users/me/Code/parsr/dist/src/exporters/MarkdownExporter.js:89:47
at Array.forEach (<anonymous>)
at /Users/me/Code/parsr/dist/src/exporters/MarkdownExporter.js:50:27
at Array.forEach (<anonymous>)
at MarkdownExporter.getMarkdown (/Users/me/Code/parsr/dist/src/exporters/MarkdownExporter.js:49:24)
at MarkdownExporter.export (/Users/me/Code/parsr/dist/src/exporters/MarkdownExporter.js:44:48)
at /Users/me/Code/parsr/dist/bin/index.js:130:107
at process._tickCallback (internal/process/next_tick.js:68:7)
Environment
Attached file*
t2 (dragged).pdf
We need a UI way to select the next/prev element with same type (Word, Paragraph, Line, Heading, Table...) of current selected element.
I would like to have inside "Element Inspector" a component with just a left arrow & right arrow that will select previous or next element with same type of the current selected
Summary
The installation procedure for pdf2json mentioned under Arch Linux assumes the package's presence in the official repositories, but it has recently been removed.
UPDATE: The package pdf2json has also been removed from the latest versions of ubuntu.
Steps To Reproduce
pacman -Ss pdf2json
. No packages will be returned.Expected behavior
An installation of pdf2json.
Actual behavior
The package is not found in the repositories.
Environment
Additional context
Add unit tests to the API and run them when we run npm test at the root of the project.
The current header/footer detection module HeaderFooterDetectionModule
requires an estimate in percentage of the maximal distance from the page limit, where the header and footers lie.
It would be great to have this module automatically detect headers and footers (using techniques like NLP, Vision, etc) without the need of such a parameter.
Currently, the TextOrderDetection module detects an order to the extracted text depending on the physical layout.
It could be interesting to keep the original text node order from PDF in the properties of the element, so that it can still be referred to, and this information is not lost.
Recent tests have shown that PDFMiner outperforms pdf2json as a pdf extractor. It's pdf2txt.py -A -t xml -o <output.xml> <input.pdf> provides an output xml very similar in content to the output of pdf2json.
It could be nice to provide PDFMiner as an extaction option.
Note: PDFMiner's installation depends on python2 and not python3.
Other extractors to be evaluated:
It could be nice to explain the whole pipeline of Parsr diagramatically for easier understanding and adaptation.
It would be great to store the diagram source files in the repository as well.
doc/architecture.md
doc/assets/architecture-drawing.xxx
Provide some representative samples in the repository under samples/
.
Currently, the system accepts either images, or PDF files.
It would be a nice feature to have if the system could accept DOCX files upon input, keeping the heirarchical structure and contents intact and as a valid types/DocumentRepresentation
instance.
Description
Currently, there is no way for the client (GUI, etc) to know which parameters are to be supplied for each module, what the default values are, what their max/min values are, etc.
In the deployed environment, the client needs to be aware of this information which inherently belongs to the server.
An API endpoint /modules
with an object providing all this information would even render future server-side modifications replicable on the client automatically.
Export confidence level corresponding to each word.
Paragraphs, lines and words follow an order value unique to their hierarchy level.
I.E. Paragraphs are ordered 0, 1, 2, 3, which their respective lines are ordered 0, 1, 2, 3... in the first paragraph, 0, 1, 2, 3.. in the second, and so on.
A consistent scheme where all elements follow the same continuous sequence would be ideal.
Currently, the output JSON supports exporting the text elements to the granularity levels of word and character, which make up for the two most fine-grained levels of detail.
It could be nice to make a more compact JSON export possible - keeping the finest granularity to either Line or Paragraph.
In an endeavor to support right to left languages, it could be useful to calculate text order (L->R or R->L) using character types from the document, then adapt the Reading Order algorithm to it.
For example: if the majority of the text is in the arabic script, adapt the reading order XY cut alogrithm to inverse the X.
Same goes for top-down languages.
The current basic web interface demo module (in demo/web-viewer
) lacks several features (like file type export, source file preview, non-overlapping text elements) which could serve as a more effective demo for end-users.
Also to be included - a legend describing which colours mean what.
A possible solution could be to handle each file page by page: if the page has no text from the text extractor, we forward it to the OCR extractor.
The cleanest solution would be to extract images with mupdf, run an OCR on them and put it back into the Document.
Summary
An error shows up when docker-compose build
is launched in the root directory, causing the installation to fail.
Steps To Reproduce
docker-compose build
Expected behavior
A working installation of Parsr.
Actual behavior
https://gist.github.com/aarohijohal/804196e1f308fd6054eac22201e7892f
Environment
Currently, all text content is outputted without any formatting (bold, italics) information.
If any formatting information is available, it should be used to enrich the markdown output.
It is a feature asked for NLP based use-cases, where bold, italics and other formatting information can be exploited for a better identification of a definition or a term, for example.
Password protected PDF files should be treated with something like QPDF (to get rid of the password requirement) before letting it go through the extractor.
Summary
On Archlinux and some other OS (maybe Windows as well?), python 3.x is ran using the python
command and python 2.x using the python2
command. Here, we're assuming it's always python3
:
Steps To Reproduce
Steps to reproduce the behavior:
python
pointing on Python 3.xpython3
from your path.Expected behavior
Table extraction should run.
Actual behavior
A node error: Error: spawnSync python3 ENOENT
.
Environment
dc3bec7ae62b0db12a075f31c26ba8951a63c2f7
python
, not python3
One folder per module would allow us to include the following 3 files for each functionality:
Some document are filled with text elements having their rendering mode set to 3.
These blocks of text are supposed to be invisible and thus can spoil paragraph detection and other similar algorithms. They should not be removed though because they are sometimes used for OCR-enhanced documents, where the detected invisible text overlap with the image.
Make an extractor (or input module) based on the Amazon Texteract API.
Summary
An error shows up when docker-compose build
is launched in the root directory, causing the installation to fail.
Steps To Reproduce
Steps to reproduce the behavior:
docker-compose build
Expected behavior
A working installation of Parsr.
Actual behavior
https://gist.github.com/aarohijohal/380f197cf990815ca49dc9919fbed211
Environment
[455799ad19e06e4ed4763033c985e279d2f00594](https://github.com/axa-group/Parsr/commit/455799ad19e06e4ed4763033c985e279d2f00594)
The documentation for each module (residing in /docs/modules
) don't yet contain reference to the optional parameters for each one of the modules. The only point to check is in the code itself.
It would increase usability if these values could be referred to in the documentation along with the descriptions.
This function is widely used throughout the code and is not straightforward.
It should be more thoroughly tested including all the various edge cases one can think off.
Summary
Trying to run the container on Openshift 3 Kubernetes fails. By default openshift runs in non-root mode. The program tries to create a directory (dist/output
) and this fails.
The code that writes this output directory is here:
https://github.com/axa-group/Parsr/blob/develop/api/server/src/api.ts#L55-L59
Steps To Reproduce
$ oc run parsr --image=axarev/parsr --expose=true --port=3001
...
$ oc logs parsr-1-6wchv
Starting par.sr API : node api/server/dist/index.js
fs.js:115
throw err;
^
Error: EACCES: permission denied, mkdir '/opt/app-root/src/api/server/dist/output'
at Object.mkdirSync (fs.js:753:3)
at new ApiServer (/opt/app-root/src/api/server/dist/api.js:63:16)
at Object.<anonymous> (/opt/app-root/src/api/server/dist/index.js:19:11)
at Module._compile (internal/modules/cjs/loader.js:689:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
at Module.load (internal/modules/cjs/loader.js:599:32)
at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
at Function.Module._load (internal/modules/cjs/loader.js:530:3)
at Function.Module.runMain (internal/modules/cjs/loader.js:742:12)
at startup (internal/bootstrap/node.js:283:19)
Expected behavior
The pod should be running. It does run properly in a default GKE cluster.
Actual behavior
The container fails to run, because it cannot create the directory.
It goes in a restart loop.
Environment
Additional context
The fact that openshift by default only allows non-root is documented e.g. here:
https://stackoverflow.com/questions/42363105/permission-denied-mkdir-in-container-on-openshift
I think it would be better if we can change the set-up of the container that it can run as non-root.
Module documentation and name should also be present in the each module's defaultConfig.json
file.
Find a way to have table detection with Tesseract.
Maybe Tesseract has some options to do it.
Maybe we can find a way to pass the bounding boxes and content to Camelot.
Related links
Description
Currently, the default server configuration is supplied along with the project source.
In a deployed environment, the client cannot be automatically notified if the server's defaults are changed.
An API endpoint /default-config
could provide this information, which would remove the need to hardcode anything on the client/GUI side.
EDIT: changed the proposition from /defaultConfig
to /default-config
to follow api endpoint naming convention.
Input and output modules should also have configurability from an external config.json file. For example: various details about the output file can be detailed in this file. Like the granularity of the output json, etc.
Make an extractor (or input module) based on the Google Cloud Vision API.
Add a route to GET / on the API, that displays or links to the documentation of the API.
Several use-cases involve direct interpretation/manipulation of .docx
files for translation, ingestion into databases, etc.
Along with markdown, raw text, JSON, it could be a nice feature to have DOCX as an output format.
If there are duplicates of certain textual elements with not the exact coordinates but slightly translated from one another, the duplicate element removal module does not treat them properly.
UPDATE:
After e5cba27 ,
Additional Ideas:
@types/...
http://definitelytyped.org/UPDATE: Priority shifted to a python/pip packaging
Currently with some specific documents, the PDF pre processing step (a call to convert that generates tiff files) can generate low resolution images. Theses images are then passed to tesseract and the results are quite bad.
We should make sure that:
We had a look at calling ghostscript directly.
300 dpi seems well enough and Ghostscript is faster than Convert.
$ gs -dNOPAUSE -q -sDEVICE=tiff48nc -dBATCH -sOutputFile=a.tiff -r300 a.pdf -c quit
Note that this create some pretty huge files.
It is interesting for an easier understandability that the following renaming be made:
Summary
The latest version of Parsr published on Docker Hub throws an error when trying to get the queue status for a document previously submitted to the /document
endpoint.
Steps To Reproduce
docker-compose.yml
:version: '3.3'
services:
duckling:
image: axarev/duckling:latest
ports:
- 8000:8000
parsr:
image: axarev/parsr:latest
ports:
- 8080:3000
- 3001:3001
environment:
DUCKLING_HOST: http://duckling:8000
ABBYY_SERVER_URL:
volumes:
- ./pipeline/:/opt/app-root/src/demo/web-viewer/pipeline/
volumes:
pipeline:
driver: local
docker-compose up
.POST http://localhost:3001/api/document
with a sample PDF and the "exempli gratia" config.json
file.GET http://localhost:3001/api/queue/{id}
with the ID produced at step 3.Expected behavior
Get an appropriate response.
Actual behavior
An HTTP 500 Internal Server Error
is received with the following payload:
/opt/app-root/src/node_modules/@grpc/grpc-js/build/src/index.js:47
throw new Error(`@grpc/grpc-js only works on Node ${supportedNodeVersions}`);
^
Error: @grpc/grpc-js only works on Node ^8.13.0 || >=10.10.0
at Object.<anonymous> (/opt/app-root/src/node_modules/@grpc/grpc-js/build/src/index.js:47:11)
at Module._compile (module.js:652:30)
at Object.Module._extensions..js (module.js:663:10)
at Module.load (module.js:565:32)
at tryModuleLoad (module.js:505:12)
at Function.Module._load (module.js:497:3)
at Module.require (module.js:596:17)
at require (internal/module.js:11:18)
at Object.<anonymous> (/opt/app-root/src/node_modules/google-gax/build/src/grpc.js:37:14)
at Module._compile (module.js:652:30)
Environment
Inside the document representation, the SvgLine object reads:
{
_id: 27627,
_box: BoundingBox {
_left: 211,
_top: 6711,
_width: 4538,
_height: 11
},
_metadata: [],
_properties: {},
_children: [],
content: null,
_thickness: 2,
_fromX: 212,
_fromY: 6725,
_toX: 4750,
_toY: 6725
}
Upon export, the same object reads:
{
id: 27627,
type: 'svg-line',
properties: {},
metadata: [],
box: {
l: 211,
t: 6711,
w: 4538,
h: 11
},
fromX: undefined,
fromY: undefined,
toX: undefined,
toY: 6725,
thickness: 2
}
NOTE: the 'undefined
' fields.
Each time a file is uploaded custom configuration is lost, in the development time a custom configuration is required to test Parsr behaviour and is a nightmare the fact to lose that each upload.
I would like to reuse current custom configuration for multiple uploads and I would like to have a "reset" button to set again the default configuration.
This feature can be quickly done by moving custom configuration state from component "/views/Upload" to sore "/vuex/Store".
Currently, the storage used for processed files is only local which can cause problems when using Docker.
The idea is to be able to use different storage such Amazon S3, Azure Storage, local files...
Summary
The file server/src/extractors/extract-fonts.ts:30
features the use of the which
command, which is nix-only.
For Windows based clients, this needs to be replaced with where
for clients WinXP 32bit and above.
A solution for older Windows clients needs to be integrated too.
Steps To Reproduce
join
of null
cannot be made. This comes from the fact that the spawn
of the command which
produces a null output.Expected behavior
The running of an input/extractor module.
Actual behavior
The following error is observed:
[2019-09-09T05:10:55] ERROR (parsr): Cannot read property 'join' of null
TypeError: Cannot read property 'join' of null
at /Users/me/Code/parsr/dist/src/extractors/extract-fonts.js:30:81
at new Promise (<anonymous>)
at Object.extractFonts (/Users/me/Code/parsr/dist/src/extractors/extract-fonts.js:29:12)
at PdfJsonExtractor.run (/Users/me/Code/parsr/dist/src/extractors/pdf2json/PdfJsonExtractor.js:63:43)
at Orchestrator.run (/Users/me/Code/parsr/dist/src/Orchestrator.js:47:31)
at runOrchestrator (/Users/me/Code/parsr/dist/bin/index.js:97:14)
at main (/Users/me/Code/parsr/dist/bin/index.js:88:5)
at Object.<anonymous> (/Users/me/Code/parsr/dist/bin/index.js:246:1)
at Module._compile (internal/modules/cjs/loader.js:775:14)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:789:10)
Environment
Additional context
The following discussions on an alternative can be useful:
The following function can be used to detect the current OS:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.