dod-advana / gamechanger-data Goto Github PK

GAMECHANGER aspires to be the Department’s trusted solution for evidence-based, data-driven decision-making across the universe of DoD requirements

License: MIT License

Python 77.70% Shell 14.54% Dockerfile 3.01% Jupyter Notebook 2.23% PLpgSQL 2.52%

defense policy-as-code policy etl

gamechanger-data's Introduction

Data Engineering

gamechanger-data focuses on the data engineering work of gamechanger. To see all repositories gamechanger

Important Note!

Configuration of repo is reliant on being able to hit advana-data-zone's s3 bucket. If you do not have access to advana-data-zone's s3 bucket, you will need to fill in your own values in config script; like topic_models (for ML features) and configure_app (ElasticSearch, Postgres, and Neo4j)
Once venv is set up, set DEPLOYMENT_ENV variable and run ./paasJobs/configure_repo.sh or paasJobs/configure_repo.bat
Example DEPLOYMENT_ENV=local ./paasJobs/configure_repo.sh or set DEPLOYMENT_ENV=local \paasJobs\configure_repo.bat

(Linux) Dev/Prod Deployment Instructions

Clone fresh gamechanger-data repo
Setup python3.8 venv with packages in requirements.txt.
- Create python3.8 venv, e.g. python3 -m venv /opt/gc-venv-20210613
- Before installing packages, update pip/wheel/setuptools, e.g. <venv>/bin/pip install --upgrade pip setuptools wheel
- Install packages from requirements.txt, with no additional dependencies, e.g. <venv>/bin/pip install --no-deps -r requirements.txt
Set up symlink /opt/gc-venv-current to the freshly created venv, e.g. ln -s /opt/gc-venv-20210613 /opt/gc-venv-current
Pull in other dependencies and configure repo with env SCRIPT_ENV=<prod|dev> <repo>/paasJobs/configure_repo.sh
- Config script will let you know if everything was configured correctly and if all backends can be reached.

How to Setup Local Env for Development

MacOS / Linux

(Linux Only) Follow instruction appropriate to repo to install ocrmypdf and its dependencies: https://ocrmypdf.readthedocs.io/en/latest/installation.html#installing-on-linux
(MacOS Only) Install "brew" then use it to install tesseract brew install tesseract-lang
Install Miniconda or Anaconda (Miniconda is much smaller)
- https://docs.conda.io/en/latest/miniconda.html
Create gamechanger python3.8 environment, like so:
- conda create -n gc python=3.8
Clone the repo and change into that dir git clone ...; cd gamechanger
Activate conda environment and install requirements:
- ‼️ reeeealy important - make sure you change into repo directory
- conda activate gc
- pip install --upgrade pip setuptools wheel
- pip install -e '.[dev]' (quoting around .[dev] is important)
That's it.

Windows (WSL Version)

Setup Windows Subsystem for Linux (WSL) environment
- https://docs.microsoft.com/en-us/windows/wsl/install-win10
(In WSL)
- Install ocrmypdf dependencies following ubuntu instructions here: https://ocrmypdf.readthedocs.io/en/latest/installation.html#installing-on-linux
- Install Miniconda or Anaconda (Miniconda is much smaller)
  - https://docs.conda.io/en/latest/miniconda.html
- Create gamechanger python3.8 environment, like so:
  - conda create -n gc python=3.8
- Clone the repo and change into that dir git clone ...; cd gamechanger-data
- Activate conda environment and install requirements:
  - ‼️ reeeealy important - make sure you change into repo directory
  - conda activate gc
  - pip install --upgrade pip setuptools wheel
  - pip install -e '.[dev]' (quoting around .[dev] is important)
- That's it, just activate that conda env if you want to use it inside the terminal.

Windows

Create venv python -m venv [venv-name] Activate \[venv-name]\Scripts\activate Update venv python -m pip install --upgrade pip setuptools wheel Install requirements.txt pip install --no-deps -r dev_tools\requirements\gc-venv-current.txt

Run Configure Repo, Steps at the top of this README

To-Do:

convert .sh scripts to .bat to support window users

Docker

docker build -t gc-data --no-cache .
docker rm -f gc-data-test || true
docker run -it --name gc-data gc-data

Configure Repo

IDE SETUP

How to Setup PyCharm IDE

Note: If you're using containerized env, you'll need Pro version of PyCharm and separate set of instructions - here

Create new project by opening directory where you cloned the repository. PyCharm will tell you that it sees existing repo there, just accept that and proceed.
With your gc conda environment all good to go, change your "Preferences -> Project -> Python Interpreter" to the EXISTING gc conda env you created. https://www.jetbrains.com/help/pycharm/conda-support-creating-conda-virtual-environment.html
Now, change your "Preferences -> Build, Execution, Deployment -> Console -> Python Console interpreter" to your gc conda interpreter env that you added earlier.
That's it, you will now have correct env in Terminal, Python Console, and elsewhere in the IDE.

How to Setup Visual Studio Code IDE

Note: if you're using containerized env, you'll need setup like this

Open the cloned dir in new workspace and make sure to set your conda gc venv as the python venv https://code.visualstudio.com/docs/python/environments
That's it, when you start new integrated terminals, they'll activate the right environment and the syntax highlighting/autocompletion is going to work as it's supposed to.

Common Issues

My venv is broken somehow!

Delete the old conda environment and create a new one, follow steps above to reinstall it.

License & Contributions

See LICENSE.md (including licensing intent - INTENT.md) and CONTRIBUTING.md

gamechanger-data's People

Stargazers

Watchers

Forkers

wildertrek ekmixon ilyarabin iamjoshbinder shamblenex amaruca141 innovateops trellixvulnteam laenn neofob rodneymbrown1 melkiga dougthompson1976 repnot

gamechanger-data's Issues

pip fails to install parse-requirements.txt

Atm, pip install dev_tools/requirements/parse-requirements.txt for the virtual env to run common.document_parser fails. It asks for python version >=3.5.*. The python version in conda virtual env is 3.8.

# at <repo>
conda create -n gc-data-dev python=3.8
conda activate gc-data-dev
pip install -U pip setuptools wheel
pip install dev_tools/requirements/parse-requirements.txt
....
Collecting pyparsing>=2.0.2 (from packaging==20.9->-r dev_tools/requirements/parse-requirements.txt (line 34))
  Using cached pyparsing-3.1.1-py3-none-any.whl.metadata (5.1 kB)
Collecting smart-open<4.0.0,>=2.2.0 (from pathy==0.5.2->-r dev_tools/requirements/parse-requirements.txt (line 36))
  Using cached smart_open-3.0.0.tar.gz (113 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [1 lines of output]
      error in smart_open setup command: 'python_requires' must be a string containing valid version specifiers; Invalid specifier: '>=3.5.*'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

So the question is: What is the correct working requirements.txt that I should use to run common.document_parser?

@drospond @amaruca @amaruca141

Missing requirements.txt file in rhel8.Dockerfile

The file requirements.txt is removed/renamed from repository.
https://github.com/dod-advana/gamechanger-data/blob/dev/dev_tools/docker/k8s/rhel8.Dockerfile#L191

Should it be the file rhel8.locked.requirements.txt instead?

FYI, rhel8.locked.requirements.txt will open a pandora's box dependency entanglement.

CC: @jram930 @drospond

gc_crawler is missing

The tool gc_crawler is missing. From the document

https://github.com/dod-advana/gamechanger-data/blob/dev/dataPipelines/README.md

The many sub packages in <repo>/dataPipelines/gc_crawler/ are used to obtain raw data from web crawlers. - The process is broken up into figuring out which files to download (that's stuff in gc_crawler) and actually downloading them (using /repo/dataPipelines/gc_downloader/)

There is no directory named gc_crawler in /dataPipelines.

CC: @jram930 @drospond

gc_downloader is missing

The tool gc_downloader is missing. From the document

https://github.com/dod-advana/gamechanger-data/blob/dev/dataPipelines/README.md

The many sub packages in <repo>/dataPipelines/gc_crawler/ are used to obtain raw data from web crawlers. - The process is broken up into figuring out which files to download (that's stuff in gc_crawler) and actually downloading them (using /repo/dataPipelines/gc_downloader/)

There is no directory named gc_downloader in /dataPipelines.

@jram930 @drospond

`common.document_parser` fails to parse PDF, object has no attribute 'pageCount'

Summary: The current requirement file dev_tools/requirements/parse-requirements.txt fails to be installed by pip for the environment to run common.document_parser. See issue #294. By install package by package until common.document_parser can run, I have the list of pip packages to be installed.

See the attached requirements.txt for a tentative requirement file for pip packages.

Yet, it fails to parse PDF documents with error object has no attribute 'pageCount'.

2023-11-07 17:05:59,186 - [INFO] - Finished Processing: <ForkProcess name='ForkPoolWorker-1' parent=952070 started daemon> - Filename: AFI 36-129_USAFE-AFAFRICASUP.pdf
2023-11-07 17:05:59,186 - [INFO] - Processing: <ForkProcess name='ForkPoolWorker-1' parent=952070 started daemon> - Filename: STP 10-92M15-SM-TG.pdf
running policy_analyics.parse on /mnt/extra/gamechanger-download-smallset/STP 10-92M15-SM-TG.pdf
ERROR in policy_analytics.parse: cannot open broken document
2023-11-07 17:05:59,187 - [INFO] - Finished Processing: <ForkProcess name='ForkPoolWorker-1' parent=952070 started daemon> - Filename: STP 10-92M15-SM-TG.pdf
Current Time = 17:05:59
2023-11-07 17:05:59,188 - [INFO] - Documents parsed (or attempted): 6

real	0m9.808s
user	0m9.351s
sys	0m2.399s

References:

Issue #294

CC: @drospond @amaruca @amaruca141

Corrupted PDFs cause mupdf to runtime error in Document Parser

I have many folders of PDFs downloaded from a legacy database, some of which are corrupted and cannot be opened with Adobe Acrobat. I cannot attach an example here, sorry. mupdf struggles with these files and raises an error. Is there some way this could be caught and noted in the output JSON's metadata instead of stopping the entire run?

Stack Trace:

2021-07-01 14:53:54,764 - [INFO] - Document Parser has started
Memory Hard Limit: -1 Soft Limit: -1 Maximum of percentage of memory use: 0.8
____________________ 12826868____________________
2021-07-01 14:54:01.331068: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-07-01 14:54:01.331137: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-07-01 14:54:19,395 - [INFO] - Parsing Multiple Documents: 3
Current Time = 14:54:19
2021-07-01 14:54:19,396 - [INFO] - Processing: <_MainProcess(MainProcess, started)> - Filename: Document.pdf
running policy_analyics.parse on /mnt/c/Users/nwagner/Downloads/test/Document.pdf
2021-07-01 14:54:19,717 - [INFO] - Finished Processing: <_MainProcess(MainProcess, started)> - Filename: Document.pdf
2021-07-01 14:54:19,717 - [INFO] - Processing: <_MainProcess(MainProcess, started)> - Filename: Document.pdf
running policy_analyics.parse on /mnt/c/Users/nwagner/Downloads/test/Document.pdf
mupdf: cannot recognize version marker
mupdf: no objects found
Traceback (most recent call last):
File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/nwagner/gamechanger-data/common/document_parser/main.py", line 17, in
cli()
File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/nwagner/gamechanger-data/common/document_parser/cli.py", line 181, in pdf_to_json_cmd_wrapper
num_ocr_threads=num_ocr_threads
File "/home/nwagner/gamechanger-data/common/document_parser/cli.py", line 71, in pdf_to_json
num_ocr_threads=num_ocr_threads
File "/home/nwagner/gamechanger-data/common/document_parser/process.py", line 189, in process_dir
single_process(item)
File "/home/nwagner/gamechanger-data/common/document_parser/process.py", line 112, in single_process
num_ocr_threads=num_ocr_threads, force_ocr=force_ocr, out_dir=out_dir)
File "/home/nwagner/gamechanger-data/common/document_parser/parsers/policy_analytics/parse.py", line 37, in parse
doc_obj = pdf_reader.get_fitz_doc_obj(f_name)
File "/home/nwagner/gamechanger-data/common/document_parser/lib/pdf_reader.py", line 8, in get_fitz_doc_obj
doc = fitz.open(f_name)
File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/fitz/fitz.py", line 2494, in init
_fitz.Document_swiginit(self, _fitz.new_Document(filename, stream, filetype, rect, width, height, fontsize))
RuntimeError: no objects found

dev_tools/requirements/rhel8.locked.requirements.txt fails to build with existing requirements

Current gamechangerml package download from from github, UOT-117914.zip, fails to build with the docker file dev_tools/requirements/rhel8.locked.requirements.txt.

#19 45.04 Collecting gamechangerml@ https://github.com/dod-advana/gamechanger-ml/archive/refs/heads/task/UOT-117914.zip (from -r /tmp/requirements.txt (line 38))
#19 45.63   Downloading https://github.com/dod-advana/gamechanger-ml/archive/refs/heads/task/UOT-117914.zip
#19 50.85      - 23.8 MB 11.3 MB/s 0:00:05
#19 51.14   Preparing metadata (setup.py): started
#19 51.24   Preparing metadata (setup.py): finished with status 'error'
#19 51.25   error: subprocess-exited-with-error
#19 51.25   
#19 51.25   × python setup.py egg_info did not run successfully.
#19 51.25   │ exit code: 1
#19 51.25   ╰─> [1 lines of output]
#19 51.25       error in gamechangerml setup command: 'python_requires' must be a string containing valid version specifiers; Invalid specifier: '>=3.6.*'
#19 51.25       [end of output]
#19 51.25   
#19 51.25   note: This error originates from a subprocess, and is likely not a problem with pip.
#19 51.25 error: metadata-generation-failed
#19 51.25 
#19 51.25 × Encountered error while generating package metadata.
#19 51.25 ╰─> See above for output.
#19 51.25 
#19 51.25 note: This is an issue with the package mentioned above, not pip.

Determine the correct version of gamechangerml that can be built successfully in Docker

Summary: The current version, UOT-117914.zip, fails to be built successfully with dev_tools/requirements/rhel8.locked.requirements.txt.

Reason:

#19 45.04 Collecting gamechangerml@ https://github.com/dod-advana/gamechanger-ml/archive/refs/heads/task/UOT-117914.zip (from -r /tmp/requirements.txt (line 38))
#19 45.63   Downloading https://github.com/dod-advana/gamechanger-ml/archive/refs/heads/task/UOT-117914.zip
#19 50.85      - 23.8 MB 11.3 MB/s 0:00:05
#19 51.14   Preparing metadata (setup.py): started
#19 51.24   Preparing metadata (setup.py): finished with status 'error'
#19 51.25   error: subprocess-exited-with-error
#19 51.25   
#19 51.25   × python setup.py egg_info did not run successfully.
#19 51.25   │ exit code: 1
#19 51.25   ╰─> [1 lines of output]
#19 51.25       error in gamechangerml setup command: 'python_requires' must be a string containing valid version specifiers; Invalid specifier: '>=3.6.*'
#19 51.25       [end of output]
#19 51.25   
#19 51.25   note: This error originates from a subprocess, and is likely not a problem with pip.
#19 51.25 error: metadata-generation-failed
#19 51.25 
#19 51.25 × Encountered error while generating package metadata.
#19 51.25 ╰─> See above for output.
#19 51.25 
#19 51.25 note: This is an issue with the package mentioned above, not pip.

Other versions of gamechangerml that fails:

Version	Reason
`v1.8.0`	Package version conflict (1)
`v2.0.0`	Package version conflict (2)

(1) --- v1.8.0

#19 143.0 ERROR: Cannot install -r /tmp/requirements.txt (line 38) and gensim==3.8.3 because these package versions have conflicting dependencies.
#19 143.0 
#19 143.0 The conflict is caused by:
#19 143.0     The user requested gensim==3.8.3
#19 143.0     gamechangerml 1.7.0 depends on gensim==4.1.2

(2) --- v2.0.0

#19 322.2 Collecting awscli==1.27.53 (from gamechangerml@ git+https://github.com/dod-advana/[email protected]>-r /tmp/requirements.txt (line 38))
#19 322.3   Downloading awscli-1.27.53-py3-none-any.whl (4.0 MB)
#19 322.5      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.0/4.0 MB 15.3 MB/s eta 0:00:00
#19 322.7 INFO: pip is looking at multiple versions of gamechangerml to determine which version is compatible with other requirements. This could take a while.
#19 322.7 ERROR: Cannot install -r /tmp/requirements.txt (line 38) and boto3==1.15.18 because these package versions have conflicting dependencies.
#19 322.7 
#19 322.7 The conflict is caused by:
#19 322.7     The user requested boto3==1.15.18
#19 322.7     gamechangerml 1.10.0 depends on boto3~=1.26.50

configuration/app-config/docker.json needs updated postgres db name to match with web_schema setup

The current db name for these are postgres. However they should be gc-orchestration and game_changer because they are the required db names to be used by sequelize script in _postgres_config_step_1_setup_web_schema.

This config will be used by _postgres_config_step_2_setup_data_schema

CC: @pathtovectoryconsulting

Using gamechangerml v1.8.0 causes package dependency conflict

Using gamechangerml package version 1.8.0, which actually reports itself as 1.7.0, cause package dependency conflict.
Snippet from dev_tools/requirements/rhel8.locked.requirements.txt

gamechangerml @ https://github.com/dod-advana/gamechanger-ml/archive/refs/tags/v1.8.0.tar.gz

The output from the script build.sh in gamechanger/deploy, used with BUILDKIT_PROGRESS=plain.

#19 138.5 Collecting zipp==3.6.0 (from -r /tmp/requirements.txt (line 154))
#19 138.5   Downloading zipp-3.6.0-py3-none-any.whl (5.3 kB)
#19 138.6 Requirement already satisfied: setuptools in /opt/app-root/lib/python3.8/site-packages (from arabic-reshaper==2.1.3->-r /tmp/requirements.txt (line 4)) (68.1.2)
#19 138.6 Requirement already satisfied: wheel<1.0,>=0.23.0 in /opt/app-root/lib/python3.8/site-packages (from astunparse==1.6.3->-r /tmp/requirements.txt (line 5)) (0.41.2)
#19 143.0 INFO: pip is looking at multiple versions of gamechangerml to determine which version is compatible with other requirements. This could take a while.
#19 143.0 ERROR: Cannot install -r /tmp/requirements.txt (line 38) and gensim==3.8.3 because these package versions have conflicting dependencies.
#19 143.0 
#19 143.0 The conflict is caused by:
#19 143.0     The user requested gensim==3.8.3
#19 143.0     gamechangerml 1.7.0 depends on gensim==4.1.2
#19 143.0 
#19 143.0 To fix this you could try to:
#19 143.0 1. loosen the range of package versions you've specified
#19 143.0 2. remove package versions to allow pip attempt to solve the dependency conflict
#19 143.0 
#19 143.0 ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts
#19 ERROR: process "/bin/sh -c python3 -m venv \"${APP_VENV}\" --prompt app-root   &&  \"${APP_VENV}/bin/python\" -m pip install --upgrade --no-cache-dir pip setuptools wheel   &&  \"${APP_VENV}/bin/python\" -m pip install --no-cache-dir -r /tmp/requirements.txt   &&  chown -R \"${APP_UID}:${APP_GID}\" \"${APP_VENV}\"" did not complete successfully: exit code: 1
------
 > [15/19] RUN       python3 -m venv "/opt/app-root" --prompt app-root   &&  "/opt/app-root/bin/python" -m pip install --upgrade --no-cache-dir pip setuptools wheel   &&  "/opt/app-root/bin/python" -m pip install --no-cache-dir -r /tmp/requirements.txt   &&  chown -R "1000:1000" "/opt/app-root":
143.0 
143.0 The conflict is caused by:
143.0     The user requested gensim==3.8.3
143.0     gamechangerml 1.7.0 depends on gensim==4.1.2
143.0 
143.0 To fix this you could try to:
143.0 1. loosen the range of package versions you've specified
143.0 2. remove package versions to allow pip attempt to solve the dependency conflict
143.0 
143.0 ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts
------
rhel8.Dockerfile:191

Using notation gamechangerml @ git+https://github.com/dod-advana/[email protected] causes the same error.