softwaresaved / jamie Goto Github PK

Jobs Analysis using Machine Information Extraction (JAMIE) is a tool to monitor and analyse the number of academic jobs in the UK that require software skills.

Home Page: http://jamie.trenozoic.net

License: BSD 3-Clause "New" or "Revised" License

Python 92.49% HTML 3.72% CSS 0.27% JavaScript 3.52%

python machine-learning jobs classification rse

jamie's Introduction

Jobs Analysis using Machine Information Extraction (JAMIE) is a tool that aims to monitor and analyse the number of academic jobs, mainly in the UK, that require software skills.

Documentation • Contribution Guidelines • Machine Learning

There is a research software jobs tracker which is an instance of jamie that tracks software jobs in UK universities.

Prerequisites

OS. Any UNIX based OS can be used to run jamie. Development was done on Debian 11 (testing, bullseye), Ubuntu 20.04 should work as well.
Python. Development uses Python 3.8, though later versions should work as well.
Database. Jamie uses MongoDB as the backing store for jobs data. Either install MongoDB locally or connect to a MongoDB database by setting a valid MongoDB connection URI (with username and password, if required) in the JAMIE_MONGO_URI environment variable.

The database uses the name jobsDB. If such a database already exists in the MongoDB, then either rename it or set the database name using jamie config db.name <newname>.
Setup. Run jamie setup. This (i) checks the database connection, (ii) downloads necessary NLTK datasets which are needed for text cleaning, and (iii) checks that a training set exists.

Installation

To install using pip:

git clone https://github.com/softwaresaved/jamie.git
cd jamie
python3 -m venv .venv
source .venv/bin/activate
pip install .
pip install .[dev,docs]  # For development work

How it works

The CLI tool jamie is a wrapper around the Jamie API (see the documentation). Working with Jamie is similar to running standard machine learning pipeline: we first train a model and use that to predict whether jobs are software jobs or not. The final step is the creation of the report.

You can take a look at the detailed workflow along with the help for the command line interface, or look at how we built the model.

Concurrency. All the steps indicated above with snapshots support multiple snapshots, and independent snapshots can be worked on concurrently. Scraping writes to the filesystem and can be run independently of other steps as well. Prediction requires read access to the database, so running it concurrently with the load step (which writes to the database) might not work or result in unpredictable behaviour. This can be fixed by making prediction work from a database snapshot (not currently supported).

Reproducibility. Training the model should be reproducible and the random number seed is set automatically where needed. Scraping is inherently non-reproducible, but loading and cleaning the data should be (not tested yet). Prediction is non-reproducible as it relies on a mutable database, but generation of reports from predictions is reproducible.

Usage

Detailed usage can be found in the workflow document.

Configuration: Show the configuration using jamie config, or set configuration using jamie config <configname> <value>
Download jobs: jamie scrape
Load jobs into MongoDB: jamie load. Pass option --dry-run to test.
Training snapshots: A training snapshot is needed to run the machine learning pipeline. First check the snapshots folder location (jamie config common.snapshots), exists and then copy an existing training set CSV file into the training snapshot location. It should be called training_set.csv:
```
cd `jamie config common.snapshots`
mkdir -p training/<date>  # date of snapshot
cp /path/to/training_set.csv training/<date>
```
Train the model: jamie train [<snapshot>]. If snapshot is not specified, uses the latest snapshot.
Predict classification: The previous command will create model snapshots in <snapshots>/models of the snapshots location. You can now use these snapshots to make predictions: jamie predict [<snapshot>]. This saves the prediction snapshot in <snapshots>/predictions.
Generate report corresponding to the prediction snapshot: jamie report. The report is created in <snapshots>/reports with the same name as the corresponding prediction snapshot. To view the report, run
```
# If snapshot not specified, see latest report
jamie view-report [<snapshot>]
```
This will start a local webserver for viewing the report. The report snapshot folder is self contained and can be served using standard webservers as well.

jamie's People

Contributors

Watchers

jamie's Issues

Document limitations of current machine learning model

Document training data reliability and any other ways the model can be improved

Clarify meaning and configuration of snapshots directory

In jamie-snaphots README, clarify that the snapshots directory is taken from jamie’s config (and snapshots is the default location in that config). i.e. if it’s different, you’d need to update the config

Create toc for sphinx to include workflow

Check sustainability of third-party prerequisites

Do a check pf dependent non-core Python (and other) packages to ensure they are in a sustainable, active state of development. Where you can, identify any problematic ones and either replace or remove them if remaining effort allows, or highlight as a potential future issue (e.g. as you mentioned, the deprecated pytoml package can be removed and a basic JSON config file put in place).

Use new date attribute for reporting

Use https protocol in clone URLs by default in READMEs

For Git clone URLs, they seem to assume a locally installed repo deploy key. This is a good for CI, but for those first approaching it they wouldn't likely have these set up, so use regular https://github.com URLs instead (noted in jamie and jamie-snapshots READMEs)

Unable to load scraped jobs into MongoDB

Platform: Mac OS X Mojave (10.14.3), Python 3.8.0, MongoDB Community Edition 4.4

Process so far:

Installed MongoDB via homebrew, all defaults, enabled as local service, verified connection string (mongodb://localhost)
Cloned the jamie repo, and followed the installation instructions (the jamie-snapshots repo is deployed within the default snapshots directory within 'jamie')
Run 'jamie setup' successfully
Run 'jamie config scrape.njobs 40' to a locally manageable number for testing
Run 'jamie scrape', with 40 jobs successfully appearing in the 'input' directory
Run 'jamie load', at which point I encountered the following error:

2020-08-13T11:21:22+0100 INFO [importer] Already recorded jobs: 0
Traceback (most recent call last):
File "/Users/user/tmp/jamie/.venv/bin/jamie", line 6, in
fire.Fire(jamie.Jamie)
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/fire/core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/jamie/init.py", line 74, in load
jamie.data.importer.main(self.cf, dry_run=dry_run)
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/jamie/data/importer.py", line 66, in main
for data in _import_iterator(config["scrape.folder"], skip=recorded_jobs):
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/jamie/data/importer.py", line 36, in _import_iterator
job = JobFile(filename).parse()
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/jamie/scrape/process.py", line 386, in parse
self.parse_json()
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/jamie/scrape/process.py", line 345, in parse_json
self.data[k] = self._get_nested_data(v)
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/jamie/scrape/process.py", line 331, in _get_nested_data
return get_nested_key(self.data, keys)
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/jamie/scrape/process.py", line 29, in get_nested_key
o = o[k]
TypeError: list indices must be integers or slices, not str

Any ideas?

Move prerequisites to before installation

In README, move the prerequisites section to just before the installation section so the setup, deployment and usage steps are chronological.

Add technical prerequisites section to README

e.g. required packages and their versions (including that of Python), as well as any specifics for expected state of things like pre-installed MongoDB (such as port numbers and so on), recommended operating systems, etc.

Document training data combination in Likert scale

[cli] Add list-models

Add a list-models command to CLI (and the API) to show current list of models.

Fix link to workflow

The README's 'detailed workflow' link currently points to the latest report, and not to the description of workflow (http://data.trenozoic.net/jamie/docs/workflow.html).

Solve issue of jamie depending on krippendorff (GPL) code

Previous commits of jamie (from ebdcb5e onwards) had a dependency on the package krippendorff (GPL) code. This has now been removed in ceb5a6e. Should I rewrite history to remove the dependency as well?

Set up CI for reproducibility of model

Remove any legacy code from repository

For clarity, where possible refactor the repository to remove any legacy and/or unused code, notebooks, reports and data.

Add thoughts on how to extend/develop testing

In a markdown file add your thoughts on how the PyTest testing suite could/should be extended and developed, any critical areas that deserve particular testing attention, and things that need to be taken into account in future tests due to the nature of the ML application.

Allow loading model specification (parameter grid) from JSON

The ground work for this has been completed, with the model specification being a standard dict.

Precision warnings when running 'jamie train'

Platform: Mac OS X Mojave (10.14.3), Python 3.8.0, MongoDB Community Edition 4.4

Process so far:

Installed MongoDB via homebrew, all defaults, enabled as local service, verified connection string (mongodb://localhost)
Cloned the jamie repo, and followed the installation instructions (the jamie-snapshots repo is deployed within the default snapshots directory within 'jamie')
Run 'jamie setup' successfully
Run 'jamie config scrape.njobs 500' to a locally manageable number for testing
Run 'jamie scrape', with 500 jobs successfully appearing in the 'input' directory
Run 'jamie load'
Run 'jamie train', during the process I get the following warnings:

(.venv) jamie$ jamie train
2020-08-14T09:25:48+0100 INFO [models] Snapshot rse_2020-08-14T09-25-48_0.1
/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/pandas/core/ops/array_ops.py:253: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
res_values = method(rvalues)
Positive Labels: 321
2020-08-14T09:26:51+0100 INFO [models] Saving features
2020-08-14T09:26:51+0100 INFO [models] Features: 1255 jobs, 28307 features
2020-08-14T09:26:51+0100 INFO [models] Nested cross validation
Adding model: SVC
Adding model: LogReg
Adding model: RandomForest
Adding model: CART
Adding model: GradientBoosting
2020-08-14T09:26:51+0100 INFO [models] [SVC] Model training (rse_2020-08-14T09-25-48_0.1)
/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))

Are these anything to worry about? Perhaps a result of using a smaller test job data set? In any event it may be an idea to document this and its cause - with any remedial action that could be taken.

Add instructions for running Python webserver to view reports

In README, add very brief instructions for running Python webserver to view reports, e.g. python -m http.server

Clarify role of include_in_study

The old script used include_in_study to filter incoming jobs from jobs.ac.uk such that the jobs are in UK only and academic (not PhD or Masters level). Currently we use the training_set.csv which already contains the in_uk tag, all of which are true. There are some (~60) PhD level jobs, some of which have a positive label. There are only 2 Masters level jobs.

My take is that we keep these for training, as (i) they are a minority (<20%) of the training set, (ii) classifying a job as software job or not should not depend on the level of the job. For prediction, we currently remove the PhD and Masters jobs before running prediction.

Document possible common extensions

Such as adding a new graph or featureset.

Parse dates and salaries from newer jobs when JSON is not present

Deprecate MySQL access

Bob is a separate program which runs the data collection (softwaresaved/training-set-collector). Since it seems to have lost much of its information, the canonical training set is now in https://github.com/softwaresaved/jobs-analysis/blob/master/jobAnalysis/dataPrediction/data/training_set/training_set.csv which has been cleaned by Ania and converted to a json format
https://github.com/softwaresaved/jobs-analysis/blob/jamie/jamie/data/tags_summary.json

For simplicity, will be deprecating MySQL access to Bob directly, and instead using this JSON output to train the model.

Document training data schema

Update license

Update software license to 2020

Report metrics for models other than Logistic Regression

Make sure LogReg is actually the best from a precision/recall point of view. The models are now running on another machine off the 0.1 tagged code.

Decide on whether to bundle JS libraries in templates

Currently JS libraries such as d3, metricsgraphics, jquery are bundled in templates. This could be removed and cdnjs.com used instead.

Explain concurrent execution limitations in documentation

You presented an excellent overview of the limitations and possibilities for running Jamie concurrently for different pipeline stages. It would be great to capture this in the documentation (perhaps in a separate markdown file and linked to the README), in case there's a need to automate the running of pipeline stages in the future (e.g. as a nightly process).

Use CDN in reports

Now that JS is no longer bundled, use CDN for JS library delivery