Git Product home page Git Product logo

softwaresaved / jamie Goto Github PK

View Code? Open in Web Editor NEW
0.0 7.0 0.0 36.25 MB

Jobs Analysis using Machine Information Extraction (JAMIE) is a tool to monitor and analyse the number of academic jobs in the UK that require software skills.

Home Page: http://jamie.trenozoic.net

License: BSD 3-Clause "New" or "Revised" License

Python 92.49% HTML 3.72% CSS 0.27% JavaScript 3.52%
python machine-learning jobs classification rse

jamie's Introduction

jamie

Python 3.8 Code style: black

Jobs Analysis using Machine Information Extraction (JAMIE) is a tool that aims to monitor and analyse the number of academic jobs, mainly in the UK, that require software skills.

DocumentationContribution GuidelinesMachine Learning

There is a research software jobs tracker which is an instance of jamie that tracks software jobs in UK universities.

Prerequisites

  1. OS. Any UNIX based OS can be used to run jamie. Development was done on Debian 11 (testing, bullseye), Ubuntu 20.04 should work as well.

  2. Python. Development uses Python 3.8, though later versions should work as well.

  3. Database. Jamie uses MongoDB as the backing store for jobs data. Either install MongoDB locally or connect to a MongoDB database by setting a valid MongoDB connection URI (with username and password, if required) in the JAMIE_MONGO_URI environment variable.

    The database uses the name jobsDB. If such a database already exists in the MongoDB, then either rename it or set the database name using jamie config db.name <newname>.

  4. Setup. Run jamie setup. This (i) checks the database connection, (ii) downloads necessary NLTK datasets which are needed for text cleaning, and (iii) checks that a training set exists.

Installation

To install using pip:

git clone https://github.com/softwaresaved/jamie.git
cd jamie
python3 -m venv .venv
source .venv/bin/activate
pip install .
pip install .[dev,docs]  # For development work

How it works

The CLI tool jamie is a wrapper around the Jamie API (see the documentation). Working with Jamie is similar to running standard machine learning pipeline: we first train a model and use that to predict whether jobs are software jobs or not. The final step is the creation of the report.

workflow

You can take a look at the detailed workflow along with the help for the command line interface, or look at how we built the model.

Concurrency. All the steps indicated above with snapshots support multiple snapshots, and independent snapshots can be worked on concurrently. Scraping writes to the filesystem and can be run independently of other steps as well. Prediction requires read access to the database, so running it concurrently with the load step (which writes to the database) might not work or result in unpredictable behaviour. This can be fixed by making prediction work from a database snapshot (not currently supported).

Reproducibility. Training the model should be reproducible and the random number seed is set automatically where needed. Scraping is inherently non-reproducible, but loading and cleaning the data should be (not tested yet). Prediction is non-reproducible as it relies on a mutable database, but generation of reports from predictions is reproducible.

Usage

Detailed usage can be found in the workflow document.

  1. Configuration: Show the configuration using jamie config, or set configuration using jamie config <configname> <value>

  2. Download jobs: jamie scrape

  3. Load jobs into MongoDB: jamie load. Pass option --dry-run to test.

  4. Training snapshots: A training snapshot is needed to run the machine learning pipeline. First check the snapshots folder location (jamie config common.snapshots), exists and then copy an existing training set CSV file into the training snapshot location. It should be called training_set.csv:

    cd `jamie config common.snapshots`
    mkdir -p training/<date>  # date of snapshot
    cp /path/to/training_set.csv training/<date>
    
  5. Train the model: jamie train [<snapshot>]. If snapshot is not specified, uses the latest snapshot.

  6. Predict classification: The previous command will create model snapshots in <snapshots>/models of the snapshots location. You can now use these snapshots to make predictions: jamie predict [<snapshot>]. This saves the prediction snapshot in <snapshots>/predictions.

  7. Generate report corresponding to the prediction snapshot: jamie report. The report is created in <snapshots>/reports with the same name as the corresponding prediction snapshot. To view the report, run

    # If snapshot not specified, see latest report
    jamie view-report [<snapshot>]
    

    This will start a local webserver for viewing the report. The report snapshot folder is self contained and can be served using standard webservers as well.

jamie's People

Contributors

abhidg avatar marioa avatar npch avatar oliph avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

jamie's Issues

Check sustainability of third-party prerequisites

Do a check pf dependent non-core Python (and other) packages to ensure they are in a sustainable, active state of development. Where you can, identify any problematic ones and either replace or remove them if remaining effort allows, or highlight as a potential future issue (e.g. as you mentioned, the deprecated pytoml package can be removed and a basic JSON config file put in place).

Unable to load scraped jobs into MongoDB

Platform: Mac OS X Mojave (10.14.3), Python 3.8.0, MongoDB Community Edition 4.4

Process so far:

  • Installed MongoDB via homebrew, all defaults, enabled as local service, verified connection string (mongodb://localhost)
  • Cloned the jamie repo, and followed the installation instructions (the jamie-snapshots repo is deployed within the default snapshots directory within 'jamie')
  • Run 'jamie setup' successfully
  • Run 'jamie config scrape.njobs 40' to a locally manageable number for testing
  • Run 'jamie scrape', with 40 jobs successfully appearing in the 'input' directory
  • Run 'jamie load', at which point I encountered the following error:

2020-08-13T11:21:22+0100 INFO [importer] Already recorded jobs: 0
Traceback (most recent call last):
File "/Users/user/tmp/jamie/.venv/bin/jamie", line 6, in
fire.Fire(jamie.Jamie)
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/fire/core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/jamie/init.py", line 74, in load
jamie.data.importer.main(self.cf, dry_run=dry_run)
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/jamie/data/importer.py", line 66, in main
for data in _import_iterator(config["scrape.folder"], skip=recorded_jobs):
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/jamie/data/importer.py", line 36, in _import_iterator
job = JobFile(filename).parse()
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/jamie/scrape/process.py", line 386, in parse
self.parse_json()
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/jamie/scrape/process.py", line 345, in parse_json
self.data[k] = self._get_nested_data(v)
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/jamie/scrape/process.py", line 331, in _get_nested_data
return get_nested_key(self.data, keys)
File "/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/jamie/scrape/process.py", line 29, in get_nested_key
o = o[k]
TypeError: list indices must be integers or slices, not str

Any ideas?

Add technical prerequisites section to README

e.g. required packages and their versions (including that of Python), as well as any specifics for expected state of things like pre-installed MongoDB (such as port numbers and so on), recommended operating systems, etc.

Add thoughts on how to extend/develop testing

In a markdown file add your thoughts on how the PyTest testing suite could/should be extended and developed, any critical areas that deserve particular testing attention, and things that need to be taken into account in future tests due to the nature of the ML application.

Precision warnings when running 'jamie train'

Platform: Mac OS X Mojave (10.14.3), Python 3.8.0, MongoDB Community Edition 4.4

Process so far:

  • Installed MongoDB via homebrew, all defaults, enabled as local service, verified connection string (mongodb://localhost)
  • Cloned the jamie repo, and followed the installation instructions (the jamie-snapshots repo is deployed within the default snapshots directory within 'jamie')
  • Run 'jamie setup' successfully
  • Run 'jamie config scrape.njobs 500' to a locally manageable number for testing
  • Run 'jamie scrape', with 500 jobs successfully appearing in the 'input' directory
  • Run 'jamie load'
  • Run 'jamie train', during the process I get the following warnings:

(.venv) jamie$ jamie train
2020-08-14T09:25:48+0100 INFO [models] Snapshot rse_2020-08-14T09-25-48_0.1
/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/pandas/core/ops/array_ops.py:253: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
res_values = method(rvalues)
Positive Labels: 321
2020-08-14T09:26:51+0100 INFO [models] Saving features
2020-08-14T09:26:51+0100 INFO [models] Features: 1255 jobs, 28307 features
2020-08-14T09:26:51+0100 INFO [models] Nested cross validation
Adding model: SVC
Adding model: LogReg
Adding model: RandomForest
Adding model: CART
Adding model: GradientBoosting
2020-08-14T09:26:51+0100 INFO [models] [SVC] Model training (rse_2020-08-14T09-25-48_0.1)
/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/Users/user/tmp/jamie/.venv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))

Are these anything to worry about? Perhaps a result of using a smaller test job data set? In any event it may be an idea to document this and its cause - with any remedial action that could be taken.

Clarify role of include_in_study

The old script used include_in_study to filter incoming jobs from jobs.ac.uk such that the jobs are in UK only and academic (not PhD or Masters level). Currently we use the training_set.csv which already contains the in_uk tag, all of which are true. There are some (~60) PhD level jobs, some of which have a positive label. There are only 2 Masters level jobs.

My take is that we keep these for training, as (i) they are a minority (<20%) of the training set, (ii) classifying a job as software job or not should not depend on the level of the job. For prediction, we currently remove the PhD and Masters jobs before running prediction.

Deprecate MySQL access

Bob is a separate program which runs the data collection (softwaresaved/training-set-collector). Since it seems to have lost much of its information, the canonical training set is now in https://github.com/softwaresaved/jobs-analysis/blob/master/jobAnalysis/dataPrediction/data/training_set/training_set.csv which has been cleaned by Ania and converted to a json format
https://github.com/softwaresaved/jobs-analysis/blob/jamie/jamie/data/tags_summary.json

For simplicity, will be deprecating MySQL access to Bob directly, and instead using this JSON output to train the model.

Explain concurrent execution limitations in documentation

You presented an excellent overview of the limitations and possibilities for running Jamie concurrently for different pipeline stages. It would be great to capture this in the documentation (perhaps in a separate markdown file and linked to the README), in case there's a need to automate the running of pipeline stages in the future (e.g. as a nightly process).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.