keras project that parses and analyze english resumes

License: MIT License

Python 100.00%

resume-parser deep-learning nltk recurrent-neural-networks convolutional-neural-networks keras

keras-english-resume-parser-and-analyzer's Introduction

keras-english-resume-parser-and-analyzer

Deep learning project that parses and analyze english resumes.

The objective of this project is to use Keras and Deep Learning such as CNN and recurrent neural network to automate the task of parsing a english resume.

Overview

Parser Features

English NLP using NLTK
Extract english texts using pdfminer.six and python-docx from PDF nad DOCX
Rule-based resume parser has been implemented.

Deep Learning Features

Tkinter-based GUI tool to generate and annotate deep learning training data from pdf and docx files
Deep learning multi-class classification using recurrent and cnn networks for
- line type: classify each line of text extracted from pdf and docx file on whether it is a header, meta-data, or content
- line label classify each line of text extracted from pdf and docx file on whether it implies experience, education, etc.

The included deep learning models that classify each line in the resume files include:

cnn.py
- 1-D CNN with Word Embedding
- Multi-Channel CNN with categorical cross-entropy loss function
cnn_lstm.py
- 1-D CNN + LSTM with Word Embedding
lstm.py
- LSTM with category cross-entropy loss function
- Bi-directional LSTM/GRU with categorical cross-entropy loss function

Usage 1: Rule-based English Resume Parser

The sample code below shows how to scan all the resumes (in PDF and DOCX formats) from a [demo/data/resume_samples] folder and print out a summary from the resume parser if information extracted are available:

from keras_en_parser_and_analyzer.library.rule_based_parser import ResumeParser
from keras_en_parser_and_analyzer.library.utility.io_utils import read_pdf_and_docx


def main():
    data_dir_path = './data/resume_samples' # directory to scan for any pdf and docx files
    collected = read_pdf_and_docx(data_dir_path)
    for file_path, file_content in collected.items():

        print('parsing file: ', file_path)

        parser = ResumeParser()
        parser.parse(file_content)
        print(parser.raw) # print out the raw contents extracted from pdf or docx files

        if parser.unknown is False:
            print(parser.summary())

        print('++++++++++++++++++++++++++++++++++++++++++')

    print('count: ', len(collected))


if __name__ == '__main__':
    main()

IMPORTANT: the parser rules are implemented in the parser_rules.py. Each of these rules will be applied to every line of text in the resume file and return the target accordingly (or return None if not found in a line). As these rules are very naive implementation, you may want to customize them further based on the resumes that you are working with.

Usage 2: Deep Learning Resume Parser

Step 1: training data generation and annotation

A training data generation and annotation tool is created in the demo folder which allows resume deep learning training data to be generated from any pdf and docx files stored in the demo/data/resume_samples folder, To launch this tool, run the following command from the root directory of the project:

cd demo
python create_training_data.py

This will parse the pdf and docx files in demo/data/resume_samples folder and for each of these file launch a Tkinter-based GUI form to user to annotate individual text line in the pdf or docx file (clicking the "Type: ..." and "Label: ..." buttons multiple time to select the correct annotation for each line). On each form closing, the generated and annotated data will be saved to a text file in the demo/data/training_data folder. each line in the text file will have the following format

line_type   line_label  line_content

line_type and line_label has the following mapping to the actual class labels

line_labels = {0: 'experience', 1: 'knowledge', 2: 'education', 3: 'project', 4: 'others'}
line_types = {0: 'header', 1: 'meta', 2: 'content'}

Step 2: train the resume parser

After the training data is generated and annotated, one can train the resume parser by running the following command:

cd demo
python dl_based_parser_train.py

Below is the code for dl_based_parser_train.py:

import numpy as np
import os
import sys 


def main():
    random_state = 42
    np.random.seed(random_state)

    current_dir = os.path.dirname(__file__)
    current_dir = current_dir if current_dir is not '' else '.'
    output_dir_path = current_dir + '/models'
    training_data_dir_path = current_dir + '/data/training_data'
    
    # add keras_en_parser_and_analyzer module to the system path
    sys.path.append(os.path.join(os.path.dirname(__file__), '..'))
    from keras_en_parser_and_analyzer.library.dl_based_parser import ResumeParser

    classifier = ResumeParser()
    batch_size = 64
    epochs = 20
    history = classifier.fit(training_data_dir_path=training_data_dir_path,
                             model_dir_path=output_dir_path,
                             batch_size=batch_size, epochs=epochs,
                             test_size=0.3,
                             random_state=random_state)


if __name__ == '__main__':
    main()

Upon completion of training, the trained models will be saved in the demo/models/line_label and demo/models/line_type folders

The default line label and line type classifier used in the deep learning ResumeParser is WordVecBidirectionalLstmSoftmax. But other classifiers can be used by adding the following line, for example:

from keras_en_parser_and_analyzer.library.dl_based_parser import ResumeParser
from keras_en_parser_and_analyzer.library.classifiers.cnn_lstm import WordVecCnnLstm

classifier = ResumeParser()
classifier.line_label_classifier = WordVecCnnLstm()
classifier.line_type_classifier = WordVecCnnLstm()
...

(Do make sure that the requirements.txt are satisfied in your python env)

Step 3: parse resumes using trained parser

After the trained models are saved in the demo/models folder, one can use the resume parser to parse the resumes in the demo/data/resume_samples by running the following command:

cd demo
python dl_based_parser_predict.py

Below is the code for dl_based_parser_predict.py:

import os
import sys 


def main():
    current_dir = os.path.dirname(__file__)
    current_dir = current_dir if current_dir is not '' else '.'
    sys.path.append(os.path.join(os.path.dirname(__file__), '..'))
    
    from keras_en_parser_and_analyzer.library.dl_based_parser import ResumeParser
    from keras_en_parser_and_analyzer.library.utility.io_utils import read_pdf_and_docx
    
    data_dir_path = current_dir + '/data/resume_samples' # directory to scan for any pdf and docx files

    def parse_resume(file_path, file_content):
        print('parsing file: ', file_path)

        parser = ResumeParser()
        parser.load_model('./models')
        parser.parse(file_content)
        print(parser.raw)  # print out the raw contents extracted from pdf or docx files

        if parser.unknown is False:
            print(parser.summary())

        print('++++++++++++++++++++++++++++++++++++++++++')

    collected = read_pdf_and_docx(data_dir_path, command_logging=True, callback=lambda index, file_path, file_content: {
        parse_resume(file_path, file_content)
    })

    print('count: ', len(collected))


if __name__ == '__main__':
    main()

Configure to run on GPU on Windows

Step 1: Change tensorflow to tensorflow-gpu in requirements.txt and install tensorflow-gpu
Step 2: Download and install the CUDA® Toolkit 9.0 (Please note that currently CUDA® Toolkit 9.1 is not yet supported by tensorflow, therefore you should download CUDA® Toolkit 9.0)
Step 3: Download and unzip the cuDNN 7.4 for CUDA@ Toolkit 9.0 and add the bin folder of the unzipped directory to the $PATH of your Windows environment

keras-english-resume-parser-and-analyzer's People

Contributors

Stargazers

Watchers

Forkers

shubhampachori12110095 varshaan0sharma yashugupta786 kamalelsaaid milyasyousuf wernert72 vp999 ankit-da nimeshabuddhika sherlock42 pilgrim2go raiucp bnajafi anishpurohit stungkit alayo-io madhbhavikar jaredjross abhishek4official jpsilvashy jamesyoungsss lumiqai alegzandra peter4k cyberandrei gene2017 smnetmike amirunpri2018 raghurambsv bbnsumanth geoffmomin uplancerhuy the-crocop androlibs appliedinfo alexfok007 kashenfelter dwightford china-liweihong trevor-walker32 magicallyindia vogali thanhtd91 cn1289 swaup3275 saipavanmeruga youonf lianchan arnavdas88 shanthkumarkhandre dsalko ogomezm sehgalromal moniquebakker aileenlam lvaleriu bahiamartins sumit-gupta-sgt bhagwanbankar xan678 qkaran pritul98 murup pivotsecurity bloodavenger ahmedn1 siva2k16 steve7158 swolicetax vamsigottipati hznu1 itamargero vishalkesti382 naveenxyz hexad3cimal rajeshsnadar mr-aeonian rajibmitra dskov echolsm jsevern224 mppmpp315 enisya66 santhoshbala18 iftekhar-mobin moojad jay1493 kiranbeethoju myhololens karthikeyankumarasamy treenity vdoan viveksjoseph smritisingh26 100rabh1401 shv07 plthiyagu freerasrin mohdrasbi shakespears88

keras-english-resume-parser-and-analyzer's Issues

ImportError: No module named 'keras_en_parser_and_analyzer.library.classifiers'

Hi,
my
current python version: 3.5.3
Current O.S: Win 8.1
and when i download the git and run
python setup.py install
everything is installed properly
and i placed sample pdf in data/resume_samples/ and when i execute
python ./demo/create_training_data.py
I get below error !

any pointers on same

can someone tell what is the role of GPU in this project?

"Header" and "Meta" labels' meaning

Hello,
First of all thank you for sharing your work!
I have a question.
What is the meaning of the Header and Meta labels?
For example, the "Content" label is the one we are looking for (education, skills etc..), but for the two label above, I can't understand in which way they could contribute to a more accurately result (they are not relevant to the output for the end user, so I suppose they could help the deep learning model to better understand the Resume's content).
The Header label is neither printed in the summary function.

How can I see the result? In result only count is coming and not text.

Can we add more items to the dictionaries line_type and line_label?

I need to add more labels to the line_type and line_label
Currently it has the following mapping to the actual class labels

line_labels = {0: 'experience', 1: 'knowledge', 2: 'education', 3: 'project', 4: 'others'}
line_types = {0: 'header', 1: 'meta', 2: 'content'}

How to modify this

ModuleNotFoundError: No module named 'keras_en_parser_and_analyzer.library.classifiers'

When running create_training_data.py, the following error pop up
ModuleNotFoundError: No module named 'keras_en_parser_and_analyzer.library.classifiers'

Word from a line annotation issue

Thanks!! for the parser is really good.
I just have a question, that in the annotation GUI we annotate a line with our labels predefined , I have got one case i.e a line with "having 4 years of experience in iOS application" I only want "4 years" as "Experience" but will get the entire line which then will be difficult for me to do further analysis with the data...
This case is same for "Skills" if the document has the line "proficient with objective c , swift" again it will annotate the whole line where as I need only the "objective c" and "swift".

So my query is to annotate the particular word from a line is possible , or if any one can suggest a way to do that with current parser!!

Thank you for the help!!

Failing to load parser while predicting

After training the resume parser with the sample data when I tried to run dl_based_parser_predict I got an issue on load the resumeparser.
The issue was
2019-11-28 15:25:02.165654: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-28 15:25:02.187682: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1800000000 Hz
2019-11-28 15:25:02.188639: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x654a2d0 executing computations on platform Host. Devices:
2019-11-28 15:25:02.188660: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): ,
2019-11-28 15:25:02.224289: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
Traceback (most recent call last):
File "/python3.6/site-packages/numpy/lib/npyio.py", line 453, in load
pickle_kwargs=pickle_kwargs)
File "/python3.6/site-packages/numpy/lib/format.py", line 722, in read_array
raise ValueError("Object arrays cannot be loaded when "
ValueError: Object arrays cannot be loaded when allow_pickle=False

I am stuck in this. Please help

Runtime Error on running training python file.

I have python version 3.7
OSX Catalina.

I am getting this error,

 File "dl_based_parser_train.py", line 30, in <module>
    main()
  File "dl_based_parser_train.py", line 9, in main
    from keras_en_parser_and_analyzer.library.dl_based_parser import ResumeParser
  File "../keras_en_parser_and_analyzer/library/dl_based_parser.py", line 2, in <module>
    from keras_en_parser_and_analyzer.library.classifiers.lstm import WordVecBidirectionalLstmSoftmax
  File "../keras_en_parser_and_analyzer/library/classifiers/lstm.py", line 1, in <module>
    from keras.callbacks import ModelCheckpoint
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/keras/__init__.py", line 3, in <module>
    from . import utils
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/keras/utils/__init__.py", line 6, in <module>
    from . import conv_utils
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/keras/utils/conv_utils.py", line 9, in <module>
    from .. import backend as K
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/keras/backend/__init__.py", line 1, in <module>
    from .load_backend import epsilon
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/keras/backend/load_backend.py", line 90, in <module>
    from .tensorflow_backend import *
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 5, in <module>
    import tensorflow as tf
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/tensorflow/__init__.py", line 101, in <module>
    from tensorflow_core import *
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/tensorflow_core/__init__.py", line 40, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 959, in _find_and_load_unlocked
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/tensorflow/__init__.py", line 50, in __getattr__
    module = self._load()
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/tensorflow/__init__.py", line 44, in _load
    module = _importlib.import_module(self.__name__)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/tensorflow_core/python/__init__.py", line 64, in <module>
    from tensorflow.core.framework.graph_pb2 import *
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/tensorflow_core/core/framework/graph_pb2.py", line 7, in <module>
    from google.protobuf import descriptor as _descriptor
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/google/protobuf/__init__.py", line 37, in <module>
    __import__('pkg_resources').declare_namespace(__name__)
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/pkg_resources/__init__.py", line 84, in <module>
    __import__('pkg_resources.extern.packaging.requirements')
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/pkg_resources/_vendor/packaging/requirements.py", line 9, in <module>
    from pkg_resources.extern.pyparsing import stringStart, stringEnd, originalTextFor, ParseException
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/pkg_resources/extern/__init__.py", line 43, in load_module
    __import__(extant)
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/pkg_resources/_vendor/pyparsing.py", line 4756, in <module>
    _escapedPunc = Word( _bslash, r"\[]-*.$+^?()~ ", exact=2 ).setParseAction(lambda s,l,t:t[0][1])
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/pkg_resources/_vendor/pyparsing.py", line 1284, in setParseAction
    self.parseAction = list(map(_trim_arity, list(fns)))
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/pkg_resources/_vendor/pyparsing.py", line 1066, in _trim_arity
    this_line = extract_stack(limit=2)[-1]
  File "/Volumes/Macintosh HD/keras-parser/lib/python3.7/site-packages/pkg_resources/_vendor/pyparsing.py", line 1050, in extract_stack
    frame_summary = traceback.extract_stack(limit=-offset+limit-1)[offset]
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/traceback.py", line 211, in extract_stack
    stack = StackSummary.extract(walk_stack(f), limit=limit)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/traceback.py", line 363, in extract
    f.line
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/traceback.py", line 285, in line
    self._line = linecache.getline(self.filename, self.lineno).strip()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/linecache.py", line 16, in getline
    lines = getlines(filename, module_globals)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/linecache.py", line 48, in getlines
    for mod in sys.modules.values():
RuntimeError: dictionary changed size during iteration

Change language

Hello, is it possible (and how) to change the language, so it will work for another languages (ex: Russian)?

Access text content based on Labels

Hey @chen0040 ,

First of all thank you for such a clean and crisp tool ! I highly appreciate your time. I've been playing with tool since few days now, I got the hang of the flow and the code. I was wondering whether there's any way currently to access specific text content by their labels, say with label 'Education' or label ' Experience '. Currently I was only able to retrieve raw content from the resume, the result is pretty good, but just wondering if that could be extended further. I understand that adding such extensions is not too straight forward. If you don't a;ready have such a service I'm willing to contribute. Let me know what you think.

Training data requirement

Hello there

great work, just would like to know how much training data would be sufficient to train the model and if there is already training data repo available that I could use for my project?

Thanks in Advance

Farooq

ModuleNotFoundError: No module named 'exceptions'

Hi,

When running the create_training_data.py, got this error. I came across a closed issue similar to this, and tried that step, but that step is specified for windows, whereas I am using macos.

Would really appreciate if there is a solution to this, or a pointer if I missed anything.

MacOS: BigSur
Python: 3.7.6

Thank you.

Zero Size Error Reduction Operation

I am getting the following error while running the training file.

2020-03-16 22:42:48.232059: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-03-16 22:42:48.255069: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f8cebe02f00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-16 22:42:48.255094: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
Traceback (most recent call last):
  File "dl_based_parser_train.py", line 30, in <module>
    main()
  File "dl_based_parser_train.py", line 26, in main
    random_state=random_state)
  File "../keras_en_parser_and_analyzer/library/dl_based_parser.py", line 42, in fit
    random_state=random_state)
  File "../keras_en_parser_and_analyzer/library/dl_based_parser.py", line 66, in fit_line_label
    random_state=random_state)
  File "../keras_en_parser_and_analyzer/library/classifiers/lstm.py", line 360, in fit
    Y = np_utils.to_categorical(ys, len(self.labels))
  File "/Users/rajabose/.pyenv/versions/keras-parser/lib/python3.7/site-packages/keras/utils/np_utils.py", line 49, in to_categorical
    num_classes = np.max(y) + 1
  File "<__array_function__ internals>", line 6, in amax
  File "/Users/rajabose/.pyenv/versions/keras-parser/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2668, in amax
    keepdims=keepdims, initial=initial, where=where)
  File "/Users/rajabose/.pyenv/versions/keras-parser/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 90, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity

Scroll bar missing in training DL gui

Unable to scroll down in the GUI that is displayed for the DL based training.
Scroll bar is not displayed

ValueError: You are passing a target array of shape (7, 1) while using as loss `categorical_crossentropy`. `categorical_crossentropy` expects targets to be binary matrices (1s and 0s) of shape (samples, classes). If your targets are integer classes, you can convert them to the expected format via:

I think I am not able to train the resume. Can anyone provide the sample files. I have tried mutiple pdf and docx but not working for any. Also while training I am not able to navigate through the below text. Tried the issue panel for scroll but that also not working.. Is there any completed code with sample data.

ValueError: zero-size array to reduction operation maximum which has no identity.

This error is coming while executing the command

python dl_based_parser_train.py

Cannot train model

ValueError: You are passing a target array of shape (3, 1) while using as loss categorical_crossentropy. categorical_crossentropy expects targets to be binary matrices (1s and 0s) of shape (samples, classes). If your targets are integer classes, you can convert them to the expected format via:

from keras.utils import to_categorical
y_binary = to_categorical(y_int)

Alternatively, you can use the loss function sparse_categorical_crossentropy instead, which does expect integer targets.

corrected sentence

Dear Xianshun
I have a dataset that the speech are changes to the text data and they are wrong sentences and my sentences must do corrected. do your parser repair my wrong text?

Got error when trying to run the code

SyntaxError: Non-ASCII character '\xe6' in file /Users/allloush/Documents/Python/keras-english-resume-parser-and-analyzer/keras_en_parser_and_analyzer/library/utility/parser_rules.py on line 16, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

chen0040 / keras-english-resume-parser-and-analyzer Goto Github PK