Educational Resource Strategies: Multi label multi class prediction

A) Problem description:

According to Driven Data, budgets for schools and school districts are huge, complex, and unwieldy. It's no easy task to digest where and how schools are using their resources. Education Resource Strategies is a non-profit that tackles just this task with the goal of letting districts be smarter, more strategic, and more effective in their spending. Our task is a multi-class-multi-label classification problem with the goal of attaching canonical labels to the freeform text in budget line items. These labels let ERS understand how schools are spending money and tailor their strategy recommendations to improve outcomes for students, teachers, and administrators.

In order to compare budget or expenditure data across districts, ERS assigns every line item to certain categories in a comprehensive financial spending framework. For instance, Object_Type describes what the spending "is"—Base Salary/Compensation, Benefits, Stipends & Other Compensation, Equipment & Equipment Lease, Property Rental, and so on. Other categories describe what the spending "does," which groups of students benefit, and where the funds come from. Once this process is complete, we can finally offer cross-district insight into a partner's finances. We might observe that a particular partner spends more on facilities and maintenance than peer districts, or staffs teaching assistants more richly. These findings are not in themselves good or bad—they depend on the context, goals, and strategy of the partner district.

This task (which we call financial coding) is very time and labor-intensive.

B) Data description:

Data source:

Feature list:

FTE (float) - If an employee, the percentage of full-time that the employee works.
Facility_or_Department - If expenditure is tied to a department/facility, that department/facility.
Function_Description - A description of the function the expenditure was serving.
Fund_Description - A description of the source of the funds.
Job_Title_Description - If this is an employee, a description of that employee's job title.
Location_Description - A description of where the funds were spent.
Object_Description - A description of what the funds were used for.
Position_Extra - Any extra information about the position that we have.
Program_Description - A description of the program that the funds were used for.
SubFund_Description - More detail on Fund_Description
Sub_Object_Description - More detail on Object_Description
Text_1 - Any additional text supplied by the district.
Text_2 - Any additional text supplied by the district.
Text_3 - Any additional text supplied by the district.
Text_4 - Any additional text supplied by the district.
Total (float) - The total cost of the expenditure.

Label list:

Function:
- Aides Compensation
- Career & Academic Counseling
- Communications
- Curriculum Development
- Data Processing & Information Services
- Development & Fundraising
- Enrichment
- Extended Time & Tutoring
- Facilities & Maintenance
- Facilities Planning
- Finance, Budget, Purchasing & Distribution
- Food Services
- Governance
- Human Resources
- Instructional Materials & Supplies
- Insurance
- Legal
- Library & Media
- NO_LABEL
- Other Compensation
- Other Non-Compensation
- Parent & Community Relations
- Physical Health & Services
- Professional Development
- Recruitment
- Research & Accountability
- School Administration
- School Supervision
- Security & Safety
- Social & Emotional
- Special Population Program Management & Support
- Student Assignment
- Student Transportation
- Substitute Compensation
- Teacher Compensation
- Untracked Budget Set-Aside
- Utilities
Object_Type:
- Base Salary/Compensation
- Benefits
- Contracted Services
- Equipment & Equipment Lease
- NO_LABEL
- Other Compensation/Stipend
- Other Non-Compensation
- Rent/Utilities
- Substitute Compensation
- Supplies/Materials
- Travel & Conferences
Operating_Status:
- Non-Operating
- Operating, Not PreK-12
- PreK-12 Operating
Position_Type:
- (Exec) Director
- Area Officers
- Club Advisor/Coach
- Coordinator/Manager
- Custodian
- Guidance Counselor
- Instructional Coach
- Librarian
- NO_LABEL
- Non-Position
- Nurse
- Nurse Aide
- Occupational Therapist
- Other
- Physical Therapist
- Principal
- Psychologist
- School Monitor/Security
- Sec/Clerk/Other Admin
- Social Worker
- Speech Therapist
- Substitute
- TA
- Teacher
- Vice Principal
Pre_K:
- NO_LABEL
- Non PreK
- PreK
Reporting:
- NO_LABEL
- Non-School
- School
Sharing:
- Leadership & Management
- NO_LABEL
- School Reported
- School on Central Budgets
- Shared Services
Student_Type:
- Alternative
- At Risk
- ELL
- Gifted
- NO_LABEL
- Poverty
- PreK
- Special Education
- Unspecified
Use:
- Business Services
- ISPD
- Instruction
- Leadership
- NO_LABEL
- O&M
- Pupil Services & Enrichment
- Untracked Budget Set-Aside

C) Files:

Initial data preparation.ipynb: Create files train.csv, test.csv, labels.csv
Create multi column text data set.ipynb: Preprocess text features to save train_multi_column_text.csv and test_multi_column_text.csv
logistic base model.ipynb: Create online logistic one vs rest models for each category of labels

D) Evaluation metric:

Multi-multiclass log loss = $\frac{1}{K} \sum^N_{n=0} \sum^C_{c=1} y_{k,c,n} log(\hat{y}_{k,c,n})$

In this calculation, K is the number of dependent variables, N is the number of rows being evaluated, and C is the number of class values k can take on. As a note, if your probabilities don't sum to 1 for each class, we will normalize them for you. The goal is to minimize multi-multiclass log loss.

F) Final Model Description:

F.1) Missing values:

Eventhough we might see we have lot of input data columns it is important to check any missing values.

From the above missing values heatmap it is clear that not all the feature columns contain enough information. In fact except features "Object Description" and "Total" other features possess too many missing values to replace. Hence, we are forced to drop many of the features such as facility_or_Department, Text_2, Text_3, Text_4, Text_1, SubFund_Description, Position_Extra.

F.2) Data Preparation:

In the data prepartion step, we clean text features. In the text cleaning step we first decontract contracted words, for example, don't becomes do not, wouldn't becomes would not, etc. After this we remove all the special characters, punctuation marks, stopwords. After removing we may make some of the samples empty strings so we fill up empty string with the word "empty".

F.3) Final model:

After tring many different feature engineering such as entropy features, numeric feature binning, single text vectorization, etc. finally tfidf vectorization of multiple text feature columns as given in the original data set is used. Along with these text features normalized and standardized numeric features are also used.

Aftering trying for many different modeling improches simple Online Logistic Regression Classifier using sklearn's SGDClassifier which uses Stochastic Gradient Descent is used for final model approach. The reason of custom building online classifier is that Sklearn's Logistic Regression Classifier fits the model on the entire data set which makes is slower and harder to run on the large data set. Where as Stochastic Gradient Descent is easy to fit and faster on a large data sets.

Early stopping is an important feature while building classifier to avoid overfitting. Using some tolerance level for iterative improvement (here 0.001) we can early model iteration.

Similar to sklearn's logistic classifier parameter max_iter, max_epoch is also set to 1000 epoch iteration.

Finally, before providing data to the model, all features in the entire data is stacked together using scipy sparse hstack module which makes data stacked dataframe as sparse scipy matrix. This is done because online classifier makes faster passes on the scipy sparse matrix which makes the model even faster.

Instead of using pure Stochastic Gradient Descent, we use mini-batch gradient descent. On this author has provided iter_minibatches function that takes in chunksize, X, y as function parameters to yield randomly selected x_chunk, y_chunk with the size of provided chunksize.

Last but not the least, we run such mini-batch-logistic-regression model as one-verse-rest on every class of every label. Hence, we run such model 104 times.

Models:

Model description	Score
Logistic hasing vectorizer without numeric features	0.8965
Numeric features fill na with mean	0.8201
Logistic hashing vectorizer + numeric features + entropy features for each labels separate	0.7934
Logistic tfidf vectorizer + numeric features + entropy features	0.5952
Same as above with multi-text-columns	0.5582
Logistic multi-text tfidf + original numeric features with scaling + feature drop	0.5381
Same as previous model but 3 more feature drop	0.5314

Things that did not worked out:

We used condesed features, meaning all the text features are collapsed into a single feature
Used numeric feature binning
Used target encoded features
Used entropy based features
Fit mini-batch model on labels instead of classes of labels.

push44 / reeboot-box-plot Goto Github PK

reeboot-box-plot's Introduction

Educational Resource Strategies: Multi label multi class prediction

A) Problem description:

B) Data description:

C) Files:

D) Evaluation metric:

F) Final Model Description:

F.1) Missing values:

F.2) Data Preparation:

F.3) Final model:

Models:

Things that did not worked out:

reeboot-box-plot's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent