Git Product home page Git Product logo

reeboot-box-plot's Introduction

Educational Resource Strategies: Multi label multi class prediction

alt_text

A) Problem description:

According to Driven Data, budgets for schools and school districts are huge, complex, and unwieldy. It's no easy task to digest where and how schools are using their resources. Education Resource Strategies is a non-profit that tackles just this task with the goal of letting districts be smarter, more strategic, and more effective in their spending. Our task is a multi-class-multi-label classification problem with the goal of attaching canonical labels to the freeform text in budget line items. These labels let ERS understand how schools are spending money and tailor their strategy recommendations to improve outcomes for students, teachers, and administrators.

In order to compare budget or expenditure data across districts, ERS assigns every line item to certain categories in a comprehensive financial spending framework. For instance, Object_Type describes what the spending "is"—Base Salary/Compensation, Benefits, Stipends & Other Compensation, Equipment & Equipment Lease, Property Rental, and so on. Other categories describe what the spending "does," which groups of students benefit, and where the funds come from. Once this process is complete, we can finally offer cross-district insight into a partner's finances. We might observe that a particular partner spends more on facilities and maintenance than peer districts, or staffs teaching assistants more richly. These findings are not in themselves good or bad—they depend on the context, goals, and strategy of the partner district.

This task (which we call financial coding) is very time and labor-intensive.

B) Data description:

Data source: Driven Data

Feature list:

  • FTE (float) - If an employee, the percentage of full-time that the employee works.
  • Facility_or_Department - If expenditure is tied to a department/facility, that department/facility.
  • Function_Description - A description of the function the expenditure was serving.
  • Fund_Description - A description of the source of the funds.
  • Job_Title_Description - If this is an employee, a description of that employee's job title.
  • Location_Description - A description of where the funds were spent.
  • Object_Description - A description of what the funds were used for.
  • Position_Extra - Any extra information about the position that we have.
  • Program_Description - A description of the program that the funds were used for.
  • SubFund_Description - More detail on Fund_Description
  • Sub_Object_Description - More detail on Object_Description
  • Text_1 - Any additional text supplied by the district.
  • Text_2 - Any additional text supplied by the district.
  • Text_3 - Any additional text supplied by the district.
  • Text_4 - Any additional text supplied by the district.
  • Total (float) - The total cost of the expenditure.

Label list:

  • Function:
    • Aides Compensation
    • Career & Academic Counseling
    • Communications
    • Curriculum Development
    • Data Processing & Information Services
    • Development & Fundraising
    • Enrichment
    • Extended Time & Tutoring
    • Facilities & Maintenance
    • Facilities Planning
    • Finance, Budget, Purchasing & Distribution
    • Food Services
    • Governance
    • Human Resources
    • Instructional Materials & Supplies
    • Insurance
    • Legal
    • Library & Media
    • NO_LABEL
    • Other Compensation
    • Other Non-Compensation
    • Parent & Community Relations
    • Physical Health & Services
    • Professional Development
    • Recruitment
    • Research & Accountability
    • School Administration
    • School Supervision
    • Security & Safety
    • Social & Emotional
    • Special Population Program Management & Support
    • Student Assignment
    • Student Transportation
    • Substitute Compensation
    • Teacher Compensation
    • Untracked Budget Set-Aside
    • Utilities
  • Object_Type:
    • Base Salary/Compensation
    • Benefits
    • Contracted Services
    • Equipment & Equipment Lease
    • NO_LABEL
    • Other Compensation/Stipend
    • Other Non-Compensation
    • Rent/Utilities
    • Substitute Compensation
    • Supplies/Materials
    • Travel & Conferences
  • Operating_Status:
    • Non-Operating
    • Operating, Not PreK-12
    • PreK-12 Operating
  • Position_Type:
    • (Exec) Director
    • Area Officers
    • Club Advisor/Coach
    • Coordinator/Manager
    • Custodian
    • Guidance Counselor
    • Instructional Coach
    • Librarian
    • NO_LABEL
    • Non-Position
    • Nurse
    • Nurse Aide
    • Occupational Therapist
    • Other
    • Physical Therapist
    • Principal
    • Psychologist
    • School Monitor/Security
    • Sec/Clerk/Other Admin
    • Social Worker
    • Speech Therapist
    • Substitute
    • TA
    • Teacher
    • Vice Principal
  • Pre_K:
    • NO_LABEL
    • Non PreK
    • PreK
  • Reporting:
    • NO_LABEL
    • Non-School
    • School
  • Sharing:
    • Leadership & Management
    • NO_LABEL
    • School Reported
    • School on Central Budgets
    • Shared Services
  • Student_Type:
    • Alternative
    • At Risk
    • ELL
    • Gifted
    • NO_LABEL
    • Poverty
    • PreK
    • Special Education
    • Unspecified
  • Use:
    • Business Services
    • ISPD
    • Instruction
    • Leadership
    • NO_LABEL
    • O&M
    • Pupil Services & Enrichment
    • Untracked Budget Set-Aside

C) Files:

  1. Initial data preparation.ipynb: Create files train.csv, test.csv, labels.csv
  2. Create multi column text data set.ipynb: Preprocess text features to save train_multi_column_text.csv and test_multi_column_text.csv
  3. logistic base model.ipynb: Create online logistic one vs rest models for each category of labels

D) Evaluation metric:

Multi-multiclass log loss =

In this calculation, K is the number of dependent variables, N is the number of rows being evaluated, and C is the number of class values k can take on. As a note, if your probabilities don't sum to 1 for each class, we will normalize them for you. The goal is to minimize multi-multiclass log loss.

F) Final Model Description:

F.1) Missing values:

Eventhough we might see we have lot of input data columns it is important to check any missing values. img

From the above missing values heatmap it is clear that not all the feature columns contain enough information. In fact except features "Object Description" and "Total" other features possess too many missing values to replace. Hence, we are forced to drop many of the features such as facility_or_Department, Text_2, Text_3, Text_4, Text_1, SubFund_Description, Position_Extra.

F.2) Data Preparation:

In the data prepartion step, we clean text features. In the text cleaning step we first decontract contracted words, for example, don't becomes do not, wouldn't becomes would not, etc. After this we remove all the special characters, punctuation marks, stopwords. After removing we may make some of the samples empty strings so we fill up empty string with the word "empty".

F.3) Final model:

After tring many different feature engineering such as entropy features, numeric feature binning, single text vectorization, etc. finally tfidf vectorization of multiple text feature columns as given in the original data set is used. Along with these text features normalized and standardized numeric features are also used.

Aftering trying for many different modeling improches simple Online Logistic Regression Classifier using sklearn's SGDClassifier which uses Stochastic Gradient Descent is used for final model approach. The reason of custom building online classifier is that Sklearn's Logistic Regression Classifier fits the model on the entire data set which makes is slower and harder to run on the large data set. Where as Stochastic Gradient Descent is easy to fit and faster on a large data sets.

Early stopping is an important feature while building classifier to avoid overfitting. Using some tolerance level for iterative improvement (here 0.001) we can early model iteration.

Similar to sklearn's logistic classifier parameter max_iter, max_epoch is also set to 1000 epoch iteration.

Finally, before providing data to the model, all features in the entire data is stacked together using scipy sparse hstack module which makes data stacked dataframe as sparse scipy matrix. This is done because online classifier makes faster passes on the scipy sparse matrix which makes the model even faster.

Instead of using pure Stochastic Gradient Descent, we use mini-batch gradient descent. On this link author has provided iter_minibatches function that takes in chunksize, X, y as function parameters to yield randomly selected x_chunk, y_chunk with the size of provided chunksize.

Last but not the least, we run such mini-batch-logistic-regression model as one-verse-rest on every class of every label. Hence, we run such model 104 times.

Models:

Model description Score
Logistic hasing vectorizer without numeric features 0.8965
Numeric features fill na with mean 0.8201
Logistic hashing vectorizer + numeric features + entropy features for each labels separate 0.7934
Logistic tfidf vectorizer + numeric features + entropy features 0.5952
Same as above with multi-text-columns 0.5582
Logistic multi-text tfidf + original numeric features with scaling + feature drop 0.5381
Same as previous model but 3 more feature drop 0.5314

Things that did not worked out:

  • We used condesed features, meaning all the text features are collapsed into a single feature
  • Used numeric feature binning
  • Used target encoded features
  • Used entropy based features
  • Fit mini-batch model on labels instead of classes of labels.

reeboot-box-plot's People

Contributors

push44 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.