Git Product home page Git Product logo

kaggle_avito's Introduction

Kaggle avito

Feature Engineering and Procedures

From Public Kernel

  • first, take label y out from train set
  • combine train set and test set, so we can do all feature engineering at one shot
  • price: Log transform
  • price: Fill NA with mean value
  • image_top_1: fill NA with -999
  • create Time Variables:
    • weekday
    • week of year (not so useful)
    • day of month (not so useful)
  • get training index and test index based on the date
  • Label Encoding for categorical features
    • including: ["user_id","region","city","parent_category_name","category_name","user_type","image_top_1","param_1","param_2","param_3"]
  • group all text features by a list: textfeats = ["description", "title"]
  • create Description Punctuation feature from "description" (by string.punctuation)
  • clean the "title" by: i. lowercase ii. remove some punctuation and signs
  • text preprocess:
    • fillna with "missing"
    • lowercase
    • create "num_words" features for both "description" & "title"
    • create "num_unique_words" features for both "description" & "title"
    • create "words_vs_unique" features for both "description" & "title" (unique_words/words)
  • TFIDF: (TODO)
    • use russian stopwords
    • ngram: 1, 2
    • max_features (TODO)
  • create a text feature df to contain all the tfidf features
  • weirdly, it fits all columns (even numeric and categorical features) into TFIDF, [ vectorizer.transform(df.to_dict('records')) ]
  • create a list of "get_feature_names" for all TFIDF features
  • make Ridge Regression as one of the feature

Our features

Image features

  • Image Dullness Score
  • Average Pixel Width (check if the image's pixel color is very uniform - i.e. the image is very 'flat')
  • Blurness
  • Width of the Image
  • Height of the Image
  • Size of the Image
  • Neural Net Object Prediction
  • image quality by using: https://github.com/titu1994/neural-image-assessment (not in this notebook)
  • Keypoint extraction

Previous Deal Probability

  • first, get all the previous deal probability, in train set
  • get all the mean deal probability from the repeated user in train set for test set (the only once repeated user's deal prob. will discount by 1/2)

Pong's features (TODO)

  • one ratio and zero ratio
  • .....

Stratified K-fold

  • 5 folds
  • stratified by deal prob range:
    y_4_bins = pd.cut(y,[-0.01,0.01,0.163, 0.76, 1], labels=['0', '0.01', '0.163', '0.76'], right=True) 

kaggle_avito's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.