View Code? Open in Web Editor
NEW
This project forked from deep-learning-hk/kaggle_avito
kaggle_avito's Introduction
Feature Engineering and Procedures
- first, take label y out from train set
- combine train set and test set, so we can do all feature engineering at one shot
- price: Log transform
- price: Fill NA with mean value
- image_top_1: fill NA with -999
- create Time Variables:
- weekday
- week of year (not so useful)
- day of month (not so useful)
- get training index and test index based on the date
- Label Encoding for categorical features
- including: ["user_id","region","city","parent_category_name","category_name","user_type","image_top_1","param_1","param_2","param_3"]
- group all text features by a list: textfeats = ["description", "title"]
- create Description Punctuation feature from "description" (by string.punctuation)
- clean the "title" by:
i. lowercase
ii. remove some punctuation and signs
- text preprocess:
- fillna with "missing"
- lowercase
- create "num_words" features for both "description" & "title"
- create "num_unique_words" features for both "description" & "title"
- create "words_vs_unique" features for both "description" & "title" (unique_words/words)
- TFIDF: (TODO)
- use russian stopwords
- ngram: 1, 2
- max_features (TODO)
- create a text feature df to contain all the tfidf features
- weirdly, it fits all columns (even numeric and categorical features) into TFIDF, [ vectorizer.transform(df.to_dict('records')) ]
- create a list of "get_feature_names" for all TFIDF features
- make Ridge Regression as one of the feature
- Image Dullness Score
- Average Pixel Width (check if the image's pixel color is very uniform - i.e. the image is very 'flat')
- Blurness
- Width of the Image
- Height of the Image
- Size of the Image
- Neural Net Object Prediction
- image quality by using: https://github.com/titu1994/neural-image-assessment (not in this notebook)
- Keypoint extraction
Previous Deal Probability
- first, get all the previous deal probability, in train set
- get all the mean deal probability from the repeated user in train set for test set (the only once repeated user's deal prob. will discount by 1/2)
- one ratio and zero ratio
- .....
- 5 folds
- stratified by deal prob range:
y_4_bins = pd.cut(y,[-0.01,0.01,0.163, 0.76, 1], labels=['0', '0.01', '0.163', '0.76'], right=True)
kaggle_avito's People
Watchers