본문 바로가기
ML | DL | Big data/Projects

kaggle 주택 가격 예측(2) - Data preprocess / Feature engineering

by 썽하 2020. 8. 17.

 

저번 글에는 데이터를 탐구하기만 했다면, 이번 글에는 탐구했던 내용들을 바탕으로 데이터를 전처리하고 추가적인 feature들을 생성해보자. 이번 글을 다 읽고 나면 머신러닝에서 Feature들이 어떻게 뻥튀기되고, 버려지기도 하며, 역 추산되는지 알게 될 것이다.

 

 

House Prices: Advanced Regression Techniques

Predict sales prices and practice feature engineering, RFs, and gradient boosting

www.kaggle.com

 

kaggle 주택 가격 예측(1) - 포괄적인 데이터 탐색 분석 / EDA

xgboost를 활용한 실전 실습을 무엇으로 해볼까 kaggle을 구경하다가 많은 사람들의 튜토리얼 compete으로 이용되고 있는 주택 가격 예측으로 진행하기로 결정했다. House Prices: Advanced Regression Technique..

dining-developer.tistory.com


 

 

 

 

 

Data Preprocessing / Feature engineering with Ames dataset

In [1]:
# Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split, KFold
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.impute import KNNImputer
from sklearn.metrics import mean_squared_error, make_scorer
from scipy import stats
from scipy.stats import skew, norm
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sns

from impyute.imputation.cs import mice

import warnings
def ignore_warn(*args, **kwargs): pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)

%matplotlib inline

pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 500)

sns.set_style("whitegrid")
In [2]:
# 데이터 읽어오기
train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")
print("train : " + str(train.shape))
print("test : " + str(test.shape))
 
train : (1460, 81)
test : (1459, 80)
In [3]:
# 중복확인
idsUnique = len(set(train.Id))
idsTotal = train.shape[0]
idsDupli = idsTotal - idsUnique
print("There are " + str(idsDupli) + " duplicate IDs for " + str(idsTotal) + " total entries")

# Id 제거
Id_test = test.Id # test의 id는 캐글 제출시 필요하므로 keep
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)
 
There are 0 duplicate IDs for 1460 total entries
 

Preprocessing

In [4]:
# GrLivArea / SalePirce간 scatter plot 확인
var = 'GrLivArea'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

train = train[(train.GrLivArea < 4000) | (train.SalePrice > 700000)]

data = pd.concat([train['SalePrice'], train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
 
 
 

오른쪽 하단에 두개의 엄청 큰 집이 정말 싸게 팔렸다. 모델을 망칠수 있으니 outlier라고 간주하고 제거한다.

In [5]:
# Log transform
sns.distplot(train['SalePrice'], fit=norm);
fig = plt.figure()

train.SalePrice = np.log1p(train.SalePrice)
y = train.SalePrice

sns.distplot(train['SalePrice'], fit=norm);
fig = plt.figure()
 
 
 
<Figure size 432x288 with 0 Axes>
In [6]:
print(test.shape)
print(train.shape)
 
(1459, 79)
(1458, 80)
In [7]:
#누락데이터 확인
total = train.isnull().sum().sort_values(ascending=False)
percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data[missing_data.Total > 0].head(-1)
Out[7]:
  Total Percent
PoolQC 1452 0.996
MiscFeature 1404 0.963
Alley 1367 0.938
Fence 1177 0.807
FireplaceQu 690 0.473
LotFrontage 259 0.178
GarageType 81 0.056
GarageCond 81 0.056
GarageFinish 81 0.056
GarageQual 81 0.056
GarageYrBlt 81 0.056
BsmtFinType2 38 0.026
BsmtExposure 38 0.026
BsmtQual 37 0.025
BsmtCond 37 0.025
BsmtFinType1 37 0.025
MasVnrArea 8 0.005
MasVnrType 8 0.005
In [8]:
# 정의에 의해 누락 데이터가 '누락'을 의미하는게 아닌 아닌 '없음'을 의미(ex. 수영장 없음)하는 경우0과 같은 데이터로 대체 해준다.
# 데이터 정의에 써져있다.
def replace_na(df) :
    # Alley : data description says NA means "no alley access"
    df.loc[:, "Alley"] = df.loc[:, "Alley"].fillna("None")
    # BedroomAbvGr : NA most likely means 0
    df.loc[:, "BedroomAbvGr"] = df.loc[:, "BedroomAbvGr"].fillna(0)
    # BsmtQual etc : data description says NA for basement features is "no basement"
    df.loc[:, "BsmtQual"] = df.loc[:, "BsmtQual"].fillna("No")
    df.loc[:, "BsmtCond"] = df.loc[:, "BsmtCond"].fillna("No")
    df.loc[:, "BsmtExposure"] = df.loc[:, "BsmtExposure"].fillna("No")
    df.loc[:, "BsmtFinType1"] = df.loc[:, "BsmtFinType1"].fillna("No")
    df.loc[:, "BsmtFinType2"] = df.loc[:, "BsmtFinType2"].fillna("No")
    df.loc[:, "BsmtFullBath"] = df.loc[:, "BsmtFullBath"].fillna(0)
    df.loc[:, "BsmtHalfBath"] = df.loc[:, "BsmtHalfBath"].fillna(0)
    df.loc[:, "BsmtUnfSF"] = df.loc[:, "BsmtUnfSF"].fillna(0)
    # CentralAir : NA most likely means No
    df.loc[:, "CentralAir"] = df.loc[:, "CentralAir"].fillna("N")
    # Condition : NA most likely means Normal
    df.loc[:, "Condition1"] = df.loc[:, "Condition1"].fillna("Norm")
    df.loc[:, "Condition2"] = df.loc[:, "Condition2"].fillna("Norm")
    # EnclosedPorch : NA most likely means no enclosed porch
    df.loc[:, "EnclosedPorch"] = df.loc[:, "EnclosedPorch"].fillna(0)
    # External stuff : NA most likely means average
    df.loc[:, "ExterCond"] = df.loc[:, "ExterCond"].fillna("TA")
    df.loc[:, "ExterQual"] = df.loc[:, "ExterQual"].fillna("TA")
    # Fence : data description says NA means "no fence"
    df.loc[:, "Fence"] = df.loc[:, "Fence"].fillna("No")
    # FireplaceQu : data description says NA means "no fireplace"
    df.loc[:, "FireplaceQu"] = df.loc[:, "FireplaceQu"].fillna("No")
    df.loc[:, "Fireplaces"] = df.loc[:, "Fireplaces"].fillna(0)
    # Functional : data description says NA means typical
    df.loc[:, "Functional"] = df.loc[:, "Functional"].fillna("Typ")
    # GarageType etc : data description says NA for garage features is "no garage"
    df.loc[:, "GarageType"] = df.loc[:, "GarageType"].fillna("No")
    df.loc[:, "GarageFinish"] = df.loc[:, "GarageFinish"].fillna("No")
    df.loc[:, "GarageQual"] = df.loc[:, "GarageQual"].fillna("No")
    df.loc[:, "GarageCond"] = df.loc[:, "GarageCond"].fillna("No")
    df.loc[:, "GarageArea"] = df.loc[:, "GarageArea"].fillna(0)
    df.loc[:, "GarageCars"] = df.loc[:, "GarageCars"].fillna(0)
    # HalfBath : NA most likely means no half baths above grade
    df.loc[:, "HalfBath"] = df.loc[:, "HalfBath"].fillna(0)
    # HeatingQC : NA most likely means typical
    df.loc[:, "HeatingQC"] = df.loc[:, "HeatingQC"].fillna("TA")
    # KitchenAbvGr : NA most likely means 0
    df.loc[:, "KitchenAbvGr"] = df.loc[:, "KitchenAbvGr"].fillna(0)
    # KitchenQual : NA most likely means typical
    df.loc[:, "KitchenQual"] = df.loc[:, "KitchenQual"].fillna("TA")
    # LotFrontage : NA most likely means no lot frontage
    df.loc[:, "LotFrontage"] = df.loc[:, "LotFrontage"].fillna(0)
    # LotShape : NA most likely means regular
    df.loc[:, "LotShape"] = df.loc[:, "LotShape"].fillna("Reg")
    # MasVnrType : NA most likely means no veneer
    df.loc[:, "MasVnrType"] = df.loc[:, "MasVnrType"].fillna("None")
    df.loc[:, "MasVnrArea"] = df.loc[:, "MasVnrArea"].fillna(0)
    # MiscFeature : data description says NA means "no misc feature"
    df.loc[:, "MiscFeature"] = df.loc[:, "MiscFeature"].fillna("No")
    df.loc[:, "MiscVal"] = df.loc[:, "MiscVal"].fillna(0)
    # OpenPorchSF : NA most likely means no open porch
    df.loc[:, "OpenPorchSF"] = df.loc[:, "OpenPorchSF"].fillna(0)
    # PavedDrive : NA most likely means not paved
    df.loc[:, "PavedDrive"] = df.loc[:, "PavedDrive"].fillna("N")
    # PoolQC : data description says NA means "no pool"
    df.loc[:, "PoolQC"] = df.loc[:, "PoolQC"].fillna("No")
    df.loc[:, "PoolArea"] = df.loc[:, "PoolArea"].fillna(0)
    # SaleCondition : NA most likely means normal sale
    df.loc[:, "SaleCondition"] = df.loc[:, "SaleCondition"].fillna("Normal")
    # ScreenPorch : NA most likely means no screen porch
    df.loc[:, "ScreenPorch"] = df.loc[:, "ScreenPorch"].fillna(0)
    # TotRmsAbvGrd : NA most likely means 0
    df.loc[:, "TotRmsAbvGrd"] = df.loc[:, "TotRmsAbvGrd"].fillna(0)
    # Utilities : NA most likely means all public utilities
    df.loc[:, "Utilities"] = df.loc[:, "Utilities"].fillna("AllPub")
    # WoodDeckSF : NA most likely means no wood deck
    df.loc[:, "WoodDeckSF"] = df.loc[:, "WoodDeckSF"].fillna(0)
    return df

train = replace_na(train)
test = replace_na(test)
In [9]:
def cat_to_num(df):
    # 숫자 데이터가 범주형 데이터로 기록되어있는 경우 숫자형 데이터로 변경해준다.
    df = df.replace({"MSSubClass" : {20 : "SC20", 30 : "SC30", 40 : "SC40", 45 : "SC45", 
                                       50 : "SC50", 60 : "SC60", 70 : "SC70", 75 : "SC75", 
                                       80 : "SC80", 85 : "SC85", 90 : "SC90", 120 : "SC120", 
                                       150 : "SC150", 160 : "SC160", 180 : "SC180", 190 : "SC190"},
                       "MoSold" : {1 : "Jan", 2 : "Feb", 3 : "Mar", 4 : "Apr", 5 : "May", 6 : "Jun",
                                   7 : "Jul", 8 : "Aug", 9 : "Sep", 10 : "Oct", 11 : "Nov", 12 : "Dec"}
                      })
    
    # 범주형 데이터가 수치적 의미를 갖고있는 경우 숫자형 데이터로 변경해준다.
    df = df.replace({"Alley" : {"Grvl" : 1, "Pave" : 2},
                       "BsmtCond" : {"No" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
                       "BsmtExposure" : {"No" : 0, "Mn" : 1, "Av": 2, "Gd" : 3},
                       "BsmtFinType1" : {"No" : 0, "Unf" : 1, "LwQ": 2, "Rec" : 3, "BLQ" : 4, 
                                         "ALQ" : 5, "GLQ" : 6},
                       "BsmtFinType2" : {"No" : 0, "Unf" : 1, "LwQ": 2, "Rec" : 3, "BLQ" : 4, 
                                         "ALQ" : 5, "GLQ" : 6},
                       "BsmtQual" : {"No" : 0, "Po" : 1, "Fa" : 2, "TA": 3, "Gd" : 4, "Ex" : 5},
                       "ExterCond" : {"Po" : 1, "Fa" : 2, "TA": 3, "Gd": 4, "Ex" : 5},
                       "ExterQual" : {"Po" : 1, "Fa" : 2, "TA": 3, "Gd": 4, "Ex" : 5},
                       "FireplaceQu" : {"No" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
                       "Functional" : {"Sal" : 1, "Sev" : 2, "Maj2" : 3, "Maj1" : 4, "Mod": 5, 
                                       "Min2" : 6, "Min1" : 7, "Typ" : 8},
                       "GarageCond" : {"No" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
                       "GarageQual" : {"No" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
                       "HeatingQC" : {"Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
                       "KitchenQual" : {"Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
                       "LandSlope" : {"Sev" : 1, "Mod" : 2, "Gtl" : 3},
                       "LotShape" : {"IR3" : 1, "IR2" : 2, "IR1" : 3, "Reg" : 4},
                       "PavedDrive" : {"N" : 0, "P" : 1, "Y" : 2},
                       "PoolQC" : {"No" : 0, "Fa" : 1, "TA" : 2, "Gd" : 3, "Ex" : 4},
                       "Street" : {"Grvl" : 1, "Pave" : 2},
                       "Utilities" : {"ELO" : 1, "NoSeWa" : 2, "NoSewr" : 3, "AllPub" : 4}}
                     )
    return df

train = cat_to_num(train)
test = cat_to_num(test)
In [10]:
def make_has_feature(df):
    # 있는지 없는지 정도만 판단해도 충분히 좋은 Feature가 된다.
    df['haspool'] = df['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
    df['has2ndfloor'] = df['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
    df['hasgarage'] = df['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
    df['hasbsmt'] = df['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
    df['hasfireplace'] = df['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)
    return df

train = make_has_feature(train)
test = make_has_feature(test)
 

중간 점검으로 필요없는 feature가 있는지 확인해보자.

corr가 0.05 이하인 아이들 중에 skew가 높은 녀석들부터 보자.

In [11]:
numerical_features = train.select_dtypes(exclude = ["object"]).columns
skew_features = train[numerical_features].apply(lambda x: skew(x)).sort_values(ascending=True)
corr_features = train.corr().SalePrice
skew_corr_features = skew_features.to_frame().join(corr_features)
skew_corr_features.columns = ['skew', 'corr']
print(skew_corr_features[(skew_corr_features['corr'] < 0.05) & (abs(skew_corr_features['skew']) > 5)])
 
                skew   corr
Utilities    -38.144 0.013 
LowQualFinSF 8.996   -0.038
MiscVal      24.435  -0.020
In [12]:
# 한개 빼고 모두 0이다. 샘플수가 너무적어 오히려 학습에 방해 될 듯하다.
feature = 'Utilities'
train.groupby(feature).count()['SalePrice']
train = train.drop(columns=[feature])
test = test.drop(columns=[feature])
In [13]:
# 관련이 없는듯 보이지만 0이 아닌 것들의 비율이 낮을뿐 여타시설의 달러 가치 정도를 남겨두자. 로그 변환해서 사용해보자.
feature = 'MiscVal'
train.groupby(feature).count()['SalePrice']
Out[13]:
MiscVal
0        1406
54       1   
350      1   
400      11  
450      4   
480      2   
500      8   
560      1   
600      4   
620      1   
700      5   
800      1   
1150     1   
1200     2   
1300     1   
1400     1   
2000     4   
2500     1   
3500     1   
8300     1   
15500    1   
Name: SalePrice, dtype: int64
In [14]:
# 저품질 타일의 수인데, 이건 어느정도 영향을 미칠듯하다. 로그 변환해서 사용해보자.
feature = 'LowQualFinSF'
train.groupby(feature).count()['SalePrice']
Out[14]:
LowQualFinSF
0      1432
53     1   
80     3   
120    1   
144    1   
156    1   
205    1   
232    1   
234    1   
360    2   
371    1   
384    1   
390    1   
392    1   
397    1   
420    1   
473    1   
479    1   
481    1   
513    1   
514    1   
515    1   
528    1   
572    1   
Name: SalePrice, dtype: int64
 

3가지 방식으로 새로운 피쳐를 만들어주자.

  1. 이미 있는 feature를 단순화시키기
  2. 기존 feature들을 조합시키기
  3. 기존 feature의 다항식 표현
In [15]:
def simplify_feature(df) :
    # 1. 이미 있는 feature를 단순화시키기
    df["SimplOverallQual"] = df.OverallQual.replace({1 : 1, 2 : 1, 3 : 1, # bad
                                                    4 : 2, 5 : 2, 6 : 2, # average
                                                    7 : 3, 8 : 3, 9 : 3, 10 : 3 # good
                                                    })
    df["SimplOverallCond"] = df.OverallCond.replace({1 : 1, 2 : 1, 3 : 1, # bad
                                                    4 : 2, 5 : 2, 6 : 2, # average
                                                    7 : 3, 8 : 3, 9 : 3, 10 : 3 # good
                                                    })
    df["SimplPoolQC"] = df.PoolQC.replace({1 : 1, 2 : 1, # average
                                            3 : 2, 4 : 2 # good
                                            })
    df["SimplGarageCond"] = df.GarageCond.replace({1 : 1, # bad
                                                   2 : 1, 3 : 1, # average
                                                   4 : 2, 5 : 2 # good
                                                        })
    df["SimplGarageQual"] = df.GarageQual.replace({1 : 1, # bad
                                                   2 : 1, 3 : 1, # average
                                                   4 : 2, 5 : 2 # good
                                                        })
    df["SimplFireplaceQu"] = df.FireplaceQu.replace({1 : 1, # bad
                                                     2 : 1, 3 : 1, # average
                                                     4 : 2, 5 : 2 # good
                                                          })
    df["SimplFireplaceQu"] = df.FireplaceQu.replace({1 : 1, # bad
                                                     2 : 1, 3 : 1, # average
                                                     4 : 2, 5 : 2 # good
                                                          })
    df["SimplFunctional"] = df.Functional.replace({1 : 1, 2 : 1, # bad
                                                   3 : 2, 4 : 2, # major
                                                   5 : 3, 6 : 3, 7 : 3, # minor
                                                   8 : 4 # typical
                                                        })
    df["SimplKitchenQual"] = df.KitchenQual.replace({1 : 1, # bad
                                                     2 : 1, 3 : 1, # average
                                                     4 : 2, 5 : 2 # good
                                                          })
    df["SimplHeatingQC"] = df.HeatingQC.replace({1 : 1, # bad
                                                 2 : 1, 3 : 1, # average
                                                 4 : 2, 5 : 2 # good
                                                      })
    df["SimplBsmtFinType1"] = df.BsmtFinType1.replace({1 : 1, # unfinished
                                                       2 : 1, 3 : 1, # rec room
                                                       4 : 2, 5 : 2, 6 : 2 # living quarters
                                                       })
    df["SimplBsmtFinType2"] = df.BsmtFinType2.replace({1 : 1, # unfinished
                                                       2 : 1, 3 : 1, # rec room
                                                       4 : 2, 5 : 2, 6 : 2 # living quarters
                                                       })
    df["SimplBsmtCond"] = df.BsmtCond.replace({1 : 1, # bad
                                               2 : 1, 3 : 1, # average
                                               4 : 2, 5 : 2 # good
                                               })
    df["SimplBsmtQual"] = df.BsmtQual.replace({1 : 1, # bad
                                               2 : 1, 3 : 1, # average
                                               4 : 2, 5 : 2 # good
                                               })
    df["SimplExterCond"] = df.ExterCond.replace({1 : 1, # bad
                                                 2 : 1, 3 : 1, # average
                                                 4 : 2, 5 : 2 # good
                                                 })
    df["SimplExterQual"] = df.ExterQual.replace({1 : 1, # bad
                                                 2 : 1, 3 : 1, # average
                                                 4 : 2, 5 : 2 # good
                                                 })
    return df

def combine_feature(df) :
    # 2* 기존 feature들을 조합시키기
    # Overall quality of the house
    df["OverallGrade"] = df["OverallQual"] * df["OverallCond"]
    # Overall quality of the garage
    df["GarageGrade"] = df["GarageQual"] * df["GarageCond"]
    # Overall quality of the exterior
    df["ExterGrade"] = df["ExterQual"] * df["ExterCond"]
    # Overall kitchen score
    df["KitchenScore"] = df["KitchenAbvGr"] * df["KitchenQual"]
    # Overall fireplace score
    df["FireplaceScore"] = df["Fireplaces"] * df["FireplaceQu"]
    # Overall garage score
    df["GarageScore"] = df["GarageArea"] * df["GarageQual"]
    # Overall pool score
    df["PoolScore"] = df["PoolArea"] * df["PoolQC"]
    # Simplified overall quality of the house
    df["SimplOverallGrade"] = df["SimplOverallQual"] * df["SimplOverallCond"]
    # Simplified overall quality of the exterior
    df["SimplExterGrade"] = df["SimplExterQual"] * df["SimplExterCond"]
    # Simplified overall pool score
    df["SimplPoolScore"] = df["PoolArea"] * df["SimplPoolQC"]
    # Simplified overall garage score
    df["SimplGarageScore"] = df["GarageArea"] * df["SimplGarageQual"]
    # Simplified overall fireplace score
    df["SimplFireplaceScore"] = df["Fireplaces"] * df["SimplFireplaceQu"]
    # Simplified overall kitchen score
    df["SimplKitchenScore"] = df["KitchenAbvGr"] * df["SimplKitchenQual"]
    # Total number of bathrooms
    df["TotalBath"] = df["BsmtFullBath"] + (0.5 * df["BsmtHalfBath"]) + \
    df["FullBath"] + (0.5 * df["HalfBath"])
    # Total SF for house (incl. basement)
    df["AllSF"] = df["GrLivArea"] + df["TotalBsmtSF"]
    # Total SF for 1st + 2nd floors
    df["AllFlrsSF"] = df["1stFlrSF"] + df["2ndFlrSF"]
    # Total SF for porch
    df["AllPorchSF"] = df["OpenPorchSF"] + df["EnclosedPorch"] + \
    df["3SsnPorch"] + df["ScreenPorch"]
    # Has masonry veneer or not
    df["HasMasVnr"] = df.MasVnrType.replace({"BrkCmn" : 1, "BrkFace" : 1, "CBlock" : 1,  "Stone" : 1, "None" : 0})
    # House completed before sale or not
    df["BoughtOffPlan"] = df.SaleCondition.replace({"Abnorml" : 0, "Alloca" : 0, "AdjLand" : 0, "Family" : 0, "Normal" : 0, "Partial" : 1})
    
    return df

train = simplify_feature(train)
test = simplify_feature(test)
 

숫자형과 범주형 피쳐를 나누어주자.

In [16]:
# Differentiate numerical features (minus the target) and categorical features
categorical_features = train.select_dtypes(include = ["object"]).columns
numerical_features = train.select_dtypes(exclude = ["object"]).columns
numerical_features = numerical_features.drop("SalePrice")
print("Numerical features : " + str(len(numerical_features)))
print("Categorical features : " + str(len(categorical_features)))

train_num = train[numerical_features]
train_cat = train[categorical_features]
test_num = test[numerical_features]
test_cat = test[categorical_features]
 
Numerical features : 71
Categorical features : 27
In [17]:
# Find most important features relative to target
print("Find most important features relative to target")
corr = train.corr()
corr.sort_values(["SalePrice"], ascending = False, inplace = True)
print(corr.SalePrice)
 
Find most important features relative to target
SalePrice           1.000 
OverallQual         0.821 
GrLivArea           0.725 
SimplOverallQual    0.707 
ExterQual           0.682 
GarageCars          0.681 
KitchenQual         0.670 
GarageArea          0.656 
TotalBsmtSF         0.648 
SimplExterQual      0.635 
1stFlrSF            0.621 
BsmtQual            0.617 
SimplKitchenQual    0.609 
FullBath            0.596 
SimplBsmtQual       0.592 
YearBuilt           0.587 
YearRemodAdd        0.566 
FireplaceQu         0.547 
GarageYrBlt         0.542 
TotRmsAbvGrd        0.538 
SimplFireplaceQu    0.514 
hasfireplace        0.510 
Fireplaces          0.492 
HeatingQC           0.474 
MasVnrArea          0.431 
SimplHeatingQC      0.397 
BsmtFinSF1          0.392 
GarageQual          0.363 
GarageCond          0.357 
BsmtExposure        0.338 
BsmtFinType1        0.335 
WoodDeckSF          0.334 
OpenPorchSF         0.325 
hasgarage           0.323 
2ndFlrSF            0.320 
HalfBath            0.314 
SimplGarageQual     0.311 
PavedDrive          0.305 
SimplBsmtFinType1   0.300 
SimplGarageCond     0.298 
LotArea             0.261 
BsmtFullBath        0.237 
BsmtUnfSF           0.222 
BedroomAbvGr        0.209 
SimplBsmtCond       0.202 
hasbsmt             0.200 
LotFrontage         0.183 
has2ndfloor         0.151 
SimplFunctional     0.137 
Functional          0.136 
ScreenPorch         0.121 
SimplBsmtFinType2   0.104 
PoolQC              0.086 
SimplPoolQC         0.081 
haspool             0.077 
PoolArea            0.074 
Street              0.057 
3SsnPorch           0.055 
ExterCond           0.049 
BsmtFinType2        0.014 
BsmtFinSF2          0.005 
BsmtHalfBath        -0.005
MiscVal             -0.020
SimplOverallCond    -0.030
OverallCond         -0.037
YrSold              -0.037
LowQualFinSF        -0.038
LandSlope           -0.039
SimplExterCond      -0.044
KitchenAbvGr        -0.148
EnclosedPorch       -0.149
LotShape            -0.288
Name: SalePrice, dtype: float64
In [18]:
li_col_high_corr = list(corr[corr['SalePrice'] > 0.5][['SalePrice']].index.values)
li_col_high_corr.remove('SalePrice')
print(li_col_high_corr)
 
['OverallQual', 'GrLivArea', 'SimplOverallQual', 'ExterQual', 'GarageCars', 'KitchenQual', 'GarageArea', 'TotalBsmtSF', 'SimplExterQual', '1stFlrSF', 'BsmtQual', 'SimplKitchenQual', 'FullBath', 'SimplBsmtQual', 'YearBuilt', 'YearRemodAdd', 'FireplaceQu', 'GarageYrBlt', 'TotRmsAbvGrd', 'SimplFireplaceQu', 'hasfireplace']
In [19]:
# 새로운 feature 만들기
# 관계도가 높은 feature의 다항식 표현
# 여러개 모두다 쓰려고 했으나, 모델 제작시 동일 뿌리에서 나온 feature들이 줄줄이 높은 importance를 기록했다.
# 하여 한가지 feature로 인해 overfit 되는걸 방지 하기위해 생성한 feature 여러개 중 가장 관련도가 높은 것 하나만 사용하기로 했다.
def polynomial_feature(df) :
    for colname in numerical_features :
        if colname not in numerical_features: continue
        train[colname + "-s2"] = train[colname] ** 2
        train[colname + "-s3"] = train[colname] ** 3
        train[colname + "-s4"] = train[colname] ** 4
        train[colname + "-Sq"] = np.sqrt(train[colname])

        li_colnames = []
        li_colnames.append(colname)
        li_colnames.append(colname + "-s2")
        li_colnames.append(colname + "-s3")
        li_colnames.append(colname + "-s4")
        li_colnames.append(colname + "-Sq")

        c = train[["SalePrice"] + li_colnames].corr()
        c.sort_values(["SalePrice"], ascending = False, inplace = True)
        li_colnames.remove(c["SalePrice"].head().index[1])

        train.drop(columns=li_colnames)
    return df

train_num = polynomial_feature(train_num)
test_num = polynomial_feature(test_num)
In [20]:
# skew가 0,5보다 크거나 -0.5보다 작은것에 한하여 로그 변환을 해준다.
# 보통 0.5보다 크면 skew가 어느정도 존재한다고 판단하여 변환시켜준다.
positive_skewed_features = []
negavtive_skewed_features = []
def log_transforam(df_train, df_test) :
    for colname in train_num.columns :
        if "-" in colname : continue
            
        if df_train[colname].skew() > 0.5 :
            min_val = min(df_train[colname].min(), df_test[colname].min())
            positive_skewed_features.append([colname, min_val])
            df_train[colname + '-log'] = np.log1p(df_train[colname] - min_val)
            df_test[colname + '-log'] = np.log1p(df_test[colname] - min_val)
            
        elif df_train[colname].skew() < -0.5 and colname != 'GarageQual':
            max_val = max(df_train[colname].max(), df_test[colname].max())
            negavtive_skewed_features.append([colname, max_val])
            df_train[colname + '-log'] = np.log1p( + max_val - df_train[colname])
            df_test[colname + '-log'] = np.log1p( + max_val - df_test[colname])
    
    return df_train, df_test

train_num, test_num = log_transforam(train_num, test_num)
In [21]:
# 숫자형 feature를 추가 생성하면서 나온 NA값을 역추산해준다.
print("NAs for numerical features in train : " + str(train_num.isnull().values.sum()))

imputer = KNNImputer(n_neighbors=5).fit(train_num.values)

np_train_imputed = imputer.transform(train_num)
train_num_imputed = pd.DataFrame(np_train_imputed, columns = train_num.columns, index=train_num.index)
np_test_imputed = imputer.transform(test_num)
test_num_imputed = pd.DataFrame(np_test_imputed, columns = test_num.columns, index=test_num.index)

print("Remaining NAs for numerical features in train : " + str(train_num_imputed.isnull().values.sum()))
train_num = train_num_imputed
test_num = test_num_imputed
 
NAs for numerical features in train : 162
Remaining NAs for numerical features in train : 0
In [22]:
# 범주형 feature를 onehot 인코딩 시킨다. 이 과정에서 null 값도 없어진다.
print("NAs for categorical features in train : " + str(train_cat.isnull().values.sum()))
train_cat_num = len(train_cat)
dataset = pd.concat(objs=[train_cat, test_cat], axis=0)
dataset_preprocessed = pd.get_dummies(dataset)
train_cat = dataset_preprocessed[:train_cat_num]
test_cat = dataset_preprocessed[train_cat_num:]
print("Remaining NAs for categorical features in train : " + str(train_cat.isnull().values.sum()))
 
NAs for categorical features in train : 1
Remaining NAs for categorical features in train : 0
In [23]:
# 범주형과 숫자형 feature들을 합쳐준다.
X_train = pd.concat([train_num, train_cat], axis = 1)
X_test = pd.concat([test_num, test_cat], axis = 1)
y_train = y
Id_test
print("New number of features : " + str(X_train.shape[1]))
 
New number of features : 333
In [24]:
#이후 사용을 위해 저장해주자
X_train.to_csv("../data/X_train.csv")
X_test.to_csv("../data/X_test.csv")
y_train.to_csv("../data/y_train.csv")
Id_test.to_csv("../data/Id_train.csv")
In [25]:
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:110% !important;}</style>"))
 
 

 


데이터를 전처리하고 Feature를 추가 생성해서 머신러닝에 입력하기 직전의 데이터를 생성해보았다.

다음 글에서는 저장된 데이터로 Linear Regression, Lasso Regression, Ridge Regrssion, ElasticNet 네 종류 모델을 학습시키고 캐글에 제출해보자.

 

Reference

댓글