본문 바로가기

Stepwise Feature Selection (단계선택법) python

by signature95 2022. 1. 14.

이전 Wrapper method를 다룬 Backward Feature Selection (후진제거법, python)에 이어서 작성하는 포스트입니다.



2022.01.13 - [공부/모델링] - Backward Feature Selection (후진제거법) python


Backward Feature Selection (후진제거법) python

이전 Wrapper method를 다룬 Forward Feature Selection (전진선택법, python)에 이어서 작성하는 포스트입니다. 2022.01.12 - [공부/모델링] - Forward feature selection (전진선택법) python Forward feature s..



Feature selection 방법은 크게 3가지로 나뉜다.

  1. Filter Method (Feature간 상관성 기반)
  2. Wrapper Method (Feature를 조정하며 모형을 형성하고 예측 성능을 참고하여 Feature 선택)
  3. Embedded Method (예측 모형 최적화, 회귀계수 추정 과정에서 각 Feature가 선택되는 방식)

이번에는 Wrapper Method 중 단계선택법에 대해 다루게 된다.



데이터는 앞 포스트와 동일하게 진행하는만큼 stepwise 코드만 변경해보겠다.

import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv('https://raw.githubusercontent.com/signature95/tistory/main/dataset/ToyotaCorolla.csv')

# Feature, target 분리
df = data.drop(columns = {'Price', 'Id', 'Model'})
target = data['Price']

# train, test 데이터 분리 (8 : 2)
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, shuffle=True, random_state=34)

# 연료 변수가 object이므로 더미화 진행 (type : 3개)
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)

import statsmodels.api as sm

X_train = sm.add_constant(X_train)
model = sm.OLS(y_train, X_train).fit()

출력 결과

                            OLS Regression Results                            
Dep. Variable:                  Price   R-squared:                       0.906
Model:                            OLS   Adj. R-squared:                  0.904
Method:                 Least Squares   F-statistic:                     327.2
Date:                Fri, 14 Jan 2022   Prob (F-statistic):               0.00
Time:                        15:00:52   Log-Likelihood:                -9659.8
No. Observations:                1148   AIC:                         1.939e+04
Df Residuals:                    1114   BIC:                         1.956e+04
Df Model:                          33                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
Age_08_04         -123.0813      3.882    -31.709      0.000    -130.697    -115.465
Mfg_Month          -99.3575     10.030     -9.906      0.000    -119.037     -79.678
Mfg_Year             1.9901      0.818      2.432      0.015       0.385       3.595
KM                  -0.0161      0.001    -12.730      0.000      -0.019      -0.014
HP                  23.7477      3.362      7.064      0.000      17.151      30.344
Met_Color          -27.6293     75.421     -0.366      0.714    -175.612     120.353
Automatic          425.3048    148.485      2.864      0.004     133.963     716.647
cc                  -0.0869      0.078     -1.116      0.265      -0.240       0.066
Doors               85.0947     39.542      2.152      0.032       7.510     162.679
Cylinders           -0.0330      0.002    -15.142      0.000      -0.037      -0.029
Gears              165.4233    202.827      0.816      0.415    -232.543     563.389
Quarterly_Tax       13.0932      1.778      7.363      0.000       9.604      16.583
Weight               8.4347      1.211      6.967      0.000       6.059      10.810
Mfr_Guarantee      217.9115     73.397      2.969      0.003      73.900     361.923
BOVAG_Guarantee    429.2382    123.585      3.473      0.001     186.754     671.723
Guarantee_Period    66.0271     14.265      4.629      0.000      38.038      94.016
ABS               -363.5526    126.413     -2.876      0.004    -611.586    -115.519
Airbag_1           171.9706    243.669      0.706      0.480    -306.132     650.073
Airbag_2           -25.7073    128.554     -0.200      0.842    -277.943     226.528
Airco              206.1457     87.896      2.345      0.019      33.686     378.606
Automatic_airco   2269.8082    189.464     11.980      0.000    1898.062    2641.554
Boardcomputer     -336.0481    115.845     -2.901      0.004    -563.347    -108.749
CD_Player          245.9519     97.957      2.511      0.012      53.751     438.152
Central_Lock      -177.1558    139.214     -1.273      0.203    -450.306      95.995
Powered_Windows    500.6008    140.942      3.552      0.000     224.059     777.143
Power_Steering      47.5105    280.255      0.170      0.865    -502.377     597.398
Radio              457.9836    657.484      0.697      0.486    -832.064    1748.031
Mistlamps           14.8378    108.159      0.137      0.891    -197.380     227.056
Sport_Model        377.4451     86.339      4.372      0.000     208.041     546.850
Backseat_Divider  -259.3296    128.036     -2.025      0.043    -510.548      -8.111
Metallic_Rim       187.7640     94.034      1.997      0.046       3.260     372.268
Radio_cassette    -605.8313    657.744     -0.921      0.357   -1896.388     684.726
Tow_Bar           -214.7932     78.290     -2.744      0.006    -368.406     -61.181
Fuel_Type_CNG    -1029.3034    213.488     -4.821      0.000   -1448.186    -610.420
Fuel_Type_Diesel   137.5537    175.440      0.784      0.433    -206.676     481.783
Fuel_Type_Petrol   891.7414    181.901      4.902      0.000     534.835    1248.648
Omnibus:                       95.104   Durbin-Watson:                   2.068
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              497.925
Skew:                           0.130   Prob(JB):                    7.53e-109
Kurtosis:                       6.216   Cond. No.                     1.10e+16

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.87e-20. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.


stepwise 기법 적용

단계선택법을 간단하게 설명하면, 아무 변수도 선택되지 않은 Null model에서 출발하여 변수를 하나씩 추가하며 회귀식을 적합하는 것이다.  추가된 변수에서 하나씩 제거해보면서 전진과 후진을 반복하며 회귀식을 적합하는 방식이다. OLS 회귀를 사용하여 적용해보았다.

def stepwise_feature_selection(X_train, y_train, variables=X_train.columns.tolist() ):
    import statsmodels.api as sm
    import matplotlib.pyplot as plt
    import warnings
    y = y_train ## 반응 변수

    selected_variables = [] ## 선택된 변수들
    sl_enter = 0.05
    sl_remove = 0.05
    sv_per_step = [] ## 각 스텝별로 선택된 변수들
    adjusted_r_squared = [] ## 각 스텝별 수정된 결정계수
    steps = [] ## 스텝
    step = 0
    while len(variables) > 0:
        remainder = list(set(variables) - set(selected_variables))
        pval = pd.Series(index=remainder) ## 변수의 p-value
        ## 기존에 포함된 변수와 새로운 변수 하나씩 돌아가면서 
        ## 선형 모형을 적합한다.
        for col in remainder: 
            X = X_train[selected_variables+[col]]
            X = sm.add_constant(X)
            model = sm.OLS(y,X).fit(disp=0)
            pval[col] = model.pvalues[col]
        min_pval = pval.min()
        if min_pval < sl_enter: ## 최소 p-value 값이 기준 값보다 작으면 포함
            ## 선택된 변수들에대해서
            ## 어떤 변수를 제거할지 고른다.
            while len(selected_variables) > 0:
                selected_X = X_train[selected_variables]
                selected_X = sm.add_constant(selected_X)
                selected_pval = sm.OLS(y,selected_X).fit(disp=0).pvalues[1:] ## 절편항의 p-value는 뺀다
                max_pval = selected_pval.max()
                if max_pval >= sl_remove: ## 최대 p-value값이 기준값보다 크거나 같으면 제외
                    remove_variable = selected_pval.idxmax()
            step += 1
            adj_r_squared = sm.OLS(y,sm.add_constant(X_train[selected_variables])).fit(disp=0).rsquared_adj

    fig = plt.figure(figsize=(100,10))
    font_size = 15
    plt.xticks(steps,[f'step {s}\n'+'\n'.join(sv_per_step[i]) for i,s in enumerate(steps)], fontsize=12)
    plt.plot(steps,adjusted_r_squared, marker='o')
    plt.ylabel('Adjusted R Squared',fontsize=font_size)

    return selected_variables

selected_variables = stepwise_feature_selection(X_train, y_train)

출력 결과


model = sm.OLS(y_train, sm.add_constant(pd.DataFrame(X_train[selected_variables]))).fit(disp=0)


                            OLS Regression Results                            
Dep. Variable:                  Price   R-squared:                       0.906
Model:                            OLS   Adj. R-squared:                  0.904
Method:                 Least Squares   F-statistic:                     470.2
Date:                Fri, 14 Jan 2022   Prob (F-statistic):               0.00
Time:                        15:11:44   Log-Likelihood:                -9663.6
No. Observations:                1148   AIC:                         1.938e+04
Df Residuals:                    1124   BIC:                         1.950e+04
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
Cylinders           -0.0319      0.002    -16.216      0.000      -0.036      -0.028
Mfg_Year             2.4462      0.667      3.670      0.000       1.138       3.754
Automatic_airco   2294.4187    178.972     12.820      0.000    1943.263    2645.575
HP                  23.3496      3.249      7.187      0.000      16.975      29.724
Weight               8.3921      1.198      7.005      0.000       6.041      10.743
KM                  -0.0162      0.001    -12.893      0.000      -0.019      -0.014
Powered_Windows    368.2483     82.493      4.464      0.000     206.391     530.106
Quarterly_Tax       13.0804      1.719      7.609      0.000       9.707      16.453
Fuel_Type_Petrol   799.8869    280.208      2.855      0.004     250.098    1349.676
Guarantee_Period    65.0485     13.921      4.673      0.000      37.735      92.362
BOVAG_Guarantee    447.1854    121.293      3.687      0.000     209.198     685.173
Sport_Model        388.0867     82.999      4.676      0.000     225.236     550.937
Tow_Bar           -232.3991     76.233     -3.049      0.002    -381.975     -82.824
Airco              203.1230     84.356      2.408      0.016      37.610     368.636
ABS               -346.3486     98.151     -3.529      0.000    -538.928    -153.769
Mfr_Guarantee      207.1914     72.093      2.874      0.004      65.739     348.644
Boardcomputer     -314.2060    113.963     -2.757      0.006    -537.809     -90.602
Fuel_Type_CNG    -1148.0595    338.728     -3.389      0.001   -1812.671    -483.448
CD_Player          254.8536     94.546      2.696      0.007      69.346     440.361
Automatic          395.4738    146.668      2.696      0.007     107.700     683.248
Mfg_Month          -99.4782      9.950     -9.998      0.000    -119.001     -79.955
Age_08_04         -121.8330      3.736    -32.612      0.000    -129.163    -114.503
Backseat_Divider  -230.7848    115.695     -1.995      0.046    -457.787      -3.782
Metallic_Rim       201.2032     88.447      2.275      0.023      27.664     374.743
Doors               80.0415     38.681      2.069      0.039       4.146     155.937
Omnibus:                       94.295   Durbin-Watson:                   2.069
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              487.180
Skew:                           0.133   Prob(JB):                    1.62e-106
Kurtosis:                       6.180   Cond. No.                     2.96e+18

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.07e-25. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.


이어지는 포스트는 feature selection 방식 중 embedded 방식인 라쏘 회귀 포스트입니다.

2022.04.19 - [공부/머신러닝] - Lasso Feature Selection (Embedded method) python


