728x90
반응형
이전 Wrapper method를 다룬 Backward Feature Selection (후진제거법, python)에 이어서 작성하는 포스트입니다.
2022.01.13 - [공부/모델링] - Backward Feature Selection (후진제거법) python
Feature selection 방법은 크게 3가지로 나뉜다.
- Filter Method (Feature간 상관성 기반)
- Wrapper Method (Feature를 조정하며 모형을 형성하고 예측 성능을 참고하여 Feature 선택)
- Embedded Method (예측 모형 최적화, 회귀계수 추정 과정에서 각 Feature가 선택되는 방식)
이번에는 Wrapper Method 중 단계선택법에 대해 다루게 된다.
데이터는 앞 포스트와 동일하게 진행하는만큼 stepwise 코드만 변경해보겠다.
import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_csv('https://raw.githubusercontent.com/signature95/tistory/main/dataset/ToyotaCorolla.csv')
# Feature, target 분리
df = data.drop(columns = {'Price', 'Id', 'Model'})
target = data['Price']
# train, test 데이터 분리 (8 : 2)
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, shuffle=True, random_state=34)
# 연료 변수가 object이므로 더미화 진행 (type : 3개)
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
import statsmodels.api as sm
X_train = sm.add_constant(X_train)
model = sm.OLS(y_train, X_train).fit()
print(model.summary())
출력 결과
OLS Regression Results
==============================================================================
Dep. Variable: Price R-squared: 0.906
Model: OLS Adj. R-squared: 0.904
Method: Least Squares F-statistic: 327.2
Date: Fri, 14 Jan 2022 Prob (F-statistic): 0.00
Time: 15:00:52 Log-Likelihood: -9659.8
No. Observations: 1148 AIC: 1.939e+04
Df Residuals: 1114 BIC: 1.956e+04
Df Model: 33
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
Age_08_04 -123.0813 3.882 -31.709 0.000 -130.697 -115.465
Mfg_Month -99.3575 10.030 -9.906 0.000 -119.037 -79.678
Mfg_Year 1.9901 0.818 2.432 0.015 0.385 3.595
KM -0.0161 0.001 -12.730 0.000 -0.019 -0.014
HP 23.7477 3.362 7.064 0.000 17.151 30.344
Met_Color -27.6293 75.421 -0.366 0.714 -175.612 120.353
Automatic 425.3048 148.485 2.864 0.004 133.963 716.647
cc -0.0869 0.078 -1.116 0.265 -0.240 0.066
Doors 85.0947 39.542 2.152 0.032 7.510 162.679
Cylinders -0.0330 0.002 -15.142 0.000 -0.037 -0.029
Gears 165.4233 202.827 0.816 0.415 -232.543 563.389
Quarterly_Tax 13.0932 1.778 7.363 0.000 9.604 16.583
Weight 8.4347 1.211 6.967 0.000 6.059 10.810
Mfr_Guarantee 217.9115 73.397 2.969 0.003 73.900 361.923
BOVAG_Guarantee 429.2382 123.585 3.473 0.001 186.754 671.723
Guarantee_Period 66.0271 14.265 4.629 0.000 38.038 94.016
ABS -363.5526 126.413 -2.876 0.004 -611.586 -115.519
Airbag_1 171.9706 243.669 0.706 0.480 -306.132 650.073
Airbag_2 -25.7073 128.554 -0.200 0.842 -277.943 226.528
Airco 206.1457 87.896 2.345 0.019 33.686 378.606
Automatic_airco 2269.8082 189.464 11.980 0.000 1898.062 2641.554
Boardcomputer -336.0481 115.845 -2.901 0.004 -563.347 -108.749
CD_Player 245.9519 97.957 2.511 0.012 53.751 438.152
Central_Lock -177.1558 139.214 -1.273 0.203 -450.306 95.995
Powered_Windows 500.6008 140.942 3.552 0.000 224.059 777.143
Power_Steering 47.5105 280.255 0.170 0.865 -502.377 597.398
Radio 457.9836 657.484 0.697 0.486 -832.064 1748.031
Mistlamps 14.8378 108.159 0.137 0.891 -197.380 227.056
Sport_Model 377.4451 86.339 4.372 0.000 208.041 546.850
Backseat_Divider -259.3296 128.036 -2.025 0.043 -510.548 -8.111
Metallic_Rim 187.7640 94.034 1.997 0.046 3.260 372.268
Radio_cassette -605.8313 657.744 -0.921 0.357 -1896.388 684.726
Tow_Bar -214.7932 78.290 -2.744 0.006 -368.406 -61.181
Fuel_Type_CNG -1029.3034 213.488 -4.821 0.000 -1448.186 -610.420
Fuel_Type_Diesel 137.5537 175.440 0.784 0.433 -206.676 481.783
Fuel_Type_Petrol 891.7414 181.901 4.902 0.000 534.835 1248.648
==============================================================================
Omnibus: 95.104 Durbin-Watson: 2.068
Prob(Omnibus): 0.000 Jarque-Bera (JB): 497.925
Skew: 0.130 Prob(JB): 7.53e-109
Kurtosis: 6.216 Cond. No. 1.10e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.87e-20. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
stepwise 기법 적용
단계선택법을 간단하게 설명하면, 아무 변수도 선택되지 않은 Null model에서 출발하여 변수를 하나씩 추가하며 회귀식을 적합하는 것이다. 또 추가된 변수에서 하나씩 제거해보면서 전진과 후진을 반복하며 회귀식을 적합하는 방식이다. OLS 회귀를 사용하여 적용해보았다.
def stepwise_feature_selection(X_train, y_train, variables=X_train.columns.tolist() ):
import statsmodels.api as sm
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
y = y_train ## 반응 변수
selected_variables = [] ## 선택된 변수들
sl_enter = 0.05
sl_remove = 0.05
sv_per_step = [] ## 각 스텝별로 선택된 변수들
adjusted_r_squared = [] ## 각 스텝별 수정된 결정계수
steps = [] ## 스텝
step = 0
while len(variables) > 0:
remainder = list(set(variables) - set(selected_variables))
pval = pd.Series(index=remainder) ## 변수의 p-value
## 기존에 포함된 변수와 새로운 변수 하나씩 돌아가면서
## 선형 모형을 적합한다.
for col in remainder:
X = X_train[selected_variables+[col]]
X = sm.add_constant(X)
model = sm.OLS(y,X).fit(disp=0)
pval[col] = model.pvalues[col]
min_pval = pval.min()
if min_pval < sl_enter: ## 최소 p-value 값이 기준 값보다 작으면 포함
selected_variables.append(pval.idxmin())
## 선택된 변수들에대해서
## 어떤 변수를 제거할지 고른다.
while len(selected_variables) > 0:
selected_X = X_train[selected_variables]
selected_X = sm.add_constant(selected_X)
selected_pval = sm.OLS(y,selected_X).fit(disp=0).pvalues[1:] ## 절편항의 p-value는 뺀다
max_pval = selected_pval.max()
if max_pval >= sl_remove: ## 최대 p-value값이 기준값보다 크거나 같으면 제외
remove_variable = selected_pval.idxmax()
selected_variables.remove(remove_variable)
else:
break
step += 1
steps.append(step)
adj_r_squared = sm.OLS(y,sm.add_constant(X_train[selected_variables])).fit(disp=0).rsquared_adj
adjusted_r_squared.append(adj_r_squared)
sv_per_step.append(selected_variables.copy())
else:
break
fig = plt.figure(figsize=(100,10))
fig.set_facecolor('white')
font_size = 15
plt.xticks(steps,[f'step {s}\n'+'\n'.join(sv_per_step[i]) for i,s in enumerate(steps)], fontsize=12)
plt.plot(steps,adjusted_r_squared, marker='o')
plt.ylabel('Adjusted R Squared',fontsize=font_size)
plt.grid(True)
plt.show()
return selected_variables
selected_variables = stepwise_feature_selection(X_train, y_train)
출력 결과
model = sm.OLS(y_train, sm.add_constant(pd.DataFrame(X_train[selected_variables]))).fit(disp=0)
print(model.summary())
>>>
OLS Regression Results
==============================================================================
Dep. Variable: Price R-squared: 0.906
Model: OLS Adj. R-squared: 0.904
Method: Least Squares F-statistic: 470.2
Date: Fri, 14 Jan 2022 Prob (F-statistic): 0.00
Time: 15:11:44 Log-Likelihood: -9663.6
No. Observations: 1148 AIC: 1.938e+04
Df Residuals: 1124 BIC: 1.950e+04
Df Model: 23
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
Cylinders -0.0319 0.002 -16.216 0.000 -0.036 -0.028
Mfg_Year 2.4462 0.667 3.670 0.000 1.138 3.754
Automatic_airco 2294.4187 178.972 12.820 0.000 1943.263 2645.575
HP 23.3496 3.249 7.187 0.000 16.975 29.724
Weight 8.3921 1.198 7.005 0.000 6.041 10.743
KM -0.0162 0.001 -12.893 0.000 -0.019 -0.014
Powered_Windows 368.2483 82.493 4.464 0.000 206.391 530.106
Quarterly_Tax 13.0804 1.719 7.609 0.000 9.707 16.453
Fuel_Type_Petrol 799.8869 280.208 2.855 0.004 250.098 1349.676
Guarantee_Period 65.0485 13.921 4.673 0.000 37.735 92.362
BOVAG_Guarantee 447.1854 121.293 3.687 0.000 209.198 685.173
Sport_Model 388.0867 82.999 4.676 0.000 225.236 550.937
Tow_Bar -232.3991 76.233 -3.049 0.002 -381.975 -82.824
Airco 203.1230 84.356 2.408 0.016 37.610 368.636
ABS -346.3486 98.151 -3.529 0.000 -538.928 -153.769
Mfr_Guarantee 207.1914 72.093 2.874 0.004 65.739 348.644
Boardcomputer -314.2060 113.963 -2.757 0.006 -537.809 -90.602
Fuel_Type_CNG -1148.0595 338.728 -3.389 0.001 -1812.671 -483.448
CD_Player 254.8536 94.546 2.696 0.007 69.346 440.361
Automatic 395.4738 146.668 2.696 0.007 107.700 683.248
Mfg_Month -99.4782 9.950 -9.998 0.000 -119.001 -79.955
Age_08_04 -121.8330 3.736 -32.612 0.000 -129.163 -114.503
Backseat_Divider -230.7848 115.695 -1.995 0.046 -457.787 -3.782
Metallic_Rim 201.2032 88.447 2.275 0.023 27.664 374.743
Doors 80.0415 38.681 2.069 0.039 4.146 155.937
==============================================================================
Omnibus: 94.295 Durbin-Watson: 2.069
Prob(Omnibus): 0.000 Jarque-Bera (JB): 487.180
Skew: 0.133 Prob(JB): 1.62e-106
Kurtosis: 6.180 Cond. No. 2.96e+18
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.07e-25. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
이어지는 포스트는 feature selection 방식 중 embedded 방식인 라쏘 회귀 포스트입니다.
2022.04.19 - [공부/머신러닝] - Lasso Feature Selection (Embedded method) python
728x90
'공부 > 통계학' 카테고리의 다른 글
Differential (차분) python (2) | 2022.01.19 |
---|---|
Stationary test (정상성 검정) python (0) | 2022.01.19 |
Backward Feature Selection (후진제거법) python (0) | 2022.01.13 |
Forward feature selection (전진선택법) python (0) | 2022.01.12 |
VIF (분산확장요인, python) (0) | 2022.01.11 |
댓글