Overview
CatBoost
据说是比Xgboost
和LightGBM
更快更准确的GBDT
算法。本文记录一下安装过程中的一个小坑和初步使用例子。
1. 安装
先安装依赖包,six
和NumPy
(假定你已经安装好了Python3.6
以上版本):
pip install six
由于Ubuntu16.04中自带的NumPy
版本是比较老的,所以要指定NumPy
版本为1.16.0
以上:
pip install numpy==1.16.0
否则会有如下报错:
numpy.ufunc size changed, may indicate binary incompatibility. expected 216, got 192
最近在用
TensorFlow2.0
时,因为numpy
版本是1.71.1
,也会报这个错误,所以把numpy
版本安装成1.16.0
也可以解决TensorFlow2.0
报错问题。
然后用pip
或者conda
安装CatBoost
,我习惯用pip
安装:
pip install catboost
安装可视化工具:
pip install ipywidgets
jupyter nbextension enable --py widgetsnbextension
2. 二分类任务demo
下面是一个二分类任务的demo
。
import pandas as pd
import numpy as np
from catboost import CatBoostClassifier, Pool
### 加载训练集,验证集和测试集数据
df_train = pd.read_csv('train_dataset.csv')
df_val = pd.read_csv('validation_dataset.csv')
df_test = pd.read_csv('test_dataset.csv')
### 取特征列表
features_list = list(df_train.columns)
### 处理数据
train_x = df_train[features_list]
train_y = list(df_train.label)
val_x = df_val[features_list]
val_y = list(df_val.label)
test_x = df_test[features_list]
test_y = list(df_test.label)
至此,三个数据集的特征都是Dataframe
形式,如果是用xgboost
,还得转换成DMatrix
形式,而CatBoost
则可以直接训练Dataframe
格式数据,非常方便。
### 设定参数
model = CatBoostClassifier(iterations=800, # 树的棵树,即轮数
depth=3, # 树深
learning_rate=1, # 学习率
loss_function='Logloss', # 损失函数
eval_metric='AUC', # 评价指标
random_seed=696, # 随机数
reg_lambda=3, # L2正则化系数
#bootstrap_type='Bayesian',
verbose=True)
### 开始训练模型
model.fit(data_train_X, train_y, eval_set=(data_val_X, val_y), early_stopping_rounds=10)
输出如下:
0: test: 0.6155599 best: 0.6155599 (0) total: 124ms remaining: 1m 38s
1: test: 0.6441688 best: 0.6441688 (1) total: 188ms remaining: 1m 15s
2: test: 0.6531472 best: 0.6531472 (2) total: 252ms remaining: 1m 6s
3: test: 0.6632070 best: 0.6632070 (3) total: 315ms remaining: 1m 2s
4: test: 0.6675112 best: 0.6675112 (4) total: 389ms remaining: 1m 1s
5: test: 0.6728938 best: 0.6728938 (5) total: 454ms remaining: 1m
6: test: 0.6770872 best: 0.6770872 (6) total: 512ms remaining: 58s
7: test: 0.6779621 best: 0.6779621 (7) total: 574ms remaining: 56.9s
8: test: 0.6794292 best: 0.6794292 (8) total: 636ms remaining: 55.9s
9: test: 0.6799766 best: 0.6799766 (9) total: 695ms remaining: 54.9s
···
70: test: 0.7060854 best: 0.7060854 (70) total: 4.54s remaining: 46.6s
71: test: 0.7066276 best: 0.7066276 (71) total: 4.58s remaining: 46.4s
72: test: 0.7071572 best: 0.7071572 (72) total: 4.63s remaining: 46.1s
73: test: 0.7066621 best: 0.7071572 (72) total: 4.68s remaining: 45.9s
74: test: 0.7058151 best: 0.7071572 (72) total: 4.74s remaining: 45.8s
75: test: 0.7057014 best: 0.7071572 (72) total: 4.78s remaining: 45.5s
76: test: 0.7056642 best: 0.7071572 (72) total: 4.82s remaining: 45.3s
77: test: 0.7054756 best: 0.7071572 (72) total: 4.86s remaining: 45s
78: test: 0.7064983 best: 0.7071572 (72) total: 4.91s remaining: 44.8s
79: test: 0.7060492 best: 0.7071572 (72) total: 4.96s remaining: 44.6s
80: test: 0.7057876 best: 0.7071572 (72) total: 5.02s remaining: 44.6s
81: test: 0.7058538 best: 0.7071572 (72) total: 5.09s remaining: 44.6s
82: test: 0.7063121 best: 0.7071572 (72) total: 5.16s remaining: 44.6s
Stopped by overfitting detector (10 iterations wait)
bestTest = 0.7071571623
bestIteration = 72
Shrink model to first 73 iterations.
会显示每一轮的结果和截至目前最好的轮数。180000*500
的训练集,用CPU
一次训练仅需要30s
左右。
我们可以预测测试集:
preds_proba = model.predict_proba(data_test_X)
每一个样本,得到两个概率,分别是正样本和负样本的概率。
我们也可以看特征重要性:
model.feature_importances_
保存模型则是:
model.save_model('catboost版本模型.model')
加载模型则是:
my_model = model.load_model('catboost版本模型.model')
本文主要参考了官方文档:CatBoost