GTX_AI 发表于 2020-8-28 13:13:12

xgboost

xgboost:
学习目标参数(针对xgboost,不是针对sklearn xgboost)
1. objective
    a: 目标函数的选择,默认为均方误差损失,当然还有很多其他的,这里列举几个主要的
    b: reg:squarederror       均方误差
    c: reg:logistic         对数几率损失,参考对数几率回归(逻辑回归)
    d: binary:logistic      二分类对数几率回归,输出概率值
    e: binary:hinge         二分类合页损失,此时不输出概率值,而是0或1
    f: multi:softmax          多分类softmax损失,此时需要设置num_class参数

2. eval_metric
    a: 模型性能度量方法,主要根据objective而定,也可以自定义一些,下面列举一些常见的
    b: rmse : root mean square error   也就是平方误差和开根号
    c: mae: mean absolute error      误差的绝对值再求平均
    d: auc: area under curve         roc曲线下面积
    e: aucpr: area under the pr curve    pr曲线下面积
pip install xgboost -i https://pypi.douban.com/simple --trusted-host pypi.douban.comimport pandas as pd
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import torch
from torch.utils.data import Dataset, DataLoader
import torch.optim as torch_optim
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models
import xgboost as xgb
from sklearn.metrics import accuracy_score# 准确率
from datetime import datetime
#from dataset import ShelterOutcomeDataset, get_default_device
from dataset4 import ToArray

# Load Data
train = pd.read_csv(r'train.csv')
print("Shape:", train.shape)
train.head()

# Data preprocessing
train_X = train.drop(columns= ['OutcomeType', 'OutcomeSubtype', 'AnimalID'])
Y = train['OutcomeType']

# Stacking train and test set so that they undergo the same preprocessing
stacked_df = train_X
stacked_df = stacked_df.drop(columns=['DateTime'])
stacked_df.head()

# dropping columns with too many nulls
for col in stacked_df.columns:
    if stacked_df.isnull().sum() > 10000:
      print("dropping", col, stacked_df.isnull().sum())
      stacked_df = stacked_df.drop(columns = )
stacked_df.head()

# label encoding
for col in stacked_df.columns:
    if stacked_df.dtypes == "object":
      stacked_df = stacked_df.fillna("NA")
    else:
      stacked_df = stacked_df.fillna(0)
    stacked_df = LabelEncoder().fit_transform(stacked_df)

# making all variables categorical
for col in stacked_df.columns:
    stacked_df = stacked_df.astype('category')

# splitting back train and test
X = stacked_df
# Encoding target
Y = LabelEncoder().fit_transform(Y)

XY = ToArray(X, Y)
X = []
X = XY.x
Y = []
Y = XY.y

#train-valid split
X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.10, random_state=0)

# xgboost
# 算法参数
params = {
      'objective': 'reg:squarederror',
      'max_depth': 6,
      'eta': 1.0
      }
plst = list(params.items())

dtrain = xgb.DMatrix(X_train, y_train) # 生成数据集格式
num_rounds = 100
model = xgb.train(plst, dtrain, num_rounds) # xgboost模型训练

# 对测试集进行预测
dval = xgb.DMatrix(X_val)
y_pred = model.predict(dval)

model.save_model('xgb.model')

# 计算准确率
#accuracy = accuracy_score(y_val, y_pred)
#print('accuarcy:%.2f%%'%(accuracy*100))

x_input = np.array([]).astype(np.float32)
dtest = xgb.DMatrix(x_input)
y_test_pred = model.predict(dtest)
print('y_test_pred = ', y_test_pred)

tar = xgb.Booster(model_file='xgb.model')
x_test = xgb.DMatrix(x_input)
pre=tar.predict(x_test)
print('pre = ', pre)


参考:
【1】Using XGBOOST in c++
【2】xgboost编译安装包
【3】cmake
【4】whl安装包:https://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost
【5】代码实现:https://blog.csdn.net/lamusique/article/details/96478351
【6】XGBoost调参详解:https://zhuanlan.zhihu.com/p/95304498
【7】xgboost安装问题(XGBoost Library (xgboost.dl1) could not be loaded)



页: [1]
查看完整版本: xgboost