|
xgboost:
学习目标参数(针对xgboost,不是针对sklearn xgboost)
1. objective [default: reg:squarederror(均方误差)]
a: 目标函数的选择,默认为均方误差损失,当然还有很多其他的,这里列举几个主要的
b: reg:squarederror 均方误差
c: reg:logistic 对数几率损失,参考对数几率回归(逻辑回归)
d: binary:logistic 二分类对数几率回归,输出概率值
e: binary:hinge 二分类合页损失,此时不输出概率值,而是0或1
f: multi:softmax 多分类softmax损失,此时需要设置num_class参数
2. eval_metric [default: 根据objective而定]
a: 模型性能度量方法,主要根据objective而定,也可以自定义一些,下面列举一些常见的
b: rmse : root mean square error 也就是平方误差和开根号
c: mae : mean absolute error 误差的绝对值再求平均
d: auc : area under curve roc曲线下面积
e: aucpr: area under the pr curve pr曲线下面积
- pip install xgboost -i https://pypi.douban.com/simple --trusted-host pypi.douban.com
复制代码- import pandas as pd
- import numpy as np
- from collections import Counter
- from sklearn.model_selection import train_test_split
- from sklearn.preprocessing import LabelEncoder
- import torch
- from torch.utils.data import Dataset, DataLoader
- import torch.optim as torch_optim
- import torch.nn as nn
- import torch.nn.functional as F
- from torchvision import models
- import xgboost as xgb
- from sklearn.metrics import accuracy_score # 准确率
- from datetime import datetime
- #from dataset import ShelterOutcomeDataset, get_default_device
- from dataset4 import ToArray
- # Load Data
- train = pd.read_csv(r'train.csv')
- print("Shape:", train.shape)
- train.head()
- # Data preprocessing
- train_X = train.drop(columns= ['OutcomeType', 'OutcomeSubtype', 'AnimalID'])
- Y = train['OutcomeType']
- # Stacking train and test set so that they undergo the same preprocessing
- stacked_df = train_X
- stacked_df = stacked_df.drop(columns=['DateTime'])
- stacked_df.head()
- # dropping columns with too many nulls
- for col in stacked_df.columns:
- if stacked_df[col].isnull().sum() > 10000:
- print("dropping", col, stacked_df[col].isnull().sum())
- stacked_df = stacked_df.drop(columns = [col])
- stacked_df.head()
- # label encoding
- for col in stacked_df.columns:
- if stacked_df.dtypes[col] == "object":
- stacked_df[col] = stacked_df[col].fillna("NA")
- else:
- stacked_df[col] = stacked_df[col].fillna(0)
- stacked_df[col] = LabelEncoder().fit_transform(stacked_df[col])
- # making all variables categorical
- for col in stacked_df.columns:
- stacked_df[col] = stacked_df[col].astype('category')
- # splitting back train and test
- X = stacked_df[0:26729]
- # Encoding target
- Y = LabelEncoder().fit_transform(Y)
- XY = ToArray(X, Y)
- X = []
- X = XY.x
- Y = []
- Y = XY.y
- #train-valid split
- X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.10, random_state=0)
- # xgboost
- # 算法参数
- params = {
- 'objective': 'reg:squarederror',
- 'max_depth': 6,
- 'eta': 1.0
- }
- plst = list(params.items())
- dtrain = xgb.DMatrix(X_train, y_train) # 生成数据集格式
- num_rounds = 100
- model = xgb.train(plst, dtrain, num_rounds) # xgboost模型训练
- # 对测试集进行预测
- dval = xgb.DMatrix(X_val)
- y_pred = model.predict(dval)
- model.save_model('xgb.model')
- # 计算准确率
- #accuracy = accuracy_score(y_val, y_pred)
- #print('accuarcy:%.2f%%'%(accuracy*100))
- x_input = np.array([[2351.0, 1.01, 3.0, 5.0, 1221.0, 130.0]]).astype(np.float32)
- dtest = xgb.DMatrix(x_input)
- y_test_pred = model.predict(dtest)
- print('y_test_pred = ', y_test_pred[0])
- tar = xgb.Booster(model_file='xgb.model')
- x_test = xgb.DMatrix(x_input)
- pre=tar.predict(x_test)
- print('pre = ', pre[0])
复制代码
参考:
【1】Using XGBOOST in c++
【2】xgboost编译安装包
【3】cmake
【4】whl安装包:https://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost
【5】代码实现:https://blog.csdn.net/lamusique/article/details/96478351
【6】XGBoost调参详解:https://zhuanlan.zhihu.com/p/95304498
【7】xgboost安装问题(XGBoost Library (xgboost.dl1) could not be loaded)
|
|