Wednesday, December 20, 2017

xgboost 保存为pmml文件问题

xgboost 二进制文件,可以通过https://github.com/jpmml/jpmml-xgboost 保存为pmml文件,
但是feature map怎么生成呢
官方关于这点的说明,只有陈天奇的回复
https://github.com/dmlc/xgboost/issues/256
重点在这句
刚才那个开源转换工具在接受feature map的时候,是需要这样的文件的
0 cyl int
1 disp q
2 hp q
3 drat q
4 wt q
5 qsec q
6 vs=0 i
7 vs=1 i
8 am=0 i
9 am=1 i
10 gear int
11 carb int
有几点需要说明
1.分隔符必须是\t分割
2.int 代表int
3.q代表float
4.i代买 binary之后的int,

测试方法,可以通过以下代码获知
system('defaults write org.R-project.R force.LANG en_US.UTF-8')
install.packages("devtools")
library("devtools")
install_git("git://github.com/jpmml/r2pmml.git")

library("r2pmml")
library("xgboost")

data(mtcars)

# Convert selected columns from numeric datatype to integer or factor datatypes
mtcars$cyl = as.integer(mtcars$cyl)
mtcars$vs = as.factor(mtcars$vs)
mtcars$am = as.factor(mtcars$am)
mtcars$gear = as.integer(mtcars$gear)
mtcars$carb = as.integer(mtcars$carb)

mpg_y = mtcars[, 1]
mpg_X = mtcars[, 2:ncol(mtcars)]

# Generate feature map
mpg.fmap = r2pmml::genFMap(mpg_X)
r2pmml::writeFMap(mpg.fmap, "xgboost.fmap")

# Generate DMatrix
mpg.dmatrix = r2pmml::genDMatrix(mpg_y, mpg_X, "xgboost.svm")

set.seed(31)

# Train a linear regression model
mpg.xgb = xgboost(data = mpg.dmatrix, objective = "reg:linear", nrounds = 7)

# Save the model in XGBoost proprietary binary format
xgb.save(mpg.xgb, "xgboost.model")

# Dump the model in text format
xgb.dump(mpg.xgb, "xgboost.model.txt", fmap = "xgboost.fmap")

在mac下使用ggplot2画图的中文显示问题

library(readr)
segsimcat <- read_delim("~/segsimcat.txt",
                        "\t", escape_double = FALSE, trim_ws = TRUE)
View(segsimcat)


colnames(segsimcat)
library(ggplot2)
library(dplyr)
unique(segsimcat$name)
ggplot(filter(segsimcat,name=="美食"), aes(sim)) +
  geom_histogram(bins = 100)

ggplot(segsimcat, aes(sim, colour = name)) +
  geom_freqpoly(bins = 100)+theme(text = element_text(family = "STHeiti"))
# binwidth = 500
ggplot(segsimcat, aes(sim, fill = name)) +
  geom_histogram(bins = 100)+theme(text = element_text(family = "STHeiti"))


for(x in unique(segsimcat$name)){
print(ggplot(filter(segsimcat,name==x), aes(sim)) +
  geom_histogram(bins = 100)+ggtitle(x)+theme(text = element_text(family = "STHeiti"))
  )}



中文显示问题 http://not.farbox.com/post/chinese-character-font-mac-rlan


rmarkdown 到pdf的中文问题:
https://github.com/yixuan/showtext
https://statr.me/2014/01/using-system-fonts-in-r-graphs/