Wednesday, December 20, 2017

xgboost 保存为pmml文件问题

xgboost 二进制文件,可以通过https://github.com/jpmml/jpmml-xgboost 保存为pmml文件,
但是feature map怎么生成呢
官方关于这点的说明,只有陈天奇的回复
https://github.com/dmlc/xgboost/issues/256
重点在这句
刚才那个开源转换工具在接受feature map的时候,是需要这样的文件的
0 cyl int
1 disp q
2 hp q
3 drat q
4 wt q
5 qsec q
6 vs=0 i
7 vs=1 i
8 am=0 i
9 am=1 i
10 gear int
11 carb int
有几点需要说明
1.分隔符必须是\t分割
2.int 代表int
3.q代表float
4.i代买 binary之后的int,

测试方法,可以通过以下代码获知
system('defaults write org.R-project.R force.LANG en_US.UTF-8')
install.packages("devtools")
library("devtools")
install_git("git://github.com/jpmml/r2pmml.git")

library("r2pmml")
library("xgboost")

data(mtcars)

# Convert selected columns from numeric datatype to integer or factor datatypes
mtcars$cyl = as.integer(mtcars$cyl)
mtcars$vs = as.factor(mtcars$vs)
mtcars$am = as.factor(mtcars$am)
mtcars$gear = as.integer(mtcars$gear)
mtcars$carb = as.integer(mtcars$carb)

mpg_y = mtcars[, 1]
mpg_X = mtcars[, 2:ncol(mtcars)]

# Generate feature map
mpg.fmap = r2pmml::genFMap(mpg_X)
r2pmml::writeFMap(mpg.fmap, "xgboost.fmap")

# Generate DMatrix
mpg.dmatrix = r2pmml::genDMatrix(mpg_y, mpg_X, "xgboost.svm")

set.seed(31)

# Train a linear regression model
mpg.xgb = xgboost(data = mpg.dmatrix, objective = "reg:linear", nrounds = 7)

# Save the model in XGBoost proprietary binary format
xgb.save(mpg.xgb, "xgboost.model")

# Dump the model in text format
xgb.dump(mpg.xgb, "xgboost.model.txt", fmap = "xgboost.fmap")

在mac下使用ggplot2画图的中文显示问题

library(readr)
segsimcat <- read_delim("~/segsimcat.txt",
                        "\t", escape_double = FALSE, trim_ws = TRUE)
View(segsimcat)


colnames(segsimcat)
library(ggplot2)
library(dplyr)
unique(segsimcat$name)
ggplot(filter(segsimcat,name=="美食"), aes(sim)) +
  geom_histogram(bins = 100)

ggplot(segsimcat, aes(sim, colour = name)) +
  geom_freqpoly(bins = 100)+theme(text = element_text(family = "STHeiti"))
# binwidth = 500
ggplot(segsimcat, aes(sim, fill = name)) +
  geom_histogram(bins = 100)+theme(text = element_text(family = "STHeiti"))


for(x in unique(segsimcat$name)){
print(ggplot(filter(segsimcat,name==x), aes(sim)) +
  geom_histogram(bins = 100)+ggtitle(x)+theme(text = element_text(family = "STHeiti"))
  )}



中文显示问题 http://not.farbox.com/post/chinese-character-font-mac-rlan


rmarkdown 到pdf的中文问题:
https://github.com/yixuan/showtext
https://statr.me/2014/01/using-system-fonts-in-r-graphs/

Friday, September 15, 2017

word2vec 编译异常信息

gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
gcc word2phrase.c -o word2phrase -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
gcc distance.c -o distance -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
distance.c: In function ‘main’:
distance.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable]
   char ch;
        ^
gcc word-analogy.c -o word-analogy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
word-analogy.c: In function ‘main’:
word-analogy.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable]
   char ch;
        ^
gcc compute-accuracy.c -o compute-accuracy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
compute-accuracy.c: In function ‘main’:
compute-accuracy.c:29:109: warning: unused variable ‘ch’ [-Wunused-variable]
   char st1[max_size], st2[max_size], st3[max_size], st4[max_size], bestw[N][max_size], file_name[max_size], ch;
                                                                                                             ^
chmod +x *.sh

Friday, August 11, 2017

连接google cloud vm其他主机类似

1.ssh-keygen -C '邮箱' -f ./aaa
会在当前路径下生产aaa和aaa.pub
2.执行ssh-keygen会在.ssh生成id_rsa和id_rsa.pub
3.在.ssh下面创建文件authorized_keys,讲上面aaa中的文件贴进去
4.打开xshell在用户身份验证中,选择public key,在用户密钥中选择本地文件aaa(跟上面的aaa一样)
5.即可直接进行连接,winscp类似
6.需要编辑google cloud vm,在ssh中加上aaa中的私钥

Saturday, July 29, 2017

shadowsocks 搭建

参见
https://xfabs.github.io/2016/08/02/shadowsocks/


主要apt-get install python-pip
pip install shadowsocks


{
    "server":"0.0.0.0",
    "server_port":1111,
    "local_address": "127.0.0.1",
    "local_port:":1080,
    "password":"mima",
    "timeout":300,
    "method":"aes-256-cfb",
    "fast_open": false
}


ssserver -c /etc/ss.json -d start
ssserver -c /etc/ss.json -d stop

https://blog.jamespan.me/posts/enable-tcp-bbr-on-google-cloud

一定要开启上面的bbr
sudo bash -c 'echo "net.core.default_qdisc=fq" >> /etc/sysctl.conf'
sudo bash -c 'echo "net.ipv4.tcp_congestion_control=bbr" >> /etc/sysctl.conf'
sudo sysctl -p

Tuesday, July 25, 2017

Friday, July 21, 2017

快速排序golang

package main

import "fmt"

func qsort(data []int) {
if len(data) <= 1 {
return
}
mid, i := data[0], 1
head, tail := 0, len(data)-1
for i = 1; i <= tail; {
                if data[i] > mid {
data[i], data[tail] = data[tail], data[i]
tail--
} else {
data[i], data[head] = data[head], data[i]
head++
i++
}
                }
data[head] = mid
qsort(data[:head])
qsort(data[head+1:])
}
func main() {
fmt.Println("Hello, 世界")

a:=[]int{1,2,3}
qsort(a)
for x:=range(a){
fmt.Println(x)}
}

Thursday, July 6, 2017

scala错误

java.lang.NoSuchMethodError: scala.runtime.IntRef.create(I)Lscala/runtime/IntRef

scala版本问题

Tuesday, June 27, 2017

sse操作

https://msdn.microsoft.com/en-us/library/ybhzf6dk(v=vs.71).aspx
https://msdn.microsoft.com/en-us/library/yd9wecaa(v=vs.120).aspx
https://msdn.microsoft.com/en-us/library/yd9wecaa(v=vs.120).aspx
https://msdn.microsoft.com/en-us/library/2x1se8ha(VS.80).aspx
https://msdn.microsoft.com/en-us/library/22kbk6t9(VS.80).aspx
https://msdn.microsoft.com/en-us/library/c9848chc(v=vs.90).aspx
https://msdn.microsoft.com/en-us/library/yd9wecaa(v=vs.120).aspx
https://msdn.microsoft.com/en-us/library/ybhzf6dk(v=vs.71).aspx
https://msdn.microsoft.com/en-us/library/tk1t2tbz(v=vs.80).aspx

Thursday, June 22, 2017

libffm编译

环境centos6.5,现有gcc 4.4.7

因为编译libffm需要c++11的支持,所以安装如下
GCC 5.2
wget https://copr.fedoraproject.org/coprs/hhorak/devtoolset-4-rebuild-bootstrap/repo/epel-6/hhorak-devtoolset-4-rebuild-bootstrap-epel-6.repo -O /etc/yum.repos.d/devtools-4.repo
yum install devtoolset-4-gcc devtoolset-4-binutils devtoolset-4-gcc-c++ -y
临时编译前使用
export CC=/opt/rh/devtoolset-4/root/usr/bin/gcc
export CPP=/opt/rh/devtoolset-4/root/usr/bin/cpp
export CXX=/opt/rh/devtoolset-4/root/usr/bin/c++

直接make,发现g++并没有生效
修改makefile
CXX = g++
CXX = /opt/rh/devtoolset-4/root/usr/bin/g++
然后编译成功

java 执行常见命令

java -cp HelloWorld.jar org.test.HelloWorld
java -jar *.jar

Monday, June 5, 2017

thrift 使用

http://blog.csdn.net/lener0611/article/details/50546569


http://sunliwen.com/2012/02/apache-thrift-on-ubuntu-10-04/

Wednesday, April 26, 2017

mongodb下载

MongoDB各个64位版本下载地址:
http://dl.mongodb.org/dl/win32/x86_64

Saturday, April 22, 2017

eclipse package explorer hierarchical

层级显示

http://stackoverflow.com/questions/3915961/how-to-view-hierarchical-package-structure-in-eclipse-package-explorer

参加上面,ctrl+F10

intellij idea 打开多个project

http://blog.csdn.net/zht666/article/details/47831893

Java执行jar包方式

http://www.cnblogs.com/adolfmc/archive/2012/10/07/2713562.html


用OneJar打包后Java -jar ****.jar总是运行指定的主方法,现在工程中有很多其他的主方法,想要运行指定的。可以用下面的命令:

java -classpath ****.jar ****.****.className [args]
2

Thursday, April 6, 2017

python 添加自己开发的模块

  1. 定位模块文件地址的方法
import codecs
codecs.__file__

打印Python系统路径
import sys
sys.path

  1. 添加自己的模块

here are a few ways. One of the simplest is to create a my-paths.pth file (as described here). This is just a file with the extension .pth that you put into your system site-packages

在site-packages中添加一个文件my-paths.pth,其中pth文件中添加自己的模块地址就行了




扩展 Python的内置函数

Tuesday, March 28, 2017

hive 表及其文件被删除恢复

1.关于hive外部表和内部表的差异,可以查看
http://stackoverflow.com/questions/38318513/does-drop-partition-delete-data-from-external-table-in-hive

大概意思就是内部表,表被drop掉之后,文件也会被删除。

我就是建立了内部表,然后把表drop了,然后文件也被删除了,恢复的办法就是/user/<user>/.Trash 去拷贝或者mv出来,因为我的操作用户是root,对应的user换成root就可以了

具体参见http://stackoverflow.com/questions/20114597/hive-files-on-hdfs-not-being-deleted-when-managed-not-external-table-is-droppe

2.因为通过spark streaming 读取kafka里的反馈日志,并且写入hdfs,文件是orc文件的
CREATE EXTERNAL TABLE test (
    foo string,
    time string,
    bar string
)  PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://test.com/';
建表方式如上,当然,orcfile的terminated by ',',另外需要加上一句STORED AS ORC就可以了
然后每天的天分区需要这样加载进去,才能被查询到
LTER TABLE test
    ADD PARTITION (dt='2014-03-05')
    location 's3://test.com/2014-03-05'

方法参考http://stackoverflow.com/questions/22220837/how-to-add-partition-using-hive-by-a-specific-date



ref
https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_alter_table.html
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Tuesday, March 14, 2017

安装keras,tensorflow,theano

环境:win10
1.安装tensorflow,下载官方的whl文件就行了,需要保证Python的版本是3.5,Anaconda3-4.2.0-Windows-x86_64是可以的
2.安装theano,需要conda install libpython ,会在C:\Anaconda3\libs下生成libpython35.dll.a文件,然后pip install keras 之后,会自动安装theano

Monday, February 20, 2017

liblinear使用

http://u.cs.biu.ac.il/~89-680/liblinear/using_liblinear.html


http://u.cs.biu.ac.il/~89-680/liblinear/liblin.py

看了上面的这篇介绍liblinear的文章,真是写的通俗易懂,更加深了自己对这个库的理解。进去网站,发现文字看不懂啊,百度了一下“巴伊兰大学1955年建于拉马特甘,是目前以色列第二大的学术机构。”


其中关于C惩罚

c越大越容易过拟合
c越小越容易欠拟合

关于惩罚项的具体公式和推倒还是要好好看看。



Q: How to select the regularization parameter C?
After version 2.0, an option -C is provided to find C. For example, you can run
> train -C data_file  
to find the C value with the best CV rate.


还有这篇文章

http://dataunion.org/15669.html

C和σ2对SVM的影响:
  • C较大(λ较小)时,可能会过拟合(Low bias, high variance);
  • C较小(λ较大)时,可能会欠拟合(High bias, low variance);
  • σ2较大时,High bias, low variance;
  • σ2较小时,Low bias, high variance;

Sunday, February 19, 2017

记录ConfigParser在Python2.6下一坑

import ConfigParser
import codecs
import os

#category_cfg_file="category.cfg"

category_cfg = ConfigParser.ConfigParser()
with codecs.open('feature_common.config', 'r',encoding='utf-8') as configfile:
    category_cfg.readfp(configfile)
   
   
features = category_cfg.sections()


上面代码在Python2.7下,返回的features顺序和配置文件中的顺序是一样的,可是服务器用的是Python2.7,这个返回的竟然是无序的,查到了很久才查到这一块。我草快被坑死了。

http://stackoverflow.com/questions/1134071/keep-configparser-output-files-sorted
找到答案,修复方法如下

In Python 2.7, you can from collections import OrderedDict and use ConfigParser(dict_type=OrderedDict) to make the parser use an ordered dictionary. Based on a minimal amount of testing, I think this should serve you well.

Wednesday, February 15, 2017

shell传参

http://stackoverflow.com/questions/415677/how-to-replace-placeholders-in-a-text-file

https://my.oschina.net/iuranus/blog/279985

Monday, February 13, 2017

安装mysqldb

##centos 环境
sudo yum install mysql-devel
sudo pip install mysql-python


##ubuntu环境
Starting with a vanilla Lucid install [1], install pip and upgrade to the latest version:
apt-get install python-pip
pip install -U pip
Next, install the required development packages:
apt-get install python-dev libmysqlclient-dev
then
pip install MySQL-python
should complete successfully.


mac下安装
$ brew uninstall mysql
$ brew install mysql-connector-c
$ brew unlink mysql-connector-c
$ brew install mysql
$ pip install mysql-python

Saturday, January 7, 2017

scala 第一深坑

http://stackoverflow.com/questions/23722048/intellij-idea-13-new-scala-sbt-project-hasnt-src-directory-structure-generated


2.intellij 打开终端terminal
alt+f12

Thursday, January 5, 2017

redis-py-cluster一坑记录

按照文档例子来一发,发现错误呀,单用redis-py错误呀
>>> from rediscluster import StrictRedisCluster

>>> # Requires at least one node for cluster discovery. Multiple nodes is recommended.
>>> startup_nodes = [{"host": "127.0.0.1", "port": "7000"}]

>>> # Note: decode_responses must be set to True when used with python3
>>> rc = StrictRedisCluster(startup_nodes=startup_nodes, decode_responses=True)

>>> rc.set("foo", "bar")
True
>>> print(rc.get("foo"))
'bar'

cluster错误如下:
ResponseError: unknown command 'CONFIG'

想死有没有,各种找不到答案
https://www.digitalocean.com/community/tutorials/how-to-secure-your-redis-installation-on-ubuntu-14-04

看到这篇文章,原来server可以选择屏蔽一些命令的,然后,搜索redis-py-cluster源码,找到这么一句

If your redis instance is configured to not have the `CONFIG ...` comannds enabled due to security reasons you need to pass this into the client object `skip_full_coverage_check=True`. Benefits is that the client class no longer requires the `CONFIG ...` commands to be enabled on the server. Downsides is that you can't use the option in your redis server and still use the same feature in this client.

加了一个skip_full_coverage_check=True搞定