Tuesday, December 26, 2017

blog清除格式

1.html
2.右上角text x号
参考http://www.cookistry.com/2013/01/blogger-tip-get-rid-of-weird-formatting.html

Wednesday, December 20, 2017

xgboost 保存为pmml文件问题

xgboost 二进制文件，可以通过https://github.com/jpmml/jpmml-xgboost 保存为pmml文件，

但是feature map怎么生成呢

官方关于这点的说明，只有陈天奇的回复

https://github.com/dmlc/xgboost/issues/256

重点在这句

刚才那个开源转换工具在接受feature map的时候，是需要这样的文件的

0 cyl int

1 disp q

2 hp q

3 drat q

4 wt q

5 qsec q

6 vs=0 i

7 vs=1 i

8 am=0 i

9 am=1 i

10 gear int

11 carb int

有几点需要说明

1.分隔符必须是\t分割

2.int 代表int

3.q代表float

4.i代买 binary之后的int，

测试方法，可以通过以下代码获知

system('defaults write org.R-project.R force.LANG en_US.UTF-8')

install.packages("devtools")

library("devtools")

install_git("git://github.com/jpmml/r2pmml.git")

library("r2pmml")

library("xgboost")

data(mtcars)

# Convert selected columns from numeric datatype to integer or factor datatypes

mtcars$cyl = as.integer(mtcars$cyl)

mtcars$vs = as.factor(mtcars$vs)

mtcars$am = as.factor(mtcars$am)

mtcars$gear = as.integer(mtcars$gear)

mtcars$carb = as.integer(mtcars$carb)

mpg_y = mtcars[, 1]

mpg_X = mtcars[, 2:ncol(mtcars)]

# Generate feature map

mpg.fmap = r2pmml::genFMap(mpg_X)

r2pmml::writeFMap(mpg.fmap, "xgboost.fmap")

# Generate DMatrix

mpg.dmatrix = r2pmml::genDMatrix(mpg_y, mpg_X, "xgboost.svm")

set.seed(31)

# Train a linear regression model

mpg.xgb = xgboost(data = mpg.dmatrix, objective = "reg:linear", nrounds = 7)

# Save the model in XGBoost proprietary binary format

xgb.save(mpg.xgb, "xgboost.model")

# Dump the model in text format

xgb.dump(mpg.xgb, "xgboost.model.txt", fmap = "xgboost.fmap")

library(readr)
segsimcat <- read_delim("~/segsimcat.txt",
"\t", escape_double = FALSE, trim_ws = TRUE)
View(segsimcat)

colnames(segsimcat)
library(ggplot2)
library(dplyr)
unique(segsimcat$name)
ggplot(filter(segsimcat,name=="美食"), aes(sim)) +
geom_histogram(bins = 100)

ggplot(segsimcat, aes(sim, colour = name)) +
geom_freqpoly(bins = 100)+theme(text = element_text(family = "STHeiti"))
# binwidth = 500
ggplot(segsimcat, aes(sim, fill = name)) +
geom_histogram(bins = 100)+theme(text = element_text(family = "STHeiti"))

for(x in unique(segsimcat$name)){
print(ggplot(filter(segsimcat,name==x), aes(sim)) +
geom_histogram(bins = 100)+ggtitle(x)+theme(text = element_text(family = "STHeiti"))
)}

重点参考：for-loop https://www.r-bloggers.com/how-to-write-the-first-for-loop-in-r/

http://www.reed.edu/data-at-reed/resources/R/loops_with_ggplot2.html

添加标题 https://stackoverflow.com/questions/40675778/center-plot-title-in-ggplot2

ggplot2在for循环中的问题 https://stackoverflow.com/questions/15678261/r-ggplot-does-not-work-if-it-is-inside-a-for-loop-although-it-works-outside-of

中文显示问题 http://not.farbox.com/post/chinese-character-font-mac-rlan

rmarkdown 到pdf的中文问题：
https://github.com/yixuan/showtext
https://statr.me/2014/01/using-system-fonts-in-r-graphs/

Friday, September 15, 2017

word2vec 编译异常信息

gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
gcc word2phrase.c -o word2phrase -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
gcc distance.c -o distance -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
distance.c: In function ‘main’:
distance.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable]
char ch;
^
gcc word-analogy.c -o word-analogy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
word-analogy.c: In function ‘main’:
word-analogy.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable]
char ch;
^
gcc compute-accuracy.c -o compute-accuracy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
compute-accuracy.c: In function ‘main’:
compute-accuracy.c:29:109: warning: unused variable ‘ch’ [-Wunused-variable]
char st1[max_size], st2[max_size], st3[max_size], st4[max_size], bestw[N][max_size], file_name[max_size], ch;
^
chmod +x *.sh

Friday, August 11, 2017

连接google cloud vm其他主机类似

1.ssh-keygen -C '邮箱' -f ./aaa
会在当前路径下生产aaa和aaa.pub
2.执行ssh-keygen会在.ssh生成id_rsa和id_rsa.pub
3.在.ssh下面创建文件authorized_keys，讲上面aaa中的文件贴进去
4.打开xshell在用户身份验证中，选择public key，在用户密钥中选择本地文件aaa（跟上面的aaa一样）
5.即可直接进行连接，winscp类似
6.需要编辑google cloud vm，在ssh中加上aaa中的私钥

Saturday, July 29, 2017

shadowsocks 搭建

参见
https://xfabs.github.io/2016/08/02/shadowsocks/

主要apt-get install python-pip
pip install shadowsocks

{
"server":"0.0.0.0",
"server_port":1111,
"local_address": "127.0.0.1",
"local_port：":1080,
"password":"mima",
"timeout":300,
"method":"aes-256-cfb",
"fast_open": false
}

ssserver -c /etc/ss.json -d start
ssserver -c /etc/ss.json -d stop

https://blog.jamespan.me/posts/enable-tcp-bbr-on-google-cloud

一定要开启上面的bbr

sudo bash -c 'echo "net.core.default_qdisc=fq" >> /etc/sysctl.conf'
sudo bash -c 'echo "net.ipv4.tcp_congestion_control=bbr" >> /etc/sysctl.conf'
sudo sysctl -p

Tuesday, July 25, 2017

spark-sehll client模式

spark-shell --master yarn --deploy-mode client

Friday, July 21, 2017

快速排序golang

package main

import "fmt"

func qsort(data []int) {
if len(data) <= 1 {
return
}
mid, i := data[0], 1
head, tail := 0, len(data)-1
for i = 1; i <= tail; {
if data[i] > mid {
data[i], data[tail] = data[tail], data[i]
tail--
} else {
data[i], data[head] = data[head], data[i]
head++
i++
}
}
data[head] = mid
qsort(data[:head])
qsort(data[head+1:])
}
func main() {
fmt.Println("Hello, 世界")

a:=[]int{1,2,3}
qsort(a)
for x:=range(a){
fmt.Println(x)}
}

Tuesday, July 11, 2017

线上spark和sqoop问题修复

线上sqoop日志错误如下，https://stackoverflow.com/questions/38649071/getting-an-file-exists-error-while-import-into-hive-using-sqoop
https://stackoverflow.com/questions/38649071/getting-an-file-exists-error-while-import-into-hive-using-sqoop

参见这两个，大概意思是说sqoop导数据的时候，会将数据先放到一个临时路径下，然后在挪过去。如果这个临时路径存在，就会报错，哪怕选线是overwrite也不行；

spark错误

http://blog.csdn.net/a921122/article/details/45095845
https://stackoverflow.com/questions/32820087/spark-multiple-spark-submit-in-parallel

Thursday, July 6, 2017

scala错误

java.lang.NoSuchMethodError: scala.runtime.IntRef.create(I)Lscala/runtime/IntRef

scala版本问题

Tuesday, June 27, 2017

sse操作

https://msdn.microsoft.com/en-us/library/ybhzf6dk(v=vs.71).aspx
https://msdn.microsoft.com/en-us/library/yd9wecaa(v=vs.120).aspx
https://msdn.microsoft.com/en-us/library/yd9wecaa(v=vs.120).aspx
https://msdn.microsoft.com/en-us/library/2x1se8ha(VS.80).aspx
https://msdn.microsoft.com/en-us/library/22kbk6t9(VS.80).aspx
https://msdn.microsoft.com/en-us/library/c9848chc(v=vs.90).aspx
https://msdn.microsoft.com/en-us/library/yd9wecaa(v=vs.120).aspx
https://msdn.microsoft.com/en-us/library/ybhzf6dk(v=vs.71).aspx
https://msdn.microsoft.com/en-us/library/tk1t2tbz(v=vs.80).aspx

Thursday, June 22, 2017

libffm编译

环境centos6.5，现有gcc 4.4.7

因为编译libffm需要c++11的支持，所以安装如下

GCC 5.2
wget https://copr.fedoraproject.org/coprs/hhorak/devtoolset-4-rebuild-bootstrap/repo/epel-6/hhorak-devtoolset-4-rebuild-bootstrap-epel-6.repo -O /etc/yum.repos.d/devtools-4.repo
yum install devtoolset-4-gcc devtoolset-4-binutils devtoolset-4-gcc-c++ -y

临时编译前使用
export CC=/opt/rh/devtoolset-4/root/usr/bin/gcc
export CPP=/opt/rh/devtoolset-4/root/usr/bin/cpp
export CXX=/opt/rh/devtoolset-4/root/usr/bin/c++

直接make，发现g++并没有生效

修改makefile

CXX = g++

为

CXX = /opt/rh/devtoolset-4/root/usr/bin/g++

然后编译成功

java 执行常见命令

java -cp HelloWorld.jar org.test.HelloWorld

java -jar *.jar

Monday, June 5, 2017

thrift 使用

http://blog.csdn.net/lener0611/article/details/50546569

http://sunliwen.com/2012/02/apache-thrift-on-ubuntu-10-04/

Wednesday, May 17, 2017

spark 通过word2vec 计算余弦相似度，值大于1

Spark MLLib's Word2Vec cosine similarity greater than 1

http://stackoverflow.com/questions/33360201/spark-mllibs-word2vec-cosine-similarity-greater-than-1

Wednesday, April 26, 2017

mongodb下载

MongoDB各个64位版本下载地址：

http://dl.mongodb.org/dl/win32/x86_64

Saturday, April 22, 2017

eclipse package explorer hierarchical

层级显示

http://stackoverflow.com/questions/3915961/how-to-view-hierarchical-package-structure-in-eclipse-package-explorer

参加上面，ctrl+F10

intellij idea 打开多个project

http://blog.csdn.net/zht666/article/details/47831893

Java执行jar包方式

http://www.cnblogs.com/adolfmc/archive/2012/10/07/2713562.html

用OneJar打包后Java -jar ****.jar总是运行指定的主方法，现在工程中有很多其他的主方法，想要运行指定的。可以用下面的命令：

java -classpath ****.jar ****.****.className [args]

顶: 2

Wednesday, April 12, 2017

样本不均衡问题SMOTE

1.http://geek.csdn.net/news/detail/189985?url_type=39&object_type=webpage&pos=1

2.http://mp.weixin.qq.com/s?__biz=MzIxNjA2ODUzNg==&mid=403428182&idx=1&sn=2bd376b0eded604211f922b7ec6df5bf#rd

3.http://blog.csdn.net/Yaphat/article/details/52463304?locationNum=7

4.http://blog.csdn.net/yaphat/article/details/60347968

Thursday, April 6, 2017

python 添加自己开发的模块

定位模块文件地址的方法

import codecs

codecs.__file__

打印Python系统路径

import sys

sys.path

添加自己的模块

here are a few ways. One of the simplest is to create a my-paths.pth file (as described here). This is just a file with the extension .pth that you put into your system site-packages

在site-packages中添加一个文件my-paths.pth，其中pth文件中添加自己的模块地址就行了

参见http://stackoverflow.com/questions/12257747/adding-a-file-path-to-sys-path-in-python

https://docs.python.org/3/library/site.html

扩展 Python的内置函数

https://docs.python.org/2/library/functions.html

Tuesday, March 28, 2017

hive 表及其文件被删除恢复

1.关于hive外部表和内部表的差异，可以查看

http://stackoverflow.com/questions/38318513/does-drop-partition-delete-data-from-external-table-in-hive

大概意思就是内部表，表被drop掉之后，文件也会被删除。

我就是建立了内部表，然后把表drop了，然后文件也被删除了，恢复的办法就是/user/<user>/.Trash 去拷贝或者mv出来，因为我的操作用户是root，对应的user换成root就可以了

具体参见http://stackoverflow.com/questions/20114597/hive-files-on-hdfs-not-being-deleted-when-managed-not-external-table-is-droppe

2.因为通过spark streaming 读取kafka里的反馈日志，并且写入hdfs，文件是orc文件的

CREATE EXTERNAL TABLE test (
    foo string,
    time string,
    bar string
)  PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://test.com/';

建表方式如上，当然，orcfile的terminated by ',',另外需要加上一句STORED AS ORC就可以了

然后每天的天分区需要这样加载进去，才能被查询到

LTER TABLE test
    ADD PARTITION (dt='2014-03-05')
    location 's3://test.com/2014-03-05'


方法参考http://stackoverflow.com/questions/22220837/how-to-add-partition-using-hive-by-a-specific-date

ref

https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_alter_table.html

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Tuesday, March 14, 2017

安装keras,tensorflow,theano

环境：win10
1.安装tensorflow，下载官方的whl文件就行了，需要保证Python的版本是3.5，Anaconda3-4.2.0-Windows-x86_64是可以的
2.安装theano，需要conda install libpython ,会在C:\Anaconda3\libs下生成libpython35.dll.a文件，然后pip install keras 之后，会自动安装theano

Monday, February 20, 2017

liblinear使用

http://u.cs.biu.ac.il/~89-680/liblinear/using_liblinear.html

http://u.cs.biu.ac.il/~89-680/liblinear/liblin.py

看了上面的这篇介绍liblinear的文章，真是写的通俗易懂，更加深了自己对这个库的理解。进去网站，发现文字看不懂啊，百度了一下“巴伊兰大学1955年建于拉马特甘，是目前以色列第二大的学术机构。”

其中关于C惩罚

c越大越容易过拟合

c越小越容易欠拟合

关于惩罚项的具体公式和推倒还是要好好看看。

Q: How to select the regularization parameter C?

After version 2.0, an option -C is provided to find C. For example, you can run

> train -C data_file

to find the C value with the best CV rate.

还有这篇文章

http://dataunion.org/15669.html

C和σ²对SVM的影响：

C较大（λ较小）时，可能会过拟合（Low bias, high variance）；
C较小（λ较大）时，可能会欠拟合（High bias, low variance）；
σ²较大时，High bias, low variance；
σ²较小时，Low bias, high variance；

Sunday, February 19, 2017

记录ConfigParser在Python2.6下一坑

import ConfigParser
import codecs
import os

#category_cfg_file="category.cfg"

category_cfg = ConfigParser.ConfigParser()
with codecs.open('feature_common.config', 'r',encoding='utf-8') as configfile:
category_cfg.readfp(configfile)

features = category_cfg.sections()

上面代码在Python2.7下，返回的features顺序和配置文件中的顺序是一样的，可是服务器用的是Python2.7，这个返回的竟然是无序的，查到了很久才查到这一块。我草快被坑死了。

http://stackoverflow.com/questions/1134071/keep-configparser-output-files-sorted
找到答案，修复方法如下

In Python 2.7, you can from collections import OrderedDict and use ConfigParser(dict_type=OrderedDict) to make the parser use an ordered dictionary. Based on a minimal amount of testing, I think this should serve you well.

Wednesday, February 15, 2017

shell传参

http://stackoverflow.com/questions/415677/how-to-replace-placeholders-in-a-text-file

https://my.oschina.net/iuranus/blog/279985

Monday, February 13, 2017

安装mysqldb

##centos 环境
sudo yum install mysql-devel
sudo pip install mysql-python

##ubuntu环境

Starting with a vanilla Lucid install [1], install pip and upgrade to the latest version:

apt-get install python-pip
pip install -U pip

Next, install the required development packages:

apt-get install python-dev libmysqlclient-dev

then

pip install MySQL-python

should complete successfully.

mac下安装
$ brew uninstall mysql
$ brew install mysql-connector-c
$ brew unlink mysql-connector-c
$ brew install mysql
$ pip install mysql-python

Saturday, January 7, 2017

scala 第一深坑

http://stackoverflow.com/questions/23722048/intellij-idea-13-new-scala-sbt-project-hasnt-src-directory-structure-generated

2.intellij 打开终端terminal
alt+f12

Thursday, January 5, 2017

redis-py-cluster一坑记录

按照文档例子来一发，发现错误呀，单用redis-py错误呀

>>> from rediscluster import StrictRedisCluster

>>> # Requires at least one node for cluster discovery. Multiple nodes is recommended.
>>> startup_nodes = [{"host": "127.0.0.1", "port": "7000"}]

>>> # Note: decode_responses must be set to True when used with python3
>>> rc = StrictRedisCluster(startup_nodes=startup_nodes, decode_responses=True)

>>> rc.set("foo", "bar")
True
>>> print(rc.get("foo"))
'bar'

cluster错误如下：

ResponseError: unknown command 'CONFIG'

想死有没有，各种找不到答案

https://www.digitalocean.com/community/tutorials/how-to-secure-your-redis-installation-on-ubuntu-14-04

看到这篇文章，原来server可以选择屏蔽一些命令的，然后，搜索redis-py-cluster源码，找到这么一句

If your redis instance is configured to not have the `CONFIG ...` comannds enabled due to security reasons you need to pass this into the client object `skip_full_coverage_check=True`. Benefits is that the client class no longer requires the `CONFIG ...` commands to be enabled on the server. Downsides is that you can't use the option in your redis server and still use the same feature in this client.

加了一个skip_full_coverage_check=True搞定