Wednesday, August 31, 2016

谈谈sklearn中的准确率和召回率





        实际
        1         0
预测 1    760       290
   0    1937      12488

其中预测为正例 1050,预测为负例14425


我们很容易得到 上面的 0正确率为12488/14425   召回为12488/12778
1的正确率为 760/1050 召回为760/2697
结果跟上面的结果一致,可是下面的avg/total 怎么来的,百思不得其解

查到
http://stackoverflow.com/questions/31169874/what-does-the-last-raw-mean-in-classification-report-in-scikit-learn

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html

http://stats.stackexchange.com/questions/117654/what-does-the-numbers-in-the-classification-report-of-sklearn-mean

源码如下:
# compute averages
values = [last_line_heading]
for v in (np.average(p, weights=s),
          np.average(r, weights=s),
          np.average(f1, weights=s)):
    values += ["{0:0.{1}f}".format(v, digits)]
values += ['{0}'.format(np.sum(s))]


原来是加权平均出来的
like this
准确率 =(0.87*12778+0.72*2697)/(12778+2697)
召回率 =(0.98*12778+0.28*2697)/(12778+2697)
也是醉了

Wednesday, August 24, 2016

shiny应用的开发与部署

shiny server的安装可以参见官方文档
具体说下遇见的问题

1.shiny表格乱码的情况,因为使用的是shiny的DT包,找到相关的文档,谢大神给出的答案是library(htmltools),windows下可以在.Rprofile文件中添加Sys.setlocale(, 'Chinese')

2.shiny中使用ggplot2,坐标,图例和图内文字显示为乱码

首先想到的是编码问题,参照了http://blog.csdn.net/wlwlwlwl015/article/details/51482065 的方法折腾一晚上,没有搞定。早上的时候看到了http://cos.name/cn/topic/408656/ 但是不知道这段话该放在哪里,最终回到谢大神这里,windows测试没有问题,放在centos下还是挂掉了,查看
http://shiny.rstudio.com/gallery/unicode-characters.html


错误跑出的异常是
sh: locate: command not found
rm(font_home)
downloaded 0 bytes

Error in utils::download.file(url, method = method, ...) : 
  cannot download all files
In addition: Warning message:
In utils::download.file(url, method = method, ...) :
  URL 'https://github-cloud.s3.amazonaws.com/releases/13722949/aca3ad44-1298-11e4-9499-2c2c4b1a1e74.ttc?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAISTNZFOVBIJMK3TQ%2F20160824%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20160824T024339Z&X-Amz-Expires=300&X-Amz-Signature=657ebda9fa7aec563592de13c160420b4013027a15f607a667a9e93159c2d7af&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dwqy-zenhei.ttc&response-content-type=application%2Foctet-stream': status was '403 Forbidden


因为是内网环境,所以是没有配置代理的原因 ,R中设置代理的方法是
Sys.setenv(http_proxy="http://ip:port ″)

但是这个只对终端打开有效,线上的app还是不能用怎么办
分析了谢大的这段代码
font_home <- function(path = '') file.path('~', '.fonts', path)
if (Sys.info()[['sysname']] == 'Linux' &&
      system('locate wqy-zenhei.ttc') != 0 &&
      !file.exists(font_home('wqy-zenhei.ttc'))) {
  if (!file.exists('wqy-zenhei.ttc'))
    shiny:::download(
      'https://github.com/rstudio/shiny-examples/releases/download/v0.10.1/wqy-zenhei.ttc',
      'wqy-zenhei.ttc'
    )
  dir.create(font_home())
  file.copy('wqy-zenhei.ttc', font_home())
  system2('fc-cache', paste('-f', font_home()))
}
rm(font_home)
其实跟上面csdn博客中做的工作是类似的,既然访问不了github,那就下载传上去,然后执行这段命令,神奇般的能够显示了。
3.log地址 /var/log/shiny-server
4.部署:app文件放在/srv/shiny-server路径下就可以了

安装如下:
https://www.rstudio.com/products/shiny/download-server/

sudo yum install R
sudo su - \
-c "R -e \"install.packages('shiny', repos='https://cran.rstudio.com/')\""
$ wget https://download3.rstudio.org/centos5.9/x86_64/shiny-server-1.5.3.838-rh5-x86_64.rpm
$ sudo yum install --nogpgcheck shiny-server-1.5.3.838-rh5-x86_64.rpm


http://docs.rstudio.com/shiny-server/

管理:
$ sudo start shiny-server
$ sudo stop shiny-server
To restart the server you can run:
$ sudo stop shiny-server
$ sudo start shiny-server

Friday, August 19, 2016

删除docker



How to Remove Docker Images and Containers


In our previous tutorials you have learned about installation of Docker engine on CentOS/RHELand Ubuntu operating system and pulled images from Docker hub. After that created containers with images. This tutorial will help you to remove unnecessary Docker images and containers from your host machine.

Remove Docker Images

To remove an images, Docker provides rmi option. Using this we can delete any docker images from our local system. For example use below command with changing <IMAGE ID> with your Docker image id.
# docker rmi  <IMAGE ID>
To find all images on your system use following command. This will help you to find ID of images.
root@tecadmin:~# docker images

REPOSITORY     TAG        IMAGE ID            CREATED          VIRTUAL SIZE
centos         latest     2933d50b9f77        11 days ago      196.6 MB
ubuntu         latest     36248ae4a9ac        11 days ago      188 MB

Remove Docker Containers

To remove a containers, Docker provides rm option. Using this we can delete any docker containers from our local system. For example use below command with changing <CONTAINER ID> with your Docker container id.
# docker rm  <CONTAINER ID>
To list all containers on your system using ps option, but ps will show only running containers. So to view all containers use -a parameter with ps.
root@tecadmin:~# docker ps -a

CONTAINER ID   IMAGE    COMMAND      CREATED       STATUS      PORTS  NAMES
f2582758af13   ubuntu   "/bin/bash"  2 hours ago   Up 2 hours         first_ubuntu
2b199b9976c4   centos   "/bin/bash"  3 days ago    Up 3 hours         thirsty_yalow

Stop & Remove All Docker Containers

If you want to remove all docker containers. You can use simply following commands. First command will stop all running docker containers and second command will delete them.
# docker stop $(docker ps -a -q)
# docker rm $(docker ps -a -q)

Tuesday, August 9, 2016

xgboost在centos和windows下的安装

0.补记一部分准备工作
上次在centos6.6服务器装好之后,因为安装TensorFlow的时候装glibc,修改了系统的lib.so.6,结果搞挂了一台服务器。搞挂之后才发现这个是核心类库(汗)。
现在因为又要使用xgboost,可是死活装不上了,都是gcc的原因啊。

centos6.6因为版本老旧,可是无奈服务器都是这个版本,想编c++11以上的东西,系统自带的gcc4.4.7是不够的。xgboost官方给出的建议是4.7以上,其实经过自己测试,4.7.2,4.8.2都是不行的https://github.com/dmlc/xgboost/issues/1319,这个issue同样提到了这个问题。无奈找不到issue中提到的4.8.3版本,只能安装4.9.4版本。安装方法如下【参考1】:
wget https://copr.fedoraproject.org/coprs/rhscl/devtoolset-3/repo/epel-6/rhscl-devtoolset-3-epel-6.repo -O /etc/yum.repos.d/rhscl-devtoolset-3-epel-6.repo
$ yum --disablerepo='*' --enablerepo='rhscl-devtoolset-3' list
$ yum --disablerepo='*' --enablerepo='rhscl-devtoolset-3' install devtoolset-3-gcc devtoolset-3-gcc-c++
然后执行【参考2】:
export CC=/opt/rh/devtoolset-3/root/usr/bin/gcc
export CPP=/opt/rh/devtoolset-3/root/usr/bin/cpp
export CXX=/opt/rh/devtoolset-3/root/usr/bin/c++






这样无论是git clone下来执行make -j4,还是直接pip install xgboost -i https://pypi.douban.com/simple都是很顺利的。

但是暗坑还是无数啊。暗坑存在哪里的,因为内网是实用的代理加上获取安装包的地址确实慢,速度实在慢的无法忍受啊,频频失败。


执行的时候发现需要安装的包如上,日中粗体的是已经下载完的,每次重新执行yum --disablerepo='*' --enablerepo='rhscl-devtoolset-3' install devtoolset-3-gcc devtoolset-3-gcc-c++
发现这玩意儿是能断点续传的,而且是全部下载(测试不能用的几个,得到的经验)完才开始安装。那就肯定有一个临时目录用来存放这些东西,存放在哪里呢
找到这个文章http://superuser.com/questions/385712/where-does-yum-save-the-rpms-it-downloads
比着葫芦画瓢,找到/var/cache/yum/x86_64/6/rhscl-devtoolset-3/packages,发现确实是存放的地方,而且需要的依赖名称都在这里了,那么事情就好办了。

直接google一下,然后迅雷下载
https://www.softwarecollections.org/repos/rhscl/devtoolset-3/epel-6-x86_64/devtoolset-3-3.1-12.el6/
https://www.softwarecollections.org/repos/rhscl/devtoolset-3/epel-6-x86_64/devtoolset-3-gcc-4.9.2-6.el6/
https://www.softwarecollections.org/repos/rhscl/devtoolset-3/epel-6-x86_64/devtoolset-3-binutils-2.24-18.el6/
https://www.softwarecollections.org/repos/rhscl/devtoolset-3/epel-6-x86_64/devtoolset-3-gcc-4.9.2-6.el6/

全部下载之后,拷贝到上面的目录下。再次执行
yum --disablerepo='*' --enablerepo='rhscl-devtoolset-3' install devtoolset-3-gcc devtoolset-3-gcc-c++


显示依赖已经具备,就开始愉快的安装了。
然后就是成功。

可以说这次尝试是深坑无数,可是自己还是一步步解决了,中间也学到不少东西。又可以愉快的训练模型了。
最后加上自己参照最早的一个安装gcc4.7.2和4.8.2的方式
http://superuser.com/questions/381160/how-to-install-gcc-4-7-x-4-8-x-on-centos

其实自己也尝试了本地使用docker尝试的方法,速度一样慢就先不说了,记录一句docker命令,从container内拷贝文件到host
In order to copy a file from a container to the host, you can use the command dockercp <containerId>:/file/path/within/container /host/path/target 

docker pull centos:6.6
docker commit containerid 名称
docker run -it images

ps:测试机上装成功了,从/anaconda2/lib/python2.7/site-packages 下面拷贝相关的东西到生产机器上,发现import xgboost的时候,报了


找到https://github.com/dmlc/xgboost/issues/1786,发现按照这个解决不行。重新卸载了,直接pip install成功,但是引入不行,对比测试和生产机器,发现测试用的是anaconda4.1.1,生产用的是anaconda4.3,降级之后发现可以了。之前fasttext也是同样的问题,看来anaconda升级之后问题多多呀。原因未知。
因为要用到xgboost,本地开发用的windows,服务器是centos,所以面临两边都要安装的问题。

1.centos安装

1.由于centos yum依赖python2.6,所以直接yum install python27就安装完成了,使用的时候执行python2.7

pip2.7 install xgboost 即可完成对xgboost的安装
但是在引入的过程中,报了如下的错误
>>> import xgboost
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/python2.7/lib/python2.7/site-packages/xgboost/__init__.py", line 13, in <module>
    from .sklearn import XGBModel, XGBClassifier, XGBRegressor
  File "/usr/local/python2.7/lib/python2.7/site-packages/xgboost/sklearn.py", line 10, in <module>
    from .compat import (SKLEARN_INSTALLED, XGBModelBase,
ImportError: cannot import name LabelEncoder

后来找到xgboost的issue中刘知远的回复是安装sklearn,执行pip2.7 install scikit-learn

再次导入,正确

2.windows下的安装

网上搜到http://blog.csdn.net/bon_mot/article/details/51742869#0-tsina-1-68989-397232819ff9a47a7b7e80a40613cfe1)这篇文章,但是由于xgboost的改变已经没有了vs的工程文件,git下已经没有windows的目录

参考以下部分进行了安装,一路使用的git bash
Note that as of the most recent release the Microsoft Visual Studio instructions no longer seem to apply as this link returns a 404 error:
You can read more about the removal of the MSVC build from Tianqi Chen's comment here.
So here's what I did to finish a 64-bit build on Windows:
  1. Download and install MinGW-64: http://sourceforge.net/projects/mingw-w64/
  2. On the first screen of the install prompt make sure you set the Architecture to x86_64 and the Threads to win32
  3. I installed to C:\mingw64 (to avoid spaces in the file path) so I added this to my PATH environment variable: C:\mingw64\mingw64\bin
  4. I also noticed that the make utility that is included in bin\mingw64 is called mingw32-make so to simplify things I just renamed this to make
  5. Open a Windows command prompt and type gcc. You should see something like "fatal error: no input file"
  6. Next type make. You should see something like "No targets specified and no makefile found"
  7. Type git. If you don't have git, install it and add it to your PATH.
These should be all the tools you need to build the xgboost project. To get the source code run these lines:
  1. cd c:\
  2. git clone --recursive https://github.com/dmlc/xgboost
  3. cd xgboost
  4. git submodule init
  5. git submodule update
  6. cp make/mingw64.mk config.mk
  7. make -j4
Note that I ran this part from a Cygwin shell. If you are using the Windows command prompt you should be able to change cp to copy and arrive at the same result. However, if the build fails on you for any reason I would recommend trying again using cygwin.
If the build finishes successfully, you should have a file called xgboost.exe located in the project root. To install the Python package, do the following:
  1. cd python-package
  2. python setup.py install
Now you should be good to go. Open up Python, and you can import the package with:
import xgboost as xgb
To test the installation, I went ahead and ran the basic_walkthrough.py file that was included in the demo/guide-python folder of the project and didn't get any errors.
安装完成在导入的时候还是报错了
告诉我需要scipy,直接执行pip install scipy报错了,但是没有时间去整理的,自从上次从python32切到64之后一直觉得有问题  http://www.lfd.uci.edu/~gohlke/pythonlibs/ 最简便的方法,依然是直接从这位老师这里直接下载,然后再试的过程中还是遇到了下面的错误
查询的时候,注意到下面这句话
Many binaries depend on numpy-1.11+mkl and the Microsoft Visual C++ 2008 (x64x86, and SP1 for CPython 2.6 and 2.7), Visual C++ 2010 (x64x86, for CPython 3.3 and 3.4), or the Visual C++ 2015 (x64 and x86 for CPython 3.5) redistributable packages.
Install numpy+mkl before other packages that depend on it.
意思就是因为很多模块依赖numpy,但是这位老师预编译的模块,依赖是numpy+mkl,我之前numpy是直接通过pip装的,也就是用不了,从新下载这位老师的numpy并安装,顺利搞定

参考:
1.http://www.hi-linux.com/posts/25767.html
2.https://www.zhangfangzhou.cn/centos6-devtoolset-gcc.html


Mac OS 安装;之前安装成功过,现在换了新的机器不能用了
目前os:10.13.4
一般按照https://machinelearningmastery.com/install-xgboost-python-macos/ 这个安装是没有问题的,官方https://xgboost.readthedocs.io/en/latest/build.html#python-package-installation
遇到问题:clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
切换到gcc-7 和g++-7,参照文档;

https://developer.apple.com/download/more/ 下载command line tools
升级xcode;
重新安装gcc 7
sudo chown -R $(whoami):admin /usr/local
Password:
chown: /usr/local: Operation not permitted 不管用;
Correct. /usr/local can no longer be chown'd in High Sierra. Instead use
sudo chown -R $(whoami) $(brew --prefix)/*

lrwxr-xr-x  1   admin        29 Jun 29 21:23 g++-7 -> ../Cellar/gcc/7.2.0/bin/g++-7
lrwxr-xr-x  1   admin        29 Jun 29 21:23 gcc-7 -> ../Cellar/gcc/7.2.0/bin/gcc-7

ln -s /usr/local/bin/g++-7 /usr/local/Cellar/gcc@7/7.3.0/bin/g++-7
ln -s /usr/local/bin/gcc-7 /usr/local/Cellar/gcc@7/7.3.0/bin/gcc-7
-sf 强制

brew unlink gcc
brew link gcc
一直没有解决,直接安装gcc-8,brew upgrade gcc
编译配置文件改为gcc-8
编译成功,正常安装



各种编译失败
conda install -c conda-forge xgboost
只能依赖这个了

Wednesday, August 3, 2016

python环境配置与pip安装

之前一直用python(x,y),非常喜欢那种可以ctrl+enter的边coding边执行的感觉,但是最近出错了。链接mongo的时候,多个节点切换的时候,老是报内存错误,终于下定决定,卸掉32位的python,装64位的python。接着很多包都不能用了,pip也不听使唤了。



1.安装python,Microsoft Visual C++ Compiler for Python 2.7,pip


2.配置环境变量path
添加 C:\Python27;C:\Python27\Scripts

用python这么久,才发现原来可以执行的东西,像pip,jupyter都在这里,没有配置环境变量怎么可以

3.至此大功告成





此文以后用来记录python使用中,预见环境问题的那些坑