azure云配置ubuntu虚拟机及部署scrapy

基本环境配置：

sudo apt-get update

sudo apt-get install build-essential python-dev python-pip

安装相关数据库mongo和redis：

mongo的安装可以参考http://blog.csdn.net/like_a_fool/article/details/14105871 关于mongo权限问题建议这篇文章http://blog.csdn.net/lxpbs8851/article/details/7569852 还有个问题需要注意一下mongo dbpath配置

redis的安装可以参考http://redis.io/download 其中需要把redis-server改为后台进程一直运行，方法是修改配置文件redis.conf，daemonize改为yes

然后src/redis-server redis.conf

部署scrapy：

1.安装mongo和redis的python api

sudo pip install pymongo redis

2.scrapy安装参考https://pypi.python.org/pypi/Scrapy

sudo pip install scrapy 这个命令运行时可能会产生这个错误：error: command 'gcc' failed with exit status 1

解决方法参考http://stackoverflow.com/questions/10927492/getting-gcc-failed-error-while-installing-scrapy sudo apt-get install libxml2-dev libxslt-dev

接下来针对具体scrapy相关项目其他配置进行说明：

sudo apt-get install git

utils4scrapy: https://github.com/linhaobuaa/utils4scrapy.git
weibopy: https://github.com/linhaobuaa/weibopy.git
scrapy-redis: https://github.com/darkrho/scrapy-redis

三个依赖包clone下来进行安装

具体项目https://github.com/linhaobuaa/scrapy_weibo_v1.git

相关配置主要为导入mongodb token，apikey，apisecret信息

1.将60上dump的mongo数据scp到云虚拟机导入本地mongo 可以参考http://www.cnblogs.com/jiangzhichao/archive/2011/08/12/2135899.html

2.运行utils4scrapy中的tk_maintain.py将授权信息导入redis 可能要sudo pip install logbook

3.https://github.com/linhaobuaa/utils4scrapy/blob/master/utils4scrapy/auto_reset_ip_req_count.py
https://github.com/linhaobuaa/utils4scrapy/blob/master/utils4scrapy/auto_calibration.py 放入后台任务/etc/crontab

0 * * * * root cd /home/azureuser/taolei/utils4scrapy/utils4scrapy;python auto_reset_ip_req_count.py
*/2 * * * * root cd /home/azureuser/taolei/utils4scrapy/utils4scrapy;python auto_calibration.py

关于后台运行scrapy说明：

nohup command > myout.file 2>&1 &

在上面的例子中，0 – stdin (standard input)，1 – stdout (standard output)，2 – stderr (standard error) ；
2>&1是将标准错误（2）重定向到标准输出（&1），标准输出（&1）再被重定向输入到myout.file文件中。