There are big data that the big big world have, and then I want to catching and analyzing it, and then I make it a reality. If I could crawl the data and see what it is and How it looks like, it would be really amazing.

In this article, I‘ll you a simple guide to install packages that the crawler required, including some very simple usage , and how to

get the crawler resources to study.

Requirements

setup a virtualenv clair (why, hhhh, maybe homophonic:P)

1
2
3
4
5
mkdir env
virtualenv clair
source clair/bin/activate
cd yourproject
pip install -r requirements.txt

requirements.txt

1
2
3
request>=2.22.0
aiohttp>=3.6.2
pyquery>=1.4.0

alternative, you can also use pip

Basic Usage and Simple Test Case

Requests

1
2
3
4
5
6
7
8
9
10
import requests
url="https://cn.bing.com/"
page = requests.get(url)
print(page)
# <Response [200]>
print(page.text)
#
# <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
# <html>
# ...

Selenium

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from selenium import webdriver
chrome = webdriver.Chrome()
# chromedriver
# chrome broswer blank pages

from selenium import webdriver
firefox = webdriver.Firefox()
# geckodriver
# Haha, Message: TypeError: Given browserName [object String] "firefox", but my name is [object String] "waterfox

from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get('https://cn.bing.com/')
print(browser.current_url)
# https://cn.bing.com/

Tesserocr

1
2
3
4
5
6
7
8
9
10
11
# pip3 install tesserocr pillow
## ubuntu
# sudo apt-get install -y tesseract-ocr libtesseract-dev libleptonica-dev
# tesseract --list-langs
# tesseract image.png result -l eng && cat result.txt
# https://raw.githubusercontent.com/Python3WebSpider/TestTess/master/image.png
import tesserocr
from PIL import Image
image = Image.open('image.png')
print(tesserocr.image_to_text(image))
print(tesserocr.file_to_text('image.png'))

Redis

1
2
3
import redis
redis.VERSION
# (2, 10, 6)

Tornado

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import tornado.ioloop
import tornado.web

class MainHandler(tornado.web.RequestHandler):
def get(self):
self.write("Hello, world")

def make_app():
return tornado.web.Application([
(r"/", MainHandler),
])

if __name__ == "__main__":
app = make_app()
app.listen(8888)
tornado.ioloop.IOLoop.current().start()

Flask

1
2
3
4
5
6
7
8
9
from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello():
return "Hello World!"

if __name__ == "__main__":
app.run()

Pyspider

1
2
3
4
5
6
7
8
import pyspider
# pyspider all
import scrapy
# if you failed with Twisted install 32 instead
# scrapy
# pip3 install scrapy-redis
# pip3 install 'requests[socks]'
# pip3 install requests_oauthlib

Treasures

Requests

Requests is an elegant and simple HTTP library for Python, built for human beings.

Query

Selenium

Tesserocr

Redis

WEB

Frames

Awesome-crawler