There are big data that the big big world have, and then I want to catching and analyzing it, and then I make it a reality. If I could crawl the data and see what it is and How it looks like, it would be really amazing.

In this article, I‘ll you a simple guide to install packages that the crawler required, including some very simple usage , and how to

get the crawler resources to study.

Requirements

setup a virtualenv clair (why, hhhh, maybe homophonic:P)

mkdir env
virtualenv clair
source clair/bin/activate
cd yourproject
pip install -r requirements.txt

requirements.txt

1
2
3

request>=2.22.0
aiohttp>=3.6.2
pyquery>=1.4.0

alternative, you can also use pip

Basic Usage and Simple Test Case

Requests

import requests
url="https://cn.bing.com/"
page = requests.get(url)
print(page)
# <Response [200]>
print(page.text)
#
# <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
# <html>
# ...

Selenium

from selenium import webdriver
chrome = webdriver.Chrome() 
# chromedriver
# chrome broswer blank pages

from selenium import webdriver
firefox = webdriver.Firefox()
# geckodriver
# Haha, Message: TypeError: Given browserName [object String] "firefox", but my name is [object String] "waterfox

from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get('https://cn.bing.com/')
print(browser.current_url)
# https://cn.bing.com/

Tesserocr

# pip3 install tesserocr pillow
## ubuntu
# sudo apt-get install -y tesseract-ocr libtesseract-dev libleptonica-dev
# tesseract --list-langs
# 	tesseract image.png result -l eng && cat result.txt
# https://raw.githubusercontent.com/Python3WebSpider/TestTess/master/image.png
import tesserocr
from PIL import Image
image = Image.open('image.png')
print(tesserocr.image_to_text(image))
print(tesserocr.file_to_text('image.png'))

Redis

1
2
3

import redis
redis.VERSION
# (2, 10, 6)

Tornado

import tornado.ioloop
import tornado.web
 
class MainHandler(tornado.web.RequestHandler):
    def get(self):
        self.write("Hello, world")
 
def make_app():
    return tornado.web.Application([
        (r"/", MainHandler),
    ])
 
if __name__ == "__main__":
    app = make_app()
    app.listen(8888)
    tornado.ioloop.IOLoop.current().start()

Flask

from flask import Flask
app = Flask(__name__)
 
@app.route("/")
def hello():
    return "Hello World!"
 
if __name__ == "__main__":
    app.run()

Pyspider

import pyspider
# pyspider all
import scrapy
# if you failed with Twisted install 32 instead
# scrapy
# pip3 install scrapy-redis
# pip3 install 'requests[socks]'
# pip3 install requests_oauthlib

Treasures

Requests

Requests is an elegant and simple HTTP library for Python, built for human beings.

Python Web Crawler Vol.1 - How to Install

Requirements

Basic Usage and Simple Test Case

Requests

Selenium

Tesserocr

Redis

Tornado

Flask

Pyspider

Treasures

Requests

Query

Selenium

Tesserocr

Redis

WEB

Frames

Awesome-crawler