Python Web Crawler Vol.1 - How to Install
There are big data that the big big world have, and then I want to catching and analyzing it, and then I make it a reality. If I could crawl the data and see what it is and How it looks like, it would be really amazing.
In this article, I‘ll you a simple guide to install packages that the crawler required, including some very simple usage , and how to
get the crawler resources to study.
Requirements
setup a virtualenv clair (why, hhhh, maybe homophonic:P)
1 | mkdir env |
requirements.txt
1 | request>=2.22.0 |
alternative, you can also use pip
Basic Usage and Simple Test Case
Requests
1 | import requests |
Selenium
1 | from selenium import webdriver |
Tesserocr
1 | # pip3 install tesserocr pillow |
Redis
1 | import redis |
Tornado
1 | import tornado.ioloop |
Flask
1 | from flask import Flask |
Pyspider
1 | import pyspider |
Treasures
Requests
Requests is an elegant and simple HTTP library for Python, built for human beings.
- Website: https://requests.kennethreitz.org/en/master/
- Doc: https://docs.python-requests.org/zh_CN/latest
- aiohttp: https://aiohttp.readthedocs.io/en/stable/
Query
- https://www.crummy.com/software/BeautifulSoup/bs4/doc
- http://pyquery.readthedocs.io
- https://pypi.org/project/pyquery/
Selenium
- http://selenium-python.readthedocs.io
- http://selenium-python-zh.readthedocs.io
- https://chromedriver.storage.googleapis.com/index.html
- https://github.com/mozilla/geckodriver/releases
http://phantomjs.org
Tesserocr
- https://github.com/tesseract-ocr/tesseract/wiki/Documentation
- https://github.com/tesseract-ocr/tesseract
- https://github.com/sirfz/tesserocr
- https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.01.exe
Redis
- https://github.com/MSOpenTech/redis/releases
- https://redisdesktop.com
- http://www.runoob.com/redis/redis-tutorial.html
WEB
Frames
- http://docs.pyspider.org/en/latest/tutorial
- http://demo.pyspider.org
- http://scrapy-chs.readthedocs.io
- https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
Awesome-crawler
All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.
Comment





