Git Product home page Git Product logo

spider's Introduction

Spider

请求库

1- urllib库

  • urllib.request 请求模块
  • urllib.error 异常处理模块
  • urllib.parse url解析模块
  • urllib.robotparse robots.txt解析模块
import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))
# read() 返回的是 bytes类型。

import urllib.parse
data = bytes(urllib.parse.urlencode({'word':'hello'}), encoding='utf-8')
# 传入的data需要为 bytes类型数据
r = urllib.request.urlopen('http://httpbin.org/post', data=data)
# url编码
parse.quote
>>> urllib.parse.quote('{}')
'%7B%7D'
parse.unquote
>>> urllib.parse.unquote('%7B%7D')
'{}'

##  响应	##
response.read().decode('utf-8')	#read读取的是bytes类型,转成utf-8编码。
response.status		# statuscode
response.getheaders()

##  request  ##
url = "http://httpbin.org/post"
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0'
}
dict = {
    'name': 'mike'
}
data = bytes(urllib.parse.urlencode(dict), encoding='utf-8')
req = urllib.request.Request(url=url, data=data, headers=headers, method='POST')
r = urllib.request.urlopen(req)

##  设置代理 ##
url = 'http://httpbin.org/ip'
proxy = {'http':'39.134.108.89:8080','https':'39.134.108.89:8080'}
proxies = urllib.request.ProxyHandler(proxy) # 创建代理处理器
opener = urllib.request.build_opener(proxies,urllib.request.HTTPHandler) # 创建特定的opener对象
urllib.request.install_opener(opener) # 安装全局的opener 把urlopen也变成特定的opener
data = urllib.request.urlopen(url)
print(data.read().decode())

##  设置cookies  ##
import urllib.request
import http.cookiejar

url = 'https://www.jianshu.com'
# 创建CookieJar对象
cookie_jar = http.cookiejar.CookieJar()
#使用HTTPCookieProcessor创建cookie处理器,并以其为参数构建opener对象
handler = urllib.request.HTTPCookieProcessor(cookie_jar)
opener=urllib.request.build_opener(handler)
# 安装opener
urllib.request.install_opener(opener)
data = urllib.request.urlopen(url)
print(cookie_jar)

###urllib.error

urllib.error中常用的有两个方法,URLError和HTTPError,HTTPError是URLError的一个子类。

URLError产生原因一般是:网络无法连接、服务器不存在等。例如访问一个不存在的url。

###urllib.parse

urllib.parse.urljoin 拼接url urllib.parse.urlencode 字典转字符串 urllib.parse.quote url编码 urllib.parse.unquote url解码 Url的编码格式采用的是ASCII码,而不是Unicode,

##解析库

####1- lxml库

​ 语法:XPath 是一门在 XML 文档中查找信息的语言。XPath 可用来在 XML 文档中对元素和属性进行遍历。XPath 是 W3C XSLT 标准的主要元素,并且 XQuery 和 XPointer 都构建于 XPath 表达之上。

html = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Chapter11_example_2</title>
</head>
<body>
<ul class="item">
    <li class="name">无人机</li>
    <li class="price">1亿</li>
</ul>
<ul class="item">
    <li class="name">火箭炮</li>
</ul>
</body>
</html>
"""
import lxml.html
selector = lxml.html.fromstring(html)
# 查找无人机
1属性前面要加 @,
>>> selector.xpath('//ul[@class="item"]/li[@class="name"]/text()')
['无人机', '火箭炮']

# li标签的class属性值
>>> selector.xpath('//li/@class')
['name', 'price', 'name']

# html中有两个ul标签,可以先抓大再抓小
>>> ul = selector.xpath('//ul[@class="item"]')	 # 先抓整体
>>> ul[0].xpath('li[@class="price"]/text()')
['1亿']

"""
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>"""
# 获取最后一个 li 标签 a 标签的 href 属性
result = html.xpath('//li[last()]/a/@href')

# 获取class为bold的标签名
result = html.xpath('//*[@class="bold"]')

#获取 <li> 标签下的所有 <span> 标签
result = html.xpath('//li/span')    # 错误,因为 / 是用来获取子元素的,而 <span> 并不是 <li> 的子元素,所以,要用双斜杠
result = html.xpath('//li//span')

2- pyquery库

spider's People

Contributors

liyufeng0813 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.