网页用python爬取后如何解析-图灵python

网页用python爬取后如何解析

发布时间:2025-05-09 10:51:20

一、使用webbrowser.open()打开网站:

>>>importwebbrowser
>>>webbrowser.open('http://i.firefoxchina.cn/?from=worldindex')
True

例子：用脚本打开一个网页。

所有Python程序的第一行都应该是#！python开头，它告诉电脑Python要执行这个程序。(我没有试过这个行业，也没关系。也许这是一个标准。)

1.从sys.argv读取命令行参数：打开新的文件编辑窗口，输入以下代码，并将其保存为map.py。

2.读取剪贴板内容：

3.调用webbrowser.open()函数打开外部浏览:

#!python3
importwebbrowser,sys,pyperclip
iflen(sys.argv)>1:
mapAddress=''.join(sys.argv[1:])
else:
mapAddress=pyperclip.paste()
webbrowser.open('http://map.baidu.com/?newmap=1&ie=utf-8&s=s%26wd3D'+mapAddress

注：sys不清楚.如果使用argv，请参考此处；不清楚.join()用法请参考此处。sys.argv是字符串的列表，因此将其传递给join()方法返回字符串。

好了，现在选择''天安门广场;复制这些单词，然后双击桌面上的程序。当然，您也可以在命令行中找到您的程序，然后输入位置。

相关推荐:Python教程

二、用requests模块从web下载文件：

Python没有带来requests模块，通过命令操作piphon install request安装。没有翻墙很难成功安装，手动安装可以参考这里。

>>>importrequests
>>>res=requests.get('http://i.firefoxchina.cn/?from=worldindex')#将一个网站传输到get中
>>>type(res)#响应对象
<class'requests.models.Response'>
>>>print(res.status_code)#响应码
200
>>>res.text#返回的文本

在requests中，有很多方法可以查看在线下载的文件内容。如果你将来可以使用它们，你将解释它们。我不会在这里逐一介绍它们。在下载文件的过程中，使用raise_for_status()方法可以保证下载真的成功，然后让程序继续做其他事情。

importrequests
res=requests.get('http://i.firefoxchina.cn/?from=worldindex')
try:
res.raise_for_status()
exceptExceptionasexc:
print('Therewasaproblem:%s'%(exc))

将下载的文件保存到本地：

>>>importrequests
>>>res=requests.get('http://tech.firefox.sina.com/17/0820/10/6DKQALVRW5JHGE.html##0-tsina-1-13074-
397232819ff9a4
7a7b7e80a40613cfe1&3;)
>>>res.raise_for_status()
>>>file=open('1.txt','wb')#以写二进制模式打开文件，其目的是保存文本中的“Unicode编码”
>>>forwordinres.iter_content(100000):#<spanclass="fontstyle"><spanclass="fontstyle">iter_content()
</span>
<spanclass="fontstyle1">该方法在循环的每次迭代中返回一段</span><spanclass="fontstyle">bytes</span><spanclass=
"fontstyle1">数据</span><spanclass="fontstyle1">类型内容，您需要指定其中包含的字节数</span></span>
file.write(word)

16997
>>>file.close()

四、用BeautifulSoup模块分析HTML：在命令行中使用pip install 安装beautifulsoup4。

1.bs4.BeautifulSoup()函数可以分析HTML网站链接requestss.get()，还可以分析本地保存的HTML文件，直接open()本地HTML页面。

>>>importrequests,bs4
>>>res=requests.get('http://i.firefoxchina.cn/?from=worldindex')
>>>res.raise_for_status()
>>>soup=bs4.BeautifulSoup(res.text)

Warning(fromwarningsmodule):
File"C:\Users\King\AppData\Local\Programs\PythonPython36-32lib\site-packagesbeautifulsoup4-4.6.0-py3.6.egg
\bs4______init__.py",line181
markup_type=markup_type))
UserWarning:Noparserwasexplicitlyspecified,soI'musingthebestavailableHTMLparserforthis
system
("html.parser").Thisusuallyisn'taproblem,butifyourunthiscodeonanothersystem,orina
differentvirtual
environment,itmayuseadifferentparserandbehavedifferently.

thecodethatcausedthiswarninginlinininlinininlininline1ofthefile<string>.Togetridofthiswarning,
changecodethat
lookslikethis:
BeautifulSoup(YOUR_MARKUP})
tothis:
BeautifulSoup(YOUR_MARKUP,"html.parser")

>>>soup=bs4.BeautifulSoup(res.text,'html.parser')
>>>type(soup)
<class'bs4.BeautifulSoup'>

我在这里有错误的提示，所以我添加了第二个参数。

>>>importbs4
>>>html=open('C:\\Users\\King\\Desktop\\1.htm')
>>>exampleSoup=bs4.BeautifulSoup(html)
>>>exampleSoup=bs4.BeautifulSoup(html,'html.parser')
>>>type(exampleSoup)
<class'bs4.BeautifulSoup'>

2.用select()寻找元素的方法:需要将字符串作为CSS的“选择器”输入到Web页面的相应元素中，例如:

soup.select('p')：所有名为<p>的元素；

soup.select('#author')：author元素具有id属性；

soup.select('.notice')：所有使用CSS class属性称为notice元素；

soup.select('p span')：所有在<p>元素之内的<span>元素；

soup.select('input[name]')：所有名为<input>并且有一个name属性，其值不重要的元素；

soup.select('input[type="button"]')：所有名为<input>还有一个type属性，其值为button元素。

如果您想查看更多的分析器，请参见此处。

>>>importrequests,bs4
>>>res=requests.get('http://i.firefoxchina.cn/?from=worldindex')
>>>res.raise_for_status()
>>>soup=bs4.BeautifulSoup(res.text,'html.parser')
>>>author=soup.select('#author')
>>>print(author)
[]
>>>type(author)
<class'list'>
>>>link=soup.select('link')
>>>print(link)
[<linkhref="css/mozMainStyle-min.css?v=20170705"rel="externalnofollow"rel="externalnofollow"rel="
stylesheet"
type="text/css"/>,<linkhref=""id="rel="externalnofollow"rel="externalnofollow"rel="external
nofollow"
moz-skin"rel="stylesheet"type="text/css"/>,<linkhref=""id="rel="externalnofollow"rel="external
nofollow"
rel="externalnofollow"moz-dir"rel="stylesheet"type="text/css"/>,<linkhref=""id="rel="external
nofollow"
rel="externalnofollow"rel="externalnofollow"moz-ver"rel="stylesheet"type="text/css"/>]
>>>type(link)
<class'list'>
>>>len(link)
4
>>>type(link[0])
<class'bs4.element.Tag'>
>>>link[0]
<linkhref="css/mozMainStyle-min.css?v=20170705"rel="externalnofollow"rel="externalnofollow"rel="stylesheet"
type="text/css"/>
>>>link[0].attrs
{'rel':['stylesheet'],'type':'text/css','href':'css/mozMainStyle-min.css?v=20170705"rel="externalnofollow"rel="externalnofollow"rel="stylesheet"
type="text/css"/>
>>>link[0].attrs
{'rel':['stylesheet'],'type':'text/css','href':'css/mozMainStyle-min.css?v=20170705'}

3.通过元素的属性获取数据:然后写上面的代码。

>>>link[0].get('href')
'css/mozMainStyle-min.css?v=20170705

上一篇数据分析用r还是python

下一篇怎么debug python

python3兼容python2吗

python3 whl怎么安装

python 字典怎么提取value

python 怎样计算字符串的长度

python 怎么样反向输出字符串

python 怎么判断字符串开头

图灵python

网页用python爬取后如何解析

相关文章