脚本需求:从一个网页中找到所有的电话号码和邮箱地址
任务划分与实现:
1、 将网页复制到剪贴板
打开网页“https://nostarch.com/contactus”,Ctrl+A、Ctrl+C
2、使用pyperclip库,从剪贴板取得文本
>>> import pyperclip,re
>>> text=str(pyperclip.paste())
>>> text
'Got It\r\nThis website usesCOOKIEs to improve your experience. Learn More\r\nSkip to maincontent\r\n
Home\r\nSearch form\r\nSearch\r\nCatalog\r\nBlog\r\nMedia\r\nWritefor Us\r\nAbout Us\r\nContact Us\r\n
We are currently shipping with some delays.Please see our FAQ.\r\n\r\nTopics\r\nArt & Design\r\nGeneralComputing\r\nHacking & Computer Security\r\n
Hardware /DIY\r\nKids\r\nLEGO®\r\nLinux &BSD\r\nManga\r\nProgramming\r\nPython\r\n
Science & Math\r\nScratch\r\nSystemAdministration\r\nEarly Access\r\nFree ebook edition with every print bookpurchased from nostarch.com!\r\nShopping cart\r\n
0 Items\tTotal: $0.00\r\nUserlogin\r\nLog in\r\nCreate account\r\nContact Us\r\n\r\n
No Starch Press,Inc.\r\n245 8th Street\r\nSan Francisco, CA 94103 USA\r\nPhone: 800.420.7240 or+1 415.863.9900 (9 a.m. to 5 p.m., M-F, PST)\r\nFax: +1415.863.9950\r\n\r\n
Reach Us by Email\r\n\r\nGeneral inquiries:info@nostarch.com\r\n
Media requests: media@nostarch.com\r\nAcademic requests:academic@nostarch.com (Further information)\r\nHelp with your order:info@nostarch.com\r\nReach Us on SocialMedia\r\nTwitter\r\nFacebook\r\nInstagram\r\nLinkedin\r\nPinterest\r\n\r\n
Navigation\r\nMyaccount\r\nWant sweet deals?\r\nSign up for our newsletter.\r\n\r\n\r\nAboutUs | Jobs! | Sales and Distribution | Rights | Media | Academic Requests | Conferences | FAQ | Contact Us | Write for Us | Privacy\r\n
Copyright 2020. No Starch Press,Inc\r\n\r\n'
3、创建正则表达式对象phoneRegex和emailRegex
# Create phone regex.
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))? # area code
(\s|-|\.)? # separator
(\d{3}) # first 3 digits
(\s|-|\.) # separator
(\d{4}) # last 4 digits
)''', re.VERBOSE)
(\d{3}|\(\d{3}\))? 匹配可选的3个数字区号,(\s|-|\.)?匹配分隔符,(\d{3})匹配3个数字。
# Create email regex.
emailRegex = re.compile(r'''(
[a-zA-Z0-9._%+-]+ # username
@ # @ symbol
[a-zA-Z0-9.-]+ # domain name
(\.[a-zA-Z]{2,4}){1,2} # dot-something
)''',re.VERBOSE)
[a-zA-Z0-9._%+-]+匹配用户名,[a-zA-Z0-9.-]+匹配域名,(\.[a-zA-Z]{2,4}){1,2}匹配.xx
可以看到上面的这种正则表达式写法非常的易读。
4、找出文本中所有的电话号码并保存。
matches = []
for groups in phoneRegex.findall(text):
matches.append(groups[0])
其中phoneRegex.findall(text)返回的是元组的列表
>>>phoneRegex.findall(text)
[
('800.420.7240','800', '.', '420', '.', '7240'),
('415.863.9900','415', '.', '863', '.', '9900'),
('415.863.9950','415', '.', '863', '.', '9950')]
最终电话号码保存到数组,如下所示
>>> matches
['800.420.7240', '415.863.9900','415.863.9950']
5、找出文本中所有的E-mail 地址并保存。
for groups in emailRegex.findall(text):
matches.append(groups[0])
6、打印结果
if len(matches) > 0:
print('\n'.join(matches))
else:
print('No phone numbers or email addressesfound.')
如下:
800.420.7240
415.863.9900
415.863.9950
info@nostarch.com
media@nostarch.com
academic@nostarch.com
info@nostarch.com
往期精彩
python脚本练习(1):表格打印
python脚本练习(2):使用正则表达式的三部曲