热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

使用beautifulsoup从Wikipedia刮下整个表格,然后加载到熊猫中

我目前正在抓取以下Wiki页面:ht

我目前正在抓取以下Wiki页面:https://en.wikipedia.org/wiki/Cargo_aircraft,只有一个表格开始进行比较。我正在尝试刮整个桌子并将其输出到熊猫。我知道如何添加初始列Aircraft,但是在从体积开始刮取列时遇到了麻烦。

如何将表的所有行添加到数据框或列中?不知道哪种方法更好。


from bs4 import BeautifulSoup
import requests
import pandas as pd
#this will use request library to call wikipedia
page = requests.get('https://en.wikipedia.org/wiki/Cargo_aircraft')
#create beautifulsoup object
soup = BeautifulSoup(page.text,'html.parser')
table = soup.find('table',attrs={'class':'wikitable sortable'})
tabledata = table.findAll('tbody')
links = table.findAll('a')
aircraft = []
for link in links:
aircraft.append(link.get('title'))
print(aircraft)
#pull table from Wikipedia
df = pd.DataFrame()
df['Aircraft'] = aircraft
df['Test'] = 'test'



使用pandas.read_html

  • 绕过beautifulsoup并将表直接读入熊猫。

  • 将HTML表读入DataFrame对象的list

    • 在这种情况下,表位于索引[1]



import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/Cargo_aircraft')[1]
# df view
Aircraft Volume Payload Cruise Range Usage
0 Airbus A400M 270 m³ 37,000 kg (82,000 lb) 780 km/h (420 kn) 6,390 km (3,450 nmi) Military
1 Airbus A300-600F 391.4 m³ 48,000 kg (106,000 lb) – 7,400 km (4,000 nmi) Commercial
2 Airbus A330-200F 475 m³ 70,000 kg (154,000 lb) 871 km/h (470 kn) 7,000 nmi) Commercial
3 Airbus Beluga 1210 m³ 47,000 kg (104,000 lb) – 4,632 km (2,500 nmi) Commercial
4 Airbus Beluga XL 2615 m³ 53,000 kg (117,074 km (2,200 nmi) Commercial
5 Antonov An-124 1028 m³ 150,000 kg (331,000 lb) 800 km/h (430 kn) 5,400 km (2,900 nmi) Both
6 Antonov An-225 1300 m³ 250,000 kg (551,000 lb) 800 km/h (430 kn) 15,400 km (8,316 nmi) Commercial
7 Boeing C-17 – 77,519 kg (170,900 lb) 830 km/h (450 kn) 4,482 km (2,420 nmi) Military
8 Boeing 737-700C 107.6 m³ 18,200 kg (40,000 lb) 931 km/h (503 kn) 5,330 km (2,880 nmi) Commercial
9 Boeing 757-200F 239 m³ 39,780 kg (87,700 lb) 955 km/h (516 kn) 5,834 km (3,150 nmi) Commercial
10 Boeing 747-8F 854.5 m³ 134,200 kg (295,900 lb) 908 km/h (490 kn) 8,288 km (4,475 nmi) Commercial
11 Boeing 747 LCF 1840 m³ 83,325 kg (183,700 lb) 878 km/h (474 kn) 7,800 km (4,200 nmi) Commercial
12 Boeing 767-300F 438.2 m³ 52,700 kg (116,200 lb) 850 km/h (461 kn) 6,025 km (3,225 nmi) Commercial
13 Boeing 777F 653 m³ 103,000 kg (227,000 lb) 896 km/h (484 kn) 9,070 km (4,900 nmi) Commercial
14 Bombardier Dash 8-100 39 m³ 4,700 kg (10,400 lb) 491 km/h (265 kn) 2,039 km (1,100 nmi) Commercial
15 Lockheed C-5 – 122,470 kg (270,000 lb) 919 km/h 4,440 km (2,400 nmi) Military
16 Lockheed C-130 – 20,400 kg (45,000 lb) 540 km/h (292 kn) 3,800 km (2,050 nmi) Military
17 Douglas DC-10-30 – 77,000 kg (170,000 lb) 908 km/h (490 kn) 5,790 km (3,127 nmi) Commercial
18 McDonnell Douglas MD-11 440 m³ 91,670 kg (202,100 lb) 945 km/h (520 kn) 7,320 km (3,950 nmi) Commercial

,

您可以尝试:

df = pd.read_html('https://en.wikipedia.org/wiki/Cargo_aircraft')[1]
df['Volume'] = pd.Series([x[0] if x[0] != '–' else None for x in df['Volume'].str.split()]).astype(float)
df['Payload'] = pd.Series([x[0].replace(',','') if x[0] != '–' else None for x in df['Payload'].str.split()]).astype(int)
df['Cruise'] = pd.Series([x[0] if x[0] != '–' else None for x in df['Cruise'].str.split()]).astype(float)
df['Range'] = pd.Series([x[0].replace(','') if x[0] != '–' else None for x in df['Range'].str.split()]).astype(int)

结果:

df.info()


RangeIndex: 19 entries,0 to 18
Data columns (total 6 columns):
Aircraft 19 non-null object
Volume 15 non-null float64
Payload 19 non-null int64
Cruise 16 non-null float64
Range 19 non-null int64
Usage 19 non-null object
dtypes: float64(2),int64(2),object(2)
memory usage: 1.0+ KB

print(df)

Aircraft Volume Payload Cruise Range Usage
0 Airbus A400M 270.0 37000 780.0 6390 Military
1 Airbus A300-600F 391.4 48000 NaN 7400 Commercial
2 Airbus A330-200F 475.0 70000 871.0 7400 Commercial
3 Airbus Beluga 1210.0 47000 NaN 4632 Commercial
4 Airbus Beluga XL 2615.0 53000 NaN 4074 Commercial
5 Antonov An-124 1028.0 150000 800.0 5400 Both
6 Antonov An-225 1300.0 250000 800.0 15400 Commercial
7 Boeing C-17 NaN 77519 830.0 4482 Military
8 Boeing 737-700C 107.6 18200 931.0 5330 Commercial
9 Boeing 757-200F 239.0 39780 955.0 5834 Commercial
10 Boeing 747-8F 854.5 134200 908.0 8288 Commercial
11 Boeing 747 LCF 1840.0 83325 878.0 7800 Commercial
12 Boeing 767-300F 438.2 52700 850.0 6025 Commercial
13 Boeing 777F 653.0 103000 896.0 9070 Commercial
14 Bombardier Dash 8-100 39.0 4700 491.0 2039 Commercial
15 Lockheed C-5 NaN 122470 919.0 4440 Military
16 Lockheed C-130 NaN 20400 540.0 3800 Military
17 Douglas DC-10-30 NaN 77000 908.0 5790 Commercial
18 McDonnell Douglas MD-11 440.0 91670 945.0 7320 Commercial

推荐阅读
author-avatar
jzhs5340636
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有