作者:起飞吧和谐号况 | 来源:互联网 | 2023-09-10 13:59
I'm trying to parse all the company names from this webpage. There are around 2431
companies in there. However, the way I've tried below can fetches me 1000
results.
This is what I can see about the number of results in response while going through dev tools:
hitsPerPage: 1000
index: "YCCompany_production"
nbHits: 2431 <------------------------
nbPages: 1
page: 0
How can I get the rest of the results using requests?
I've tried so far:
import requests
url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries?'
params = {
'x-algolia-agent': 'Algolia for Javascript (3.35.1); Browser; JS Helper (3.1.0)',
'x-algolia-application-id': '45BWZJ1SGC',
'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJlc3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUyMiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}
payload = {"requests":[{"indexName":"YCCompany_production","params":"hitsPerPage=1000&query=&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="}]}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
r = s.post(url,params=params,json=payload)
print(len(r.json()['results'][0]['hits']))
回答
作为一种解决方法,您可以使用字母表作为搜索模式来模拟搜索。使用下面的代码,您将获得所有 2431 家公司作为字典,ID 作为键,完整的公司数据字典作为值。
import requests
import string
params = {
'x-algolia-agent': 'Algolia for Javascript (3.35.1); Browser; JS Helper (3.1.0)',
'x-algolia-application-id': '45BWZJ1SGC',
'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJl'
'c3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUy'
'MiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}
url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries'
result = dict()
for letter in string.ascii_lowercase:
print(letter)
payload = {
"requests": [{
"indexName": "YCCompany_production",
"params": "hitsPerPage=1000&query=" + letter + "&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="
}]
}
r = requests.post(url, params=params, json=payload)
result.update({h['id']: h for h in r.json()['results'][0]['hits']})
print(len(result))
回答
更新 01-04-2021
在查看 Algolia API文档中的“细则”后,我发现paginationLimitedTo参数不能在查询中使用。此参数只能在数据所有者索引期间使用。
似乎您可以通过这种方式使用查询和偏移量:
payload = {"requests":[{"indexName":"YCCompany_production",
"params": "query=&offset=1000&length=500&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit"
"%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="}]}
不幸的是,客户设置的 paginationLimitedTo 索引不会让您通过 API 检索超过 1000 条记录。
"hits": [],
"nbHits": 2432,
"offset": 1000,
"length": 500,
"message": "you can only fetch the 1000 hits for this query. You can extend the number of hits returned via the paginationLimitedTo index parameter or use the browse method. You can read our FAQ for more details about browsing: https://www.algolia.com/doc/faq/index-configuration/how-can-i-retrieve-all-the-records-in-my-index",
提到的浏览绕过方法需要ApplicationID和AdminAPIKey
原帖
根据 Algolia API文档,查询命中限制为 1000。
该文档列出了几种覆盖或绕过此限制的方法。
API 的一部分是paginationLimitedTo,默认情况下设置为 1000 以实现性能和“抓取保护”。
语法是:
'paginationLimitedTo': number_of_records
文档中提到的另一种方法是设置参数偏移和长度。
偏移量让您指定起始命中(或记录)
length设置返回的记录数
您可以使用这些参数来遍历记录,因此可能不会影响您的抓取性能。
例如,您可以以 500 块为单位进行刮刮。
- 记录 1-500(偏移=0 和长度=500)
- 记录 501-1001(偏移=500 和长度=500)
- 记录 1002-1502(偏移=1001 和长度=500)
- 等等...
或者
- 记录 1-500(偏移=0 和长度=500)
- 记录 500-1000(偏移=499 和长度=500)
- 记录 1000-1500(偏移=999 和长度=500)
- 等等...
后一个会产生一些重复项,在将它们添加到内存存储(列表、字典、数据框)时可以轻松删除这些重复项。
----------------------------------------
My system information
----------------------------------------
Platform: macOS
Python: 3.8.0
Requests: 2.25.1
----------------------------------------