虽然第一部分确实可以在“文本模式”中使用正则表达式或Javascript中更完整的DOM实现来解决,但对于第二部分(高度计算),您需要一个真实的完整浏览器或像PhantomJS这样的无头引擎.
PhantomJS is a command-line tool that packs and embeds WebKit.
Literally it acts like any other WebKit-based web browser, except that
nothing gets displayed to the screen (thus, the term headless). In
addition to that, PhantomJS can be controlled or scripted using its
Javascript API.
下面是一个示意性指令(我承认未经测试).
在您的修改脚本(例如,modify-html-file.js)中打开一个HTML页面,修改它的DOM树和console.log根元素的HTML:
var page = new WebPage();
page.open(encodeURI('file://' + phantom.args[0]), function (status) {
if (status === 'success') {
var html = page.evaluate(function () {
// your DOM manipulation here
return document.documentElement.outerHTML;
});
console.log(html);
}
phantom.exit();
});
接下来,通过将脚本的输出重定向到文件来保存新的HTML:
#!/bin/bash
mkdir modified
for i in *.html; do
phantomjs modify-html-file.js "$1" > modified/"$1"
done