作者:mobiledu2502861593 | 来源:互联网 | 2023-09-25 01:04
Ihaveaxmlfilewhichcontainsmultipledeclarationslikethefollowing我有一个包含多个声明的xml文件,如下所示<
I have a xml file which contains multiple declarations like the following
我有一个包含多个声明的xml文件,如下所示
Stefan
42
Shirt
3000
Damon
32
Jeans
4000
when i tried to load the xml with
当我尝试加载xml时
$data = simplexml_load_file("testdoc.xml") or die("Error: Cannot create object");
then it gives me the following error
然后它给了我以下的错误
Warning: simplexml_load_file(): testdoc.xml:11: parser error : XML declaration allowed only at the start of the document in C:\xampp\htdocs\crea\services\testxml.php on line 3
Warning: simplexml_load_file(): in C:\xampp\htdocs\crea\services\testxml.php on line 3
Warning: simplexml_load_file(): ^ in C:\xampp\htdocs\crea\services\testxml.php on line 3
Warning: simplexml_load_file(): testdoc.xml:12: parser error : Extra content at the end of the document in C:\xampp\htdocs\crea\services\testxml.php on line 3
Warning: simplexml_load_file(): in C:\xampp\htdocs\crea\services\testxml.php on line 3
Warning: simplexml_load_file(): ^ in C:\xampp\htdocs\crea\services\testxml.php on line 3
Error: Cannot create object
please let me know how to parse this xml or how to split it into no of xml files so that i can read. The file size is around 1 gb.
请让我知道如何解析这个xml,或者如何将它分割成xml文件,以便我可以阅读。文件大小约为1gb。
2 个解决方案
4
The second line
第二行
needs to be removed. Only 1 xml declaration is a allowed in any file and it must be the first line.
需要被删除。任何文件中只允许有1个xml声明,并且必须是第一行。
Strictly speaking, you also need to have a single root element (though i've seen lenient parsers). Just wrap the contents with a pseudo tag, such that your file would look like:
严格地说,您还需要一个单独的根元素(尽管我看到过一些比较宽容的解析器)。只需用伪标记将内容包装起来,使您的文件看起来如下:
Solution for (very) large files:
(非常)大的文件的解决方案:
Use sed
to eliminate offending xml declarations and printf
to add a single xml declaration plus a unique root element. A sequence of bash commands follows:
使用sed消除违规的xml声明,并使用printf添加一个xml声明和一个惟一的根元素。bash命令序列如下:
printf "\n\n" >out.xml
sed '/<\?xml /d' in.xml >>out.xml
printf "\n\n" >>out.xml
in.xml
denotes your original file,out.xml
the purged result.
在。xml表示原始文件out。xml清除的结果。
printf
prints a single xml declaration and the opening/closing tags. sed
is a tool to edit a file line by line performing actions contingent on regex pattern matches. The pattern to match is the start of the xml declaration (<\? xml
), the action to perform is to delete that line.
printf打印一个xml声明和打开/关闭标记。sed是一种工具,可以根据regex模式匹配逐行编辑执行操作的文件。要匹配的模式是xml声明的开始(<\?执行的操作是删除这一行。
Notes:
注:
- the backslashes in the commands escape symbols with special semantics at the position where they occur.
- 命令中的反斜杠以特殊语义在它们发生的位置转义符号。
sed
is available for windows/macos too.
- sed也适用于windows/macos。
Alternate solution
Another option is to split the file into individual well-formed files (taken from this SO answer:
另一种选择是将文件分割成单独的格式良好的文件(从这个SO中获取答案:
csplit -z -f 'temp' -b 'out%03d.xml' in.xml '/<\?xml /' {*}
which produces files named out000.xml
, out001.xml
, ... You should know at least the magnitude of the number of individual files that have been processed into your input file to be safe with the autonumbering ( though you could of course take the byte number of the input file as the magnitude, using -b 'out%09d.xml'
in the above command).
生成名为out000的文件。xml,out001。xml,…您应该至少知道被处理到您的输入文件中的单个文件的数量,以便在自动编号时安全(当然,您可以使用-b 'out%09d将输入文件的字节数作为大小。以上命令中的xml)。