作者:手机用户2502901575_836 | 来源:互联网 | 2023-07-16 08:14
Imtryingtousepythontoparselinesofc++sourcecode.TheonlythingIaminterestedinisinc
I'm trying to use python to parse lines of c++ source code. The only thing I am interested in is include directives.
我正在尝试使用python来解析c ++源代码行。我唯一感兴趣的是包含指令。
#include "header.hpp"
I want it to be flexible and still work with poor coding styles like:
我希望它具有灵活性,仍然适用于不良的编码风格,如:
# include"header.hpp"
I have gotten to the point where I can read lines and trim whitespace before and after the #. However I still need to find out what directive it is by reading the string until a non-alpha character is encountered regardless of weather it is a space, quote, tab or angled bracket.
我已经到了能够在#之前和之后读取线条和修剪空白的地步。但是我仍然需要通过读取字符串来找出它是什么指令,直到遇到非字母字符,无论天气如何,它都是空格,引号,制表符或有角度的括号。
So basically my question is: How can I split a string starting with alphas until a non alpha is encountered?
所以基本上我的问题是:我如何分割以alpha开头的字符串,直到遇到非alpha?
I think I might be able to do this with regex, but I have not found anything in the documentation that looks like what I want.
我想我可以用正则表达式做到这一点,但我没有在文档中找到任何看起来像我想要的东西。
Also if anyone has advice on how I would get the file name inside the quotes or angled brackets that would be a plus.
此外,如果有人有关于我如何获得引号或斜角括号内的文件名的建议,这将是一个加号。
7 个解决方案
1
The two options mentioned by others that are best in my opinion are re.split
and re.findall
:
在我看来,其他人提到的两个选项是re.split和re.findall:
>>> import re
>>> re.split(r'\W+', '#include "header.hpp"')
['', 'include', 'header', 'hpp', '']
>>> re.findall(r'\w+', '#include "header.hpp"')
['include', 'header', 'hpp']
A quick benchmark:
快速基准:
>>> setup = "import re; word_pattern = re.compile(r'\w+'); sep_pattern = re.compile(r'\W+')"
>>> iteratiOns= 10**6
>>> timeit.timeit("re.findall(r'\w+', '#header foo bar!')", setup=setup, number=iterations)
3.000092029571533
>>> timeit.timeit("word_pattern.findall('#header foo bar!')", setup=setup, number=iterations)
1.5247418880462646
>>> timeit.timeit("re.split(r'\W+', '#header foo bar!')", setup=setup, number=iterations)
3.786440134048462
>>> timeit.timeit("sep_pattern.split('#header foo bar!')", setup=setup, number=iterations)
2.256173849105835
The functional difference is that re.split
keeps empty tokens. That’s usually not useful for tokenization purposes, but the following should be identical to the re.findall
solution:
功能上的区别在于re.split保持空令牌。这通常对标记化目的没有用,但以下内容应与re.findall解决方案相同:
>>> filter(bool, re.split(r'\W+', '#include "header.hpp"'))
['include', 'header', 'hpp']