您好,匿名用户
随意问技术百科期待您的加入

解析文本文件一般方法

0 投票

看过一些解析文本或者源代码文件程序的代码,在写lexer的时候好像一个一个字符读进去再收集token然后分析是keyword\identifier\number...之类的方法比较多,是不是用正则表达式会比较慢?还是有部分用正则表达式无法实现?用正则表达式将文本切开再从中析下keyword\identifier\number...有什么不同?

用户头像 提问 2012年 12月1日 @ Xerath 上等兵 (312 威望)
分享到:

1个回答

0 投票

现在的lexer也是用正则了。

# http://www.evanfosmark.com/2009/02/sexy-lexing-with-python/

import re
 
 
class UnknownTokenError(Exception):
    """ This exception is for use to be thrown when an unknown token is
        encountered in the token stream. It hols the line number and the
        offending token.
    """
    def __init__(self, token, lineno):
        self.token = token
        self.lineno = lineno
 
    def __str__(self):
        return "Line #%s, Found token: %s" % (self.lineno, self.token)
 
 
class _InputScanner(object):
    """ This class manages the scanning of a specific input. An instance of it is
        returned when scan() is called. It is built to be great for iteration. This is
        mainly to be used by the Lexer and ideally not directly.
    """
 
    def __init__(self, lexer, input):
        """ Put the lexer into this instance so the callbacks can reference it 
            if needed.
        """
        self._position = 0
        self.lexer = lexer
        self.input = input
 
    def __iter__(self):
        """ All of the code for iteration is controlled by the class itself.
            This and next() (or __next__() in Python 3.0) are so syntax
            like `for token in Lexer(...):` is valid and works.
        """
        return self
 
    def __next__(self):
        """ Used for iteration. It returns token after token until there
            are no more tokens. (change this to __next__(self) if using Py3.0)
        """
        if not self.done_scanning():
            return self.scan_next()
        raise StopIteration
 
    def done_scanning(self):
        """ A simple boolean function that returns true if scanning is
            complete and false if it isn't.
        """
        return self._position >= len(self.input)
 
    def scan_next(self):
        """ Retreive the next token from the input. If the
            flag `omit_whitespace` is set to True, then it will
            skip over the whitespace characters present.
        """
        if self.done_scanning():
            return None
        if self.lexer.omit_whitespace:
            match = self.lexer.ws_regexc.match(self.input, self._position)
            if match:
                self._position = match.end()
        match = self.lexer.regexc.match(self.input, self._position)
        if match is None:
            lineno = self.input[:self._position].count("\n") + 1
            raise UnknownTokenError(self.input[self._position], lineno)
        self._position = match.end()
        value = match.group(match.lastgroup)
        if match.lastgroup in self.lexer._callbacks:
            value = self.lexer._callbacks[match.lastgroup](self, value)
        return match.lastgroup, value
 
 
class Lexer(object):
    """ A lexical scanner. It takes in an input and a set of rules based
        on reqular expressions. It then scans the input and returns the
        tokens one-by-one. It is meant to be used through iterating.
    """
 
    def __init__(self, rules, case_sensitive=True, omit_whitespace=True):
        """ Set up the lexical scanner. Build and compile the regular expression
            and prepare the whitespace searcher.
        """
        self._callbacks = {}
        self.omit_whitespace = omit_whitespace
        self.case_sensitive = case_sensitive
        parts = []
        for name, rule in rules:
            if not isinstance(rule, str):
                rule, callback = rule
                self._callbacks[name] = callback
            parts.append("(?P<%s>%s)" % (name, rule))
        if self.case_sensitive:
            flags = re.M
        else:
            flags = re.M|re.I
        self.regexc = re.compile("|".join(parts), flags)
        self.ws_regexc = re.compile("\s*", re.MULTILINE)
 
    def scan(self, input):
        """ Return a scanner built for matching through the `input` field. 
            The scanner that it returns is built well for iterating.
        """
        return _InputScanner(self, input)


if __name__ == '__main__':
    def stmnt_callback(scanner, token):
        """ This is just an example of providing a function to run the
        token through.
        """
        return ""
 
    rules = [
        ("IDENTIFIER", r"[a-zA-Z_]\w*"),
        ("OPERATOR",   r"\+|\-|\\|\*|\="),
        ("DIGIT",      r"[0-9]+(\.[0-9]+)?"),
        ("END_STMNT",  (";", stmnt_callback)), 
        ]
 
    lex = Lexer(rules, case_sensitive=True)
    for token in lex.scan("foo = 5 * 30; bar = bar - 60;"):
        print(token)
用户头像 回复 2012年 12月1日 @ Poppy 上等兵 (395 威望)
提一个问题:

相关问题

0 投票
1 回复 156 阅读
0 投票
0 回复 199 阅读
用户头像 提问 2013年 10月14日 @ Hades 上等兵 (152 威望)
0 投票
1 回复 173 阅读
用户头像 提问 2012年 12月1日 @ Amumu 上等兵 (340 威望)
0 投票
1 回复 180 阅读
0 投票
0 回复 33 阅读

欢迎来到随意问技术百科, 这是一个面向专业开发者的IT问答网站,提供途径助开发者查找IT技术方案,解决程序bug和网站运维难题等。
温馨提示:本网站禁止用户发布与IT技术无关的、粗浅的、毫无意义的或者违法国家法规的等不合理内容,谢谢支持。

欢迎访问随意问技术百科,为了给您提供更好的服务,请及时反馈您的意见。
...