Python Symbol Table
2015-07-18 15:16
495 查看
点击打开链接
Eli
Bendersky's website
About
Code
Archives
September 18, 2010 at 08:03 Tags Articles , Python
internals
This article is the first in a short series in which I intend to explain how CPython [1] implements
and uses symbol tables in its quest to compile Python source code into bytecode. In this part I will explain what a symbol table is and show how the general concepts apply to Python. In the second part I will delve into the implementation of symbol tables
in the core of CPython.
As usual, it's hard to beat Wikipedia for
a succinct definition:
In computer science, a symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier in a program's source code is associated with information relating to its declaration or appearance in the source,
such as its type, scope level and sometimes its location.
Symbol tables are used by practically all compilers. They're especially important in statically typed languages where all variables have types and type checking by the compiler is an important part of the front-end.
Consider this C code:
There are two distinct assignments of the form bb = *aa here,
but only the second one is legal. The compiler throws the following error when it sees the first one:
How does the compiler know that the * operator
is given an argument of type int,
which is invalid for this operator? The answer is: the symbol table. The compiler sees *aa and
asks itself what the type of aa is.
To answer this question it consults the symbol table it constructed earlier. The symbol table contains the declared types for all variables the compiler encountered in the code.
This example demonstrates another important concept - for most languages a single symbol table with information about all variables won't do. The second assignment is valid, because in the
internal scope created by the curly bracesaa is
redefined to be of a pointer type. Thus, to correctly compile such code the C compiler has to keep a separate symbol table per scope [2].
So far I've been using the term "variable" liberally. Just to be on the safe side, let's clarify what is meant by variable in Python. Formally, Python doesn't really have variables in the sense
C has. Rather, Python has symbolic names bound to objects:
In this code, aa is
a name bound to a list object. bb is
a name bound to the same object. The third line modifies the list through aa,
and if we print out bb we'll
see the modified list as well.
Now, once this is understood, I will still use the term "variable" from time to time since it's occasionally convenient and everybody's used to it anyway.
Alright, so symbol tables are very useful for type checking. But Python doesn't have compile-time type checking (duck typing FTW!), so what does CPython need symbol tables for?
The CPython compiler still has to resolve what kinds of variables are used in the code. Variables in Python can be local, global or even bound by a lexically enclosing scope. For example:
The function inner uses
three variables: aa, bb and cc.
They're all different from Python's point of view: aa is
lexically bound in outer, bb is
locally bound in inner itself,
and cc is not bound anywhere
in sight, so it's treated as global. The bytecode generated for inner shows
clearly the different treatment of these variables:
As you can see, different opcodes are used for loading the variables onto the stack prior to applying BINARY_ADD.LOAD_DEREF is
used for aa, LOAD_FAST for bb and LOAD_GLOBAL for cc.
At this point, there are three different directions we can pursue on our path to deeper understanding of Python:
Figure out the exact semantics of variables in Python - when are they local, when are they global and what exactly makes them lexically bound.
Understand how the CPython compiler knows the difference.
Learn about the different bytecode opcodes for these variables and how they affect the way the VM runs code.
I won't even try going into (1) since it's a broad topic completely out of the scope of this article. There are plenty of resources online - start with the official and
continue Googling until you're fully enlightened. (3) is also out of scope as I'm currently focusing on the front-end of CPython. If you're interested, there's an excellent series of in-depth articles on Python focusing on the back-end, with a
nice treatment of this very issue.
To answer (2) we need to understand how CPython uses symbol tables, which is what this series of articles is about.
A high-level view of the front-end of CPython is:
Parse source code into a parse tree
Transform parse tree into an Abstract Syntax Tree
Transform AST into a Control Flow Graph
Emit bytecode based on the Control Flow Graph
Symbol tables are created in step 3. The compiler builds a symbol table from the AST representing the Python source code. This symbol table, in conjunction with the AST is then used to generate the control flow
graph (CFG) and ultimately the bytecode.
CPython does a great job exposing some of its internals via standard-library modules [3].
Symbol tables is yet another internal data structure that can be explored from the outside in pure Python code, with the help of the symtable module.
From its description:
Symbol tables are generated by the compiler from AST just before bytecode is generated. The symbol table is responsible for calculating the scope of every identifier in the code. symtable provides an interface to examine these tables.
The symtable module
provides a lot of information on the various identifiers encountered in Python code. Apart from telling their scope, it allows us to find out which variables are referenced in their scope, assigned in their scope, define new namespaces (like functions) and
so on. To help with exploring the symbol table I've written the following function that simplifies working with the module:
Let's see what it has to say about the inner function
from the example above:
Indeed, we see that the symbol table marks aa as
lexically bound, or "free" (more on this in the next section), bb as
local and cc as global. It
also tells us that all these variables are referenced in the scope of inner and
that bb is assigned in that
scope.
The symbol table contains other useful information as well. For example, it can help distinguish between explicitly declared globals and implicit globals:
In this code both ff and gg are
global to outer, but only gg was
explicitly declared global.
The symbol table knows this - the output of describe_symbol for
this function is:
Unfortunately, there's a shorthand in the core of Python that may initially confuse readers as to exactly what constitutes a "free" variable. Fortunately, it's a very slight confusion that's easy to put in order.
The execution model reference says:
If a variable is used in a code block but not defined there, it is a free variable.
This is consistent with the formal
definition. In the source, however, "free" is actually used as a shorthand for "lexically bound free variable" (i.e. variables for which a binding has been found in an enclosing scope), with "global" being used to refer to all remaining free variables.
So when reading the CPython source code it is important to remember that the full set of free variables includes both the variables tagged specifically as "free", as well as those tagged as "global".
Thus, to avoid a confusion I say "lexically bound" when I want to refer to the variables actually treated in CPython as free.
Although Python is duck-typed, some things can still be enforced at compile-time. The symbol table is a powerful tool allowing the compiler to catch some errors [4].
For example, it's not allowed to declare function parameters as global:
When compiling this function, the error is caught while constructing the symbol table:
The symbol table is useful here since it knows that aa is
a parameter in the scope of outer and
when global aa is
encountered it's a sure sign of an error.
Other errors are handled by the symbol table: duplicate argument names in functions, using import * inside
functions, returning values inside generators, and a few more.
This article serves mainly as an introduction for the next one, where I plan to explore the actual implementation of symbol tables in the core of CPython. A symbol table is a tool designed to solve some problems
for the compiler, and I hope this article did a fair job describing a few of these problems and related terminology.
Special thanks to Nick Coghlan for reviewing this article.
© 2003-2015 Eli Bendersky
Back
to top
Eli
Bendersky's website
About
Code
Archives
Python internals: Symbol
tables, part 1
September 18, 2010 at 08:03 Tags Articles , Pythoninternals
Introduction
This article is the first in a short series in which I intend to explain how CPython [1] implementsand uses symbol tables in its quest to compile Python source code into bytecode. In this part I will explain what a symbol table is and show how the general concepts apply to Python. In the second part I will delve into the implementation of symbol tables
in the core of CPython.
So what is a symbol table?
As usual, it's hard to beat Wikipedia fora succinct definition:
In computer science, a symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier in a program's source code is associated with information relating to its declaration or appearance in the source,
such as its type, scope level and sometimes its location.
Symbol tables are used by practically all compilers. They're especially important in statically typed languages where all variables have types and type checking by the compiler is an important part of the front-end.
Consider this C code:
int main() { int aa, bb; bb = *aa; { int* aa; bb = *aa; } return 0; }
There are two distinct assignments of the form bb = *aa here,
but only the second one is legal. The compiler throws the following error when it sees the first one:
error: invalid type argument of ‘unary *’ (have ‘int’)
How does the compiler know that the * operator
is given an argument of type int,
which is invalid for this operator? The answer is: the symbol table. The compiler sees *aa and
asks itself what the type of aa is.
To answer this question it consults the symbol table it constructed earlier. The symbol table contains the declared types for all variables the compiler encountered in the code.
This example demonstrates another important concept - for most languages a single symbol table with information about all variables won't do. The second assignment is valid, because in the
internal scope created by the curly bracesaa is
redefined to be of a pointer type. Thus, to correctly compile such code the C compiler has to keep a separate symbol table per scope [2].
A digression: "variables" in Python
So far I've been using the term "variable" liberally. Just to be on the safe side, let's clarify what is meant by variable in Python. Formally, Python doesn't really have variables in the senseC has. Rather, Python has symbolic names bound to objects:
aa = [1, 2, 3] bb = aa aa[0] = 666
In this code, aa is
a name bound to a list object. bb is
a name bound to the same object. The third line modifies the list through aa,
and if we print out bb we'll
see the modified list as well.
Now, once this is understood, I will still use the term "variable" from time to time since it's occasionally convenient and everybody's used to it anyway.
Symbol tables for Python code
Alright, so symbol tables are very useful for type checking. But Python doesn't have compile-time type checking (duck typing FTW!), so what does CPython need symbol tables for?The CPython compiler still has to resolve what kinds of variables are used in the code. Variables in Python can be local, global or even bound by a lexically enclosing scope. For example:
def outer(aa): def inner(): bb = 1 return aa + bb + cc return inner
The function inner uses
three variables: aa, bb and cc.
They're all different from Python's point of view: aa is
lexically bound in outer, bb is
locally bound in inner itself,
and cc is not bound anywhere
in sight, so it's treated as global. The bytecode generated for inner shows
clearly the different treatment of these variables:
5 0 LOAD_CONST 1 (1) 3 STORE_FAST 0 (bb) 6 6 LOAD_DEREF 0 (aa) 9 LOAD_FAST 0 (bb) 12 BINARY_ADD 13 LOAD_GLOBAL 0 (cc) 16 BINARY_ADD 17 RETURN_VALUE
As you can see, different opcodes are used for loading the variables onto the stack prior to applying BINARY_ADD.LOAD_DEREF is
used for aa, LOAD_FAST for bb and LOAD_GLOBAL for cc.
At this point, there are three different directions we can pursue on our path to deeper understanding of Python:
Figure out the exact semantics of variables in Python - when are they local, when are they global and what exactly makes them lexically bound.
Understand how the CPython compiler knows the difference.
Learn about the different bytecode opcodes for these variables and how they affect the way the VM runs code.
I won't even try going into (1) since it's a broad topic completely out of the scope of this article. There are plenty of resources online - start with the official and
continue Googling until you're fully enlightened. (3) is also out of scope as I'm currently focusing on the front-end of CPython. If you're interested, there's an excellent series of in-depth articles on Python focusing on the back-end, with a
nice treatment of this very issue.
To answer (2) we need to understand how CPython uses symbol tables, which is what this series of articles is about.
Where symbol tables fit in
A high-level view of the front-end of CPython is:Parse source code into a parse tree
Transform parse tree into an Abstract Syntax Tree
Transform AST into a Control Flow Graph
Emit bytecode based on the Control Flow Graph
Symbol tables are created in step 3. The compiler builds a symbol table from the AST representing the Python source code. This symbol table, in conjunction with the AST is then used to generate the control flow
graph (CFG) and ultimately the bytecode.
Exploring the symbol table
CPython does a great job exposing some of its internals via standard-library modules [3].Symbol tables is yet another internal data structure that can be explored from the outside in pure Python code, with the help of the symtable module.
From its description:
Symbol tables are generated by the compiler from AST just before bytecode is generated. The symbol table is responsible for calculating the scope of every identifier in the code. symtable provides an interface to examine these tables.
The symtable module
provides a lot of information on the various identifiers encountered in Python code. Apart from telling their scope, it allows us to find out which variables are referenced in their scope, assigned in their scope, define new namespaces (like functions) and
so on. To help with exploring the symbol table I've written the following function that simplifies working with the module:
def describe_symbol(sym): assert type(sym) == symtable.Symbol print("Symbol:", sym.get_name()) for prop in [ 'referenced', 'imported', 'parameter', 'global', 'declared_global', 'local', 'free', 'assigned', 'namespace']: if getattr(sym, 'is_' + prop)(): print(' is', prop)
Let's see what it has to say about the inner function
from the example above:
Symbol: aa is referenced is free Symbol: cc is referenced is global Symbol: bb is referenced is local is assigned
Indeed, we see that the symbol table marks aa as
lexically bound, or "free" (more on this in the next section), bb as
local and cc as global. It
also tells us that all these variables are referenced in the scope of inner and
that bb is assigned in that
scope.
The symbol table contains other useful information as well. For example, it can help distinguish between explicitly declared globals and implicit globals:
def outer(): global gg return ff + gg
In this code both ff and gg are
global to outer, but only gg was
explicitly declared global.
The symbol table knows this - the output of describe_symbol for
this function is:
Symbol: gg is referenced is global is declared_global Symbol: ff is referenced is global
Free variables
Unfortunately, there's a shorthand in the core of Python that may initially confuse readers as to exactly what constitutes a "free" variable. Fortunately, it's a very slight confusion that's easy to put in order.The execution model reference says:
If a variable is used in a code block but not defined there, it is a free variable.
This is consistent with the formal
definition. In the source, however, "free" is actually used as a shorthand for "lexically bound free variable" (i.e. variables for which a binding has been found in an enclosing scope), with "global" being used to refer to all remaining free variables.
So when reading the CPython source code it is important to remember that the full set of free variables includes both the variables tagged specifically as "free", as well as those tagged as "global".
Thus, to avoid a confusion I say "lexically bound" when I want to refer to the variables actually treated in CPython as free.
Catching errors
Although Python is duck-typed, some things can still be enforced at compile-time. The symbol table is a powerful tool allowing the compiler to catch some errors [4].For example, it's not allowed to declare function parameters as global:
def outer(aa): global aa
When compiling this function, the error is caught while constructing the symbol table:
Traceback (most recent call last): File "symtab_1.py", line 33, in <module> table = symtable.symtable(code, '<string>', 'exec') File "symtable.py", line 13, in symtable raw = _symtable.symtable(code, filename, compile_type) File "<string>", line 2 SyntaxError: name 'aa' is parameter and global
The symbol table is useful here since it knows that aa is
a parameter in the scope of outer and
when global aa is
encountered it's a sure sign of an error.
Other errors are handled by the symbol table: duplicate argument names in functions, using import * inside
functions, returning values inside generators, and a few more.
Conclusion
This article serves mainly as an introduction for the next one, where I plan to explore the actual implementation of symbol tables in the core of CPython. A symbol table is a tool designed to solve some problemsfor the compiler, and I hope this article did a fair job describing a few of these problems and related terminology.
Special thanks to Nick Coghlan for reviewing this article.
[1] | You'll note that in this article I'm using the terms Python and CPython interchangeably. They're not the same - by Python I mean the language (version 3.x) and by CPython I mean the official C implementation of the compiler + VM. There are several implementations of the Python language in existence, and while they all implement the same specification they may do it differently. |
[2] | It's even more complex than that, but we're not here to talk about C compilers. In the next article I will explain exactly the structure of symbol tables used by CPython. |
[3] | I've previously discussed how to use the ast module to tap into the compilation process of CPython. |
[4] | Actually, if you grep the CPython source, you'll find out that a good proportion of SyntaxError exceptions thrown by the compiler are from Python/symtable.c. |
Comments
© 2003-2015 Eli BenderskyBack
to top
相关文章推荐
- Python通过正则表达式选取callback的方法
- python RSA签名和解签
- python 实现DES加密 ECB模式
- python Tkinter图形用户界面组件(鼠标、键盘事件)
- Python字典key值查询效率低的问题
- Python的Django框架中URLconf相关的一些技巧整理
- Python中的编码与解码(转)
- 同时安装使用Python 2.X和 Python 3.X
- python 多行注释
- 数据挖掘python,java
- 数据挖掘python,java
- Apriori
- python编写的自动获取代理IP列表的爬虫-chinaboywg-ChinaUnix博客
- Python第一个基本教程4章 词典: 当指数不工作时也
- Python文件读取切割
- 利用Dnspod api批量更新添加DNS解析【python脚本】 - 推酷
- python递归删除目录文件
- python:任意输入3个数,判断能否组成三角形
- Python3.4 tkinter GUI编程示例
- python深拷贝与浅拷贝