您的位置：首页 > 编程语言 > Python开发

Python Symbol Table

2015-07-18 15:16 495 查看

点击打开链接

Eli
Bendersky's website

About
Code
Archives

Python internals: Symbol
tables, part 1

September 18, 2010 at 08:03 Tags Articles , Python
internals

Introduction

This article is the first in a short series in which I intend to explain how CPython [1] implements
and uses symbol tables in its quest to compile Python source code into bytecode. In this part I will explain what a symbol table is and show how the general concepts apply to Python. In the second part I will delve into the implementation of symbol tables
in the core of CPython.

So what is a symbol table?

As usual, it's hard to beat Wikipedia for
a succinct definition:

In computer science, a symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier in a program's source code is associated with information relating to its declaration or appearance in the source,
such as its type, scope level and sometimes its location.
Symbol tables are used by practically all compilers. They're especially important in statically typed languages where all variables have types and type checking by the compiler is an important part of the front-end.
Consider this C code:

int main()
{
int aa, bb;

bb = *aa;

{
int* aa;
bb = *aa;
}

return 0;
}

There are two distinct assignments of the form bb = *aa here,
but only the second one is legal. The compiler throws the following error when it sees the first one:

error: invalid type argument of ‘unary *’ (have ‘int’)

How does the compiler know that the * operator
is given an argument of type int,
which is invalid for this operator? The answer is: the symbol table. The compiler sees *aa and
asks itself what the type of aa is.
To answer this question it consults the symbol table it constructed earlier. The symbol table contains the declared types for all variables the compiler encountered in the code.
This example demonstrates another important concept - for most languages a single symbol table with information about all variables won't do. The second assignment is valid, because in the
internal scope created by the curly bracesaa is
redefined to be of a pointer type. Thus, to correctly compile such code the C compiler has to keep a separate symbol table per scope [2].

A digression: "variables" in Python

So far I've been using the term "variable" liberally. Just to be on the safe side, let's clarify what is meant by variable in Python. Formally, Python doesn't really have variables in the sense
C has. Rather, Python has symbolic names bound to objects:

aa = [1, 2, 3]
bb = aa
aa[0] = 666

In this code, aa is
a name bound to a list object. bb is
a name bound to the same object. The third line modifies the list through aa,
and if we print out bb we'll
see the modified list as well.
Now, once this is understood, I will still use the term "variable" from time to time since it's occasionally convenient and everybody's used to it anyway.

Symbol tables for Python code

Alright, so symbol tables are very useful for type checking. But Python doesn't have compile-time type checking (duck typing FTW!), so what does CPython need symbol tables for?
The CPython compiler still has to resolve what kinds of variables are used in the code. Variables in Python can be local, global or even bound by a lexically enclosing scope. For example:

def outer(aa):
def inner():
bb = 1
return aa + bb + cc
return inner

The function inner uses
three variables: aa, bb and cc.
They're all different from Python's point of view: aa is
lexically bound in outer, bb is
locally bound in inner itself,
and cc is not bound anywhere
in sight, so it's treated as global. The bytecode generated for inner shows
clearly the different treatment of these variables:

5           0 LOAD_CONST               1 (1)
3 STORE_FAST               0 (bb)

6           6 LOAD_DEREF               0 (aa)
9 LOAD_FAST                0 (bb)
12 BINARY_ADD
13 LOAD_GLOBAL              0 (cc)
16 BINARY_ADD
17 RETURN_VALUE

As you can see, different opcodes are used for loading the variables onto the stack prior to applying BINARY_ADD.LOAD_DEREF is
used for aa, LOAD_FAST for bb and LOAD_GLOBAL for cc.
At this point, there are three different directions we can pursue on our path to deeper understanding of Python:

Figure out the exact semantics of variables in Python - when are they local, when are they global and what exactly makes them lexically bound.
Understand how the CPython compiler knows the difference.
Learn about the different bytecode opcodes for these variables and how they affect the way the VM runs code.

I won't even try going into (1) since it's a broad topic completely out of the scope of this article. There are plenty of resources online - start with the official and
continue Googling until you're fully enlightened. (3) is also out of scope as I'm currently focusing on the front-end of CPython. If you're interested, there's an excellent series of in-depth articles on Python focusing on the back-end, with a
nice treatment of this very issue.
To answer (2) we need to understand how CPython uses symbol tables, which is what this series of articles is about.

Where symbol tables fit in

A high-level view of the front-end of CPython is:

Parse source code into a parse tree
Transform parse tree into an Abstract Syntax Tree
Transform AST into a Control Flow Graph
Emit bytecode based on the Control Flow Graph

Symbol tables are created in step 3. The compiler builds a symbol table from the AST representing the Python source code. This symbol table, in conjunction with the AST is then used to generate the control flow
graph (CFG) and ultimately the bytecode.

Exploring the symbol table

CPython does a great job exposing some of its internals via standard-library modules [3].
Symbol tables is yet another internal data structure that can be explored from the outside in pure Python code, with the help of the symtable module.
From its description:

Symbol tables are generated by the compiler from AST just before bytecode is generated. The symbol table is responsible for calculating the scope of every identifier in the code. symtable provides an interface to examine these tables.
The symtable module
provides a lot of information on the various identifiers encountered in Python code. Apart from telling their scope, it allows us to find out which variables are referenced in their scope, assigned in their scope, define new namespaces (like functions) and
so on. To help with exploring the symbol table I've written the following function that simplifies working with the module:

def describe_symbol(sym):
assert type(sym) == symtable.Symbol
print("Symbol:", sym.get_name())

for prop in [
'referenced', 'imported', 'parameter',
'global', 'declared_global', 'local',
'free', 'assigned', 'namespace']:
if getattr(sym, 'is_' + prop)():
print('    is', prop)

Let's see what it has to say about the inner function
from the example above:

Symbol: aa
is referenced
is free
Symbol: cc
is referenced
is global
Symbol: bb
is referenced
is local
is assigned

Indeed, we see that the symbol table marks aa as
lexically bound, or "free" (more on this in the next section), bb as
local and cc as global. It
also tells us that all these variables are referenced in the scope of inner and
that bb is assigned in that
scope.
The symbol table contains other useful information as well. For example, it can help distinguish between explicitly declared globals and implicit globals:

def outer():
global gg
return ff + gg

In this code both ff and gg are
global to outer, but only gg was
explicitly declared global.
The symbol table knows this - the output of describe_symbol for
this function is:

Symbol: gg
is referenced
is global
is declared_global
Symbol: ff
is referenced
is global

Free variables

Unfortunately, there's a shorthand in the core of Python that may initially confuse readers as to exactly what constitutes a "free" variable. Fortunately, it's a very slight confusion that's easy to put in order.
The execution model reference says:

If a variable is used in a code block but not defined there, it is a free variable.
This is consistent with the formal
definition. In the source, however, "free" is actually used as a shorthand for "lexically bound free variable" (i.e. variables for which a binding has been found in an enclosing scope), with "global" being used to refer to all remaining free variables.
So when reading the CPython source code it is important to remember that the full set of free variables includes both the variables tagged specifically as "free", as well as those tagged as "global".
Thus, to avoid a confusion I say "lexically bound" when I want to refer to the variables actually treated in CPython as free.

Catching errors

Although Python is duck-typed, some things can still be enforced at compile-time. The symbol table is a powerful tool allowing the compiler to catch some errors [4].
For example, it's not allowed to declare function parameters as global:

def outer(aa):
global aa

When compiling this function, the error is caught while constructing the symbol table:

Traceback (most recent call last):
File "symtab_1.py", line 33, in <module>
table = symtable.symtable(code, '<string>', 'exec')
File "symtable.py", line 13, in symtable
raw = _symtable.symtable(code, filename, compile_type)
File "<string>", line 2
SyntaxError: name 'aa' is parameter and global

The symbol table is useful here since it knows that aa is
a parameter in the scope of outer and
when global aa is
encountered it's a sure sign of an error.
Other errors are handled by the symbol table: duplicate argument names in functions, using import * inside
functions, returning values inside generators, and a few more.

Conclusion

This article serves mainly as an introduction for the next one, where I plan to explore the actual implementation of symbol tables in the core of CPython. A symbol table is a tool designed to solve some problems
for the compiler, and I hope this article did a fair job describing a few of these problems and related terminology.

Special thanks to Nick Coghlan for reviewing this article.

[1]

You'll note that in this article I'm using the terms Python and CPython interchangeably. They're not the same - by Python I mean the language (version 3.x) and by CPython I mean the official C implementation of the compiler
+ VM. There are several implementations of the Python language in existence, and while they all implement the same specification they may do it differently.

[2]	It's even more complex than that, but we're not here to talk about C compilers. In the next article I will explain exactly the structure of symbol tables used by CPython.

[3]	I've previously discussed how to use the ast module to tap into the compilation process of CPython.

[4]	Actually, if you grep the CPython source, you'll find out that a good proportion of SyntaxError exceptions thrown by the compiler are from Python/symtable.c.

Comments

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航