您的位置:首页 > 编程语言

GCC-3.4.6源代码学习笔记(77)

2010-08-10 11:48 531 查看

5.6. 准备解析器

现在马上就要解析源文件了,而解析器的输入是必须C++的标识符。用于获取标识符的就称为词法分析器(Lexer)。值得注意的是,GCC并没有所谓的预处理遍,因为cpp_get_token这样的函数,直接就可以获取经过预处理的符号。显然,这个函数也是词法分析器的重要部分。作为源文件解析的第一步,首先准备解析器及附属的词法分析器。

15112 void
15113 c_parse_file (void) in parser.c
15114 {
15115 bool error_occurred;
15116
15117 the_parser = cp_parser_new ();
15118 push_deferring_access_checks (flag_access_control
15119 ? dk_no_deferred : dk_no_check);
15120 error_occurred = cp_parser_translation_unit (the_parser);
15121 the_parser = NULL;
15122 }

作为C++解析器的数据结构,cp_parser定义如下。注意到这个类型的数据由GCC的垃圾收集器管理,对于每个编译单元,都要创建一个新的解析器对象。

1170 typedef struct cp_parser GTY(()) in parser.c
1171 {
1172 /* The lexer from which we are obtaining tokens. */
1173 cp_lexer *lexer;
1174
1175 /* The scope in which names should be looked up. If NULL_TREE, then
1176 we look up names in the scope that is currently open in the
1177 source program. If non-NULL, this is either a TYPE or
1178 NAMESPACE_DECL for the scope in which we should look.
1179
1180 This value is not cleared automatically after a name is looked
1181 up, so we must be careful to clear it before starting a new look
1182 up sequence. (If it is not cleared, then `X::Y' followed by `Z'
1183 will look up `Z' in the scope of `X', rather than the current
1184 scope.) Unfortunately, it is difficult to tell when name lookup
1185 is complete, because we sometimes peek at a token, look it up,
1186 and then decide not to consume it. */
1187 tree scope;
1188
1189 /* OBJECT_SCOPE and QUALIFYING_SCOPE give the scopes in which the
1190 last lookup took place. OBJECT_SCOPE is used if an expression
1191 like "x->y" or "x.y" was used; it gives the type of "*x" or "x",
1192 respectively. QUALIFYING_SCOPE is used for an expression of the
1193 form "X::Y"; it refers to X. */
1194 tree object_scope;
1195 tree qualifying_scope;
1196
1197 /* A stack of parsing contexts. All but the bottom entry on the
1198 stack will be tentative contexts.
1199
1200 We parse tentatively in order to determine which construct is in
1201 use in some situations. For example, in order to determine
1202 whether a statement is an expression-statement or a
1203 declaration-statement we parse it tentatively as a
1204 declaration-statement. If that fails, we then reparse the same
1205 token stream as an expression-statement. */
1206 cp_parser_context *context;
1207
1208 /* True if we are parsing GNU C++. If this flag is not set, then
1209 GNU extensions are not recognized. */
1210 bool allow_gnu_extensions_p;
1211
1212 /* TRUE if the `>' token should be interpreted as the greater-than
1213 operator. FALSE if it is the end of a template-id or
1214 template-parameter-list. */
1215 bool greater_than_is_operator_p;
1216
1217 /* TRUE if default arguments are allowed within a parameter list
1218 that starts at this point. FALSE if only a gnu extension makes
1219 them permissible. */
1220 bool default_arg_ok_p;
1221
1222 /* TRUE if we are parsing an integral constant-expression. See
1223 [expr.const] for a precise definition. */
1224 bool integral_constant_expression_p;
1225
1226 /* TRUE if we are parsing an integral constant-expression -- but a
1227 non-constant expression should be permitted as well. This flag
1228 is used when parsing an array bound so that GNU variable-length
1229 arrays are tolerated. */
1230 bool allow_non_integral_constant_expression_p;
1231
1232 /* TRUE if ALLOW_NON_CONSTANT_EXPRESSION_P is TRUE and something has
1233 been seen that makes the expression non-constant. */
1234 bool non_integral_constant_expression_p;
1235
1236 /* TRUE if we are parsing the argument to "__offsetof__". */
1237 bool in_offsetof_p;
1238
1239 /* TRUE if local variable names and `this' are forbidden in the
1240 current context. */
1241 bool local_variables_forbidden_p;
1242
1243 /* TRUE if the declaration we are parsing is part of a
1244 linkage-specification of the form `extern string-literal
1245 declaration'. */
1246 bool in_unbraced_linkage_specification_p;
1247
1248 /* TRUE if we are presently parsing a declarator, after the
1249 direct-declarator. */
1250 bool in_declarator_p;
1251
1252 /* TRUE if we are presently parsing a template-argument-list. */
1253 bool in_template_argument_list_p;
1254
1255 /* TRUE if we are presently parsing the body of an
1256 iteration-statement. */
1257 bool in_iteration_statement_p;
1258
1259 /* TRUE if we are presently parsing the body of a switch
1260 statement. */
1261 bool in_switch_statement_p;
1262
1263 /* TRUE if we are parsing a type-id in an expression context. In
1264 such a situation, both "type (expr)" and "type (type)" are valid
1265 alternatives. */
1266 bool in_type_id_in_expr_p;
1267
1268 /* If non-NULL, then we are parsing a construct where new type
1269 definitions are not permitted. The string stored here will be
1270 issued as an error message if a type is defined. */
1271 const char *type_definition_forbidden_message;
1272
1273 /* A list of lists. The outer list is a stack, used for member
1274 functions of local classes. At each level there are two sub-list,
1275 one on TREE_VALUE and one on TREE_PURPOSE. Each of those
1276 sub-lists has a FUNCTION_DECL or TEMPLATE_DECL on their
1277 TREE_VALUE's. The functions are chained in reverse declaration
1278 order.
1279
1280 The TREE_PURPOSE sublist contains those functions with default
1281 arguments that need post processing, and the TREE_VALUE sublist
1282 contains those functions with definitions that need post
1283 processing.
1284
1285 These lists can only be processed once the outermost class being
1286 defined is complete. */
1287 tree unparsed_functions_queues;
1288
1289 /* The number of classes whose definitions are currently in
1290 progress. */
1291 unsigned num_classes_being_defined;
1292
1293 /* The number of template parameter lists that apply directly to the
1294 current declaration. */
1295 unsigned num_template_parameter_lists;
1296 } cp_parser;

用于创建cp_parser解析器实例的函数cp_parser_new的定义如下:

2230 static cp_parser *
2231 cp_parser_new (void) in parser.c
2232 {
2233 cp_parser *parser;
2234 cp_lexer *lexer;
2235
2236 /* cp_lexer_new_main is called before calling ggc_alloc because
2237 cp_lexer_new_main might load a PCH file. */
2238 lexer = cp_lexer_new_main ();

由cp_lexer_new_main创建的词法分析器则有以下的定义,它亦是GC管理的类型。注意到在其定义中,所有指针成员都由GC管理,除了212行的next域。这个next域透露不像在整个编译单元唯一的解析器,更多的词法分析器可以被临时创建。

166 typedef struct cp_lexer GTY (()) in parser.c
167 {
168 /* The memory allocated for the buffer. Never NULL. */
169 cp_token * GTY ((length ("(%h.buffer_end - %h.buffer)"))) buffer;
170 /* A pointer just past the end of the memory allocated for the buffer. */
171 cp_token * GTY ((skip (""))) buffer_end;
172 /* The first valid token in the buffer, or NULL if none. */
173 cp_token * GTY ((skip (""))) first_token;
174 /* The next available token. If NEXT_TOKEN is NULL, then there are
175 no more available tokens. */
176 cp_token * GTY ((skip (""))) next_token;
177 /* A pointer just past the last available token. If FIRST_TOKEN is
178 NULL, however, there are no available tokens, and then this
179 location is simply the place in which the next token read will be
180 placed. If LAST_TOKEN == FIRST_TOKEN, then the buffer is full.
181 When the LAST_TOKEN == BUFFER, then the last token is at the
182 highest memory address in the BUFFER. */
183 cp_token * GTY ((skip (""))) last_token;
184
185 /* A stack indicating positions at which cp_lexer_save_tokens was
186 called. The top entry is the most recent position at which we
187 began saving tokens. The entries are differences in token
188 position between FIRST_TOKEN and the first saved token.
189
190 If the stack is non-empty, we are saving tokens. When a token is
191 consumed, the NEXT_TOKEN pointer will move, but the FIRST_TOKEN
192 pointer will not. The token stream will be preserved so that it
193 can be reexamined later.
194
195 If the stack is empty, then we are not saving tokens. Whenever a
196 token is consumed, the FIRST_TOKEN pointer will be moved, and the
197 consumed token will be gone forever. */
198 varray_type saved_tokens;
199
200 /* The STRING_CST tokens encountered while processing the current
201 string literal. */
202 varray_type string_tokens;
203
204 /* True if we should obtain more tokens from the preprocessor; false
205 if we are processing a saved token cache. */
206 bool main_lexer_p;
207
208 /* True if we should output debugging information. */
209 bool debugging_p;
210
211 /* The next lexer in a linked list of lexers. */
212 struct cp_lexer *next;
213 } cp_lexer;

在前面章节中,我们已经看到符号由类型cpp_token来表示,然而这个类型是为预处理器设计的。经过预处理后,宏、断言、#include指示等预处理成分不复存在,cpp_token不再合适,取而代之的是下面的cp_token所表示的预处理后符号。

69 typedef struct cp_token GTY (()) in parser.c
70 {
71 /* The kind of token. */
72 ENUM_BITFIELD (cpp_ttype) type : 8;
73 /* If this token is a keyword, this value indicates which keyword.
74 Otherwise, this value is RID_MAX. */
75 ENUM_BITFIELD (rid) keyword : 8;
76 /* Token flags. */
77 unsigned char flags;
78 /* The value associated with this token, if any. */
79 tree value;
80 /* The location at which this token was found. */
81 location_t location;
82 } cp_token;

比较可见,这2者的定义相当的相似。

5.6.1. 创建主词法分析器

每个编译单元都会有一个主词法分析器伴随解析器。这个主词法分析器由下面的函数创建。

301 static cp_lexer *
302 cp_lexer_new_main (void) in parser.c
303 {
304 cp_lexer *lexer;
305 cp_token first_token;
306
307 /* It's possible that lexing the first token will load a PCH file,
308 which is a GC collection point. So we have to grab the first
309 token before allocating any memory. */
310 cp_lexer_get_preprocessor_token (NULL, &first_token);
311 c_common_no_more_pch ();
312
313 /* Allocate the memory. */
314 lexer = ggc_alloc_cleared (sizeof (cp_lexer));
315
316 /* Create the circular buffer. */
317 lexer->buffer = ggc_calloc (CP_TOKEN_BUFFER_SIZE, sizeof (cp_token));
318 lexer->buffer_end = lexer->buffer + CP_TOKEN_BUFFER_SIZE;
319
320 /* There is one token in the buffer. */
321 lexer->last_token = lexer->buffer + 1;
322 lexer->first_token = lexer->buffer;
323 lexer->next_token = lexer->buffer;
324 memcpy (lexer->buffer, &first_token, sizeof (cp_token));
325
326 /* This lexer obtains more tokens by calling c_lex. */
327 lexer->main_lexer_p = true;
328
329 /* Create the SAVED_TOKENS stack. */
330 VARRAY_INT_INIT(lexer->saved_tokens, CP_SAVED_TOKENS_SIZE, "saved_tokens");
331
332 /* Create the STRINGS array. */
333 VARRAY_TREE_INIT (lexer->string_tokens, 32, "strings");
334
335 /* Assume we are not debugging. */
336 lexer->debugging_p = false;
337
338 return lexer;
339 }

注意到目前为止,我们读入了主输入文件,-include引入的头文件(如果有的话),但尚未开始分析源文件的符号。因此下面310行的cp_lexer_get_preprocessor_token将触发源文件的第一个符号分析。按照GCC目前的实现和要求,每个源文件只能包含一个预编译头文件,而且预编译头文件必须是第一个包含文件。因此,如果当前源文件使用了预编译头文件,该函数将读入该预编译头文件(还记得吗,首先看到#include指示,由run_directive调用处理句柄do_include,该句柄则调用_cpp_stack_include,这个函数进一步调用c_common_read_pch读入PCH文件)。而在c_common_read_pch所调用的ggc_pch_read里,如果编译器所在操作系统使用分页内存管理,将触发GC垃圾收集。

580 static void
581 cp_lexer_get_preprocessor_token (cp_lexer *lexer ATTRIBUTE_UNUSED , in parser.c
582 cp_token *token)
583 {
584 bool done;
585
586 /* If this not the main lexer, return a terminating CPP_EOF token. */
587 if (lexer != NULL && !lexer->main_lexer_p)
588 {
589 token->type = CPP_EOF;
590 token->location.line = 0;
591 token->location.file = NULL;
592 token->value = NULL_TREE;
593 token->keyword = RID_MAX;
594
595 return;
596 }
597
598 done = false;
599 /* Keep going until we get a token we like. */
600 while (!done)
601 {
602 /* Get a new token from the preprocessor. */
603 token->type = c_lex_with_flags (&token->value, &token->flags);
604 /* Issue messages about tokens we cannot process. */
605 switch (token->type)
606 {
607 case CPP_ATSIGN:
608 case CPP_HASH:
609 case CPP_PASTE:
610 error ("invalid token");
611 break;
612
613 default:
614 /* This is a good token, so we exit the loop. */
615 done = true;
616 break;
617 }
618 }
619 /* Now we've got our token. */
620 token->location = input_location;
621
622 /* Check to see if this token is a keyword. */
623 if (token->type == CPP_NAME
624 && C_IS_RESERVED_WORD (token->value))
625 {
626 /* Mark this token as a keyword. */
627 token->type = CPP_KEYWORD;
628 /* Record which keyword. */
629 token->keyword = C_RID_CODE (token->value);
630 /* Update the value. Some keywords are mapped to particular
631 entities, rather than simply having the value of the
632 corresponding IDENTIFIER_NODE. For example, `__const' is
633 mapped to `const'. */
634 token->value = ridpointers[token->keyword];
635 }
636 else
637 token->keyword = RID_MAX;
638 }

cp_lexer_get_preprocessor_token是词法分析器的低级函数,负责向词法分析器返回预处理后的符号。显然“#”,“##”,“@”(607行,用于Obj-C)都不是有效的预处理后符号。另外,预处理后符号应该都是标识符或各种常量,但C++保留了某些标识符作为保留字,在这里要予以识别(参考为C++初始化关键字一节)。

5.6.1.1. 获取预处理后符号

5.6.1.1.1. 标识符
预处理后符号由cp_token来表示,而其flags域的取值,为以下各值。

619 #define CPP_N_CATEGORY 0x000F in cpplib.h
620 #define CPP_N_INVALID 0x0000
621 #define CPP_N_INTEGER 0x0001
622 #define CPP_N_FLOATING 0x0002
623
624 #define CPP_N_WIDTH 0x00F0
625 #define CPP_N_SMALL 0x0010 /* int, float. */
626 #define CPP_N_MEDIUM 0x0020 /* long, double. */
627 #define CPP_N_LARGE 0x0040 /* long long, long double. */
628
629 #define CPP_N_RADIX 0x0F00
630 #define CPP_N_DECIMAL 0x0100
631 #define CPP_N_HEX 0x0200
632 #define CPP_N_OCTAL 0x0400
633
634 #define CPP_N_UNSIGNED 0x1000 /* Properties. */
635 #define CPP_N_IMAGINARY 0x2000

这里分为5个组别为flags设置。例如,对于符号0x50,flags被设为CPP_N_INTEGER,CPP_N_SMALL,CPP_N_HEX和CPP_N_UNSIGNED。预处理后符号的类型、值、标记(flags)都由c_lex_with_flags来获得。

315 int
316 c_lex_with_flags (tree *value, unsigned char *cpp_flags) in c-lex.c
317 {
318 const cpp_token *tok;
319 location_t atloc;
320 static bool no_more_pch;
321
322 retry:
323 tok = get_nonpadding_token ();

get_nonpadding_token的核心是cpp_get_token。正如我们在前面章节所见,这个函数是预处理进行的地方。在那里,宏定义直接被消化进了cpp_macro,宏调用直接通过实参替换(如果需要的话)展开,其它指示为各色句柄所承包,并执行各种预处理操作符。

302 static inline const cpp_token *
303 get_nonpadding_token (void) in c-lex.c
304 {
305 const cpp_token *tok;
306 timevar_push (TV_CPP);
307 do
308 tok = cpp_get_token (parse_in);
309 while (tok->type == CPP_PADDING);
310 timevar_pop (TV_CPP);
311
312 return tok;
313 }

注意get_nonpadding_token返回的还是cpp_token,而不是cp_token。

c_lex_with_flags (continue)

325 retry_after_at:
326 switch (tok->type)
327 {
328 case CPP_NAME:
329 *value = HT_IDENT_TO_GCC_IDENT (HT_NODE (tok->val.node));
330 break;
331
332 case CPP_NUMBER:
333 {
334 unsigned int flags = cpp_classify_number (parse_in, tok);
335
336 switch (flags & CPP_N_CATEGORY)
337 {
338 case CPP_N_INVALID:
339 /* cpplib has issued an error. */
340 *value = error_mark_node;
341 break;
342
343 case CPP_N_INTEGER:
344 *value = interpret_integer (tok, flags);
345 break;
346
347 case CPP_N_FLOATING:
348 *value = interpret_float (tok, flags);
349 break;
350
351 default:
352 abort ();
353 }
354 }
355 break;

在328行,类型为CPP_NAME的符号即是标识符,通过HT_IDENT_TO_GCC_IDENT将符号对应的哈希表节点转换为树节点。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: