您的位置:首页 > 其它

NLP学习笔记

2014-06-30 15:11 393 查看
最近刚刚接触NLP,一些概念都不是很熟悉。小学期刚开始,今天下午就在图书馆好好看看一些基本的概念。学习笔记记录如下。

1. Kleene star操作

Given a set V define

V0 = { ε } (the language consisting only of the empty string),V1 = Vand define recursively the set

Vi+1= { wv : w ∈ Viand v ∈ V } for each i>0.If V is a formal language, then Vi, the
i-th power of the set V, is a shorthand for the
concatenation of set V with itself i times. That is, Vi can be understood to be the set of all

strings that can be represented as the concatenation of i strings in
V.

The definition of Kleene star on V is[2]



2.Kleene plus操作

In some
formal language studies, (e.g.
AFL Theory) a variation on the Kleene star operation called the Kleene plus is used. The Kleene plus omits the
V0 term in the above union. In other words, the Kleene plus on
V is



3.Production rule

A grammar is defined by production rules (or just 'productions') that specify which symbols may replace which other symbols; these rules may be used to

generate strings, or to parse them. Each such rule has a head, or left-hand side, which consists of the string that may be replaced, and a
body, or right-hand side, which consists of a string that may replace it. Rules are often written in the form
head → body; e.g., the rule z0 → z1 specifies that z0 can be replaced by z1.

In the classic formalization of generative grammars first proposed by
Noam Chomsky in the 1950s,[1][2]
a grammar G consists of the following components:

A finite set

of
nonterminal symbols.
A finite set

of
terminal symbols that is
disjoint from

.
A finite set

of
production rules, each rule of the form


where

is the

Kleene star operator and

denotes

set union, so

represents zero or more symbols, and


means one
nonterminal symbol. That is, each production rule maps from one string of symbols to another, where the first string contains at least one nonterminal symbol. In the case that the body consists solely of the

empty string—i.e., that it contains no symbols at all—it may be denoted with a special notation (often


,


or


) in order to avoid confusion.A distinguished symbol

that is the
start symbol.
A grammar is formally defined as the ordered quadruple

. Such a formal grammar is often called a

rewriting system or a
phrase structure grammar in the literature.[3][4]

4.Terminal symbols and Nonterminal Symbols

Terminal symbols are literal symbols which may appear in the inputs to or outputs fr
4000
om the production rules of a formal grammar and which cannot be changed using the rules of the grammar.

Nonterminal symbols are those symbols which can be replaced. They may also be called simply
syntactic variables.

5.Context-Free Grammar
    Definition:

In
formal language theory, a context-free grammar (CFG) is a

formal grammar in which every
production rule is of the form

V → wwhere V is a single
nonterminal symbol, and w is a string of
terminals and/or nonterminals (w can be empty). A formal grammar is considered "context free" when its production rules can be applied regardless of the context of a nonterminal. No matter which symbols surround it, the single nonterminal on the
left hand side can always be replaced by the right hand side.

    Formal definitions:

A context-free grammar G is defined by the 4-tuple:[3]


where


is a finite set; each element


is called
a non-terminal character or a variable. Each variable represents a different type of phrase or clause in the sentence. Variables are also sometimes called syntactic categories. Each variable defines a sub-language of the language defined by


.


is a finite set of
terminals, disjoint from

, which make up the actual content of the sentence. The set of terminals is the alphabet
of the language defined by the grammar

.


is a finite relation from


to


, where the asterisk represents the

Kleene star operation. The members of

are called the
(rewrite) rules or productions of the grammar. (also commonly symbolized by a


)


is the start variable (or start symbol), used to represent the whole sentence (or program). It must be an element
of

.

Production rule notation[edit]

A
production rule in

is formalized mathematically as a pair


, where


is a non-terminal and


is a

string of variables and/or terminals; rather than using
ordered pair notation, production rules are usually written using an arrow operator with


as its left hand side and


as its right hand side:


.

It is allowed for

to be the

empty string, and in this case it is customary to denote it by ε. The form

is called an ε-production.[4]

It is common to list all right-hand sides for the same left-hand side on the same line, using | (the

pipe symbol) to separate them. Rules

and


can hence be written as


.

Rule application[edit]

For any strings

, we say


directly yields


, written as


, if


with


and


such that


and


. Thus,


is the result of applying the rule


to


.

Repetitive rule application[edit]

For any

we say


yields


written as


(or


in some textbooks), if


such that



Context-free language[edit]

The language of a grammar

is the set


A language

is said to be a context-free language (CFL), if there exists a CFG


, such that


.

6.PCFG

一个概率上下文无关文法(PCFG)是一个五元组(N,∑,S,R,P):
(1)一个非终结符集N
(2)一个终结符集∑
(3)一个开始非终结符S∈N
(4)一个产生式集R
(5)对于任意产生式r∈R,其概率为P(r)
PCFG是CFG的扩展,PCFG的规则表示形式为:A→α p,其中A为非终结符,p为A推导出α的概率,即p=P(A→α),该概率分布必须满足如下条件:
∑P(A→α)=1
也就是说,相同左部的产生式概率分布满足归一化条件。
分析树的概率等于所有使用规则概率之积。

7.CNF

In
formal language theory, a
context-free grammar is said to be in Chomsky normal form (invented by

Noam Chomsky)[1][2]
if all of its
production rules are of the form:


or

or

,where

,


and


are nonterminal symbols,


is a

terminal symbol (a symbol that represents a constant value),

is the start symbol, and


is the

empty string. Also, neither

nor


may be the start symbol, and the third production rule can only appear if


is in


, namely, the language produced by the context-free grammar


.


内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  nlp parse 学习