NLP学习笔记
2014-06-30 15:11
393 查看
最近刚刚接触NLP,一些概念都不是很熟悉。小学期刚开始,今天下午就在图书馆好好看看一些基本的概念。学习笔记记录如下。
1. Kleene star操作
Given a set V define
V0 = { ε } (the language consisting only of the empty string),V1 = Vand define recursively the set
Vi+1= { wv : w ∈ Viand v ∈ V } for each i>0.If V is a formal language, then Vi, the
i-th power of the set V, is a shorthand for the
concatenation of set V with itself i times. That is, Vi can be understood to be the set of all
strings that can be represented as the concatenation of i strings in
V.
The definition of Kleene star on V is[2]
2.Kleene plus操作
In some
formal language studies, (e.g.
AFL Theory) a variation on the Kleene star operation called the Kleene plus is used. The Kleene plus omits the
V0 term in the above union. In other words, the Kleene plus on
V is
3.Production rule
A grammar is defined by production rules (or just 'productions') that specify which symbols may replace which other symbols; these rules may be used to
generate strings, or to parse them. Each such rule has a head, or left-hand side, which consists of the string that may be replaced, and a
body, or right-hand side, which consists of a string that may replace it. Rules are often written in the form
head → body; e.g., the rule z0 → z1 specifies that z0 can be replaced by z1.
In the classic formalization of generative grammars first proposed by
Noam Chomsky in the 1950s,[1][2]
a grammar G consists of the following components:
A finite set
of
nonterminal symbols.
A finite set
of
terminal symbols that is
disjoint from
.
A finite set
of
production rules, each rule of the form
where
is the
Kleene star operator and
denotes
set union, so
represents zero or more symbols, and
means one
nonterminal symbol. That is, each production rule maps from one string of symbols to another, where the first string contains at least one nonterminal symbol. In the case that the body consists solely of the
empty string—i.e., that it contains no symbols at all—it may be denoted with a special notation (often
,
or
) in order to avoid confusion.A distinguished symbol
that is the
start symbol.
A grammar is formally defined as the ordered quadruple
. Such a formal grammar is often called a
rewriting system or a
phrase structure grammar in the literature.[3][4]
4.Terminal symbols and Nonterminal Symbols
Terminal symbols are literal symbols which may appear in the inputs to or outputs fr
4000
om the production rules of a formal grammar and which cannot be changed using the rules of the grammar.
Nonterminal symbols are those symbols which can be replaced. They may also be called simply
syntactic variables.
5.Context-Free Grammar
Definition:
In
formal language theory, a context-free grammar (CFG) is a
formal grammar in which every
production rule is of the form
V → wwhere V is a single
nonterminal symbol, and w is a string of
terminals and/or nonterminals (w can be empty). A formal grammar is considered "context free" when its production rules can be applied regardless of the context of a nonterminal. No matter which symbols surround it, the single nonterminal on the
left hand side can always be replaced by the right hand side.
Formal definitions:
A context-free grammar G is defined by the 4-tuple:[3]
where
is a finite set; each element
is called
a non-terminal character or a variable. Each variable represents a different type of phrase or clause in the sentence. Variables are also sometimes called syntactic categories. Each variable defines a sub-language of the language defined by
.
is a finite set of
terminals, disjoint from
, which make up the actual content of the sentence. The set of terminals is the alphabet
of the language defined by the grammar
.
is a finite relation from
to
, where the asterisk represents the
Kleene star operation. The members of
are called the
(rewrite) rules or productions of the grammar. (also commonly symbolized by a
)
is the start variable (or start symbol), used to represent the whole sentence (or program). It must be an element
of
.
production rule in
is formalized mathematically as a pair
, where
is a non-terminal and
is a
string of variables and/or terminals; rather than using
ordered pair notation, production rules are usually written using an arrow operator with
as its left hand side and
as its right hand side:
.
It is allowed for
to be the
empty string, and in this case it is customary to denote it by ε. The form
is called an ε-production.[4]
It is common to list all right-hand sides for the same left-hand side on the same line, using | (the
pipe symbol) to separate them. Rules
and
can hence be written as
.
, we say
directly yields
, written as
, if
with
and
such that
and
. Thus,
is the result of applying the rule
to
.
we say
yields
written as
(or
in some textbooks), if
such that
is the set
A language
is said to be a context-free language (CFL), if there exists a CFG
, such that
.
6.PCFG
一个概率上下文无关文法(PCFG)是一个五元组(N,∑,S,R,P):
(1)一个非终结符集N
(2)一个终结符集∑
(3)一个开始非终结符S∈N
(4)一个产生式集R
(5)对于任意产生式r∈R,其概率为P(r)
PCFG是CFG的扩展,PCFG的规则表示形式为:A→α p,其中A为非终结符,p为A推导出α的概率,即p=P(A→α),该概率分布必须满足如下条件:
∑P(A→α)=1
也就是说,相同左部的产生式概率分布满足归一化条件。
分析树的概率等于所有使用规则概率之积。
7.CNF
In
formal language theory, a
context-free grammar is said to be in Chomsky normal form (invented by
Noam Chomsky)[1][2]
if all of its
production rules are of the form:
or
or
,where
,
and
are nonterminal symbols,
is a
terminal symbol (a symbol that represents a constant value),
is the start symbol, and
is the
empty string. Also, neither
nor
may be the start symbol, and the third production rule can only appear if
is in
, namely, the language produced by the context-free grammar
.
1. Kleene star操作
Given a set V define
V0 = { ε } (the language consisting only of the empty string),V1 = Vand define recursively the set
Vi+1= { wv : w ∈ Viand v ∈ V } for each i>0.If V is a formal language, then Vi, the
i-th power of the set V, is a shorthand for the
concatenation of set V with itself i times. That is, Vi can be understood to be the set of all
strings that can be represented as the concatenation of i strings in
V.
The definition of Kleene star on V is[2]
2.Kleene plus操作
In some
formal language studies, (e.g.
AFL Theory) a variation on the Kleene star operation called the Kleene plus is used. The Kleene plus omits the
V0 term in the above union. In other words, the Kleene plus on
V is
3.Production rule
A grammar is defined by production rules (or just 'productions') that specify which symbols may replace which other symbols; these rules may be used to
generate strings, or to parse them. Each such rule has a head, or left-hand side, which consists of the string that may be replaced, and a
body, or right-hand side, which consists of a string that may replace it. Rules are often written in the form
head → body; e.g., the rule z0 → z1 specifies that z0 can be replaced by z1.
In the classic formalization of generative grammars first proposed by
Noam Chomsky in the 1950s,[1][2]
a grammar G consists of the following components:
A finite set
of
nonterminal symbols.
A finite set
of
terminal symbols that is
disjoint from
.
A finite set
of
production rules, each rule of the form
where
is the
Kleene star operator and
denotes
set union, so
represents zero or more symbols, and
means one
nonterminal symbol. That is, each production rule maps from one string of symbols to another, where the first string contains at least one nonterminal symbol. In the case that the body consists solely of the
empty string—i.e., that it contains no symbols at all—it may be denoted with a special notation (often
,
or
) in order to avoid confusion.A distinguished symbol
that is the
start symbol.
A grammar is formally defined as the ordered quadruple
. Such a formal grammar is often called a
rewriting system or a
phrase structure grammar in the literature.[3][4]
4.Terminal symbols and Nonterminal Symbols
Terminal symbols are literal symbols which may appear in the inputs to or outputs fr
4000
om the production rules of a formal grammar and which cannot be changed using the rules of the grammar.
Nonterminal symbols are those symbols which can be replaced. They may also be called simply
syntactic variables.
5.Context-Free Grammar
Definition:
In
formal language theory, a context-free grammar (CFG) is a
formal grammar in which every
production rule is of the form
V → wwhere V is a single
nonterminal symbol, and w is a string of
terminals and/or nonterminals (w can be empty). A formal grammar is considered "context free" when its production rules can be applied regardless of the context of a nonterminal. No matter which symbols surround it, the single nonterminal on the
left hand side can always be replaced by the right hand side.
Formal definitions:
A context-free grammar G is defined by the 4-tuple:[3]
where
is a finite set; each element
is called
a non-terminal character or a variable. Each variable represents a different type of phrase or clause in the sentence. Variables are also sometimes called syntactic categories. Each variable defines a sub-language of the language defined by
.
is a finite set of
terminals, disjoint from
, which make up the actual content of the sentence. The set of terminals is the alphabet
of the language defined by the grammar
.
is a finite relation from
to
, where the asterisk represents the
Kleene star operation. The members of
are called the
(rewrite) rules or productions of the grammar. (also commonly symbolized by a
)
is the start variable (or start symbol), used to represent the whole sentence (or program). It must be an element
of
.
Production rule notation[edit]
Aproduction rule in
is formalized mathematically as a pair
, where
is a non-terminal and
is a
string of variables and/or terminals; rather than using
ordered pair notation, production rules are usually written using an arrow operator with
as its left hand side and
as its right hand side:
.
It is allowed for
to be the
empty string, and in this case it is customary to denote it by ε. The form
is called an ε-production.[4]
It is common to list all right-hand sides for the same left-hand side on the same line, using | (the
pipe symbol) to separate them. Rules
and
can hence be written as
.
Rule application[edit]
For any strings, we say
directly yields
, written as
, if
with
and
such that
and
. Thus,
is the result of applying the rule
to
.
Repetitive rule application[edit]
For anywe say
yields
written as
(or
in some textbooks), if
such that
Context-free language[edit]
The language of a grammaris the set
A language
is said to be a context-free language (CFL), if there exists a CFG
, such that
.
6.PCFG
一个概率上下文无关文法(PCFG)是一个五元组(N,∑,S,R,P):
(1)一个非终结符集N
(2)一个终结符集∑
(3)一个开始非终结符S∈N
(4)一个产生式集R
(5)对于任意产生式r∈R,其概率为P(r)
PCFG是CFG的扩展,PCFG的规则表示形式为:A→α p,其中A为非终结符,p为A推导出α的概率,即p=P(A→α),该概率分布必须满足如下条件:
∑P(A→α)=1
也就是说,相同左部的产生式概率分布满足归一化条件。
分析树的概率等于所有使用规则概率之积。
7.CNF
In
formal language theory, a
context-free grammar is said to be in Chomsky normal form (invented by
Noam Chomsky)[1][2]
if all of its
production rules are of the form:
or
or
,where
,
and
are nonterminal symbols,
is a
terminal symbol (a symbol that represents a constant value),
is the start symbol, and
is the
empty string. Also, neither
nor
may be the start symbol, and the third production rule can only appear if
is in
, namely, the language produced by the context-free grammar
.
相关文章推荐
- coursera NLP学习笔记之week1课程介绍&基础的文本处理
- NLP 学习笔记 04 (Machine Translation)
- Attention and Memory in Deep Learning and NLP(深度学习和NLP中的注意和记忆机制) 阅读笔记
- Stanford NLP 学习笔记2:文本处理基础(text processing)
- Stanford NLP学习笔记1:课程介绍
- Stanford NLP学习笔记:7. 情感分析(Sentiment)
- NLP 学习笔记 01
- NLP学习笔记01
- NLP学习笔记
- NLP学习笔记1 text processing
- 类模型NLP 学习笔记 05 (Brown Clustering && Global Linear Models)
- coursera NLP学习笔记之week2 语言模型
- coursera NLP学习笔记之week1最小编辑距离计算
- NLP学习笔记1
- 自然语言处理(NLP)学习笔记(二)——NLP技术
- 模型参数NLP 学习笔记 05 (Log-linear Models)
- 【Deep Learning学习笔记】Deep learning for nlp without magic_Bengio_ppt_acl2012
- NLP 学习笔记 03 (Probabilistic Context-Free Grammars (PCFGs))
- GAN︱生成模型学习笔记(运行机制、NLP结合难点、应用案例、相关Paper)
- OpenNLP学习笔记1