您的位置:首页 > 理论基础 > 计算机网络

nlp n-gram_NLP中的单词嵌入:一键编码和Skip-Gram神经网络

2020-08-23 21:27 991 查看

nlp n-gram

I’m a poet-turned-programmer who has just begun learning about the wonderful world of natural language processing. In this post, I’ll be sharing what I’ve come to understand about word embedding, with the focus on two embedding methods: one-hot encoding and skip-gram neural network model.

我是一位由诗人转变为程序员的人,刚刚开始学习自然语言处理的美好世界。 在这篇文章中,我将分享我对词嵌入的理解,并重点介绍两种嵌入方法:单热编码和skip-gram神经网络模型。

Last year, OpenAI released a (restricted) version of GPT-2, an AI system that generates texts. When the user inputs a prompt — a word, sentence, or even paragraph — the system “predicts” and produces the next words. What is striking about GPT-2 is that the resulting passages, more often than not, could easily pass for texts written by a (rather rambling) human.

去年,OpenAI发布了GPT-2(受限制的版本),这是一个生成文本的AI系统。 当用户输入提示(单词,句子甚至段落)时,系统会“预测”并生成下一个单词。 GPT-2的惊人之处在于,所得到的段落通常可以轻松地传递给(而不是漫无目的的)人类撰写的文本。

It is not too hard to find poetry generators that output random words with or without regard to syntactic logic. One can write a fairly simple algorithm that selects a randomized word from a pool of nouns, then selects a randomized word from a pool of verbs, and so on. (Some say poetry generation is relatively “easy” because of the inherent literary license of the genre.)

找到或不考虑句法逻辑输出随机单词的诗歌生成器并不是很难。 可以编写一种相当简单的算法,从名词池中选择一个随机词,然后从动词池中选择一个随机词,依此类推。 (有人说,由于该类型的固有文学许可,诗歌的产生相对“容易”。)

But how do machines generate coherent sentences that seem to know about their surrounding context similarly as humans?

但是,机器如何产生连贯的句子,这些句子似乎像人类一样了解周围的环境?

Enter natural language processing (NLP). NLP is a branch in the field of artificial intelligence that aims to make sense of everyday (thus natural) human languages. Numerous applications of NLP have been around for quite a while now, from text auto completion and chatbots to voice assistants and spot-on music recommendation.

输入自然语言处理(NLP)。 NLP是人工智能领域的一个分支,旨在理解日常(因此是自然)人类语言。 从文本自动完成和聊天机器人到语音助手和现场音乐推荐,NLP的许多应用已经存在了很长一段时间。

As NLP models are showing state-of-the-art performance more than ever, it might be worthwhile to take a closer look at one of the most commonly used methods in NLP: word embedding.

随着NLP模型比以往任何时候都显示出最先进的性能,有必要仔细研究一下NLP中最常用的方法之一:词嵌入。

什么是词嵌入? (What is Word Embedding?)

At its core, word embedding is a means of turning texts into numbers. We do this because machine learning algorithms can only understand numbers, not plain texts.

嵌入词的核心是将文本转化为数字的一种方法。 我们这样做是因为机器学习算法只能理解数字,而不能理解纯文本。

In order for a computer to be able to read texts, they have to be encoded as a continuous vector of numeric values. You might be familiar with the concept of vector in Euclidean geometry, which indicates an object with magnitude and direction. In computer science, a vector means a one-dimensional array.

为了使计算机能够读取文本,必须将它们编码为数值的连续向量 。 您可能熟悉欧几里得几何中的矢量的概念,该概念指示具有大小和方向的对象。 在计算机科学中,向量表示一维数组。

(Note: array is a type of data structure notated with square brackets ([]). An example of a one-dimensional array could be something like [.44, .26, .07, -.89, -.15]. A two-dimensional, nested array could look like [0, 1, [2, 3], 4]. Arrays hold values stored in a continuous set.)

(注意:数组是一种数据结构,用方括号([]表示)。一维数组的示例可能是[.44,.26,.07,-。89,-。15]。二维嵌套数组看起来像[0,1,[2,3],4]。数组保存存储在连续集中的值。)

Okay, then how do we embed — or vectorize — words? The simplest method is called one-hot encoding, also known as “1-of-N” encoding (meaning the vector is composed of a single one and a number of zeros).

好的,那么我们如何嵌入或向量化单词呢? 最简单的方法称为单热编码,也称为“ N中的1-”编码(意味着矢量由单个1和多个零组成)。

一种方法:一键编码 (An Approach: One-Hot Encoding)

Let’s take a look at the following sentence: “I ate an apple and played the piano.” We can begin by indexing each word’s position in the given vocabulary set.

让我们看一下下面的句子:“我吃了一个苹果,然后弹了钢琴。” 我们可以从索引给定词汇集中每个单词的位置开始。

Position of each word in the vocabulary每个单词在词汇表中的位置

The word “I” is at position 1, so its one-hot vector representation would be [1, 0, 0, 0, 0, 0, 0, 0]. Similarly, the word “ate” is at position 2, so its one-hot vector would be [0, 1, 0, 0, 0, 0, 0, 0]. The number of words in the source vocabulary signifies the number of dimensions — here we have eight.

单词“ I”在位置1上,因此其一键向量表示为[1、0、0、0、0、0、0、0]。 类似地,单词“ ate”在位置2处,因此其一词热向量为[0,1,0,0,0,0,0,0]。 源词汇表中的单词数表示维数–在这里,我们有8个。

The one-hot embedding matrix for the example text would look like this:

示例文本的一键式嵌入矩阵如下所示:

One-hot vector representation of each word in the vocabulary词汇表中每个单词的一键式矢量表示

一键编码问题 (Problems with One-Hot Encoding)

There are two major issues with this approach for word embedding.

这种词嵌入方法存在两个主要问题。

First issue is the curse of dimensionality, which refers to all sorts of problems that arise with data in high dimensions. Even with relatively small eight dimensions, our example text requires exponentially large memory space. Most of the matrix is taken up by zeros, so useful data becomes sparse. Imagine we have a vocabulary of 50,000. (There are roughly a million words in English language.) Each word is represented with 49,999 zeros and a single one, and we need 50,000 squared = 2.5 billion units of memory space. Not computationally efficient.

第一个问题是维数的诅咒,它指的是高维数据产生的各种问题。 即使相对较小的八个维度,我们的示例文本也需要成倍地增加存储空间。 矩阵的大部分被零占用,因此有用的数据变得稀疏。 想象一下,我们的词汇量为50,000。 (英语中大约有一百万个单词。)每个单词都用49,999个零和一个单数表示,我们需要50,000平方= 25亿个存储空间。 计算效率不高。

Second issue is that it is hard to extract meanings. Each word is embedded in isolation, and every word contains a single one and N zeros where N is the number of dimensions. The resulting set of vectors do not say much about one another. If our vocabulary had “orange,” “banana,” and “watermelon,” we can see the similarity between those words, such as the fact that they are types of fruit, or that they usually follow some form of the verb “eat.” We can easily form a mental map or cluster where these words exist close to each other. But with one-hot vectors, all words are equal distance apart.

第二个问题是很难提取含义。 每个单词都被隔离地嵌入,每个单词包含一个1和N个零,其中N是维数。 所得的一组向量彼此之间的关系不大。 如果我们的词汇中包含“橙色”,“香蕉”和“西瓜”,我们可以看到这些词之间的相似性,例如它们是水果的类型,或者它们通常遵循动词“吃”的某种形式。 ” 我们可以轻松地形成一个思维导图或群集,这些单词彼此靠近。 但是,对于一个热向量,所有单词之间的距离相等。

另一种方法:Skip-Gram神经网络 (Another Approach: Skip-Gram Neural Network)

There are different ways to address the two issues mentioned above, but in this article, we’ll be looking at the skip-gram neural network model. “Skip-gram” is about guessing — given an input word — its context words (words that occur nearby), and “neural network” is about a network of nodes and the weights (strength of connection) between those nodes.

解决上述两个问题的方法有很多,但是在本文中,我们将研究跳跃式神经网络模型。 “跳过语法”是关于给定输入词的上下文词(在附近出现的词)的猜测,而“神经网络”是关于节点的网络以及这些节点之间的权重(连接强度)的。

This model of word embedding helps reduce dimensionality and retain information on contextual similarity, to which we’ll get back later. The famous Word2vec developed by Tomas Mikolov uses this model (along with another one called Continuous Bag of Words or CBOW model, which does the opposite of skip-gram by guessing a word based on its context words).

这种词嵌入模型有助于减少维数,并保留有关上下文相似性的信息,我们稍后会再讨论。 托马斯·米科洛夫(Tomas Mikolov)开发的著名的Word2vec使用了这种模型(以及另一个名为“连续词袋”或CBOW模型的模型,该模型通过根据上下文词猜测一个词来完成“跳过语法”的操作)。

“Efficient Estimation of Word Representations in Vector Space”)“向量空间中单词表示的有效估计” )

To illustrate the skip-gram method, let’s go back to our sample text, “I ate an apple and played the piano.” We’ll use the word “ate” as an input to our model. Again, skip-gram is about predicting context words appearing near the input word. Say we define the window size to be two words (the window size can vary) and look at two words that come before the input word and two words that come after it. Then we’ll be looking at “I”, “an”, and “apple”. From here, our training sample of input-output pairs would be (“ate”, “I”), (“ate”, “an”), and (“ate”, “apple”). See the following image for reference:

为了说明skip-gram方法,让我们回到示例文本“我吃了一个苹果,然后弹了钢琴”。 我们将使用“ ate”一词作为模型输入。 同样,skip-gram与预测出现在输入单词附近的上下文单词有关。 假设我们将窗口大小定义为两个单词(窗口大小可以变化),然后查看输入单词之前的两个单词和输入单词之后的两个单词。 然后,我们将看“ I”,“ an”和“ apple”。 从这里开始,我们的输入输出对的训练样本将是(“ ate”,“ I”),(“ ate”,“ an”)和(“ ate”,“ apple”)。 请参见下图以供参考:

(Source: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) (来源: http : //mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ )

Imagine we are starting off with a model that has not been trained yet and therefore does not know the context words for each word. When we take the first pair of our training samples, (“ate”, “I”), and input “ate” into our model expecting to see “I,” our model might randomly spit out “played.” More precisely, what’s actually happening is that the model is outputting a prediction vector with a probability of each word in the vocabulary to occur near “ate”, then selecting the word with the highest probability (in this case, “played’). At the early training stage, it’s pretty much a random prediction.

想象一下,我们从一个尚未训练的模型开始,因此不知道每个单词的上下文单词。 当我们获取第一对训练样本(“ ate”,“ I”),并将“ ate”输入到我们的模型中希望看到“ I”时,我们的模型可能会随机吐出“玩过”。 更确切地说,实际情况是该模型输出的预测矢量的出现概率是词汇表中每个单词出现在“ ate”附近,然后选择概率最高的单词(在本例中为“ played”)。 在早期训练阶段,这几乎是一个随机的预测。

How do we tune the prediction to be more accurate? The fact that the model thought the word “played” would probably appear near the word “ate” tells something about the relationship between “played” and “ate” — they currently have a relatively strong connection, or weight, between them, though inaccurate to their actual relationship. How do we know that this connection should have been weaker? Because our sample pairs only had context words of “I”, “an”, or “apple”, and not “played”.

我们如何调整预测以使其更准确? 模型认为单词“ played”可能会出现在“ ate”附近的事实,说明了“ played”和“ ate”之间的关系,尽管它们不准确,但它们之间的联系或权重相对较高。他们的实际关系。 我们怎么知道这种联系应该更弱? 因为我们的样本对只有上下文词“ I”,“ an”或“ apple”,而没有“ played”。

Note that our target output should look like this in this particular pair of (“ate”, “I”): a probability of 1 at “I” and 0 at every other word, such that [1, 0, 0, 0, 0, 0, 0, 0]. We get a vector of errors values by comparing the actual output (where “played” has a value closest to 1) with the expected target output (where “I” has a value closest to 1), and use those values in back-propagation to adjust the weights between “ate” and the rest.

请注意,在这对特殊的(“ ate”,“ I”)对中,我们的目标输出应如下所示:“ I”的概率为1,每隔一个单词的概率为0,例如[1、0、0、0, 0,0,0,0]。 通过将实际输出(“播放”的值最接近1)与预期目标输出(“ I”的值最接近1)进行比较,我们得到了误差值的向量,并在反向传播中使用了这些值在“吃”和其余食物之间调整权重。

Plainly put, “I” should have occurred with 100% probability near “ate” but it only occurred with (for instance) 10% probability, so we refactor the “ate-to-I” weight to be closer to 100%, using the 90% difference. As you might have guessed, since there are two other words (“an” and “apple”) that also appear near “ate”, adding these pairs into calculation will reduce the weight of “I” occurring near “ate” to be closer to 33%.

简而言之,“ I”应该在“ ate”附近以100%的概率发生,但是它仅以(例如)10%的概率发生,因此我们使用以下公式将“ ate-to-I”权重重构为接近100% 90%的差异。 您可能已经猜到了,由于还有两个词(“ an”和“ apple”)也出现在“ ate”附近,因此将这些对添加到计算中将减少在“ ate”附近出现的“ I”的权重至33%。

The resulting weights at the end of the training are the word embeddings we are looking for — we want to get the tool that predicts relationships, not necessarily what each individual input returns.

训练结束时产生的权重就是我们正在寻找的词嵌入-我们想要一种可以预测关系的工具,而不必预测每个单独输入的返回值。

Hidden layer of weights is the word embeddings we get as a result of skip-gram training (Source: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) 隐藏的权重层是我们通过跳过语法训练得到的词嵌入(来源: http : //mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ )

In fact, this is the basic idea of neural network training. It is similar to how we remember things by association, in that certain associations become stronger with repeated co-occurrence (e.g. “New” and “York”), while others become weaker as they are used less and less together. The neurons in our brain keep adjusting weights between themselves to better represent reality.

实际上,这是神经网络训练的基本思想。 这类似于我们通过关联记忆事物的方式,其中某些关联在反复出现时会变得更强(例如“ New”和“ York”),而另一些则因为越来越少地被一起使用而变得更弱。 我们大脑中的神经元不断调整自身之间的权重,以更好地代表现实。

We’ve now seen how the skip-gram neural network model works. As mentioned before, this has two advantages over one-hot embedding: dimensionality reduction and context similarity.

现在,我们已经了解了跳跃克神经网络模型是如何工作的。 如前所述,与一键式嵌入相比,它具有两个优点:降维和上下文相似。

降维 (Dimensionality reduction)

Without going into the detail, I’d like to bring our attention back to the above diagram, in particular on the hidden layer in the middle. To give you a context, the source article uses a vocabulary of 10,000 words. You’ll see that the hidden layer says 300 neurons, which means it has an embedding size of 300, as compared to the embedding size of 10,000.

在不赘述的情况下,我想将我们的注意力重新带回上图,尤其是中间的隐藏层。 为了给您一个上下文,源文章使用了10,000个单词的词汇表。 您会看到隐藏层显示300个神经元,这意味着其嵌入大小为300,而嵌入大小为10,000。

Remember how one-hot encoding created vectors composed of as many values as the total number of words in the training set, mostly filled with zeros? Instead of doing that, the skip-gram neural network model selects a smaller number of features, or neurons, that say something more useful about that word. Think of features as character traits — we are made of multiple traits (e.g. introverted vs. extroverted) that have a value on a spectrum. We only retain what describes us best.

还记得一次热编码是如何创建由向量组成的向量,该向量与训练集中的单词总数一样多,并且大部分都填充有零吗? 跳过语法神经网络模型没有这样做,而是选择了较少的特征或神经元,这些特征或神经元对这个词说得更有用。 将特征视为字符特征-我们由在频谱上具有价值的多个特征组成(例如,内向与外向)。 我们只保留最能描述我们的内容。

As a side note, I can’t stress enough how much I love this quote from Pathmind:

附带一提,我对Pathmind的这段话非常满意

“Just as Van Gogh’s painting of sunflowers is a two-dimensional mixture of oil on canvas that represents vegetable matter in a three-dimensional space in Paris in the late 1880s, so 500 numbers arranged in a vector can represent a word or group of words. Those numbers locate each word as a point in 500-dimensional vectorspace.”

“就像梵高的向日葵画是画布上油的二维混合物一样,它代表了1880年代后期巴黎三维空间中的植物物质,因此向量中排列的500个数字可以代表一个单词或一组单词。 这些数字将每个单词定位为500维向量空间中的一个点。”

上下文相似 (Context similarity)

The skip-gram model lets you keep information about the context of each word based on their proximity. In the example of “I ate an apple,” “ate” is a context word of “apple.” Context allows for grouping of words based on their syntactic and/or semantic similarity. If we were given additional input of “I ate a banana” and “I ate an orange,” we will soon find out that “ate” is also a context word of “banana” and “orange,” and therefore can infer that “apple,” “banana” and “orange” must share some commonality.

跳过语法模型使您可以根据每个单词的接近程度保留有关每个单词的上下文的信息。 在“我吃了一个苹果”的示例中,“吃”是上下文词“苹果”。 上下文允许基于词的句法和/或语义相似性对词进行分组。 如果我们额外输入了“我吃了香蕉”和“我吃了橙子”,我们很快就会发现“吃”也是上下文中的“香蕉”和“橙色”,因此可以推断出“苹果”,“香蕉”和“橙色”必须具有一些共同点。

The vectors of “apple,” “banana,” and “orange,” since they have similar context, are then adjusted to be closer to each other, forming a cluster on some multidimensional geometric space. On this note, linguist J.R. Firth said, “you shall know a word by the company it keeps.”

由于“ apple”,“ banana”和“ orange”的向量具有相似的上下文,因此将它们调整为彼此更接近,从而在某个多维几何空间上形成一个簇。 语言学家JR费思(JR Firth)对此表示:“您将知道它所经营的公司的一句话。”

(A caveat: context isn’t always most strongly correlated with the closest words; think of a noun or verb phrase, separated by a lot of miscellaneous words in between. This is only one way of language modeling.)

(一个警告:上下文并不总是与最接近的词有最强的关联;想一想名词或动词短语,两者之间用许多其他词隔开。这只是语言建模的一种方式。)

摘要 (Summary)

In this article, we’ve looked at what I understood about the concept of word embedding in NLP and two common embedding methods, one-hot encoding and skip-gram.

在本文中,我们了解了我对NLP中单词嵌入的概念以及两种常见的嵌入方法(单热编码和skip-gram)的理解。

I also briefly touched upon the advantages of particular word embedding methods over another, namely dimensionality reduction and context similarity.

我还简要地谈到了特定词嵌入方法相对于另一种方法的优势,即降维和上下文相似。

Moving forward, I’d like to gain a better understanding of transformers in language modeling and the workings of GPT-2 (as well as the most recently released GPT-3) which initially prompted this writing.

展望未来,我想更好地理解语言建模中的转换器以及GPT-2(以及最新发布的GPT-3)的工作原理,这些知识最初促使了本文的撰写。

结语 (Closing Note)

The idea that machines can compose texts, increasingly sensible ones more so, still awes me. I have a long journey ahead to get a fuller picture of how different NLP models work, and there will probably be (many) misunderstandings along the way. I’d very much appreciate any kind of feedback on how I could be a better learner and writer.

机器可以组成文本的想法,越来越明智的想法,仍然令我敬畏。 为了全面了解不同的NLP模型的工作方式,我还有很长的路要走,在此过程中可能会(很多)误解。 我非常感谢任何关于如何成为更好的学习者和作家的反馈。

For those who are not familiar with the topic, I hope this piece has sparked your interest in the field of NLP, the way the following articles did mine.

对于那些不熟悉该主题的人,我希望本文能激发您对NLP领域的兴趣,就像下面的文章一样。

资料来源: (Sources & Credits:)

翻译自: https://towardsdatascience.com/word-embedding-in-nlp-one-hot-encoding-and-skip-gram-neural-network-81b424da58f2

nlp n-gram

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: