您的位置:首页 > 编程语言 > Python开发

python缓冲区_如何在Python中使用Google的协议缓冲区

2020-08-21 02:16 871 查看

python缓冲区

When people who speak different languages get together and talk, they try to use a language that everyone in the group understands.

当说不同语言的人聚在一起聊天时,他们会尝试使用小组中每个人都能理解的语言。

To achieve this, everyone has to translate their thoughts, which are usually in their native language, into the language of the group. This “encoding and decoding” of language, however, leads to a loss of efficiency, speed, and precision.The same concept is present in computer systems and their components. Why should we send data in XML, JSON, or any other human-readable format if there is no need for us to understand what they are talking about directly? As long as we can still translate it into a human-readable format if explicitly needed.Protocol Buffers are a way to encode data before transportation, which efficiently shrinks data blocks and therefore increases speed when sending it. It abstracts data into a language- and platform-neutral format.

为了实现这一目标,每个人都必须将他们通常以其本国语言表达的思想翻译成小组的语言。 但是,这种语言的“编码和解码”会导致效率,速度和精度的损失。计算机系统及其组件中存在相同的概念。 如果不需要我们直接了解他们在说什么,为什么我们应该以XML,JSON或任何其他人类可读格式发送数据? 只要明确需要,我们仍然可以将其转换为人类可读的格式。协议缓冲区是一种在传输之前对数据进行编码的方法,它可以有效地缩小数据块,从而提高发送数据时的速度。 它将数据抽象为与语言和平台无关的格式。

目录 (Table of Contents)

为什么要使用协议缓冲区? (Why Protocol Buffers?)

The initial purpose of Protocol Buffers was to simplify the work with request/response protocols. Before ProtoBuf, Google used a different format which required additional handling of marshaling for the messages sent.

协议缓冲区的最初目的是简化请求/响应协议的工作。 在ProtoBuf之前,Google使用了另一种格式,该格式需要对发送的邮件进行其他封送处理。

In addition to that, new versions of the previous format required the developers to make sure that new versions are understood before replacing old ones, making it a hassle to work with.

除此之外,以前格式的新版本要求开发人员在替换旧版本之前确保已理解新版本,这使使用起来很麻烦。

This overhead motivated Google to design an interface that solves precisely those problems.

这项开销促使Google设计了一个可以准确解决这些问题的界面。

ProtoBuf allows changes to the protocol to be introduced without breaking compatibility. Also, servers can pass around the data and execute read operations on the data without modifying its content.

ProtoBuf允许在不破坏兼容性的情况下对协议进行更改。 此外,服务器可以传递数据并在不修改其内容的情况下对数据执行读取操作。

Since the format is somewhat self-describing, ProtoBuf is used as a base for automatic code generation for Serializers and Deserializers.

由于格式有些自描述,因此ProtoBuf用作自动生成序列化器和反序列化器代码的基础。

Another interesting use case is how Google uses it for short-lived Remote Procedure Calls (RPC) and to persistently store data in Bigtable. Due to their specific use case, they integrated RPC interfaces into ProtoBuf. This allows for quick and straightforward code stub generation that can be used as starting points for the actual implementation. (More on ProtoBuf RPC.)

另一个有趣的用例是Google如何将其用于短暂的远程过程调用 (RPC)并将数据持久存储在Bigtable中。 由于其特定的用例,他们将RPC接口集成到ProtoBuf中。 这允许快速直接的代码存根生成,可用作实际实现的起点。 (有关ProtoBuf RPC的更多信息。)

Other examples of where ProtoBuf can be useful are for IoT devices that are connected through mobile networks in which the amount of sent data has to be kept small or for applications in countries where high bandwidths are still rare. Sending payloads in optimized, binary formats can lead to noticeable differences in operation cost and speed.

ProtoBuf有用的其他示例是通过移动网络连接的IoT设备,其中必须将发送的数据量保持在很小的水平,或者用于那些仍很少使用高带宽的国家/地区。 以优化的二进制格式发送有效载荷会导致操作成本和速度上的明显差异。

Using

gzip
compression in your HTTPS communication can further improve those metrics.

在HTTPS通信中使用

gzip
压缩可以进一步改善这些指标。

什么是协议缓冲区,它们如何工作? (What are Protocol buffers and how do they work?)

Generally speaking, Protocol Buffers are a defined interface for the serialization of structured data. It defines a normalized way to communicate, utterly independent of languages and platforms.

一般来说,协议缓冲区是用于结构化数据序列化的已定义接口。 它定义了一种完全独立于语言和平台的标准化通信方式。

Google advertises its ProtoBuf like this:

Google 像这样广告其ProtoBuf:

Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once …

协议缓冲区是Google的与语言无关,与平台无关,可扩展的机制,用于对结构化数据进行序列化(例如XML),但更小,更快,更简单。 您定义如何一次构造数据……

The ProtoBuf interface describes the structure of the data to be sent. Payload structures are defined as “messages” in what is called Proto-Files. Those files always end with a

.proto
extension.For example, the basic structure of a todolist.proto file looks like this. We will also look at a complete example in the next section.

ProtoBuf接口描述了要发送的数据的结构。 有效载荷结构在所谓的“原始文件”中定义为“消息”。 这些文件始终以

.proto
扩展名。例如, todolist.proto文件的基本结构如下所示。 我们还将在下一部分中查看一个完整的示例。

syntax = "proto3";

// Not necessary for Python, should still be declared to avoid name collisions
// in the Protocol Buffers namespace and non-Python languages
package protoblog;

message TodoList {
// Elements of the todo list will be defined here
...
}

Those files are then used to generate integration classes or stubs for the language of your choice using code generators within the protoc compiler. The current version, Proto3, already supports all the major programming languages. The community supports many more in third-party open-source implementations.

然后,使用这些文件使用协议编译器中的代码生成器为您选择的语言生成集成类或存根。 当前版本Proto3已经支持所有主要的编程语言。 社区支持更多第三方开放源代码实施。

Generated classes are the core elements of Protocol Buffers. They allow the creation of elements by instantiating new messages, based on the

.proto
files, which are then used for serialization. We’ll look at how this is done with Python in detail in the next section.

生成的类是协议缓冲区的核心元素。 它们允许通过实例化基于

.proto
文件的新消息来创建元素,然后将这些消息用于序列化。 在下一节中,我们将详细介绍如何使用Python完成此操作。

Independent of the language for serialization, the messages are serialized into a non-self-describing, binary format that is pretty useless without the initial structure definition.

与用于序列化的语言无关,消息被序列化为非自我描述的二进制格式,如果没有初始结构定义,该格式几乎没有用。

The binary data can then be stored, sent over the network, and used any other way human-readable data like JSON or XML is. After transmission or storage, the byte-stream can be deserialized and restored using any language-specific, compiled protobuf class we generate from the .proto file.Using Python as an example, the process could look something like this:

然后可以存储二进制数据,通过网络发送和使用其他任何人类可读数据(如JSON或XML)的方式。 传输或存储后,可以使用从.proto文件生成的任何特定于语言的已编译protobuf类对字节流进行反序列化和还原。以Python为例,该过程看起来像这样:

First, we create a new todo list and fill it with some tasks. This todo list is then serialized and sent over the network, saved in a file, or persistently stored in a database.

首先,我们创建一个新的待办事项列表,并执行一些任务。 然后,此待办事项列表将被序列化并通过网络发送,保存在文件中或永久存储在数据库中。

The sent byte stream is deserialized using the parse method of our language-specific, compiled class.Most current architectures and infrastructures, especially microservices, are based on REST, WebSockets, or GraphQL communication. However, when speed and efficiency are essential, low-level RPCs can make a huge difference.

使用特定于语言的已编译类的parse方法对发送的字节流进行反序列化。当前大多数体系结构和基础结构(尤其是微服务)都基于REST,WebSockets或GraphQL通信。 但是,当速度和效率至关重要时,低级RPC可能会产生很大的不同。

Instead of high overhead protocols, we can use a fast and compact way to move data between the different entities into our service without wasting many resources.

代替高开销协议,我们可以使用快速而紧凑的方式在不同实体之间将数据移动到我们的服务中,而不会浪费很多资源。

但是,为什么还没有在所有地方使用它呢? (But why isn’t it used everywhere yet?)

Protocol Buffers are a bit more complicated than other, human-readable formats. This makes them comparably harder to debug and integrate into your applications.

协议缓冲区比其他人类可读格式要复杂一些。 这使得它们很难进行调试和集成到您的应用程序中。

Iteration times in engineering also tend to increase since updates in the data require updating the proto files before usage.

工程中的迭代时间也往往会增加,因为数据更新需要在使用前更新原型文件。

Careful considerations have to be made since ProtoBuf might be an over-engineered solution in many cases.

由于ProtoBuf在许多情况下可能是过度设计的解决方案,因此必须谨慎考虑。

我有什么选择? (What alternatives do I have?)

Several projects take a similar approach to Google’s Protocol Buffers.

一些项目对Google的协议缓冲区采用了类似的方法。

Google’s Flatbuffers and a third party implementation, called Cap’n Proto, are more focused on removing the parsing and unpacking step, which is necessary to access the actual data when using ProtoBufs. They have been designed explicitly for performance-critical applications, making them even faster and more memory efficient than ProtoBuf.When focusing on the RPC capabilities of ProtoBuf (used with gRPC), there are projects from other large companies like Facebook (Apache Thrift) or Microsoft (Bond protocols) that can offer alternatives.

Google的Flatbuffers和称为Cap'n Proto的第三方实现更着重于消除解析和拆包步骤,这是使用ProtoBufs时访问实际数据所必需的。 它们专为对性能至关重要的应用程序而设计,使其比ProtoBuf更快,内存效率更高。当专注于ProtoBuf(与gRPC结合使用)的RPC功能时,Facebook等其他大型公司(Apache Thrift)或可以提供替代方案的Microsoft(债券协议)。

Python和协议缓冲区 (Python and Protocol Buffers)

Python already provides some ways of data persistence using pickling. Pickling is useful in Python-only applications. It's not well suited for more complex scenarios where data sharing with other languages or changing schemas is involved.Protocol Buffers, in contrast, are developed for exactly those scenarios.The

.proto
files, we’ve quickly covered before, allow the user to generate code for many supported languages.

Python已经使用酸洗提供了一些数据持久化的方法。 酸洗在仅Python的应用程序中很有用。 它不适用于涉及与其他语言共享数据或更改架构的更复杂的场景。相比之下,

.proto
正是针对这些场景而开发的
.proto
文件,我们之前已经快速介绍过,允许用户生成许多受支持语言的代码。

To compile the

.proto
file to the language class of our choice, we use protoc, the proto compiler.If you don’t have the protoc compiler installed, there are excellent guides on how to do that:

编译

.proto
文件添加到我们选择的语言类中,我们使用protoc(即proto编译器)。如果您未安装protoc编译器,则有很好的指南来指导您:

Once we’ve installed protoc on our system, we can use an extended example of our todo list structure from before and generate the Python integration class from it.

一旦在系统上安装了协议,就可以使用之前的待办事项列表结构的扩展示例,并从中生成Python集成类。

syntax = "proto3";

// Not necessary for Python but should still be declared to avoid name collisions
// in the Protocol Buffers namespace and non-Python languages
package protoblog;

// Style guide prefers prefixing enum values instead of surrounding
// with an enclosing message
enum TaskState {
TASK_OPEN = 0;
TASK_IN_PROGRESS = 1;
TASK_POST_PONED = 2;
TASK_CLOSED = 3;
TASK_DONE = 4;
}

message TodoList {
int32 owner_id = 1;
string owner_name = 2;

message ListItems {
TaskState state = 1;
string task = 2;
string due_date = 3;
}

repeated ListItems todos = 3;
}

Let’s take a more detailed look at the structure of the

.proto
file to understand it.In the first line of the proto file, we define whether we’re using Proto2 or 3. In this case, we’re using Proto3.

让我们更详细地了解

.proto
文件的结构以了解它。在proto文件的第一行中,我们定义是使用Proto2还是3。在这种情况下,我们使用Proto3

The most uncommon elements of proto files are the numbers assigned to each entity of a message. Those dedicated numbers make each attribute unique and are used to identify the assigned fields in the binary encoded output.

原始文件中最不常见的元素是分配给消息的每个实体的编号。 这些专用数字使每个属性都唯一,并用于标识二进制编码输出中的分配字段。

One important concept to grasp is that only values 1-15 are encoded with one less byte (Hex), which is useful to understand so we can assign higher numbers to the less frequently used entities. The numbers define neither the order of encoding nor the position of the given attribute in the encoded message.

要掌握的一个重要概念是,只有值1-15会用少一个字节(Hex)进行编码,这对于理解很有用,因此我们可以为使用频率较低的实体分配较高的数字。 数字既不定义编码顺序 也不定义给定属性在编码消息中的位置。

The package definition helps prevent name clashes. In Python, packages are defined by their directory. Therefore providing a package attribute doesn’t have any effect on the generated Python code.

程序包定义有助于防止名称冲突。 在Python中,软件包由其目录定义。 因此,提供包属性对生成的Python代码没有任何影响。

Please note that this should still be declared to avoid protocol buffer related name collisions and for other languages like Java.

请注意,对于其他语言(例如Java),仍应声明该名称以避免协议缓冲区相关的名称冲突。

Enumerations are simple listings of possible values for a given variable.In this case, we define an Enum for the possible states of each task on the todo list.We’ll see how to use them in a bit when we look at the usage in Python.As we can see in the example, we can also nest messages inside messages.If we, for example, want to have a list of todos associated with a given todo list, we can use the repeated keyword, which is comparable to dynamically sized arrays.

枚举是给定变量可能值的简单列表。在这种情况下,我们为待办事项列表中每个任务的可能状态定义了一个枚举,我们将在后面的用法中看到如何使用它们。在示例中可以看到,我们也可以将消息嵌套在消息中,例如,如果我们想要与给定的待办事项列表关联的待办事项列表,则可以使用重复关键字,该关键字与动态大小的数组。

To generate usable integration code, we use the proto compiler which compiles a given .proto file into language-specific integration classes. In our case we use the --python-out argument to generate Python-specific code.

为了生成可用的集成代码,我们使用proto编译器,该编译器将给定的.proto文件编译为特定于语言的集成类。 在我们的例子中,我们使用--python-out参数生成特定于Python的代码。

protoc -I=. --python_out=. ./todolist.proto

protoc -I=. --python_out=. ./todolist.proto

In the terminal, we invoke the protocol compiler with three parameters:

在终端中,我们使用三个参数调用协议编译器:

  1. -I: defines the directory where we search for any dependencies (we use . which is the current directory)

    -I :定义在其中搜索任何依赖项的目录(我们使用。作为当前目录)

  2. --python_out: defines the location we want to generate a Python integration class in (again we use . which is the current directory)

    --python_out :定义我们要在其中生成Python集成类的位置(再次使用这是当前目录)

  3. The last unnamed parameter defines the .proto file that will be compiled (we use the todolist.proto file in the current directory)

    最后一个未命名的参数定义将要编译的.proto文件(我们在当前目录中使用todolist.proto文件)

This creates a new Python file called <name_of_proto_file>_pb2.py. In our case, it is todolist_pb2.py. When taking a closer look at this file, we won’t be able to understand much about its structure immediately.

这将创建一个名为<name_of_proto_file> _pb2.py的新Python文件。 在我们的例子中,它是todolist_pb2.py。 当仔细查看此文件时,我们将无法立即了解其结构。

This is because the generator doesn’t produce direct data access elements, but further abstracts away the complexity using metaclasses and descriptors for each attribute. They describe how a class behaves instead of each instance of that class.The more exciting part is how to use this generated code to create, build, and serialize data. A straightforward integration done with our recently generated class is seen in the following:

这是因为生成器不会产生直接的数据访问元素,而是会使用元类和每个属性的描述符进一步简化复杂性。 它们描述了一个类的行为方式,而不是该类的每个实例。更令人兴奋的部分是如何使用此生成的代码来创建,构建和序列化数据。 以下是与我们最近生成的类进行的直接集成:

import todolist_pb2 as TodoList

my_list = TodoList.TodoList()
my_list.owner_id = 1234
my_list.owner_name = "Tim"

first_item = my_list.todos.add()
first_item.state = TodoList.TaskState.Value("TASK_DONE")
first_item.task = "Test ProtoBuf for Python"
first_item.due_date = "31.10.2019"

print(my_list)

It merely creates a new todo list and adds one item to it. We then print the todo list element itself and can see the non-binary, non-serialized version of the data we just defined in our script.

它仅创建一个新的待办事项列表并向其中添加一个项目。 然后,我们打印待办事项列表元素本身,并可以看到我们刚刚在脚本中定义的数据的非二进制,非序列化版本。

owner_id: 1234
owner_name: "Tim"
todos {
state: TASK_DONE
task: "Test ProtoBuf for Python"
due_date: "31.10.2019"
}

Each Protocol Buffer class has methods for reading and writing messages using a Protocol Buffer-specific encoding, that encodes messages into binary format.Those two methods are

SerializeToString()
and
ParseFromString()
.

每个协议缓冲区类都有使用协议缓冲区特定的编码来读取和写入消息的方法,该方法将消息编码为二进制格式。这两个方法是

SerializeToString()
ParseFromString()

import todolist_pb2 as TodoList

my_list = TodoList.TodoList()
my_list.owner_id = 1234

# ...

with open("./serializedFile", "wb") as fd:
fd.write(my_list.SerializeToString())

my_list = TodoList.TodoList()
with open("./serializedFile", "rb") as fd:
my_list.ParseFromString(fd.read())

print(my_list)

In the code example above, we write the Serialized string of bytes into a file using the wb flags.

在上面的代码示例中,我们使用wb标志将字节的序列化字符串写入文件。

Since we have already written the file, we can read back the content and Parse it using ParseFromString. ParseFromString calls on a new instance of our Serialized class using the rb flags and parses it.

由于已经编写了文件,因此可以读回内容并使用ParseFromString对其进行解析。 ParseFromString使用rb标志调用序列化类的新实例并对其进行解析。

If we serialize this message and print it in the console, we get the byte representation which looks like this.

如果我们将此消息序列化并在控制台中打印,我们将获得如下所示的字节表示形式。

b'\x08\xd2\t\x12\x03Tim\x1a(\x08\x04\x12\x18Test ProtoBuf for Python\x1a\n31.10.2019'

b'\x08\xd2\t\x12\x03Tim\x1a(\x08\x04\x12\x18Test ProtoBuf for Python\x1a\n31.10.2019'

Note the b in front of the quotes. This indicates that the following string is composed of byte octets in Python.

请注意引号前面的b。 这表明以下字符串由Python中的字节八位字节组成。

If we directly compare this to, e.g., XML, we can see the impact ProtoBuf serialization has on the size.

如果直接将其与XML进行比较,我们可以看到ProtoBuf序列化对大小的影响。

<todolist>
<owner_id>1234</owner_id>
<owner_name>Tim</owner_name>
<todos>
<todo>
<state>TASK_DONE</state>
<task>Test ProtoBuf for Python</task>
<due_date>31.10.2019</due_date>
</todo>
</todos>
</todolist>

The JSON representation, non-uglified, would look like this.

未丑化的JSON表示将如下所示。

{
"todoList": {
"ownerId": "1234",
"ownerName": "Tim",
"todos": [
{
"state": "TASK_DONE",
"task": "Test ProtoBuf for Python",
"dueDate": "31.10.2019"
}
]
}
}

Judging the different formats only by the total number of bytes used, ignoring the memory needed for the overhead of formatting it, we can of course see the difference.But in addition to the memory used for the data, we also have 12 extra bytes in ProtoBuf for formatting serialized data. Comparing that to XML, we have 171 extra bytes in XML for formatting serialized data.

仅通过使用的字节总数来判断不同的格式,而忽略格式化所需的内存,我们当然可以看到区别。但是除了用于数据的内存外,我们还有12个额外的字节ProtoBuf用于格式化序列化数据。 与XML相比,我们在XML中171个额外的字节用于格式化序列化数据。

Without Schema, we need 136 extra bytes in JSON for formatting serialized data.

没有Schema,我们需要JSON中的136个额外字节 格式化 序列化数据

If we’re talking about several thousands of messages sent over the network or stored on disk, ProtoBuf can make a difference.

如果我们谈论的是通过网络发送或存储在磁盘上的数千条消息,ProtoBuf可以有所作为。

However, there is a catch. The platform Auth0.com created an extensive comparison between ProtoBuf and JSON. It shows that, when compressed, the size difference between the two can be marginal (only around 9%).

但是,有一个陷阱。 Auth0.com平台在ProtoBuf和JSON之间进行了广泛的比较。 它表明,压缩后,两者之间的大小差异可能很小(仅9%左右)。

If you’re interested in the exact numbers, please refer to the full article, which gives a detailed analysis of several factors like size and speed.

如果您对确切的数字感兴趣,请参阅整篇文章 ,其中详细分析了一些因素,例如大小和速度。

An interesting side note is that each data type has a default value. If attributes are not assigned or changed, they will maintain the default values. In our case, if we don’t change the TaskState of a ListItem, it has the state of “TASK_OPEN” by default. The significant advantage of this is that non-set values are not serialized, saving additional space.

一个有趣的旁注是,每种数据类型都有一个默认值。 如果未分配或更改属性,则它们将保留默认值。 在我们的情况下,如果我们不更改ListItem的TaskState,则默认情况下其状态为“ TASK_OPEN”。 这样的显着优点是未设置的值不会被序列化,从而节省了额外的空间。

If we, for example, change the state of our task from TASK_DONE to TASK_OPEN, it will not be serialized.

例如,如果我们将任务的状态从TASK_DONE更改为TASK_OPEN,它将不会被序列化。

owner_id: 1234
owner_name: "Tim"
todos {
task: "Test ProtoBuf for Python"
due_date: "31.10.2019"
}

b'\x08\xd2\t\x12\x03Tim\x1a&\x12\x18Test ProtoBuf for Python\x1a\n31.10.2019'

b'\x08\xd2\t\x12\x03Tim\x1a&\x12\x18Test ProtoBuf for Python\x1a\n31.10.2019'

最后说明 (Final Notes)

As we have seen, Protocol Buffers are quite handy when it comes to speed and efficiency when working with data. Due to its powerful nature, it can take some time to get used to the ProtoBuf system, even though the syntax for defining new messages is straightforward.

如我们所见,在处理数据时,在速度和效率方面,协议缓冲区非常方便。 由于其强大的特性,即使定义新消息的语法很简单,也要花一些时间才能习惯ProtoBuf系统。

As a last note, I want to point out that there were/are discussions going on about whether Protocol Buffers are “useful” for regular applications. They were developed explicitly for problems Google had in mind.If you have any questions or feedback, feel free to reach out to me on any social media like twitter or email :)

最后一点,我想指出的是,关于协议缓冲区是否对常规应用程序“有用”的讨论正在进行中。 它们是专门针对Google遇到的问题而开发的。如果您有任何疑问或反馈,请随时通过Twitter电子邮件等任何社交媒体与我联系:)

翻译自: https://www.freecodecamp.org/news/googles-protocol-buffers-in-python/

python缓冲区

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: