LZ77算法
2014-03-14 08:59
162 查看
LZ77 and LZ78 are the two lossless
data compression algorithms published in papers by Abraham
Lempel and Jacob Ziv in 1977[1] and
1978.[2] They
are also known as LZ1 and LZ2 respectively.[3] These
two algorithms form the basis for many variations including LZW, LZSS, LZMA and
others. Besides their academic influence, these algorithms formed the basis of several ubiquitous compression schemes, including GIF and
the DEFLATE algorithm used in PNG.
They are both theoretically dictionary coders. LZ77 maintains a sliding
window during compression. This was later shown to be equivalent to theexplicit dictionary constructed by LZ78—however, they are only equivalent when the entire data is intended to be decompressed. LZ78 decompression
allows random access to the input as long as the entire dictionary is available,[dubious – discuss] while
LZ77 decompression must always start at the beginning of the input.
The algorithms were named an IEEE Milestone in 2004.[4]
This measure gives a bound on the compression ratio that can be achieved. It is then shown that there exist finite lossless encoders for every sequence that achieve this bound as the length of the sequence grows to infinity. In this sense an algorithm based
on this scheme produces asymptotically optimal encodings. This result can be proved more directly, as for example in notes by Peter Shor.[5]
pair, which is equivalent to the statement "each of the next length characters is equal to the characters exactly distance characters behind it in the uncompressed stream". (The "distance"
is sometimes called the "offset" instead.)
To spot matches, the encoder must keep track of some amount of the most recent data, such as the last 2 kB, 4 kB, or 32 kB. The structure in which this data is held is called a sliding window, which is why LZ77 is sometimes
called sliding window compression. The encoder needs to keep this data to look for matches, and the decoder needs to keep this data to interpret the matches the encoder refers to. The larger the sliding window is, the longer back
the encoder may search for creating references.
It is not only acceptable but frequently useful to allow length-distance pairs to specify a length that actually exceeds the distance. As a copy command, this is puzzling: "Go back four characters and copy ten characters
from that position into the current position". How can ten characters be copied over when only four of them are actually in the buffer? Tackling one byte at a time, there is no problem serving this request, because as a byte is copied over, it may be fed again
as input to the copy command. When the copy-from position makes it to the initial destination position, it is consequently fed data that was pasted from the beginning of the copy-from position. The operation is thus equivalent
to the statement "copy the data you were given and repetitively paste it until it fits". As this type of pair repeats a single copy of data multiple times, it can be used to incorporate a flexible and easy form of run-length
encoding.
Another way to see things is as follows: While encoding, for the search pointer to continue finding matched pairs past the end of the search window, all characters from the first match at offset D and forward to the end of the search window must
have matched input, and these are the (previously seen) characters that comprise a single run unit of length LR, which must equal D. Then as the search pointer proceeds past the search window and forward, as far as the run
pattern repeats in the input, the search and input pointers will be in sync and match characters until the run pattern is interrupted. Then L characters have been matched in total, L>D, and the code is [D,L,c].
Upon decoding [D,L,c], again, D=LR. When the first LR characters are read to the output, this corresponds to a single run unit appended to the output buffer. At this
point, the read pointer could be thought of as only needing to return int(L/LR) + (1 if L mod LR does not equal 0) times to the start of that single buffered run unit, read
LR characters (or maybe fewer on the last return), and repeat until a total of L characters are read. But mirroring the encoding process, since the pattern is repetitive, the read pointer need only trail in sync
with the write pointer by a fixed distance equal to the run length LR until L characters have been copied to output in total.
Considering the above, especially if the compression of data runs is expected to predominate, the window search should begin at the end of the window and proceed backwards, since run patterns, if they exist, will be found first and allow the search
to terminate, absolutely if the current maximum matching sequence length is met, or judiciously, if a sufficient length is met, and finally for the simple possibility that the data is more recent and may correlate better with the next input.
LZ77 and LZ78 are the two lossless
data compression algorithms published in papers by Abraham
Lempel and Jacob Ziv in 1977[1] and
1978.[2] They
are also known as LZ1 and LZ2 respectively.[3] These
two algorithms form the basis for many variations including LZW, LZSS, LZMA and
others. Besides their academic influence, these algorithms formed the basis of several ubiquitous compression schemes, including GIF and
the DEFLATE algorithm used in PNG.
They are both theoretically dictionary coders. LZ77 maintains a sliding
window during compression. This was later shown to be equivalent to theexplicit dictionary constructed by LZ78—however, they are only equivalent when the entire data is intended to be decompressed. LZ78 decompression
allows random access to the input as long as the entire dictionary is available,[dubious – discuss] while
LZ77 decompression must always start at the beginning of the input.
The algorithms were named an IEEE Milestone in 2004.[4]
Theoretical efficiency[edit]
In the second of the two papers that introduced these algorithms they are analyzed as encoders defined by finite-state machines. A measure analogous to information entropy is developed for individual sequences (as opposed to probabilistic ensembles).This measure gives a bound on the compression ratio that can be achieved. It is then shown that there exist finite lossless encoders for every sequence that achieve this bound as the length of the sequence grows to infinity. In this sense an algorithm based
on this scheme produces asymptotically optimal encodings. This result can be proved more directly, as for example in notes by Peter Shor.[5]
LZ77[edit]
LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the input (uncompressed) data stream. A match is encoded by a pair of numbers called a length-distancepair, which is equivalent to the statement "each of the next length characters is equal to the characters exactly distance characters behind it in the uncompressed stream". (The "distance"
is sometimes called the "offset" instead.)
To spot matches, the encoder must keep track of some amount of the most recent data, such as the last 2 kB, 4 kB, or 32 kB. The structure in which this data is held is called a sliding window, which is why LZ77 is sometimes
called sliding window compression. The encoder needs to keep this data to look for matches, and the decoder needs to keep this data to interpret the matches the encoder refers to. The larger the sliding window is, the longer back
the encoder may search for creating references.
It is not only acceptable but frequently useful to allow length-distance pairs to specify a length that actually exceeds the distance. As a copy command, this is puzzling: "Go back four characters and copy ten characters
from that position into the current position". How can ten characters be copied over when only four of them are actually in the buffer? Tackling one byte at a time, there is no problem serving this request, because as a byte is copied over, it may be fed again
as input to the copy command. When the copy-from position makes it to the initial destination position, it is consequently fed data that was pasted from the beginning of the copy-from position. The operation is thus equivalent
to the statement "copy the data you were given and repetitively paste it until it fits". As this type of pair repeats a single copy of data multiple times, it can be used to incorporate a flexible and easy form of run-length
encoding.
Another way to see things is as follows: While encoding, for the search pointer to continue finding matched pairs past the end of the search window, all characters from the first match at offset D and forward to the end of the search window must
have matched input, and these are the (previously seen) characters that comprise a single run unit of length LR, which must equal D. Then as the search pointer proceeds past the search window and forward, as far as the run
pattern repeats in the input, the search and input pointers will be in sync and match characters until the run pattern is interrupted. Then L characters have been matched in total, L>D, and the code is [D,L,c].
Upon decoding [D,L,c], again, D=LR. When the first LR characters are read to the output, this corresponds to a single run unit appended to the output buffer. At this
point, the read pointer could be thought of as only needing to return int(L/LR) + (1 if L mod LR does not equal 0) times to the start of that single buffered run unit, read
LR characters (or maybe fewer on the last return), and repeat until a total of L characters are read. But mirroring the encoding process, since the pattern is repetitive, the read pointer need only trail in sync
with the write pointer by a fixed distance equal to the run length LR until L characters have been copied to output in total.
Considering the above, especially if the compression of data runs is expected to predominate, the window search should begin at the end of the window and proceed backwards, since run patterns, if they exist, will be found first and allow the search
to terminate, absolutely if the current maximum matching sequence length is met, or judiciously, if a sufficient length is met, and finally for the simple possibility that the data is more recent and may correlate better with the next input.
相关文章推荐
- 管理系统
- 生产者消费者模式浅析
- 应用程序常驻系统
- C++当中的virtual继承
- 银行业务中的抹帐和冲销
- 苹果开发者账号申请与iTunesconnect中心问题联系电话
- getHibernateTemplate.load() 和get()之间的区别
- hive深度理解与调优
- 入门---1.4: 探究SQL映射语句
- PowerHA 简介【转】
- 能时刻查询火车是否晚点的网站
- 当RPM包安装遇上“依赖性”问题时的解决办法
- Ubuntu下安装ruby
- Struts2值栈的理解【转】
- 理清需求的层次(软件需求管理二)
- 7款HTML5/CSS3应用新鲜出炉 功能强大实用 (www.html5tricks.com)
- Lucene 4 和 Solr 4 学习笔记
- 拷贝构造函数与赋值函数
- 解决使用jquery esayUI是重复加载
- android 开发之activity之间传递数据