热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

数据压缩与编解码技术优化

编码的种类编码(Encoding)在认知上是解释传入的刺激的一种基本知觉的过程。技术上来说,这是一个复杂的、多阶段的转换过程,从较为客观的感觉输入&

编码的种类  编码(Encoding)在认知上是解释传入的刺激的一种基本知觉的过程。技术上来说,这是一个复杂的、多阶段的转换过程,从较为客观的感觉输入(例如光、声)到主观上有意义的体验。

  字符编码(Character encoding)是一套法则,使用该法则能够对自然语言的字符的一个集合(如字母表或音节表),与其他东西的一个集合(如号码或电脉冲)进行配对。

  文字编码(Text encoding)使用一种标记语言来标记一篇文字的结构和其他特征,以方便计算机进行处理。

  语义编码(Semantics encoding),以正式语言乙对正式语言甲进行语义编码,即是使用语言乙表达语言甲所有的词汇(如程序或说明)的一种方法。

  电子编码(Electronic encoding)是将一个信号转换成为一个代码,这种代码是被优化过的以利于传输或存储。转换工作通常由一个编解码器完成。

  神经编码(Neural encoding)是指信息在神经元中被如何描绘的方法。

  记忆编码(Memory encoding)是把感觉转换成记忆的过程。

  加密(Encryption)是为了保密而对信息进行转换的过程。

  译码(Transcoding)是将编码从一种格式转换到另一种格式的过程。    

 

 

减少数据流中的冗余信息   统计高频单词替换

 

数学游戏

设计具体的压缩算法的过程通常更像是一场数学游戏。开发者首先要寻找一种能尽量精确地统计或估计信息中符号出现概率的方法,然后还要设计一套用最短的代码描述每个符号的编码规则。统计学知识对于前一项工作相当有效,迄今为止,人们已经陆续实现了静态模型、半静态模型、自适应模型、 Markov 模型、部分匹配预测模型等概率统计模型。相对而言,编码方法的发展历程更为曲折一些。

 

Huffman 编码 A Method for the Construction of Minimum Redundancy Codes

 

异族传说

逆向思维永远是科学和技术领域里出奇制胜的法宝。大多数人绞尽脑汁想改进 Huffman 或算术编码,以获得一种兼顾了运行速度和压缩效果的“完美”编码的时候

创造出了一系列比 Huffman 编码更有效,比算术编码更快捷的压缩算法----LZ 系列算法

按照时间顺序, LZ 系列算法的发展历程大致是:

 

顺序数据压缩的一个通用算法( A Universal Algorithm for Sequential Data Compression )

通过可变比率编码的独立序列的压缩( Compression of Individual Sequences via Variable Rate Coding )

高性能数据压缩技术( A Technique for High Performance Data Compression ) LZW 算法

 

新技术特性

在Oracle9i中虽然引入了表压缩,但是有很大的限制。只能对批量装载操作(比如直接路径装载,CTAS等)涉及的数据进行压缩,普通的DML操作的数据是无法压缩的。这应该是对于写操作的压缩难题没有解决,一直遗留到Oracle11g,总算是解决了关系数据压缩的写性能问题。Oracle的表压缩是针对Block级别的数据压缩,主要技术和Oracle9i差不多,还是在Block中引入symbol表,将block中的重复数据在symbol中用一个项表示。Oracle会对block进行批量压缩,而不是每次在block中写入数据时都进行压缩,通过这种方式,可以尽量降低数据压缩对于DML操作的性能影响。这样,在block级别应该会引入一个新的参数,用于控制block中未压缩的数据量达到某个标准以后进行压缩操作。

 

LZW

http://marknelson.us/1989/10/01/lzw-data-compression/

The routines shown here belong in any programmer's toolbox. For example, a program that has a few dozen help screens could easily chop 50K bytes off by compressing the screens. Or 500K bytes of software could be distributed to end users on a single 360K byte floppy disk. Highly redundant database files can be compressed down to 10% of their original size. Once the tools are available, the applications for compression will show up on a regular basis.


LZW Fundamentals

The algorithm is surprisingly simple. In a nutshell, LZW compression replaces strings of characters with single codes. It does not do any analysis of the incoming text. Instead, it just adds every new string of characters it sees to a table of strings. Compression occurs when a single code is output instead of a string of characters.

The code that the LZW algorithm outputs can be of any arbitrary length, but it must have more bits in it than a single character. The first 256 codes (when using eight bit characters) are by default assigned to the standard character set. The remaining codes are assigned to strings as the algorithm proceeds. The sample program runs as shown with 12 bit codes. This means codes 0-255 refer to individual bytes, while codes 256-4095 refer to substrings.

 


Compression

The LZW compression algorithm in its simplest form is shown in Figure 1. A quick examination of the algorithm shows that LZW is always trying to output codes for strings that are already known. And each time a new code is output, a new string is added to the string table.

Routine LZW_COMPRESS


CODE:
  1. STRING = get input character
  2. WHILE there are still input characters DO
  3. CHARACTER = get input character
  4. IF STRING+CHARACTER is in the string table then
  5. STRING = STRING+character
  6. ELSE
  7. output the code for STRING
  8. add STRING+CHARACTER to the string table
  9. STRING = CHARACTER
  10. END of IF
  11. END of WHILE
  12. output the code for STRING

 

 


The Compression Algorithm
Figure 1

 

A sample string used to demonstrate the algorithm is shown in Figure 2. The input string is a short list of English words separated by the '/' character. Stepping through the start of the algorithm for this string, you can see that the first pass through the loop, a check is performed to see if the string "/W" is in the table. Since it isn't, the code for '/' is output, and the string "/W" is added to the table. Since we have 256 characters already defined for codes 0-255, the first string definition can be assigned to code 256. After the third letter, 'E', has been read in, the second string code, "WE" is added to the table, and the code for letter 'W' is output. This continues until in the second word, the characters '/' and 'W' are read in, matching string number 256. In this case, the code 256 is output, and a three character string is added to the string table. The process continues until the string is exhausted and all of the codes have been output.

 

 

 

 


 

Input String = /WED/WE/WEE/WEB/WET
Character InputCode OutputNew code valueNew String
/W/256/W
EW257WE
DE258ED
/D259D/
WE256260/WE
/E261E/
WEE260262/WEE
/W261263E/W
EB257264WEB
/B265B/
WET260266/WET
EOFT  

 

 

The Compression Process
Figure 2

The sample output for the string is shown in Figure 2 along with the resulting string table. As can be seen, the string table fills up rapidly, since a new string is added to the table each time a code is output. In this highly redundant input, 5 code substitutions were output, along with 7 characters. If we were using 9 bit codes for output, the 19 character input string would be reduced to a 13.5 byte output string. Of course, this example was carefully chosen to demonstrate code substitution. In real world examples, compression usually doesn't begin until a sizable table has been built, usually after at least one hundred or so bytes have been read in.

 

 

词典

http://www.codeproject.com/KB/recipes/Patterns.aspx

 

 


推荐阅读
author-avatar
跌蕩起伏的2012_900
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有