I am looking for a scripting (or higher level programming) language (or e.g. modules for Python or similar languages) for effortlessly analyzing and manipulating binary data in files (e.g. core dumps), much like Perl allows manipulating text files very smoothly.


Things I want to do include presenting arbitrary chunks of the data in various forms (binary, decimal, hex), convert data from one endianess to another, etc. That is, things you normally would use C or assembly for, but I'm looking for a language which allows for writing tiny pieces of code for highly specific, one-time purposes very quickly.


Any suggestions?

Well, while it may seem counter-intuitive, I found erlang extremely well-suited for this, namely due to its powerful support for pattern matching, even for bytes and bits (called "Erlang Bit Syntax"). Which makes it very easy to create even very advanced programs that deal with inspecting and manipulating data on a byte- and even on a bit-level:


Since 2001, the functional language Erlang comes with a byte-oriented datatype (called binary) and with constructs to do pattern matching on a binary.


And to quote informIT.com:


(Erlang) Pattern matching really starts to get fun when combined with the binary type. Consider an application that receives packets from a network and then processes them. The four bytes in a packet might be a network byte-order packet type identifier. In Erlang, you would just need a single processPacket function that could convert this into a data structure for internal processing. It would look something like this:


processPacket(<<1:32/big,RestOfPacket>>) ->
    % Process type one packets
processPacket(<<2:32/big,RestOfPacket>>) ->
    % Process type two packets

So, erlang with its built-in support for pattern matching and it being a functional language is pretty expressive, see for example the implementation of ueencode in erlang:


uuencode(BitStr) ->
<<(X+32):8 || <> <= BitStr >>.
uudecode(Text) ->
<<(X-32):6 || <> <= Text >>.

For an introduction, see Bitlevel Binaries and Generalized Comprehensions in Erlang.You may also want to check out some of the following pointers:


  Parsing Binaries with erlang, lamers inside
  • 解析二进制文件与erlang,lamers里面

  More File Processing with Erlang
  • 使用Erlang进行更多文件处理

  Learning Erlang and Adobe Flash format same time
  • 同时学习Erlang和Adobe Flash格式

  Large Binary Data is (not) a Weakness of Erlang
  • 大二进制数据(不)是Erlang的弱点

  Programming Efficiently with Binaries and Bit Strings
  • 使用二进制和位串有效编程

  Erlang bit syntax and network programming
  • Erlang位语法和网络编程

  erlang, the language for network programming (1)
  • erlang,网络编程的语言(1)

  Erlang, the language for network programming Issue 2: binary pattern matching
  • Erlang,网络编程的语言问题2:二进制模式匹配

  An Erlang MIDI File Reader/Writer
  • Erlang MIDI文件读写器

  Erlang Bit Syntax
  • Erlang位语法

  Comprehending endianness
  Playing with Erlang
  • 和Erlang一起玩

  Erlang: Pattern Matching Declarations vs Case Statements/Other
  • Erlang:模式匹配声明与案例陈述/其他

  A Stream Library using Erlang Binaries
  • 使用Erlang二进制文件的流库

  Bit-level Binaries and Generalized Comprehensions in Erlang
  • Erlang中的位级二进制和广义理解

  Applications, Implementation and Performance Evaluation of Bit Stream Programming in Erlang
  • Erlang中比特流编程的应用,实现和性能评估


perl's pack and unpack ?



The Python bitstring module was written for this purpose. It lets you take arbitary slices of binary data and offers a number of different interpretations through Python properties. It also gives plenty of tools for constructing and modifying binary data.

Python bitstring模块是为此目的而编写的。它允许您获取二进制数据的任意切片,并通过Python属性提供许多不同的解释。它还提供了大量用于构造和修改二进制数据的工具。

For example:

>>> from bitstring import BitArray, ConstBitStream
>>> s = BitArray('0x00cf')                           # 16 bits long
>>> print(s.hex, s.bin, s.int)                       # Some different views
00cf 0000000011001111 207
>>> s[2:5] = '0b001100001'                           # slice assignment
>>> s.replace('0b110', '0x345')                      # find and replace
2                                                    # 2 replacements made
>>> s.prepend([1])                                   # Add 1 bit to the start
>>> s.byteswap()                                     # Byte reversal
>>> ordinary_string = s.bytes                        # Back to Python string

There are also functions for bit-wise reading and navigation in the bitstring, much like in files; in fact this can be done straight from a file without reading it into memory:


>>> s = ConstBitStream(filename='somefile.ext')
>>> hex_code, a, b = s.readlist('hex:32, uint:7, uint:13')
>>> s.find('0x0001')         # Seek to next occurence, if found

There are also views with different endiannesses as well as the ability to swap endianness and much more - take a look at the manual.

还有具有不同字节序的视图以及交换字节序的能力等等 - 请查看手册。


Take a look at python bitstring, it looks like exactly what you want :)

看看python bitstring,它看起来就像你想要的:)


I'm using 010 Editor to view binary files all the time to view binary files. It's especially geared to work with binary files.

我正在使用010 Editor来查看二进制文件以查看二进制文件。它特别适合使用二进制文件。

It has an easy to use c-like scripting language to parse binary files and present them in a very readable way (as a tree, fields coded by color, stuff like that).. There are some example scripts to parse zipfiles and bmpfiles.


Whenever I create a binary file format, I always make a little script for 010 editor to view the files. If you've got some header files with some structs, making a reader for binary files is a matter of minutes.



Any high-level programming language with pack/unpack functions will do. All 3 Perl, Python and Ruby can do it. It's matter of personal preference. I wrote a bit of binary parsing in each of these and felt that Ruby was easiest/most elegant for this task.

任何带有打包/解包功能的高级编程语言都可以。所有3 Perl,Python和Ruby都可以做到。这是个人喜好的问题。我在每一个中都写了一些二进制解析,觉得Ruby对于这个任务来说是最容易/最优雅的。


Why not use a C interpreter? I always used them to experiment with snippets, but you could use one to script something like you describe without too much trouble.


I have always liked EiC. It was dead, but the project has been resurrected lately. EiC is surprisingly capable and reasonably quick. There is also CINT. Both can be compiled for different platforms, though I think CINT needs Cygwin on windows.

我一直很喜欢EiC。它已经死了,但该项目最近已经复活了。 EiC令人惊讶的能力和相当快的速度。还有CINT。两者都可以针对不同的平台进行编译,不过我认为CINT需要在Windows上使用Cygwin。


Python's standard library has some of what you require -- the array module in particular lets you easily read parts of binary files, swap endianness, etc; the struct module allows for finer-grained interpretation of binary strings. However, neither is quite as rich as you require: for example, to present the same data as bytes or halfwords, you need to copy it between two arrays (the numpy third-party add-on is much more powerful for interpreting the same area of memory in several different ways), and, for example, to display some bytes in hex there's nothing much "bundled" beyond a simple loop or list comprehension such as [hex(b) for b in thebytes[start:stop]]. I suspect there are reusable third-party modules to facilitate such tasks yet further, but I can't point you to one...

Python的标准库具有您需要的一些功能 - 特别是数组模块可以让您轻松读取部分二进制文件,交换字节序等; struct模块允许对二进制字符串进行更精细的解释。但是,它们都不如您所需的那么丰富:例如,要呈现与字节或半字相同的数据,您需要在两个数组之间复制它(numpy第三方附加组件对于解释相同区域更加强大例如,以十六进制显示一些字节,除了简单的循环或列表理解之外没有太多“捆绑”,例如[字节(b)代表字节[start:stop]]中的b。我怀疑有可重用的第三方模块可以进一步促进这些任务,但我不能指出你一个......


Forth can also be pretty good at this, but it's a bit arcane.



Well, if speed is not a consideration, and you want perl, then translate each line of binary into a line of chars - 0's and 1's. Yes, I know there are no linefeeds in binary :) but presumably you have some fixed size -- e.g. by byte or some other unit, with which you can break up the binary blob.

好吧,如果速度不是考虑因素,并且你想要perl,那么将每行二进制转换成一行字符 - 0和1。是的,我知道二进制文件中没有换行符:)但可能你有一些固定的大小 - 例如按字节或其他单位,您可以使用它来分解二进制blob。

Then just use the perl string processing on that data :)



If you're doing binary level processing, it is very low level and likely needs to be very efficient and have minimal dependencies/install requirements.


So I would go with C - handles bytes well - and you can probably google for some library packages that handle bytes.

所以我会使用C - 处理字节很好 - 你可以google一些处理字节的库包。

Going with something like Erlang introduces inefficiencies, dependencies, and other baggage you probably don't want with a low-level library.


