TransWikia.com

Reverse engineering a partially known binary format

Reverse Engineering Asked by S. Kalabukha on January 6, 2021

I have files with binary data, the format description of them is very vague and incomplete. E.g., it states that records start with header byte, like (hex) FA, followed by datetime (accurate down to milliseconds) and other data fields, but no indication of field length, least significant bit (LSB) value, or even the byte endianness of record fields. Overall, the files should represent some sort of message log, and I need to decode them properly into meaningful data.

Given the vagueness, incompleteness and possible errors (see below) in format description, my only hope to achieve the goal is a table that I have. It’s describing roughly what’s in the binary files. E.g., I know that some field from a specific file must decode to a value near 2700, another field must be -8.77, etc. There’s at most one record statement like that, per file.

I’ve first read this question, but I’m not sure which of those tools can help in my situation. So I’ve translated my input binary into text files, simply displaying the initial data in hex representation, all in one big string. Splitting it by header bytes yielded some weird picture where each record seemed to have different length in bytes. Further investigation has shown that there are more types of headers (I call them sub-headers) than stated in format description. Also the first 1-byte field seems to indicate how many internal 22-byte blocks of data a record additionally has. This first field is out of place – it should’ve been datetime, judging by the format description. So, it’s not that accurate/trustworthy, but at least it pushed me (seemingly) in the right direction.

I’m totally new to reverse engineering, so my questions may be rather bad, but please bear with me:

  1. Is my task even possible to do, given the described situation?

  2. If it is, how should I try and find a decoding method? What tools could help find correct field length, LSB and semantic (i.e., which data field is which, as I don’t trust that format description too much anymore)?

EDIT: Additional information on findings

Here are some examples of internal 22-byte blocks. One of the records has 7 blocks:

0018001E030825411C004303076D000D230000013802
0018002B020B56010C001C030011000D22065D011601
0018003103166A0052001803000A000D22065D011601
00187F7301197440390017030779000D22065D011701
0018002B02230540390019030779000D22065D011E01
00187F7E032578004A0024030009000D22065D012B01
00180038012B2501040028030010000D230000013101

Prefixed by ‘FE070F600710′, where ’07’ says that there are 7 of them, and ‘0F600710’ seems to be repeated in such prefixes throughout the file. Example of a different, 8-blocks record:

00187F4C020614414E0030030767000D230000012001
00187F4E000669414E0031030767000D230000012301
00180014030E3B004A0028030009000D230000012601
0018002B0110694042001B030778000D230000011C01
00187F620321080052001203000A000D230000011601
0018000B00254440390028030779000D230000012E02
0018001601345C00420018030008000D230000012401
0018002B013923404A0010030777000D230000011E01

As we can see, they all start with ‘0018’, so that may be another sub-header, not data. That leaves us with exactly five 4-byte floats, or two 8-byte doubles and extra 4 bytes.

Some columns of ’00’s can be seen, ‘0D’ seems to also repeat in a column pattern. There’s a ’03’ that is also always present. If we think of them as additional delimiters, fields of 7, 1, 2, and 6 bytes can be guessed, which mostly isn’t like some standard single- or double-precision floats. That’s why in the initial statement I thought real numbers were coded as integers, with some unknown LSB.

3 Answers

Edit:

I'll leave my previous post/edits for historical purposes, but given this comment

Also, I'd like to try solve it myself as much as possible with your help, not you solving it FOR me, pretty much.

I guess I won't continue trying to make progress on the format. Though I do have some additional ideas based on my previous observations.

So to directly answer the original 2-part question:

  1. Is my task even possible to do, given the described situation?

It may or may not be possible, depending on what the final goal is, and what resources are available.

If you have enough data samples, with matching knowledge of the inputs that created those samples, then it may be possible to figure out the parts of the format that represent those inputs, if that's all you require. It likely helps that you have the format description, even if it's imprecise or inaccurate.

But if the goal is a complete understanding of the data format (for example, to write an implementation that's 100% compatible), then in my (novice) opinion, it's unlikely you will be able to do that without access to something that reads/writes the files (if for no other reason than you would need a way to validate assumptions). It perhaps might be possible if you have a large amount of data samples that have adequate variation in the data values across all fields, but I think it would be an uphill struggle, and that there's a high likelihood that understanding would fall short of 100%.

  1. If it is, how should I try and find a decoding method? What tools could help find correct field length, LSB and semantic (i.e., which data field is which, as I don't trust that format description too much anymore)?

In my opinion, there aren't tools to do this because this is the human part of reverse engineering. Sure there are hex editors, and tools like 010 Editor or Kaitai Struct or binary diff tools that can help you do the human part, but actually figuring out what everything represents and how it all fits together isn't (as far as I know) something that can be done by a tool, particularly when you only have data files and not machine code. (there are tools to do automated analysis of executable code, but my impression is that data files are a different class of problem).

Good luck to you, I hope you get it figured out.


Previous:

With the caveat that I'm still a novice with regard to RE, I've made some observations based on the posted samples.

It would be helpful if you could look at the other data samples you have and validate/disprove the assumptions below. I'll make updates as you respond and as I make further progress.

Observations and assumptions so far:

(Byte offsets start from 0)

Bytes 02-03: 16 bit int. Notable is the juxtaposition of small positive values, and values near INT16_MAX, with nothing in between. This leads me to wonder if the original value might have been negative, but the sign bit got stripped during a conversion. Alternatively, there wasn't any conversion issue and the data is simply bi-modal.

Aside: if you can give more detail on what the logs are supposed to represent and/or what is generating the logs, it would be helpful. As would more information on the expected values (e.g., you said "near 2700" and "must be -8.77") and what they represent. In general, context is often helpful. More samples may be helpful as well.

Byte 04: 8 bit int. May represent an enum. Values seem to always be in the range of 0x00-0x03.

Byte 05-06: Byte 05 appears to monotonically increase within a group of records. The step is variable, so likely not a counter, but it could indicate a time stamp or time offset of some sort. My current thinking is that 5-6 could be "milliseconds since T" where T is a reference time found elsewhere in the file. If the header before the group is supposed to contain a timestamp, then it could be relative to that.

However, the fact that the field is 16-bits would mean that there would need to be a new reference timestamp at least every minute (approximately) or the field would overflow. Do the data samples you have reflect that kind of behavior?

That's all I have for the moment. I'll check back later.

Correct answer by Bill B on January 6, 2021

22 bytes: a simple guess, if each block contained a float value double precision
X.XXXXXXXXXXXXXXXe + XXX (len 22 bytes).
Maybe this is a bit too simple, so can you give us some examples of your 22 bytes blocks?

Just a comment after read the interesting answer from Bill B:
There is no value > 0x7f
which is unlikely for floats 8.77 I guess.

Answered by Gordon Freeman on January 6, 2021

I'm working on some tooling for automatic reverse engineering.

Having messages of varying length makes it much easier to determine which fields are related to overall message lengths. It also makes it much easier to identify where the 'header' portion is, as it will have a consistent format and precede the variable length portion.

The more data and the more diverse that data is, the easier it is to infer a format. Many times I've seen datasets generated by holding everything constant, and altering on a single value in memory. Those are easier for humans to spot checksums in, but harder for finding general field boundaries.

Here's my best guess at the format given the data. Looks like it's big endian, with byte 3 looking like a tag. |'s indicate places where there's a heuristic field boundary.

    TTTTTTTT ?? FFFFFFFF | ???? | ?????? | ?????? TTTTTTTT | ??
    --
    00187F4C 02 0614414E | 0030 | 030767 | 000D23 00000120 | 01
    00187F4E 00 0669414E | 0031 | 030767 | 000D23 00000123 | 01
    00180014 03 0E3B004A | 0028 | 030009 | 000D23 00000126 | 01
    0018002B 01 10694042 | 001B | 030778 | 000D23 0000011C | 01
    00187F62 03 21080052 | 0012 | 03000A | 000D23 00000116 | 01
    0018000B 00 25444039 | 0028 | 030779 | 000D23 0000012E | 02
    00180016 01 345C0042 | 0018 | 030008 | 000D23 00000124 | 01
    0018002B 01 3923404A | 0010 | 030777 | 000D23 0000011E | 01
    --
    0 T  BE TIMESTAMP 32
    1 ? UNKNOWN TYPE 1 BYTE(S)
    2 F BE FLOAT 
    3 ? UNKNOWN TYPE 2 BYTE(S)
    4 ? UNKNOWN TYPE 3 BYTE(S)
    5 ? UNKNOWN TYPE 3 BYTE(S)
    6 T  BE TIMESTAMP 32
    7 ? UNKNOWN TYPE 1 BYTE(S)

I think there's some sort of sequence in section 4 (likely it's just the last 2 bytes).

Answered by pythonpython on January 6, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP