TransWikia.com

Algorithm for decoding database like file structure

Reverse Engineering Asked by user1038502 on May 3, 2021

I am working on an automatic decode tool for a database like file structure. The file consists of a header and a body. Information that is known before trying to decode the file: Number of rows and size of the body. The files are generally pretty small. 1 to 35 columns and 1 to to 4000 rows.

The fields can be the following:

  • String : one byte for size, followed by the actual string, not zero terminated
  • StringAcii : one byte for size, followed by the actual string, not zero terminated
  • OptionalString : one byte which can be 0 or 1, if 1, its followed by a string
  • OptionalStringAcii : one byte which can be 0 or 1, if 1, its followed by a string
  • Number : 32 bits
  • Bool : Byte with value 0 or 1

There is also a meta data file for each file which contains the number of columns.

My current approach is this: Compute all possible column types, but exit early if I find a combination that never can be possible while looking at the first row. Eg column 1 = bool, where value is 4.

For each possible column combination that is at least valid for the first row, try to parse the whole table. If no errors and all byte are consumed at the end we have a possible candidate.

This works pretty well, but it is way too slow if a table has 25+ columns.I also often get multiple possible schamas. What would be a better way to address this problem? Anything I can do to reduce the number of possible combinations?

One Answer

I would start by computing all valid column types for the first byte of every message. First start with the smallest message though. That will restrict your possible sequence of fields the most.

Take the intersection across those types. The correct column type for the first column is in there.

For each column type in the intersection, take the data, remove that column from the start and recurse.

I'd think carefully about what is a "valid" field. Can there be 0 valued bytes in a string or an ascii string?

Post a couple of these files. That'll help us look at ways to infer the format.

Another smart thing you could do... for each row, calculate the number of times you can successfully find a location in that row which is a valid field of type t.

Then for each type t, across all rows, find the minimum number of successful application types. For instance if row 1 has 3 strings and 2 bools, and row two has 2 strings and 3 bools, you would search with the assumption that you can only use up to 2 strings and 2 bool fields.

Answered by pythonpython on May 3, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP