TransWikia.com

Are there existing automated approaches to reverse engineering the data types in a binary data stream?

Reverse Engineering Asked by J. Tylka on February 20, 2021

Consider a stream of data packets of a known and consistent size, i.e., N bytes per packet.
Are there existing tools that automatically detect (or estimate) the various data types present and their arrangement in the packet?
My goal is to convert the data stream into a handful of time-series data signals by

  1. deducing the arrangement and types of the data in the stream and
  2. extracting each variable into its own signal array.

For example, the packet might consist of:

[double double int32 single int8 int8 int8 int8]

but all I know is that the packet is 28 bytes long.
Let’s assume that the only possible data types in the stream are: double, single, int32, int16, or uint8.
(I don’t particularly care if I can disambiguate between char, int8, or uint8.)
Let’s also assume everything is stored in bytes (no single bit flags or anything).

Some ideas

Basically what I’ve tried so far is to exhaustively try every possible combination and arrangement of data types and then apply some heuristics to eliminate possibilities.
For example, I’ll cast each set of 8 bytes as a double, shifting over 1 byte at a time, and then compute the variance of the resulting signal.
If the variance of the signal is >1e100, then we can probably safely say we’ve misidentified the type.
The problems that this approach runs into are that:

  1. the heuristics are hard coded and not dependent on the data
  2. the algorithm can easily mis-classify things, e.g., it can’t necessarily differentiate between an int32 and a pair of int16.

I’m also aware of these two related questions: How to analyze binary file? and Tools to help reverse engineer binary file formats,
but these answers seem to only give manual tools that require the user to then play around with the data and make some guesses about its configuration.
So, my question is particularly interested in automatic approaches or tools for this job.

This seems related to this question and its answer, but the methods mentioned there seem to be aimed at a more general task of inferring the communication protocol, and not necessarily(?) the datatypes of the packet’s payload, so I’m not sure how suitable these programs are to the task I’ve described.

One Answer

IMO there is not much to add to the linked answer (even though it's focused on network protocols, your task sounds pretty similar).

One more thing you can try is to look at papers citing the mentioned tools to find more/related approaches or tools.

Answered by Igor Skochinsky on February 20, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP