TransWikia.com

Turning an unknown audio data stream into wav or similar format

Reverse Engineering Asked by user6916458 on April 14, 2021

I am trying to get the commentary (casters voice) from a dota2 game file. I’ve managed to parse the game file and select what I believe is the voice data. This is in a weird format (CSVCMsg_VoiceData) which has the following struc:

type CSVCMsg_VoiceData struct {
Client                   *int32            `protobuf:"varint,1,opt,name=client" json:"client,omitempty"`
Proximity                *bool             `protobuf:"varint,2,opt,name=proximity" json:"proximity,omitempty"`
Xuid                     *uint64           `protobuf:"fixed64,3,opt,name=xuid" json:"xuid,omitempty"`
AudibleMask              *int32            `protobuf:"varint,4,opt,name=audible_mask" json:"audible_mask,omitempty"`
VoiceData                []byte            `protobuf:"bytes,5,opt,name=voice_data" json:"voice_data,omitempty"`
Caster                   *bool             `protobuf:"varint,6,opt,name=caster" json:"caster,omitempty"`
Format                   *VoiceDataFormatT `protobuf:"varint,7,opt,name=format,enum=VoiceDataFormatT,def=1" json:"format,omitempty"`
SequenceBytes            *int32            `protobuf:"varint,8,opt,name=sequence_bytes" json:"sequence_bytes,omitempty"`
SectionNumber            *uint32           `protobuf:"varint,9,opt,name=section_number" json:"section_number,omitempty"`
UncompressedSampleOffset *uint32           `protobuf:"varint,10,opt,name=uncompressed_sample_offset" json:"uncompressed_sample_offset,omitempty"`
XXX_unrecognized         []byte            `json:"-"`

}

This seems to work when reading the data. Logically I’m probably looking for the VoiceData part of the struct when given this:

"format":0,"voice_data":"uz+ACgEAEAELgD4EQgEWAKV4mxnepfmhxKCQxAnKVNaHhKRXPIsmAH5RjXmJV0u+WTmrvgyCKxcraehjo/ZeKcFjksXQZEeOju4hLNv/MAB9KA7ww14Vc0ndYPB7dDXoXTexuxcW0Jg/diMgdH5ijWhe02Ch48KX86qJZYFyZV81AH76qCgh9AXliMdyWEgWTMbRD6xMX37WJALrXlSnxymIloSq2KGwXCcMXzQiSQIrcLVNfqdNJACCluFOIRKPmugUvsLZmnD04X0xhpAuNkwJECK4t51MBOWNWJlCAIDyZlJwWI45EPTjBB6yKyGOclu96qBV2MhFAh1d2J7WDZwe6YxOVu/BGkGcur9qTP85ZRfjANoiQxQrWvpoHFBFBy0AfX6k8XvbSwrk2nUAEP3P6kcmXORKUNKeu8HDnOUflQqtA5AkkTiun77fZrqnimIfWg==","sequence_bytes":23598094,"section_number":1,"sample_rate":16000

I’m able to pull the voice data out like so:
uz+ACgEAEAELgD4EQgEWAKV4mxnepfmhxKCQxAnKVNaHhKRXPIsmAH5RjXmJV0u+WTmrvgyCKxcraehjo/ZeKcFjksXQZEeOju4hLNv/MAB9KA7ww14Vc0ndYPB7dDXoXTexuxcW0Jg/diMgdH5ijWhe02Ch48KX86qJZYFyZV81AH76qCgh9AXliMdyWEgWTMbRD6xMX37WJALrXlSnxymIloSq2KGwXCcMXzQiSQIrcLVNfqdNJACCluFOIRKPmugUvsLZmnD04X0xhpAuNkwJECK4t51MBOWNWJlCAIDyZlJwWI45EPTjBB6yKyGOclu96qBV2MhFAh1d2J7WDZwe6YxOVu/BGkGcur9qTP85ZRfjANoiQxQrWvpoHFBFBy0AfX6k8XvbSwrk2nUAEP3P6kcmXORKUNKeu8HDnOUflQqtA5AkkTiun77fZrqnimIfWg==

However this is where I’m hitting a bit of a wall. This data is in an unknown format. I’ve tried to do some research on what the format might be and I’ve found that steam started using SILK codec for voice data in 2011 – however when trying to write this data to file and open it with opus (which I believe supports SILK) the opus decoder tells me it can’t open the file – so I’m not 100% convinced it is silk codec. Recognising audio data isn’t something I have a great deal of experience with – so any advice would be great.

I have noticed there’s a VoiceDataFormatT part of the struct but the only definition I can find for it is this:

type VoiceDataFormatT int32

Which doesn’t seem too helpful! :/

EDIT 1:
As per advice from user Ian Cook I’ve decoded the data from base64 into the following (as hex dump):

BB 3F 80 0A 01 00 10 01 0B 80 3E 04 42 01 16 00 A5 78 9B 19 DE A5 F9 A1 C4 A0 90 C4 09 CA 54 D6 87 84 A4 57 3C 8B 26 00 7E 51 8D 79 89 57 4B BE 59 39 AB BE 0C 82 2B 17 2B 69 E8 63 A3 F6 5E 29 C1 63 92 C5 D0 64 47 8E 8E EE 21 2C DB FF 30 00 7D 28 0E F0 C3 5E 15 73 49 DD 60 F0 7B 74 35 E8 5D 37 B1 BB 17 16 D0 98 3F 76 23 20 74 7E 62 8D 68 5E D3 60 A1 E3 C2 97 F3 AA 89 65 81 72 65 5F 35 00 7E FA A8 28 21 F4 05 E5 88 C7 72 58 48 16 4C C6 D1 0F AC 4C 5F 7E D6 24 02 EB 5E 54 A7 C7 29 88 96 84 AA D8 A1 B0 5C 27 0C 5F 34 22 49 02 2B 70 B5 4D 7E A7 4D 24 00 82 96 E1 4E 21 12 8F 9A E8 14 BE C2 D9 9A 70 F4 E1 7D 31 86 90 2E 36 4C 09 10 22 B8 B7 9D 4C 04 E5 8D 58 99 42 00 80 F2 66 52 70 58 8E 39 10 F4 E3 04 1E B2 2B 21 8E 72 5B BD EA A0 55 D8 C8 45 02 1D 5D D8 9E D6 0D 9C 1E E9 8C 4E 56 EF C1 1A 41 9C BA BF 6A 4C FF 39 65 17 E3 00 DA 22 43 14 2B 5A FA 68 1C 50 45 07 2D 00 7D 7E A4 F1 7B DB 4B 0A E4 DA 75 00 10 FD CF EA 47 26 5C E4 4A 50 D2 9E BB C1 C3 9C E5 1F 95 0A AD 03 90 24 91 38 AE 9F BE DF 66 BA A7 8A 62 1F 5A

I’m still at a loss as to what this information is – I’ve tried converting it to a wav file using ffmpeg (assuming is pcm) but it still comes out as white noise.

EDIT 2:
So it’s occurred to me that it might help if I include more samples of the data – the decoded hex of the data can be found here (each sample separated by a new line character):

pastebin

I’ve noticed that each one seems to start with the following hex:

BB 3F 80 0A 01 00 10 01 0B 80 3E 04

Which translates to:

»?€
�€>

I’m still at a loss as to how to convert this to audio data.

EDIT 3:
I’ve uploaded some more datadumps to the following pastebin (More data), it’s not a full dump as it’s roughly 15mb and pastebin crashed when I was trying to paste!

The data file is a dota2 demo file (extension .dem) which is a collection of protobuf messages that I parse using GoLang and the Manta replay parse (found here). This allows me to pull out any type of message, and I select OnCSVCMsg_VoiceData, which returns m.Audio.VoiceData of the form: CSVCMsg_VoiceData (the struct I display above).

EDIT 4

Here’s (finally) the link to the file with the concatenated voiceData messages.

And here’s the link to the original file of protobuff messages

2 Answers

TL;DR

  1. Each section n indicates a separate stream of data
  2. The sequence_bytes value indicates the order that the frames should be placed in when decoding.
  3. The voice_data is base64-encoded
    1. The decoded data is a SILK-encoded frame, but with the following exceptions:
      1. The first 14 bytes
      2. The last 4 bytes
  4. To decode the data, you must do the following:
    1. For each section n, order n's structs in ascending order based on the value of sequence_bytes
    2. De-base64 each struct's voice_data
    3. Extract the SILK payload from each struct (i.e. remove the first 14 bytes and the last 4 bytes) and concatenate them all together (again, must be in order based on sequence_bytes)
    4. Prepend the resulting file with #!SILK_V3 (the SILK header)
    5. You now have a valid SILK file that can be decoded (details below)

Long version

Using the sample data you posted, first thing I had to do was replace the final comma with a ] to make it valid JSON.

I originally used shell scripts to to convert the structs from JSON to SILK, but in the interest of efficiency, I re-implemented the conversion in Python.

import json
import base64
import sys

def main():
    if len(sys.argv) < 2:
        print("Usage: python3", sys.argv[0], "<CSVCMsg_VoiceData json file>")
        exit(1)

    with open(sys.argv[1], 'r') as infile:
        json_data = json.load(infile)

    # Create dictionary with section number as the key and list of
    # that section's structs as the value
    section_dict = {}
    for obj in json_data:
        sec_num = obj['section_number']
        if sec_num not in section_dict:
            section_dict[sec_num] = []
        section_dict[sec_num].append(obj)

    # Create SILK file for each section number stream
    for section in section_dict.keys():
        filename=f"section_{section}.slk"
        print(f"Generating SILK file {filename} for section {section}...")
        with open(filename, 'wb') as outfile:
            # SILK header
            outfile.write(b"#!SILK_V3")
            # Sort frames in ascending order based on sequence_bytes value
            for frame in sorted(section_dict[section], key=lambda x : x['sequence_bytes']):
                decoded = base64.b64decode(frame['voice_data'])
                # strip first 14 bytes and last 4 bytes before writing
                outfile.write(decoded[14:-4])

if __name__ == '__main__':
    main()

To decode SILK, I used the official SDK (that's what the decoder linked by Gordon Freeman is built on top of). The SDK can be downloaded from this link, which I found from this page.

After I downloaded the SDK, I extracted it, went into the directory named SILK_SDK_SRC_FIX_v1.0.9, and ran make (I'm on Kali, but pretty much any Linux variant should be fine).

Once make completes, you're left with a couple executables; the only one we care about is decoder.

Simply run decoder on the SILK payloads generated above, and you'll get a pcm file you can do whatever you want with. For example, ./decoder section_12.slk section_12.pcm. The output file is at 22050 Hz.

Hat tip to @Gordon Freeman for pointing out that the header isn't 18 bytes like I originally suspected and that the last 4 bytes aren't part of the SILK payload.

Old shell scripts

For posterity, here's how I converted the JSON to SILK files with shell scripts.

I used the following script to extract the data, de-base64 it, and put each struct's data in its own file.

#!/bin/bash

# Write each decoded VoiceData to a file with the naming convention
# <sequence_bytes>_<section_number>
write_data ()
{
    filename=`echo $1 | cut -d_ -f1,2`
    data=`echo $1 | cut -d_ -f3`
    echo -n "$data" | base64 -d > $filename
}
export -f write_data
jq -r '.[] | "(.sequence_bytes)_(.section_number)_(.voice_data)"' dota2CasterParse.json | xargs -I '{}' bash -c "write_data '{}'"

I then used the following script to create a SILK file for each section:

#!/bin/bash

section_numbers=$(ls [0-9]*_[0-9]* | cut -d_ -f2 | sort -u)

for section in $section_numbers; do
    output="section_${section}_voiceData.slk"
    echo -n '#!SILK_V3' > $output
    for i in $(ls *_${section} | sort -n); do
        dd bs=1 skip=14 count=$(($(stat -c "%s" $i)-18)) if=$i of=$output conv=notrunc oflag=append
    done
done

Answered by hairlessbear on April 14, 2021

There are 3 types of "frame", i guess 3 casters
BB 3F 80 0A 01 00 10 01 0B 80 3E 04 42 01 (@ 0x0)
D8 76 DD 02 01 00 10 01 0B 80 3E 04 FA 01 (@ 0x5f0)
67 7D 11 05 01 00 10 01 0B 80 3E 04 7E 01 (@ 0x44ccf)

Example for the first one:
BB 3F 80 0A identifier of the caster
01 channel number mono
80 3E = 0x3e80 =16000 the rate
42 01 = 0x142 the size of silk data

After the size the following 0x142 bytes are the datas of silk file
just add it silk header #!SILK_V3
23 21 53 49 4C 4B 5F 56 33

I use silk_v3_decoder.exe (? some python script can do it)
silk_v3_decoder.exe in.hex out.pcm -Fs_API 16000
then
ffmpeg -f s16le -ar 16000 -ac 1 -i out.pcm out.wav

A frame represents a short time, so all the data must be concatenated
(as said hairlessbear)

Nota: at the end of the "frame" there is 4 bytes could be checksum

Answered by Gordon Freeman on April 14, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP