XOR Encryption, Decryption, and Cracking in Python

Question

I have recently been working through a number of the Cryptopals Cryptography Challenges to try and improve my knowledge and understanding of both cryptography and Python. As five of the first six challenges are XOR related problems I thought it would be a good idea to compile my work into a single program capable of encrypting, decrypting, and cracking using the XOR cipher. The program is capable of both single-byte and multi-byte encryption modes and can employ statistical analysis to guess a key when none is given.
I have previously asked for reviews on my Ceasar and Vigenere implementations/crackers and have included all of them together as a small suite for these fun little ciphers which I have uploaded to a repository on GitHub. I won't include all the code here but if possible I would like to know what improvements I could make to the overall structure of the project as I am trying to learn about how to organise a project like this with the intention of growing it with more and more encryption tools in the future. Due to folder structure dependencies I would suggest you clone the GitHub repository if you intend to run this code although all the relevant code will be posted below.
What I Would Like Feedback On

Correctness of my implementations. Please let me know if there are any errors in my code.
Readability, Pythonic-ness, style, and documentation. I am learning Python with the intention of working with large teams on projects in the future and I feel these will be important for collaborative work.
There is an issue with the method predictKeySize() in XORAnalysis.py in which it will have a strong bias towards short keys if it is allowed to guess short keys. As such it is currently hard coded to only guess lengths greater than 6 meaning my program is incapable of cracking keys between two and five characters long. Any ideas about how to improve this would be much appreciated
Performance improvements and memory usage reductions. Not as important as other areas as the program isn't particularly slow or resource intensive but still nice to know about.

The Code
xor.py
#!/usr/bin/python3

"""

Filename:   xor.py
    Author:     Jess Turner
    Date:       15/07/20
    Licence:    GNU GPL V3
    
    Multipurpose XOR Encryption tool, can encrypt and decrypt text using a specified single-byte or multi-byte key or attempt to decrypt an input without a given key by using statistical analysis

Options:
        --encrypt           Enable encryption mode (Default)
        --decrypt           Enable decryption mode
        --key               Specify the encryption key
        --guess             Attempt to guess the encryption key by statistical analysis
        --single-byte       Enable single-byte XOR mode (Default)
        --multi-byte        Enable multi-byte XOR mode

"""

import argparse
import string
import codecs
import sys
from itertools import cycle

from internal.XORAnalysis import predictKeySize, multiByteXORCrack, multiByteXOR, repeatingByteXOR, repeatingByteXORCrack

def initialiseParser():
    parser = argparse.ArgumentParser(description = "Encrypt, decrypt, or crack a message using the XOR Cipher")

parser.add_argument("--key", "-k", help = "The encryption key to be used (if relevant)", type = str)
    parser.add_argument("--guess", "-g", help = "Perform statistical analysis to estimate the most likely value of the encryption key", action = "store_true")
    parser.add_argument("--single-byte", "--single", "-s", help = "Enable single-byte key mode", action = "store_true")
    parser.add_argument("--multi-byte", "--multi", "-m", help = "Enable multi-byte key mode", action = "store_true")
    parser.add_argument("--decrypt", "-d", help = "Enable decryption mode", action = "store_true")

return parser

def main():
    parser = initialiseParser()
    args = parser.parse_args()
    inputString = sys.stdin.read().encode()

if args.decrypt or args.guess:
        inputString = codecs.decode(inputString, "base-64")

if args.guess:
        if args.multi_byte:
            print("[+] Selecting multi-byte key mode...", file = sys.stderr)
            print("[+] Predicting key length...", file = sys.stderr) # At this point we have the entire decoded input in memory, all that is left is to crack it

keyLength = predictKeySize(inputString)

print("[-] Got length of {}...n[+] Attempting to crack key...".format(keyLength), file = sys.stderr)

crack = multiByteXORCrack(inputString, keyLength)
            key = crack['key']
        else:
            print("[+] Selecting single-byte key mode...", file = sys.stderr)
            print("[+] Attempting to crack key...", file = sys.stderr)

crack = repeatingByteXORCrack(inputString)
            key = chr(crack['key'])

print("[-] Got key: "{}" !n[+] Decrypting message...".format(key), file = sys.stderr)

output = crack['message']
    elif args.key != None:
        if len(args.key) > 1 and not args.multi_byte:
            print("[+] Single-byte mode selected but multi-byte key was given. Defaulting to multi-byte mode...", file = sys.stderr)
            args.multi_byte = True

output = multiByteXOR(inputString, [ord(c) for c in args.key]) if args.multi_byte else repeatingByteXOR(inputString, ord(args.key))
            
    else:
        print("[-] Error: No key given!", file = sys.stderr)
        return

if not args.decrypt and not args.guess:
        output = codecs.encode(output.encode(), "base-64").decode()

print(output, end = "")

if __name__ == "__main__":
    main()

XORAnalysis.py
"""

Filename:   XORAnalysis.py
    Author:     Jess Turner
    Date:       19/06/20
    Licence:    GNU GPL V3
    
    A collection of analysis functions and pieces of information required byciphertools programs which implement XOR-based algorithms
    
"""

from itertools import cycle
import string

from .Strings import alphanumeric_characters, buildSubStrings

# XOR analysis functions

def letterRatio(inputString):
    return sum([x in alphanumeric_characters for x in inputString]) / len(inputString)

def probablyText(inputString):
    return letterRatio(inputString) > 0.7

# Functions for single-byte key XOR

def repeatingByteXOR(inputString, byte):
    return "".join(chr(c ^ byte) for c in inputString)

def repeatingByteXORCrack(inputString):
    best = None

for byte in range(256):
        currentString = repeatingByteXOR(inputString.strip(), byte)
        num_chars = sum([x in alphanumeric_characters for x in currentString])

if best == None or num_chars > best['num_chars']:
            best = { 'message': currentString, 'num_chars': num_chars, 'key': byte }

return best

# Functions for multi-byte key XOR

def multiByteXORCrack(inputString, keyLength):
    key = "".join(chr(repeatingByteXORCrack(string.strip())['key']) for string in buildSubStrings(inputString, keyLength))
    message = multiByteXOR(inputString, key.encode())

return { 'message': message, 'key': key }

def multiByteXOR(inputString, key):
    return "".join(chr(c ^ byte) for c, byte in zip(inputString, cycle(key)))

# Functions for multi-byte XOR key length prediction

def XORStrings(first, second):
    return bytes([i ^ j for i, j in zip(first, second)]) # Convert two byte strings to their xor product

def hammingDistance(first, second):
    return bin(int.from_bytes(XORStrings(first, second), "little")).count("1") # Calculate the bit difference between two strings

def predictKeySize(inputString):
    bestKey = 0
    bestDistance = 10000

for i in range(6, 40): # Set to a lower bound of 6 because otherwise it always guesses a really short key. Will try and fix in later version.
        distance = 0
        blocks = len(inputString) // i - 1

for x in range(blocks):
            distance += hammingDistance(inputString[i * x:i * (x + 2) - 1], inputString[i * (x + 2):i * (x + 4) - 1])

distance /= i
        distance /= blocks

if distance < bestDistance:
            bestDistance = distance
            bestKey = i

return bestKey

Strings.py
"""

Filename:   strings.py
    Author:     Jess Turner
    Date:       28/09/19
    Licence:    GNU GPL V3

A collection of functions for the modification of strings required by multiple programs in the ciphertools suite

"""

import re

alphanumeric_characters = "1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz "

english = { 'monogram-frequencies': [8.167, 1.492, 2.782, 4.253, 12.702, 2.228, 2.015, 6.094, 6.966, 0.153, 0.772, 4.025, 2.406, 6.749, 7.507, 1.929, 0.095, 5.987, 6.327, 9.056, 2.758, 0.978, 2.360, 0.150, 1.974, 0.074 ],
            'bigram-frequencies': [] }

def stringPrepare(string, preserveSpacing): # Strip all non alphabetic characters from a string and convert to upper case
    return re.compile("[^A-Zs]" if preserveSpacing else "[^A-Z]").sub("", string.upper())

def buildSubStrings(string, separation): # Build a list of substrings required to analyse the ciphertext
    return [string[i::separation] for i in range(separation)]

Reinderien · Answer

Nomenclature
By PEP8, initialiseParser should be initialise_parser, and similarly for inputString, etc.
String interpolation
print("[-] Got length of {}...n[+] Attempting to crack key...".format(keyLength), file = sys.stderr)

is simpler as
print(
    f"[-] Got length of {key_length}...n"
    "Attempting to crack key...",
    file=sys.stderr,
)

Type hints
For example,
def probablyText(inputString):

can be
def probably_text(input_string: str) -> bool:

Sum without a comprehension
sum([x in alphanumeric_characters for x in currentString])

should use the generator directly instead of making a list; i.e.
sum(x in alphanumeric_characters for x in current_string)

The same goes for
return bytes([i ^ j for i, j in zip(first, second)]) # Convert two byte strings to their xor product

Strongly-typed results
best = { 'message': currentString, 'num_chars': num_chars, 'key': byte }

If you're only doing this because you need to return multiple things, idiomatic Python is to simply return them as a tuple, i.e.
best = current_string, num_chars, byte
# ...
return best

But this would be better-represented by a named tuple, or (better) a @dataclass with type hints. Just not a dictionary.
Combined division
    distance /= i
    distance /= blocks

can be
distance /= i * blocks

Sums rather than successive addition
    for x in range(blocks):
        distance += hammingDistance(inputString[i * x:i * (x + 2) - 1], inputString[i * (x + 2):i * (x + 4) - 1])

can be
distance = sum(
    hamming_distance(
        input_string[i*x     : i*(x+2)-1],
        input_string[i*(x+2) : i*(x+4)-1],
    )
    for x in range(blocks)
)

Drop dictionaries to variables
Given your current code,
english = { 'monogram-frequencies': [8.167, 1.492, 2.782, 4.253, 12.702, 2.228, 2.015, 6.094, 6.966, 0.153, 0.772, 4.025, 2.406, 6.749, 7.507, 1.929, 0.095, 5.987, 6.327, 9.056, 2.758, 0.978, 2.360, 0.150, 1.974, 0.074 ],
            'bigram-frequencies': [] }

should simply be a monogram variable and a bigram variable.

XOR Encryption, Decryption, and Cracking in Python

One Answer

Nomenclature

String interpolation

Type hints

Sum without a comprehension

Strongly-typed results

Combined division

Sums rather than successive addition

Drop dictionaries to variables

Add your own answers!

Ask a Question