TransWikia.com

Avoiding Python tricks to find anagrams

Code Review Asked by shubhamprashar on December 21, 2020

What are anagrams?

An anagram of a string is another string that contains the same characters, only the order of characters can be different. For example, “abcd” and “dabc” are an anagram of each other

def anagram_check(string1,string2):
    list1 = [char1.lower() for char1 in string1 if char1 != " "]
    list2 = [char2.lower() for char2 in string2 if char2 != " "]
    if len(list1) != len(list2):
        return False
    else:
        list1.sort()
        list2.sort()
        return list1 == list2

print(anagram_check('clint eastwood','old west action'))

Question

I have started to learn data structures and algorithms, and my question is: Am I using too many Python tricks to solve this problem or do I need to think of a more basic approach to solve this?

4 Answers

To answer your specific question, I only see a few "tricks" being used in your code:

  1. Converting to lower case.
  2. Eliminating spaces.
  3. Comparing the lengths as a shortcut
  4. Sorting the lists to quickly compare them.

None of these are particularly "Python-centric". The only caveat would be that Python strings are $O(1)$ to compute the length, while for example C strings are $O(n)$ to compute the length.

(Note: while Python's standard sort() is faster than C's standard qsort(), this is a library thing and not your responsibility.)

One "obvious" thing would be to use collections.Counter to generate an equivalence class that would determine whether two strings were anagrams. That definitely would be Python-centric, since most other languages don't have a direct equivalent. (Although it's trivial to write in a language that already has dict/hash/assoc-array support.)

That said, let's look at your code from a Python standpoint.

Does it pep8?

I see a couple of problems with your function from a PEP-8 standpoint.

  1. Your function has no docstring.

  2. Use spaces after commas.

Is it quick-like?

You have done a fair job of optimizing the code without becoming "too Pythonic." I would suggest some of the following, however:

  1. Only call .lower() once. You call char1.lower() for char1 in string1 but you could instead use char1 for char1 in string1.lower() which would call .lower() once per string, rather than once per character. This will become significant when someone passes you a 60,000 character string. ;-)

  2. Trust the snake! Instead of checking the length, trust Python to do that! You can simplify your code even further with that.

Fix that name!

Unless you have some kind of requirement that you must use the name, change that name! The name anagram_check sounds like a procedure (do something) not like a function (return something). But you're returning a boolean value. So make it a boolean-sounding name:

is_anagram_of
are_anagrams
has_anagram
compare_equal_when_sorted_with_spaces_removed

Correct answer by Austin Hastings on December 21, 2020

I'll focus on performance. Let's first benchmark removing spaces and lower-casing:

string1 = 'clint eastwood'; string2 = 'old west action'

6.45 μs  list1 = [char1.lower() for char1 in string1 if char1 != " "]; list2 = [char2.lower() for char2 in string2 if char2 != " "]
0.42 μs  string1.replace(' ', ''); string2.replace(' ', '')
0.25 μs  string1.lower(); string2.lower()
0.63 μs  string1.replace(' ', '').lower(); string2.replace(' ', '').lower()
0.71 μs  string1.lower().replace(' ', ''); string2.lower().replace(' ', '')

Your preprocessing is a lot slower than using string methods. And unsurprisingly, it's faster to remove spaces before lower-casing. Both @M_Juckes and @hjpotter92 did it the other way around.

Now let's check length, sorting and counting of the preprocessed strings:

string1 = 'clinteastwood'; string2 = 'oldwestaction'

0.16 μs  len(string1); len(string2)
1.65 μs  sorted(string1); sorted(string2)
6.59 μs  Counter(string1); Counter(string2)

Getting the length is far cheaper than sorting and counting, which are also a lot slower than the space-removal and lower-casing and thus make up most of the time. So for performance it's a good idea to do your length check and not sort/count at all if the lengths already differ. Also, counting is far more expensive than sorting. Probably not for a lot longer strings, but I find it unrealistic to check whether two very long strings are anagrams.

Let's turn what we learned into an optimized solution:

def anagram_check(string1, string2):
    s = string1.replace(' ', '')
    t = string2.replace(' ', '')
    if len(s) != len(t):
        return False
    return sorted(s.lower()) == sorted(t.lower())

First I only remove the spaces, so that the length check can avoid unnecessary lowering as well. Then lower-case and use sorting. I kept the longer parameter names for clarity but internally switched to more convenient short variable names.

Benchmarks on full solutions, first with your original string pair:

('clint eastwood', 'old west action')

7.97 μs  original
2.69 μs  heap_overflow
3.71 μs  M_Juckes
2.45 μs  hjpotter92

Mine is a bit slower than hjpotter92's, as the length check does take a little time. Next let's modify the second string a bit so they're not anagrams:

('clint eastwood', 'new west action')

7.74 μs  original
2.62 μs  heap_overflow
3.63 μs  M_Juckes
2.50 μs  hjpotter92

Pretty much the same. Now let's make the second string a bit longer:

('clint eastwood', 'older west action')

6.94 μs  original
0.77 μs  heap_overflow
4.00 μs  M_Juckes
2.64 μs  hjpotter92

Now the length check pays off. In my solution more than in yours, as I avoid not only the sorting but also the lower-casing. It's by far the fastest now, as expected from the earlier timings of the individual steps.

I don't know what data you're using this on, but if the length check fails let's say 50% of the time, then it's clearly worth doing, and my solution is fastest on average. I'd also say it's better style than those long one-liners.

Full benchmark code:

from collections import Counter
from functools import partial
from timeit import repeat

tests = [
    ('clint eastwood','old west action'),
    ('clint eastwood','new west action'),
    ('clint eastwood','older west action'),
    ]

#----------------------------------------------------------------------------

normers = [
    'list1 = [char1.lower() for char1 in string1 if char1 != " "]; list2 = [char2.lower() for char2 in string2 if char2 != " "]',
    "string1.replace(' ', ''); string2.replace(' ', '')",
    "string1.lower(); string2.lower()",
    "string1.replace(' ', '').lower(); string2.replace(' ', '').lower()",
    "string1.lower().replace(' ', ''); string2.lower().replace(' ', '')",
    ]

for test in tests[:1]:
    setup = f'string1 = {test[0]!r}; string2 = {test[1]!r}'
    print(setup)
    tss = [[] for _ in normers]
    for _ in range(3):
        for normer, ts in zip(normers, tss):
            t = min(repeat(normer, setup, number=10**5)) * 10
            ts.append(t)
    for normer, ts in zip(normers, tss):
        print('%.2f μs ' % min(ts), normer)
    print()

#----------------------------------------------------------------------------

normers = [
    "len(string1); len(string2)",
    "sorted(string1); sorted(string2)",
    "Counter(string1); Counter(string2)",
    ]

for test in tests[:1]:
    string1 = test[0].replace(' ', '').lower()
    string2 = test[1].replace(' ', '').lower()
    setup = f'string1 = {string1!r}; string2 = {string2!r}'
    print(setup)
    setup += '; from collections import Counter'
    tss = [[] for _ in normers]
    for _ in range(3):
        for normer, ts in zip(normers, tss):
            t = min(repeat(normer, setup, number=10**5)) * 10
            ts.append(t)
    for normer, ts in zip(normers, tss):
        print('%.2f μs ' % min(ts), normer)
    print()

#----------------------------------------------------------------------------

def original(string1,string2):
    list1 = [char1.lower() for char1 in string1 if char1 != " "]
    list2 = [char2.lower() for char2 in string2 if char2 != " "]
    if len(list1) != len(list2):
        return False
    else:
        list1.sort()
        list2.sort()
        return list1 == list2
    
def heap_overflow(string1, string2):
    s = string1.replace(' ', '')
    t = string2.replace(' ', '')
    if len(s) != len(t):
        return False
    return sorted(s.lower()) == sorted(t.lower())

def M_Juckes(*args : 'Two or more strings') -> 'True if all strings are anagrams':
    return len( { tuple(sorted(x.lower().replace(' ', ''))) for x in args } ) == 1

def hjpotter92(string1, string2):
    return sorted(string1.lower().replace(' ', '')) == sorted(string2.lower().replace(' ', ''))

funcs = original, heap_overflow, M_Juckes, hjpotter92

for test in tests:
    print(test)
    print()
    tss = [[] for _ in funcs]
    for _ in range(3):
        for func, ts in zip(funcs, tss):
            t = min(repeat(partial(func, *test), number=10**5)) * 10
            ts.append(t)
    for func, ts in zip(funcs, tss):
        print('%.2f μs ' % min(ts), func.__name__)
    print()

Answered by Kelly Bundy on December 21, 2020

Taking advantage of the work previously submitted by @hjpotter92 and adding an extra twist:

def anagram_check(*args ):
    return len( { tuple(sorted(x.lower().replace(' ', ''))) for x in args } ) == 1

This, potentially, gains an additional pythonicity score by avoiding repetition of the sorted(x.lower().replace( ...) construct. Instead, the construct is applied to both arguments in the set comprehension, and the test to see if the resulting set has a single element provides the answer. This implementation also generalizes the API to cover more that two strings if wanted, returning True if all strings given are anagrams of each other (thanks to Heap Overflow for pointing this out).

If you really want a function which is limited to two strings, you might prefer:

def anagram_check(string1, string2):
    return len({tuple(sorted(x.lower().replace(' ', ''))) for x in [string1,string2]}) == 1

The code can be made less terse and more legible by defining a function to act on each string:

def make_unique(x): 
     return tuple( sorted( x.lower().replace(' ', '') ) )

def anagram_check(*args): 
     return len( { make_unique(x) for x in args } ) == 1

Answered by M Juckes on December 21, 2020

Welcome to Code Review.

Since you're looking for Python tricks here, there are great many things which can both make it Pythonic, maintainable, shorter, etc. Depending on your understanding, as well the target audience for the code in near future, you may choose individual methods. A few pointers to get you started:

  1. You can use a filter to remove all whitespace characters:

    filter(lambda char: char != ' ', string.lower())
    
  2. You can use replace to replace all whitespaces:

    string.lower().replace(' ', '')
    
  3. You can use collections.Counter to verify that count of each character in both strings is equal, although list equality works as well.

  4. You can validate (or simply return) equality of a sorted call on the filtered strings.


def anagram_check(string1, string2):
    return sorted(string1.lower().replace(' ', '')) == sorted(string2.lower().replace(' ', ''))

The one-liner above is perfectly Pythonic, imo.

Answered by hjpotter92 on December 21, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP