TransWikia.com

Distinguish English and Spanish with regular expressions

Code Golf Asked on November 30, 2021

The task is to to compete for the shortest regex (in bytes) in your preferred programming language which can distinguish between English and Spanish with minimum 60% 90% accuracy.

Silvio Mayolo‘s submission (pinned as Best Answer) has secured his spot as the winner of original contest against any chance of being contested. In order to provide room for further submissions, he has generously allowed the scoring requirement to be pushed to 90% accuracy.

Links to wordlists have been replaced due to concerns voiced in the comments.

The following word lists (based on these) must be used: English, Spanish

The Spanish wordlist is already transliterated into ASCII, and there is no word present in either which is also present in the other.

A naive approach to distinguishing Spanish from English might be to match if the word ends in a vowel:

[aeiou]$ i 9 bytes

Here’s a live example, where 6 of 8 words are successfully identified, for 75% accuracy:

const regex = /[aeiou]$/i;

const words = [
  'hello',
  'hola',
  'world',
  'mundo',
  'foo',
  'tonto',
  'bar',
  'barra'
];

words.forEach(word => {
  const match = word.match(regex);
  const langs = ['English', 'Spanish'];
  const lang = langs[+!!match];
  console.log(word, lang);
});

2 Answers

50 bytes, 90.02% accurate

(a(d?|is|r|se?)|dor|eis|ese|je|n|[ns]te|os?|res?)$

For 18,004 out of the 20,000 words in es_clean.json and en_clean.json, this regex matches iff the input word is Spanish.

Answered by Lynn on November 30, 2021

Any Language, 0.3677 (60.6064%, 1 byte)

a

No, I'm not joking. The single-character regular expression a successfully identifies Spanish words over English given your input files 60.6064% of the time, which makes it a valid submission.

Here's a complete, runnable Perl script that checks the percentage of this regular expression, assuming you've downloaded english.json and spanish.json into the same folder as the script.

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;

my @english;
my @spanish;

my $fh;
open $fh, '<', 'english.json';
while (<$fh>) {
    push @english, $1 if /"(w+)"/;
}
close $fh;

open $fh, '<', 'spanish.json';
while (<$fh>) {
    push @spanish, $1 if /"(w+)"/;
}
close $fh;

my $correct = 0;
my $total = 0;

my $re = qr/a/;

for (@english) {
    $total++;
    $correct++ unless /$re/;
}
for (@spanish) {
    $total++;
    $correct++ if /$re/;
}

say "$correct / $total (@{[100*$correct/$total]}%)";

Answered by Silvio Mayolo on November 30, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP