TransWikia.com

How to solve this UTF-8 encoding C problem?

Stack Overflow Asked by Eamon Ryan on December 5, 2021

In my class, we were given this problem. I have no clue how to solve it.

"The program below counts the number of characters in a file, assuming the file is encoded as ASCII. Modify the program so that it counts the number of characters in a file encoded as UTF-8"

#include <stdbool.h>
#include <stdio.h>
typedef unsigned char BYTE;
int main(int argc, char *argv[])
{
    if (argc != 2)
    {
        printf("Usage: ./count INPUTn");
        return 1;
    }
    FILE *file = fopen(argv[1], "r");
    if (!file)
    {
        printf("Could not open file.n");
        return 1;
    }
    int count = 0;
    while (true)
    {
        BYTE b;
        fread(&b, 1, 1, file);
        if (feof(file))
        {
            break;
        }
        count++;
    }
    printf("Number of characters: %in", count);
}

Can anyone help me solve this?

One Answer

UTF-8 is designed such that this is trivial. There's a property that's common to all continuation bytes (the bytes you want to ignore), and only found in continuation bytes. What is it?

First     Last      Number of
Code      Code      bytes in   Byte 1    Byte 2    Byte 3    Byte 4
Point     Point     encoding 
--------  --------  ---------  --------  --------  --------  --------
U+000000  U+00007F          1  0xxxxxxx
U+000080  U+0007FF          2  110xxxxx  10xxxxxx
U+000800  U+00FFFF          3  1110xxxx  10xxxxxx  10xxxxxx
U+010000  U+10FFFF          4  11110xxx  10xxxxxx  10xxxxxx  10xxxxxx

Then, it's simply a question of doing some bit arithmetic. Bitwise-AND can be used to isolate the bits you want to check. C has an operator for that.

Answered by ikegami on December 5, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP