How to solve this UTF-8 encoding C problem?

Question

In my class, we were given this problem. I have no clue how to solve it.

"The program below counts the number of characters in a file, assuming the file is encoded as ASCII. Modify the program so that it counts the number of characters in a file encoded as UTF-8"

#include <stdbool.h>
#include <stdio.h>
typedef unsigned char BYTE;
int main(int argc, char *argv[])
{
    if (argc != 2)
    {
        printf("Usage: ./count INPUTn");
        return 1;
    }
    FILE *file = fopen(argv[1], "r");
    if (!file)
    {
        printf("Could not open file.n");
        return 1;
    }
    int count = 0;
    while (true)
    {
        BYTE b;
        fread(&b, 1, 1, file);
        if (feof(file))
        {
            break;
        }
        count++;
    }
    printf("Number of characters: %in", count);
}

Can anyone help me solve this?

ascii byte c++unicode

ikegami · Answer

UTF-8 is designed such that this is trivial. There's a property that's common to all continuation bytes (the bytes you want to ignore), and only found in continuation bytes. What is it?
First     Last      Number of
Code      Code      bytes in   Byte 1    Byte 2    Byte 3    Byte 4
Point     Point     encoding 
--------  --------  ---------  --------  --------  --------  --------
U+000000  U+00007F          1  0xxxxxxx
U+000080  U+0007FF          2  110xxxxx  10xxxxxx
U+000800  U+00FFFF          3  1110xxxx  10xxxxxx  10xxxxxx
U+010000  U+10FFFF          4  11110xxx  10xxxxxx  10xxxxxx  10xxxxxx

Then, it's simply a question of doing some bit arithmetic. Bitwise-AND can be used to isolate the bits you want to check. C has an operator for that.

How to solve this UTF-8 encoding C problem?

One Answer

Add your own answers!

Ask a Question