Is it defined in C to access "extra" union space via pointer to char?

Question

The C standard permits accessing an object via a pointer to character type (§6.3):

An object shall have its stored value accessed only by an lvalue that has one of the following
types:
[...]

a character type

This allows functions like memcpy() and fwrite() to work.
Say I have a union type used for a variant type (aka. a tagged union):
union var_uint
{
    uint8_t  n1;
    uint32_t n4;
};

enum kind_t
{
    KIND_1,
    KIND_4
};

struct tagged_uint
{
    enum  kind_t   kind;
    union var_uint value;
};

The C standard also says:

the size of a union is the size of its largest member, and there may be unnamed padding at the end of the union
type punning (accessing a union member via another member that's a char array) is implementation defined (I think it should also be legal to access it via a char array that isn't a member as per above?)

Is it defined behaviour to access the full size of the union type via a char pointer, even if you don't actually have any logic that depends on the values of the padding bytes? For example:
union var_uint number;
number.n1 = 127;
struct tagged_uint tagged_number = { KIND_1, number };
fwrite(&tagged_number, sizeof (union var_uint), 1, my_stream);

// Later.

struct other_tagged_number;
fread(&other_tagged_number, sizeof (union var_uint), 1, my_reopened_stream);

Here the padding bytes must be accessed to be written to the stream, even though it makes no difference to the logic of the code later (assuming it checks the kind field before accessing the var_uint member).
I only have the C90 standard with me right now, but I'd be interested in what the other standards say too.
(I am not actually serialising data to disk in this way.)

supercat · Answer

The C89 Standard as a whole was, like the language it was written to describe, based around an abstraction model where every addressable object of type T that would fit in a region of storage will hold, based on the contents of the sizeof (T) bytes at its address, hold either a value of its type or a trap representation.  Attempting to read the object when it holds a value will yield that value; the consequences of attempting to read an object when it holds a trap representation are outside the Standard's jurisdiction.  Note that under this model, anything which is said about the behavior of writing one object in a union and reading another would apply equally to code which e.g. writes an object with a pointer of another type, converts the pointer to a different type, and then reads the object using that.
A lot of the vagueness about structures and unions was intended to accommodate implementations where writing to larger areas of memory might be faster than writing to smaller ones.  For example, if a little-endian platform supports 8-bit and 32-bit reads and writes, but not 16-bit ones, and one has a structure type:
struct foo {
  uint16_t a,b,c; // Followed by 16 bits of padding
  uint32_t d;
} *p;

it might be faster to process p->c = 23; by storing 23 to the 32-bit word holding p->c, than by using a pair of 8-bit writes.  If, however, there were also a structure type:
struct bar {
  uint16_t a,b,c,d;
} my_bar;

and one were to perform p->c = 23; while p pointed to my_bar, such an action would corrupt the value of my_bar.d.  Although there would be some situations where it would be useful to be able to read and write common-initial-sequence members, allowing CIS members to be written as well as read would have made it necessary for a compiler processing p->c = 23; to use a pair of byte stores rather than a word store, possibly incurring a major speed penalty.  The authors of the Standard likely figured that implementations that would have no reason to process writes in that fashion could and would extend the language by processing CIS writes in a way that avoids corrupting nearby objects, regardless of whether the Standard would require that they do so, and thus there was no need to have the Standard address the behavior of CIS writes.
Although the Standard seems to hint that for every object of type T, there exists a char[sizeof (T)] object which occupies the same storage, and that actions may be performed using the latter object to access the storage associated with the former with the semantics implied by the original abstraction model, it doesn't actually say that such an object exists.  Given e.g.
union foo {
  unsigned char x[2];
  int y;
} it;

the Standard never makes clear whether the pointer expression (unsigned char*)it yields a pointer to the first element of the char[sizeof (union foo)] object which implicitly overlays it, rather than e.g. a pointer to the first element of a char[2] object u.x (which would of course share the same address).
Note that such ambiguities don't create situations where there a piece of code might have two different defined behaviors in some circumstances.  In nearly all such cases, it's clear that behavior is either defined unambiguously or not defined at all.  If compilers treat constructs as having the unambiguously-defined behaviors in cases where doing so would be sensible, regardless of whether the Standard requires them to do so, it won't matter if the Standard leaves ambiguous the question of whether the behaviors are defined.  Unfortunately, compiler writers and programmers are often at odds about when it is "sensible" to treat as UB an action that could have only one possible defined meaning.
I don't think any of today's compilers are apt to behave nonsensically in the circumstances you describe, but I don't think the Standard would forbid it.  Instead, it relies upon compiler writers to exercise common sense, and in this particular situation, so far as I can tell, they haven't thrown common sense out the window.

Chris Dodd · Answer

6.2.6.1 of the C99 spec has two relevant paragraphs:

When a value is stored in an object of structure or union type, including in a member
object, the bytes of the object representation that correspond to any padding bytes take
unspecified values. The values of padding bytes shall not affect whether the value of
such an object is a trap representation. Those bits of a structure or union object that are
in the same byte as a bit-field member, but are not part of that member, shall similarly not
affect whether the value of such an object is a trap representation.
When a value is stored in a member of an object of union type, the bytes of the object
representation that do not correspond to that member but do correspond to other members
take unspecified values, but the value of the union object shall not thereby become a trap
representation.

Thus accessing the padding is not undefined behavior, as long as you use a type (such as char) that cannot have any trap values.

Is it defined in C to access "extra" union space via pointer to char?

2 Answers

Add your own answers!

Ask a Question