Applied Pokology - A blog about GNU poke

Dealing with alternatives - Unions in Poke
[18-10-2019]

by Jose E. Marchesi

The Poke type definitions can be seen as a sort of declarative specifications for decoding and encoding procedures. You specify the structure of the data you want to operate on, and poke uses that information to automagically decode and encode the data for you. Under this perspective, struct types correspond to sequences of instructions, array types to repetitions or loops, and union types to alternatives or conditionals.

Decode-Compute-Encode

Computing with data whose form is not the most convenient way to be manipulated, like is often the case in unstructured binary data, requires performing a preliminary step that transforms the data into a more convenient representation, usually featuring a higher level of abstraction. This step is known in computer jargon as unmarshalling, when the data is fetch from some storage or transmission media or, more generally, decoding.

Once the computation has been performed, the result should be transformed back to the low-level representation to be stored or transmitted. This is performed in a closing step known as marshalling or, more generally, encoding.

Consider the following C program whose purpose is to read a 32-bit signed integer from a byte-oriented storage media at a given offset, multiply it by two, and store the result at the same offset.

void double_number (int fd, off_t offset, int endian)
{
   int number, i;
   int b[4];

   /* Decode.  */
   lseek (fd, offset, SEEK_SET);
   for (i = 0; i < 4; ++i)
      read (fd, &b[i], 1);

   if (endian == BIG)
     number = b[0] << 24 | b[1] << 16 | b[2] << 8 | b[3];
   else                                                      
     number = b[3] << 24 | b[2] << 16 | b[1] << 8 | b[0];
                                                      
   /* Compute.  */
   number = number * 2;

   /* Encode.  */
   if (endian == BIG)
   {
     b[0] = (number >> 24) & 0xff;
     b[1] = (number >> 16) & 0xff;
     b[2] = (number >> 8) & 0xff;
     b[3] = number & 0xff;                            
   }
   else
   {
     b[3] = (number >> 24) & 0xff;
     b[2] = (number >> 16) & 0xff;
     b[1] = (number >> 8) & 0xff;
     b[0] = number & 0xff;                            
   }
                                                      
   lseek (fd, offset, SEEK_SET);
   for (i = 0; i < 4; ++i)
      write (fd, &b[i], 1);
}

As we can see, decoding takes care of fetching the data from the storage in simple units, bytes. Then it mounts the more abstract entity on which the computation will be performed, in this case a signed 32-bit integer. Considerations like endianness, negative encoding (which is assumed to be two's complement in this example and handled automatically by C) and error conditions (omitted in this example for clarity) should be handled properly.

Conversely, encoding turns the signed 32-bit integer into a sequence of bytes and then writes them out to the storage at the desired offset. Again, this requires taking endianness into account and handling error conditions.

This example may look simplistic and artificial, and it is, but too often the computation proper (like multiplying the integer by two) is way more straightforward than the decoding and encoding of the data used for the computation.

Generally speaking, decoding and encoding binary data is laborious and error prone. Think about sequences of elements, variable-length and clever compact encodings, elements not aligned to byte boundaries, the always bug-prone endianness, and a long etc. Dirty business, sometimes risky, and always boring.

Describe-Compute

This is where poke comes into play. Basically, it allows you to describe the characteristics of the data you want to compute on, and then decodes and encodes it for you, taking care of the gory details. That way you can concentrate your energy on the fun part: computing on the data at your pleasure.

Of course, you are still required to provide a description of the data. In Poke, these descriptions take the form of type definitions, which are declarative: you specify what you want, and poke extracts the how from that.

For example, consider the following Poke type definition:

type Packet =
  struct
  {
    uint<16> magic : magic == 0xef;
    uint<32> size;
    byte[size] data @ 64#B;
  };

This tells poke that, in order to decode a Packet, it should perform the following procedure (a similar procedure is implied for encoding):

Read four bytes from the IO space, mount them into an unsigned 16-bit integer using whatever current endianness, and put it in magic.
Read eight bytes, mount them into an unsigned 32-bit integer using the same endianness, and put it in size.
Seek the IO space to advance 16 bits.
Do size times:
- Read one byte and mount it into an unsigned 8-bit integer.
- Put the integer in the proper place in the data array.

If during this procedure an end of file is encountered, or a constraint fails, an appropriate error is raised.

In the procedure sketched above we find a sequence of operations, implied by the struct type, and a loop, implied by the array type. As we shall see below, it is also possible to decode conditionally. Union types are used for that purpose.

BSON

BSON is a binary encoding for JSON documents. The top-level entity described by the spec is the document, which contains the total size of the document, a sequence of elements, and a trailing null byte.

We can describe the above in Poke with the following type definition:

type BSON_Doc =
  struct
  {
    offset<int32,B> size;
    BSON_Elem[size - size'size - 1#B] elements;
    byte endmark : endmark == 0x0;
  };

BSON elements come in different kinds, which correspond to the different types of JSON entities: 32-bit integers, 64-bit integers, strings, arrays, timestamps, and so on. Every element starts with a tag, which is a 8-bit unsigned integer that identifies it's kind, and a name encoded as a NULL-terminated string. What comes next depends on the kind of element.

The following Poke type definition describes a subset of BSON elements, namely integers, big integers and strings:

type BSON_Elem =
  struct
  {
    byte tag;
    string name;

    union
    {
      int32 integer32 : tag == 0x10;
      int64 integer64 : tag == 0x12;
      BSON_String str : tag == 0x02;
    } value;
  };

The union in BSON_Elem corresponds to the variable part. When poke decodes an union, it tries to decode each alternative (union field) in turn. The first alternative that is successfully decoded without raising a constraint violation exception is the chosen one. If no alternative can be decoded, a constraint violation exception is raised.

To see this process in action, let's use the BSON corresponding to the following little JSON document ¹:

{
    "name" : "Jose E. Marchesi",
    "age" : 39,
    "big" : 1076543210012345
}

Let's take a look to the different elements ²:

(poke) .load bson.pk
(poke) var d = BSON_Doc @ 0#B
(poke) d.elements'length
0x3UL
(poke) d.elements[0]
BSON_Elem {tag=0x2UB,name="name",value=struct {str=BSON_String {size=0x11,value="Jose E. Marchesi",chars=[0x4aUB,0x6fUB,0x73UB,0x65UB,0x20UB,0x45UB,0x2eUB,0x20UB,0x4dUB,0x61UB,0x72UB,0x63UB,0x68UB,0x65UB,0x73UB,0x69UB,0x0UB]}}}
(poke) d.elements[1]
BSON_Elem {tag=0x10UB,name="age",value=struct {integer32=0x27}}
(poke) d.elements[2]
BSON_Elem {tag=0x12UB,name="big",value=struct {integer64=0x3d31c3f9e3eb9L}}

Note how unions decode into structs featuring different fields. What field is available depends on the alternative chosen while decoding.

In the session above, d.elements[1].value contains an integer32 field, whereas d.elements[2].value contains an integer64 field. Let's see what happens if we try to access the wrong field:

(poke) d.elements[1].value.integer64
unhandled invalid element exception

We get a run-time exception. This kind of errors cannot be catched at compile time, since both integer32 and integer64 are potentially valid fields in the union value.

Unions are Tagged

Unlike in C, Poke unions are tagged. Unions have their own lexical closures, and it is the capured values that determine what field is chosen at every time. Wherever the union goes, its tag accompanies it.

To see this more clearly, consider the following alternative Poke description of the BSON elements:

type BSON_Elem =
  union
  {
    struct
    {
      byte tag : tag == 0x10;
      string name;
      int32 value;
    } integer32;

    struct
    {
      byte tag : tag == 0x12;
      string name;
      int64 value;
    } integer64;

    struct
    {
      byte tag : tag == 0x12;
      string name;
      BSON_String value;
    } str;
  };

This description is way more verbose, but it shows a few important properties of Poke unions.

First, the constraints guiding the decoding are not required to appear in the union itself: it is a recursive process. In this example, BSON_String could have constraints on it's own, and these constraints will also impact the decoding of the union.

Second, there are generally many different ways to express the same plain binary using different type structures. This is no different than getting different parse trees from the same sequence of tokens using different grammars denoting the same language.

See how different a BSON element looks (and feels) using this alternative description:

(poke) d.elements[0]
BSON_Elem {str=struct {tag=0x2UB,name="name",value=BSON_String {size=0x11,value="Jose E. Marchesi",chars=[0x4aUB,0x6fUB,0x73UB,0x65UB,0x20UB,0x45UB,0x2eUB,0x20UB,0x4dUB,0x61UB,0x72UB,0x63UB,0x68UB,0x65UB,0x73UB,0x69UB,0x0UB]}}}
(poke) d.elements[1]
BSON_Elem {integer32=struct {tag=0x10UB,name="age",value=0x27}}
(poke) d.elements[2]
BSON_Elem {integer64=struct {tag=0x12UB,name="big",value=0x3d31c3f9e3eb9L}}

What is the best way? It certainly depends on the kind of data you want to manipulate, and the level of abstraction you want to achieve. Ultimately, it is up to you.

Generally speaking, the best structuring is the one that allows you to manipulate the data in terms of the structured abstractions as naturally as possible. That's the art and craft of writing good pickles.

GNU poke is evolving to facilitate this task as much as possible. Features like struct flattening and unification of union fields (i.e. to be able to have different alternatives with the same name but potentially different types) are being worked on, so stay tuned.

Happy poking! :)

Footnotes:

[1] To generate the BSON from the JSON I used libbson and its little example program json2bson

[2] The colorized poke sessions in this article have been generated using poke --color=html. A similar styling is also performed in the terminal, and it is fully customizable by the user. This is courtesy of Bruno Haible and his fantastic libtextstyle.