Code Page 437: 2015

Wednesday, 13 May 2015

cat unknown

Postgresql can store timestamps. It prefers to store timestamps with timezone information. You can store a timestamp, date, time or interval.

Postgresql also lets you use strings when inserting or selecting the date, which you can then cast. For example:

select 'Mon May 11 11:21:31 SAST 2015'::date;

This works, but only for some timezones:

select 'Mon May 11 11:21:31 CAT 2015'::date; ERROR: invalid input syntax for type date: "Mon May 11 11:21:31 CAT 2015" LINE 1: select 'Mon May 11 11:21:31 CAT 2015'::date ;

CAT is unknown.

SAST is South African Standard Time. CAT is Central African Time.

africa behind

My first reaction is that the African locales are left behind again. I've encountered this before. For example: Botswana's Pula currency, or Zambia, in Java 1.4 and Java 1.6.

It turns out that this isn't the reason. It's just part of the reason.

cat isn't always a cat

Postgresql uses two tables for timezone lookups. pg_timezone_names and pg_timezone_abbrevs.

pg_timezone_names works with more complete timezones, eg. Asia/Ho_Chi_Minh. pg_timezone_abbrevs works with the abbreviations. However, pg_timezone_abbrevs must do forward and reverse lookups, and it turns out that timezone names aren't unique.

For example, at Wikipedia you'll find that ACT can mean Acre Time, or ASEAN Common Time. It's simply not possible for Postgresql to know what all timezones mean.

As a note: there is only one CAT in that table.

problems

os, java and postgresql

It's frustrating that the OS, Java and Postgresql do not share locale information. It would be nice if all three services actually provided exactly the same information, but they don't.

Zambia's currency changed in January 2014. Was the OS updated? Java? Postgresql? At the same time?

testing

Testing for these strings is hard. Do you know if all your applications work with the input?

africa really is left behind

There is only one CAT, yet Postgresql does not add it, and will not add it (I filed a bug report, #13267) Yet Postgresql does add CST which has five different meanings. And it adds the North American -- and only the North American -- definition.

solutions

early input sanitation

Input should be sanitized. That string should never have reached the database. Input should be sanitized and converted away from ambiguous strings as early as possible. This is just good programming practice.

late output representation

The output should be converted to human readable strings as late as possible. Again, this is good programming practice.

packaging

There is a very real packaging problem here. OS, Postgresql and Java definitely should have the same information. There are likely other packages with their own information.

unconfirmed

I haven't checked which Linux (or Windows, or MacOS X) distributions have different information for these services. I have checked that some (RedHat Enterprise 5 and 6, Postgresql 9.2 and 9.3, Java 1.4, 1.6 and 1.7) have different information for some of the values.

Monday, 27 April 2015

-fsigned-char

Ancient History

I maintain a proprietary Cobol machine. It was originally written with DOS and mainframes in mind. It was ported across to Xenix, Unix, and Linux (in the kernel 1.2 days) and recently to 64-bit Linux.

The code includes an editor, a compiler, and a runtime. The code is compiled to it's own machine. The runtime is much like a Java runtime: a VM that executes code (as opposed to a VM that runs another OS)

The code is in K&R C.

Yesterday I ported it to ARM. To a Raspberry Pi, specifically.

A Raspberry Pi has about 1,000 times more horsepower than the machines this code used to run on, so it's not so far from it's origins.

Wild free() chase

The C code would compile and run, but it would also crash soon after starting a VM and executing some code.

This was expected. When porting the code to 64-bit, I had some similar issues. And I expect similar issues when porting to big endian machines.

So I set off with gdb and valgrind -- wonderful tools -- and soon found a bunch of use-after-free.

However, this doesn't happen on Intel. And use-after-free is an unexpected crash.

unsigned char

The error was actually triggered by code similar to:

char c = -1 // -1;
int i = c // -1;
call_something(i /* -1 */);

The problem is caused by a peculiar characteristic of C: char is machine specific. char can be signed or unsigned, depending on the processor.

On x86-32 and x86-64 char is signed. On ARM char is unsigned. On ARM, the code becomes:

char c = -1; // 255
int i = c; // 255
call_something(i /* 255 */);

The difference is because C auto-promotes signed char to signed int, and unsigned char to signed int.

ebcdic

C also doesn't specify if char is ASCII or EBCDIC, or another character set. If it is ASCII it doesn't specify if it's codepage 437 or 850.

char-ed heap

The code is actually using a char[] as a heap, and putting and pulling variables on that heap.
It's common for code to use their own stack, but a heap is a bit less common.

This means that char[1234] might have different uses.

(signed char)

Casting is a possibility, but it's very hard to do right in practice. I'll just give one example:

Implicit Promotion

When porting to 64-bit, I had to take care of variable argument functions. In modern C this is done with

#include 

void foo(char *fmt, ...){
    va_list ap;
    va_start(ap, fmt);
    int i = va_arg(ap, int);
    ...
    va_end(ap);
}

This code will behave like:

void foo(char *fmt, int i){ }

However, it will auto-promote chars. You cannot use va_arg(ap, char). va_list and friends are macros. So you have to be very careful if you think you'll grab all instances of char.

Are you confident you'll (unsigned char) everywhere?

s/char/signed char/g

The next reaction is to just replace char with signed char in the variable declarations. This has it's own set of problems.

Some calls are defined as taking char. Feeding them specifically signed or unsigned chars can upset them.

However, this is the long-term solution. Define the data correctly, and unambiguously.

I have 1304 instances to evaluate.

typedef char

The more correct way, in this case, is to typedef some structures to signed or unsigned chars. This is indeed what should be done, and has been done, for some of the data to make it portable.

In other words, where we use char as a byte, we want to use sbyte and ubyte, and not char. char should be used for characters.

-fsigned-char

gcc comes with a flag: -fsigned-char.

This will make the code behave as if char is signed. It's specifically meant for this. To make this work, you can try the following in configure.ac:

AC_C_CHAR_UNSIGNED
if test $ac_cv_c_char_unsigned = yes && test "$GCC" = yes; then
    CFLAGS+=" -fsigned-char "
fi

(void *)-1

NULL is a pointer

In C, NULL is usually a special case pointer. It can mean the end of a string or list, or it can mean an error.

List Terminator

// list of strings
char * list[] = {
    "apple",
    "pear",
    NULL
}

Then we can use the following code to loop over the list:

char ** s;
for (s = list; *s; ++s) {
    printf("%s\n", *s);
}

Error

if (malloc(-1) == NULL) return (-1);

Valid Pointer

NULL is also a pointer. NULL is 0x0. It points to the very first block of RAM. On the 8086 this would be the interrupt vector table. On the Commodore 64 you might get the processor port data direction.

Unless you're the kernel and interested in hardware you're unlikely to care.

Today we care.

Today we want to use pointers to mark specific conditions that don't have meaning in specific C. For example, in the list of strings above we might want to warn about an uninitialized string.

In particular, today I had to build a list recursively, and then loop over it and free() the elements. However, it was possible to have an empty entry.

We can't have:

// list of strings
char * list[] = {
    "apple",
    "pear",
    NULL,
    "banana",
    NULL
}

we'd never get to "banana". We'd never free it. There would be a memory leak.

malloc() == NULL => success

First, we'll discus when NULL is valid.

Under one condition malloc() will return NULL, but it will also be a success. If malloc() returns 0x0 -- a pointer to the beginning of RAM, it will actually return successfully, but report an error.

This can happen when we do malloc(sizeof(everything)). Everything might be all RAM, but is more likely to be all map-able memory, which will include swap. It will be even bigger if overcommit is enabled.

When this happens, the memory allocated must start at 0. If it starts at 0x1, it misses the first byte, so it hasn't allocated everything.

Therefore malloc(sizeof(everything)) can only ever return 0x0 -- on both success and failure.

If 0x0 points to a hardware-specific area, the beginning of RAM might be remapped away, and malloc(sizeof(everything)) might point to the start of RAM, say 0x100. However, if this is the case, 0x0 is still valid: it points to a hardware-specific area.

0xdeadbeef

When debugging programs it can help to initialise variables to something easily visible. The default in C is uninitialized: whatever is in RAM. The default when using bzero() is 0x0.

The default is 0x0 because it initializes all memory to NULL, which will terminate a string or a list in C. It's also appeals to psychology: unused memory is empty.

When debugging it helps to initialize memory to something easily visible in a debugger. Unfortunately there are lots of valid zeroes. So if we initialize to 0x0, we don't know if the variable was never used, or actually is set to zero.

Debuggers also speak hex, so we want a non-zero value that's easy to spot in hex. Hence: 0xdeadbeef, which is easy to spot and almost English.

(void *)-1

Just as 0x0 is an interesting answer, so is -1, or 0xffffffff (on 32bit machines.)

Is (void *)-1 a valid pointer?

(void *)-1 can only ever point to a one-byte block of RAM. It can only ever point to malloc(1). It cannot contain more than one byte.

(void *)-1 is not valid on any machine that requires memory to be word or page or int aligned.

(void *)-1 might be valid in a memory allocator that allocates top-down, but even then it's likely to skip, to allow for easy realloc().

In Use

If you need a list of arrays that could be empty, you might use:

char * list[] = {
    "apple",
    "pear",
    (void *)-1,
    "banana",
    NULL
}

Now we can check for (void *)-1 to know that the list isn't finished (not NULL), but that the contents at this index are uninitialized.