[crossfire] Data File (Maps, Archetypes) Encodings
Mark Wedel
mwedel at sonic.net
Tue Feb 6 02:03:49 CST 2007
Christian Hujer wrote:
> Hello dear co-devs.
>
>
> We have a common problem with the text encodings of data files.
>
> Examples:
> * Daimonin used (until a few minutes ago) the ISO paragraph character 0xA7 for
> separating a map's sound spec from its name.
> * Daimonin uses the ISO degree character 0xB0 for highlights in messages.
> * Crossfire uses the a circumflex character 0xE2 for the name of a wine in
> map /maps/scorn/houses/house3.bas2.
Not sure if still the case, but at one time there were some objects that also
used special characters - Mjølnir comes to mind.
>
> This leads to some problems.
> * Crossfire x11 client displays 0xE2 as a circumflex.
> * Crossfire gtk client displays 0xE2 as ? (tested by Ragnor).
And it appears in the GTK2 client, it won't draw the entire line/message that
has the bad character.
>
> For both projects, it makes sense to rethink the file formats. I see three
> possible solutions:
>
> 1. Use US-ASCII text only.
> That means, only data files with bytes 0x13, 0x20-0x7E are valid.
> Pro: easy
> Pro: stable
> Pro: no changes required.
> Con: very limited solution
And one that is currently not in use, as demonstrated we already have some non
ASCII characters making their way in.
>
> 2. Use ISO-8859-15 text.
> That means, bytes 0x13, 0x20-0x7E, 0xA0-0xFF are valid.
> Pro: easy
> Con: clients need special handling for non-ascii chars if they are UTF-8 aware
> and run on UTF-8 systems (e.g. gtk client).
> Con: limited solution
>
> 3. Use UTF-8 text.
> That means, only valid UTF-8 streams with Unicodes u0013, u0020-u007E,
> u00A0-... are valid.
> Pro: future-proof
> Pro: Allows full unicode (e.g. Chinese chars if somebody likes, or even
> klingon if the underlying system supports it).
> Con: clients need special handling.
> Con: Windows users or users of other ancient OS editions with no good UTF-8
> support will have more problems than with ISO-8859-15.
>
> I see two places, where the encoding needs to be specified:
> * Data files
> * Network protocol
>
> My favorite solution would be 3. UTF-8, followed by 1. US-ASCII. I dislike 2.
> ISO-8859-15 very much.
#3 probably makes the most sense, and at least for the gtk2 client, looks like
it would actually be handled properly (as the message generated on the wine
bottle is about invalid utf8 character).
Also, I'm not sure how easy #2 is - it is easy from a person writing the maps
or archetypes, but as demonstrated, pretty much all clients would have to do
special string handling.
#3 does make it harder for people putting the strings in (I'd think the map
editors could try to do the right thing in those cases and covert ISO 8859 15
characters to unicode)
So I'd vote unicode. I'd suspect that for clients that don't support utf8,
things won't really be any more broken than right now - the client would display
a funky character instead of the correct one. But I don't believe that would
break any portion of the clients or protocol.
More information about the crossfire
mailing list