CharacterEncoding

From Gnash Project Wiki

Jump to: navigation, search

Dealing with different encodings is a necessary feature.

SWF5 supports three different 8-bit character encodings and can deal with multibyte characters using native methods.

SWF6 and above support unicode, although the System.useCodepage setting can change this.

Gnash's present implementation uses a home-made utf8-to-wide-character conversion. All normal string methods are possible on the resulting wide string, but it has no ability to deal with case conversion.

Contents

Case Conversion

Case conversion is not an aspect of unicode. There is in fact no way to convert correctly from upper to lower case (assuming a language has exactly two cases) without an understanding of the language and the meaning of the text in that language. The commonest example is the Greek capital letter Ϛ, which as a miniscule can be ς, but also σ, depending on its position in the word.

Any automatic conversion is consequently imperfect, relying on an approximate mapping of upper- and lower-case letters. It is unlikely that any two implementations will agree, but this is probably not important.

C++

Standard C++ offers no way to do case conversion (sensibly, as there is no standard way to do it). If an available system locale offers case conversion, C++ can use it, but there is neither any way of knowing which locales are available, nor any way to know which ones of them can convert particular character sets.

The only cross-platform way of implementing case conversion is an external library. Whichever is chosen should be used in Gnash taking the following draft into account.

Drafts for a new implementation in Gnash

Possible external libraries are:

  1. libICU: 14MB, so quite large, C++ interface, many more features than necessary.
  2. glib / cairo: information here. Cairo additionally provides advanced text rendering capabilities.
  1. Encoding depends on SWF version. String handling must use the VM version to decide how to treat strings.
  2. If you feed SWF5 a utf-8 string, it will treat it as nonsensical characters. It can only interpret 7- or 8-bit character strings, so there must be a way of treating UTF8 and unicode differently (one way would be to convert utf8 to a nonsensical string and then store that internally as utf8).
  3. Every string operation such as searching, splitting, finding length, needs a call to an external library. There is no other way to know the length of a utf-8 string.
  4. Strings do not change state once read in. That is, if System.useCodepage is changed, it only affects new strings loaded externally.
  5. libICU does not match the Flash interpretation of UTF8. Where Flash will happily return illegal characters, libICU refuses.
    • This may be solvable by providing custom data for libicu. On the other hand, it may not.

At present, every string is stored as a std::string, either in utf-8 or 8-bit (latin1) form. On every string operation, the entire string is converted to wstring so that the string operation can be carried out. The result is then converted back to the appropriate std::string form.

This is a big performance hit. It can be optimized to stop when enough characters have been read to carry out the operation. This applies to charAt(), indexOf() etc, but certain operations still require the whole string to be converted, for instance:

  • length
  • lastIndexOf
  • split

A real-world case of poor performance is the twitter badge SWF, which makes several million calls to std::string::const_iterator's increment operator. This is because it uses charAt many times on a fairly long string, causing it to be converted to and fro several thousand times. Even after a recent optimization (stopping when the index is reached), it has poor performance.

Performance drops when conversion happens too much. Conversion is necessary between IO (logging, reading in strings), which requires std::string, and string operations (length, editing, finding glyphs). At present, the core requires std::strings for property names and values (string_table).

A good implementation will keep the amount of conversion to a minimum.

Implementation 1

Central CharacterDecoder class

Advantages:

  • Requires few changes
  • Easy to swap for a different implementation
  • Can keep track of VM version

Disadvantages:

  • Does not solve the performance problem
  • Cannot store state of strings (codepage or not).

Implementation 2

Custom String class

Advantages:

  • Can keep track of string state (codepage or not).
  • Should fix performance problems.
  • Can be implementation-swapped if carefully designed.

Disadvantages:

  • Requires large changes to Gnash core.
  • Implementation is not trivial.