[Lazarus] UTF8 RTL for Windows

Hans-Peter Diettrich DrDiettrich1 at aol.com
Tue Nov 25 20:45:14 CET 2014


Mattias Gaertner schrieb:
> On Tue, 25 Nov 2014 11:53:00 +0100
> Hans-Peter Diettrich <DrDiettrich1 at aol.com> wrote:
> 
>> [...]
>>> Correction: *This* Char type needs to be extended.
>> Please specify.
> 
> The ThousandSeparator type is "Char", which does not work with
> Russian in UTF-8. Well, at least if you want the non breakable space
> instead of the normal space.
> There are many cases where Char is enough.

You admit that there exist cases where Char is not enough :-]


>>> There is a Pos overload for strings. Where is the flaw in Pos?
>> The flaw is the added overload with a Char parameter.
> 
> I use that a lot. It is faster than the string variant.
> Why is that a flaw?

When working with SBCS you can assume that a Char can hold any entire 
character. This is not true with MBCS, like UTF-8.

With CP_ACP set to UTF-8 you cannot assign 'ä' to a Char, and search for 
it. Depending on your exact code, the compiler may not find out that 
this assignment is invalid, because it assigns only *part* of a 
multibyte sequence. A following Pos, with that partial character, can 
not always yield the *expected* result, it might find an 'ö' or 'ü' as 
well. In detail that Pos overload has no indication of the codepage of 
the Char, and consequently cannot enforce an eventually required 
conversion, to the encoding of the string parameter. The same 
considerations apply to eventual StringReplace (or similar) overloads.

Delphi users may think like you, that a Char is sufficient in such 
cases. They are right so far, as in Unicode Delphi a Char is a WideChar, 
and a String is UnicodeString, so that such optimizations work with BMP 
characters. [Users of MBCS/non-BMP character sets already know that Char 
is quite useless for text processing]

But compiling such code with FPC/Lazarus and the new RTL, where String 
is AnsiString, and the default encoding is UTF-8, the same code will not 
work properly. That's why I consider Char (=AnsiChar) dangerous in the 
new RTL, causing obscure program errors.

Removing Char, perphaps in some special compiler mode, would allow to 
identify all *possibly* wrong uses of the *generic* Char. Then the code 
can be fixed in various ways, by e.g. replacing Char by WideChar or 
UnicodeChar (4 bytes), removing overloads with Char parameters, or 
whatever else will prevent inadvertent misuse of constants, variables, 
fields or parameters of Char type.

Please note that Delphi compatibility is not a valid argument, as long 
as FPC/Lazarus differs in the declaration of the generic String and Char 
types. That's why Delphi made the Unicode move in *one* step, retyping 
both String and Char at the same time, and (effectively) deprecating 
AnsiString. This will at least make legacy code applicable to BMP 
encodings, where WideChar is sufficient to hold any character value, and 
legacy MBCS code will continue working without unexpected surprises.


>> Furthermore the Pos arguments should never be subject to automatic 
>> conversion, otherwise the returned index will be useless.
> 
> You can argue the same way in the direction: If it does not
> automatically convert it will find crap.

That's why the *original* declaration, with both parameters of type 
String, will *allow* to identify and perform all required conversions. A 
Char type, without an encoding indicator, prevents such checks and 
conversions both at compiler level (in translating the call) and inside 
function code.


>>>> In the best case Char could be retyped into an string (substring),
>>> That would be wrong in 99.9% of the cases.
>> Please give at least one example.
> 
> Retype "Char" to "String" and the compiler will bark. For example in
> Graphics.

How is *graphic* information related to *text*? Using Char for Byte, 
only because using strings offers some coding comfort, is another flaw.

Delphi discourages since long the use of strings for holding anything 
but text. The continued abuse of strings, for other types of 
information, will now result in errors whenever an (implicit) string 
conversion occurs in some library routine, as can happen easily with 
encoded AnsiStrings. Unfortunately Delphi missed the chance to simply 
add an "unencoded" AnsiString encoding, which would allow to prevent any 
conversions of according string variables. The RawByteString type, 
despite its name, was added for quite a different purpose, *not* as a 
chance to safely store arbitrary bytes in such strings.

DoDi




More information about the Lazarus mailing list