Szövegkezelés

A .NET egyik legfontosabb osztálya a System.String és a c#-os megfelelője a „string” osztály.

Kiemelt jellemzői:

referencia típusú
null karaktert is tartalmazhat
nem módosítható (immutable)
== összehasonlító operátor túlterhelt/felüldefiniált, így az értékeket és nem a referenciákat hasonlítja össze (így lehet egyenlő két kölönböző helyen tárolt string - lásd még: referencia és érték típusok egyenlőségét)

Immutable, azaz értékadás után a memóriában tárolt érték nem módosítható.

    string s;
    s = "Nevem Gipsz Jakab";
    s = s + " Márton.";           // itt új 'példány' jön létre a memóriában, de már a
                                  // + operátorral összefűzött (concatenated) értékkel,
                                  // és az eredeti terület elérhetetlen lesz, a szemétgyűjtő
                                  // fogja a következő futásakor felszabadítani.
    System.Console.WriteLine(s);  // Nevem Gipsz Jakab Márton.

Hivatkozhatunk a string egyes karaktereire az indexelőjével (get), de közvetlenül azokat nem módosíthatjuk (set):

    string s = "Árvizes fúrópajzs";
    for ( int i = 0; i < s.Length; ++i )
    {
        Console.Write( s[i] + " " );
        //s[i] = '_'; // az értékadás nem megengedett művelet
        //a fordító hibaüzenete: Property or indexer 'string.this[int]' cannot be assigned to
    }

String és StringBuilder osztályok közti különbség…✘

Kódolás

A karakter absztrakt fogalom, egy karakter fizikai tárolása többféle bináris formában is elképzelhető. Egy karakter és a hozzá tartozó bitsorrend közötti kapcsolatot a kódlapok definiálják. Kódolásnak hívjuk a karakter→bájt konverziót, dekódolásnak a bájt→karakterré visszaalakítást. A .NET-framework és a CLR futtatókörnyezet 2 Byteos UTF-16 kódolással ábrázol minden karaktert. Egy ilyen karakter neve a dokumentációban az ún. surrogate pairs.

A kódolásokért a System.Text névtér absztrakt Encoding osztálya felel.

Tárolás

a COM idők (és .NET1.0) tárlása:

✘A karakterlánc BSTR tárolású: az elején megtalálható a hossz, a hasznos adat unicode (pc-n: 16-bites little endian byte-sorrendű) formátumban és végül a null karakter. Így egy string teljes memóriafoglalása: 16B + hossz*2B + 2B

msdn forrás

Az alábbi idézet az msdn help-ből való. Vigyázz, kicsit félrevezető lehet…

Remarks

A string is a sequential collection of Unicode characters that is used to represent text. A String object is a sequential collection of System.Char objects that represent a string. The value of the String object is the content of the sequential collection, and that value is immutable.

A String object is called immutable (read-only) because its value cannot be modified once it has been created. Methods that appear to modify a String object actually return a new String object that contains the modification. If it is necessary to modify the actual contents of a string-like object, use the System.Text.StringBuilder class.

Each Unicode character in a string is defined by a Unicode scalar value, also called a Unicode code point or the ordinal (numeric) value of the Unicode character. Each code point is encoded using UTF-16 encoding, and the numeric value of each element of the encoding is represented by a Char object.

A single Char object usually represents a single code point; that is, the numeric value of the Char equals the code point. However, a code point might require more than one encoded element. For example, a Unicode supplementary code point (a surrogate pair) is encoded with two Char objects. Indexes

An index is the position of a Char object, not a Unicode character, in a String. An index is a zero-based, nonnegative number starting from the first position in the string, which is index position zero. Consecutive index values might not correspond to consecutive Unicode characters because a Unicode character might be encoded as more than one Char object. To work with each Unicode character instead of each Char object, use the System.Globalization.StringInfo class.

msdn

Internally, the .NET Framework stores text as Unicode UTF-16. An encoder transforms this text data to a sequence of bytes. A decoder transforms a sequence of bytes into this internal format. An encoding describes the rules by which an encoder or decoder operates. For example, the UTF8Encoding class describes the rules for encoding to and decoding from a sequence of bytes representing text as Unicode UTF-8. Encoding and decoding can also include certain validation steps. For example, the UnicodeEncoding class checks all surrogates to make sure they constitute valid surrogate pairs. Both of these classes inherit from the Encoding class.