HDK
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
UT_Unicode Class Reference

Helper functions for Unicode and the UTF-8 variable length encoding. More...

#include <UT_Unicode.h>

Classes

class  iterator
 
class  transform
 

Static Public Member Functions

static const utf8convert (const utf8 *str, utf32 &cp)
 
static int convert (utf32 cp, utf8 *str, exint buflen)
 
static const utf8next (const utf8 *current)
 
static utf8next (utf8 *current)
 
static const utf8prev (const utf8 *start, const utf8 *current)
 
static utf8prev (const utf8 *start, utf8 *current)
 
static const utf8nextWord (const utf8 *start, const utf8 *current)
 
static const utf8prevWord (const utf8 *start, const utf8 *current)
 
static bool fixpos (const utf8 *start, const utf8 *&current)
 
static bool fixpos (const utf8 *start, utf8 *&current)
 
static exint count (const utf8 *start, const utf8 *end=0)
 Returns the number of code points this variable encoding represents. More...
 
static exint length (const utf8 *start, const utf8 *end=0)
 
static utf8duplicate (const utf8 *start, const utf8 *end=0)
 
static const utf8find (utf32 cp, const utf8 *start, const utf8 *end=0)
 
static const utf8find (const utf8 *str, const utf8 *start, const utf8 *end=0)
 
static const utf16convert (const utf16 *str, utf32 &cp, bool big_endian=false)
 
static int convert (utf32 cp, utf16 *str, exint buflen)
 
static utf32 replacementCodePoint ()
 
static bool isSurrogatePair (utf32 cp)
 
static bool isFromSupplementaryPlane (utf32 cp)
 
static bool isValidCodePoint (utf32 cp)
 
static bool isControlChar (utf32 cp)
 
static bool isASCII (utf32 cp)
 
static bool isLatin1 (utf32 cp)
 
static bool isSpace (utf32 cp, bool break_only=true)
 
static bool isDigit (utf32 cp)
 
static bool isAlpha (utf32 cp)
 
static bool isAlnum (utf32 cp)
 
static bool isPunct (utf32 cp)
 
static bool isUpper (utf32 cp)
 
static bool isLower (utf32 cp)
 
static bool isCJK (utf32 cp)
 
static utf32 toLower (utf32 cp)
 
static utf32 toUpper (utf32 cp)
 
static bool isWordDelimiter (utf32 cp)
 
static bool isUTF8 (utf8 octet)
 

Detailed Description

Helper functions for Unicode and the UTF-8 variable length encoding.

Definition at line 28 of file UT_Unicode.h.

Member Function Documentation

const utf8 * UT_Unicode::convert ( const utf8 str,
utf32 cp 
)
inlinestatic

Parses a code point from a UTF-8 encoding and returns it as a single code point value. Returns a pointer to the next encoding if the current one got successfully decoded. If the decoding fails, it return NULL and cp is set to zero.

Definition at line 81 of file UT_UnicodeImpl.h.

int UT_Unicode::convert ( utf32  cp,
utf8 str,
exint  buflen 
)
inlinestatic

Converts a code point to its UTF-8 encoding. If no buffer is given, returns the number of characters needed to store the resulting encoded sequence. Does not write out a terminating zero but moves the pointer to where the next character after the sequence should be written.

0x10FFFF is the greatest code point value allowed by Unicode and

Definition at line 151 of file UT_UnicodeImpl.h.

const utf16 * UT_Unicode::convert ( const utf16 str,
utf32 cp,
bool  big_endian = false 
)
inlinestatic

Parses a code point from a UTF-16 encoding and returns it as a single code point value. Returns a pointer to the next encoding if the current one got successfully decoded. If the decoding fails, it return NULL and cp is set to zero. Set big_endian to true if the incoming UTF-16 string is encoded as big endian (UTF-16BE).

Definition at line 214 of file UT_UnicodeImpl.h.

int UT_Unicode::convert ( utf32  cp,
utf16 str,
exint  buflen 
)
inlinestatic

Converts a code point to its UTF-16LE encoding into the buffer given. If no buffer is given, or if the buffer size is too small, returns the number of bytes needed to store the resulting encoded sequence. buflen should be given in bytes, and not number of utf16 entries. Does not write out a terminating zero but moves the pointer to where the next character after the sequence should be written.

Definition at line 246 of file UT_UnicodeImpl.h.

exint UT_Unicode::count ( const utf8 start,
const utf8 end = 0 
)
inlinestatic

Returns the number of code points this variable encoding represents.

Definition at line 526 of file UT_UnicodeImpl.h.

utf8 * UT_Unicode::duplicate ( const utf8 start,
const utf8 end = 0 
)
inlinestatic

Duplicates the string using malloc. Use free() to free the resulting string. If a NULL pointer is passed, a NULL pointer is returned.

Definition at line 557 of file UT_UnicodeImpl.h.

const utf8 * UT_Unicode::find ( utf32  cp,
const utf8 start,
const utf8 end = 0 
)
inlinestatic

Find a code point in a variable length string and return a pointer to it. An optional end point can be supplied, which delineates a search range. Otherwise the string is searched up to the terminating NUL.

Definition at line 581 of file UT_UnicodeImpl.h.

const utf8 * UT_Unicode::find ( const utf8 str,
const utf8 start,
const utf8 end = 0 
)
inlinestatic

Find a UTF8 encoded string in another UTF8 encoded string and return a pointer to the start of the match. Returns NULL if the string was not found.

Definition at line 625 of file UT_UnicodeImpl.h.

bool UT_Unicode::fixpos ( const utf8 start,
const utf8 *&  current 
)
inlinestatic

Given a pointer inside of a string representing variable length encoding, moves the pointer so that it points to the beginning of the encoding, if not there already. Returns false if it was unable to fix the position and true if successful or the position was already valid.

Definition at line 517 of file UT_UnicodeImpl.h.

bool UT_Unicode::fixpos ( const utf8 start,
utf8 *&  current 
)
inlinestatic

Definition at line 511 of file UT_UnicodeImpl.h.

bool UT_Unicode::isAlnum ( utf32  cp)
inlinestatic

Definition at line 687 of file UT_UnicodeImpl.h.

bool UT_Unicode::isAlpha ( utf32  cp)
inlinestatic

Definition at line 681 of file UT_UnicodeImpl.h.

static bool UT_Unicode::isASCII ( utf32  cp)
inlinestatic

Definition at line 177 of file UT_Unicode.h.

bool UT_Unicode::isCJK ( utf32  cp)
inlinestatic

Returns true if the character is from any of the Unicode CJK Unified Ideographs blocks.

Definition at line 713 of file UT_UnicodeImpl.h.

static bool UT_Unicode::isControlChar ( utf32  cp)
inlinestatic

Definition at line 171 of file UT_Unicode.h.

bool UT_Unicode::isDigit ( utf32  cp)
inlinestatic

Definition at line 675 of file UT_UnicodeImpl.h.

static bool UT_Unicode::isFromSupplementaryPlane ( utf32  cp)
inlinestatic

Definition at line 156 of file UT_Unicode.h.

static bool UT_Unicode::isLatin1 ( utf32  cp)
inlinestatic

Definition at line 182 of file UT_Unicode.h.

bool UT_Unicode::isLower ( utf32  cp)
inlinestatic

Definition at line 707 of file UT_UnicodeImpl.h.

bool UT_Unicode::isPunct ( utf32  cp)
inlinestatic

Definition at line 693 of file UT_UnicodeImpl.h.

bool UT_Unicode::isSpace ( utf32  cp,
bool  break_only = true 
)
inlinestatic

Definition at line 667 of file UT_UnicodeImpl.h.

static bool UT_Unicode::isSurrogatePair ( utf32  cp)
inlinestatic

Returns true if the code point given is a surrogate pair. This is valid UTF-16 character, since it is used to encode greater-than 0xFFFF code points. It is not a valid UTF-32 code point, however.

Definition at line 151 of file UT_Unicode.h.

bool UT_Unicode::isUpper ( utf32  cp)
inlinestatic

Definition at line 699 of file UT_UnicodeImpl.h.

bool UT_Unicode::isUTF8 ( utf8  octet)
inlinestatic

Definition at line 74 of file UT_UnicodeImpl.h.

static bool UT_Unicode::isValidCodePoint ( utf32  cp)
inlinestatic

Definition at line 162 of file UT_Unicode.h.

bool UT_Unicode::isWordDelimiter ( utf32  cp)
inlinestatic

Definition at line 743 of file UT_UnicodeImpl.h.

exint UT_Unicode::length ( const utf8 start,
const utf8 end = 0 
)
inlinestatic

Returns the number of octets for this variable encoding. One octet is the same as a byte for UTF-8 encodings.

Definition at line 542 of file UT_UnicodeImpl.h.

const utf8 * UT_Unicode::next ( const utf8 current)
inlinestatic

Given a current location in a buffer, moves to the next character. If the location is inside a UTF-8 multi-character encoding (i.e not at the beginning of one), it moves to the next encoded character start after. If the current location is already at the terminating NUL character the function does nothing and just returns the current pointer. If it is unable to move successfully to the next encoded character (e.g. it's already at the end of the string, or the encoding is garbage and no recovery is possible) the function returns NULL.

Definition at line 279 of file UT_UnicodeImpl.h.

utf8 * UT_Unicode::next ( utf8 current)
inlinestatic

Definition at line 273 of file UT_UnicodeImpl.h.

const utf8 * UT_Unicode::nextWord ( const utf8 start,
const utf8 current 
)
inlinestatic

Given a location in a buffer, moves after the end of the word. This is done by grouping characters that are considered continuous. There are 4 types of groups:

  1. space
  2. alphanumeric: ASCII, including _@
  3. punctuation : {}[]();,. and \\n\\r
  4. other : symbols (including non ASCII) Punctuation is always one character group, or never grouped with another character. Also, when a dot is sandwiched by digits (e.g., 1.1) it's considerd continuous. Note that this function uses different rules from isWordDelimiter.

Definition at line 499 of file UT_UnicodeImpl.h.

const utf8 * UT_Unicode::prev ( const utf8 start,
const utf8 current 
)
inlinestatic

Given a location in a buffer, moves to the to the previous character, unless already at the beginning of the string, as defined by 'start'. If the location is inside a UTF-8 multi-character encoding, it moves to the beginning of that encoding. If going back lands on an invalid character, it encounters bad encoding (e.g. too many continuation bytes), or it's already at the start, the function returns NULL.

Definition at line 342 of file UT_UnicodeImpl.h.

utf8 * UT_Unicode::prev ( const utf8 start,
utf8 current 
)
inlinestatic

Definition at line 336 of file UT_UnicodeImpl.h.

const utf8 * UT_Unicode::prevWord ( const utf8 start,
const utf8 current 
)
inlinestatic

Given a location in a buffer, moves to the beginning of the word. This is done by grouping characters that are considered continuous. There are 4 types of groups:

  1. space
  2. alphanumeric: ASCII, including _@
  3. punctuation : {}[]();,. and \\n\\r
  4. other : symbols (including non ASCII) Punctuation is always one character group, or never grouped with another character. Also, when a dot is sandwiched by digits (e.g., 1.1) it's considerd continuous. Note that this function uses different rules from isWordDelimiter.

Definition at line 505 of file UT_UnicodeImpl.h.

static utf32 UT_Unicode::replacementCodePoint ( )
inlinestatic

Returns the replacement character, which is returned by the convert functions, when they encounter an invalid, but recoverable, encoding.

Definition at line 143 of file UT_Unicode.h.

utf32 UT_Unicode::toLower ( utf32  cp)
inlinestatic

Definition at line 722 of file UT_UnicodeImpl.h.

utf32 UT_Unicode::toUpper ( utf32  cp)
inlinestatic

Definition at line 733 of file UT_UnicodeImpl.h.


The documentation for this class was generated from the following files: