An ascii character can be encoded as a 1 to 4 byte sequence using the utf-8 encoding.

UTF-8 is a variable-width encoding for Unicode. UTF-8 uses 1 to 4 bytes per character, depending on the Unicode symbol. UTF-8 has the following properties:

The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.
It is easy to convert between UTF-8 and UCS-2 and UCS-4 fixed-width representations of characters.
The lexicographic sorting order of UCS-4 strings is preserved.
All possible 2^21 UCS codes can be encoded using UTF-8.

Generally, programs should validate UTF-8 data before performing other checks. The following table lists the well-formed UTF-8 byte sequences.

Bits of code pointFirst code pointLast code pointBytes in sequenceByte 1Byte 2Byte 3Byte 4 7111621

U+0000	U+007F	1	0xxxxxxx
U+0080	U+07FF	2	110xxxxx	10xxxxxx
U+0800	U+FFFF	3	1110xxxx	10xxxxxx	10xxxxxx
U+10000	U+1FFFFF	4	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

Although UTF-8 originated from the Plan 9 developers [Pike 1993], Plan 9's own support covers only the low 16-bit range. In general, many "Unicode" systems support only the low 16-bit range, not the full 21-bit ISO 10646 code space [ISO/IEC 10646:2012].

According to RFC 2279: UTF-8, a transformation format of ISO 10646 [Yergeau 1998],

Implementors of UTF-8 need to consider the security aspects of how they handle invalid UTF-8 sequences. It is conceivable that, in some circumstances, an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.
A particularly subtle form of this attack can be carried out against a parser that performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain invalid octet sequences as characters. For example, a parser might prohibit the null character when encoded as the single-octet sequence 00, but allow the invalid two-octet sequence C0 80 and interpret it as a null character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the invalid octet sequence 2F C0 AE 2E 2F.

Following are more specific recommendations.

Only the "shortest" form of UTF-8 should be permitted. Naive decoders might accept encodings that are longer than necessary, allowing for potentially dangerous input to have multiple representations. For example,

Process A performs security checks but does not check for nonshortest UTF-8 forms.
Process B accepts the byte sequence from process A and transforms it into UTF-16 while interpreting possible nonshortest forms.
The UTF-16 text may contain characters that should have been filtered out by process A and can potentially be dangerous. These "nonshortest" UTF-8 attacks have been used to bypass security validations in high-profile products, such as Microsoft's IIS Web server.

Corrigendum #1: UTF-8 Shortest Form to the Unicode Standard [Unicode 2006] describes modifications made to version 3.0 of the Unicode Standard to forbid the interpretation of the nonshortest forms.

Handling Invalid Inputs

UTF-8 decoders have no uniformly defined behavior upon encountering an invalid input. Following are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence. Note that implementing these behaviors requires careful security considerations.

Substitute for the replacement character "U+FFFD" or the wildcard character such as "?" when U+FFFD is not available.
Ignore the bytes (for example, delete the invalid byte before the validation process; see "Unicode Technical Report #36, 3.5 Deletion of Code Points" for more information).
Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map; other encoding, such as Shift_JIS, is known to trigger self-XSS, and so is potentially dangerous).
Fail to notice but decode as if the bytes were some similar bit of UTF-8.
Stop decoding and report an error.

The following function from John Viega's "Protecting Sensitive Data in Memory" [Viega 2003] detects invalid character sequences in a string but does not reject nonminimal forms. It returns 1 if the string is composed only of legitimate sequences; otherwise, it returns 0.

int spc_utf8_isvalid(const unsigned char *input) { int nb; const unsigned char *c = input; for (c = input; *c; c += (nb + 1)) { if (!(*c & 0x80)) nb = 0; else if ((*c & 0xc0) == 0x80) return 0; else if ((*c & 0xe0) == 0xc0) nb = 1; else if ((*c & 0xf0) == 0xe0) nb = 2; else if ((*c & 0xf8) == 0xf0) nb = 3; else if ((*c & 0xfc) == 0xf8) nb = 4; else if ((*c & 0xfe) == 0xfc) nb = 5; while (nb-- > 0) if ((*(c + nb) & 0xc0) != 0x80) return 0; } return 1; }

Encoding of individual or out-of-order surrogate halves should not be permitted. Broken surrogates are invalid in Unicode and introduce ambiguity when they appear in Unicode data. Broken surrogates are often signs of bad data transmission. They can also indicate internal bugs in an application or intentional efforts to find security vulnerabilities.

Failing to properly handle UTF8-encoded data can result in a data integrity violation or denial-of-service attack.

Recommendation

Severity

Likelihood

Remediation Cost

Priority

Level

MSC10-C

Medium

Unlikely

High

Tool

Version

Checker

Description

LDRA tool suite

9.7.1

176 S, 376 S

Partially implemented

Search for vulnerabilities resulting from the violation of this rule on the CERT website.

Is ASCII same as UTF

For characters represented by the 7-bit ASCII character codes, the UTF-8 representation is exactly equivalent to ASCII, allowing transparent round trip migration. Other Unicode characters are represented in UTF-8 by sequences of up to 6 bytes, though most Western European characters require only 2 bytes3.

What is encoding =' UTF

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

How many bytes is a character in UTF

UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8. These code points are the same as those in ASCII CCSID 367.

How many characters can UTF

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.

An ascii character can be encoded as a 1 to 4 byte sequence using the utf-8 encoding.

Handling Invalid Inputs

Is ASCII same as UTF

What is encoding =' UTF

How many bytes is a character in UTF

How many characters can UTF

Bài Viết Liên Quan

7/8/2023

8 plus có nên mua

Top 20 cửa hàng nail Quận 8 Hồ Chí Minh 2022

Chung kết world cup 2023 pháp vs croatia

5 thương hiệu đồng bằng 8 hàng đầu năm 2022

Cách làm bài tập văn lớp 8

Top 20 cửa hàng casino royal Quận 8 Hồ Chí Minh 2022

Explain the Difference of a product oriented and service oriented business with examples

Các loại boy trong tiếng Anh

6-2 chapter five quiz government and the healthcare system

Toplist

Top 19 đặt một câu ghép chính phụ sử dụng cặp quan hệ từ để thi 2022

Top 29 suy thận độ 2 kiêng ăn gì 2022

Top 10 triển vọng thị trường chứng khoán việt nam sách 2022

Top 9 trong các tài sản sau đây tài sản nào thuộc sở hữu của nhà nước 2022

Top 10 giáo an phát triển năng lực môn kĩ thuật lớp 4 2022

Top 8 chuẩn mực đạo đức của vinamilk 2022

Top 28 kế hoạch bài dạy môn tự nhiên xã hội lớp 2 mô đun 4 2022

Top 10 de thi giữa học kì 1 lớp 10 môn lý có đáp an tự luận 2022

Top 9 huyện hoài đức - hà nội có bao nhiều xã 2022

Bài mới nhất

Top 10 dan ba hiem doc nhat trung hoa p1 năm 2024

Hóa đơn bị mờ chữ có bị phạt không năm 2024

Bài tập giải toán có lời văn lớp 4 năm 2024

Top 20 ngân hàng lớn nhất việt nam năm 2024

Biểu hiện của thoái hóa cột sống lưng năm 2024

Bị lỗi this page cant be displayed năm 2024

Vở bài tập toán lớp 5 bài 166 luyện tập năm 2024

Chiếu powerpoint có phải lúc nào cũng hiệu quả năm 2024

Bảng tính exel tính toán các thông số động đất năm 2024

Đừng mắc kẹt ở mức trung bình năm 2024

Chủ đề