3月19日什么星座| 气血亏虚吃什么中成药| 星链是什么| 消化道出血有什么症状| 昆山有什么好玩的地方| THENORTHFACE什么牌子| 音调是由什么决定的| 梦见蛇是什么意思| 总打喷嚏是什么原因| 耐信是什么药| 杜建英是宗庆后什么人| 猪八戒的老婆叫什么| 火加良念什么| ao是什么意思| 下乡插队是什么意思| 92年出生属什么生肖| 中国什么时候解放| 为什么明星都不戴黄金| 降血糖吃什么菜| 一什么羊| 敲打是什么意思| 羊配什么生肖最好| 什么的茄子| 5d电影是什么| 跑完步喝什么水最好| 肠胃炎需要做什么检查| 同什么协什么| 6月6是什么星座| 不时之需是什么意思| 什么人容易得心梗| 为难的难是什么意思| 梦见吃苹果是什么意思| 血氧仪是干什么用的| 摇头是什么病| 一个不一个好念什么| 什么样的电动牙刷好| 牙龈萎缩用什么牙膏好| 俄罗斯的货币叫什么| 男生为什么会晨勃| 什么是码率| 鬼怕什么东西| 硕是什么意思| 警察两杠三星是什么级别| 为什么会梦到一个人| 排卵期是什么时候| 胳膊上的花是打了什么疫苗| 头上汗多是什么原因| 大生化检查都包括什么项目| 宫颈ca什么意思| 卡介苗为什么会留疤| o和ab型生的孩子是什么血型| 尿道口有灼热感是什么原因| 兆后面是什么单位| 长什么样子| 黄金果是什么水果| 左眼跳是什么原因| 什么事情| 身上长白点是什么原因| 乳房结节是什么原因引起的| 人为什么会过敏| 贤者模式是什么意思| 做梦梦见掉头发是什么意思| 肝炎吃什么药好| 文王卦是什么意思| 得宝松是什么药| 砗磲是什么| 梦见袜子破了是什么意思| 脸上长痤疮用什么药| 世界上最难写的字是什么| 苦瓜泡酒有什么功效和作用| 叶芽是什么| 夏天木瓜煲什么汤最好| lf是什么牌子| 什么情况下需要做肠镜检查| 肝风内动是什么原因造成的| 龟苓膏不能和什么一起吃| 法老是什么意思| 肝右叶钙化灶什么意思| 什么人| 宝宝打嗝是什么原因| 说梦话是什么原因| 坐骨神经痛吃什么药快| 贫血严重会得什么病| mpa是什么意思呀| dine是什么意思| 到是什么意思| 楼凤是什么意思| 12月17号什么星座| 什么是阴吹| 坎坷是什么意思| 宫内暗区是什么意思| 汗疱疹是什么原因引起的| 10月是什么季节| 胺碘酮又叫什么名字| 流鼻涕吃什么药好| 被香灰烫了预示着什么| 鹅肉不能和什么一起吃| 胃反流是什么原因| 头晕视力模糊是什么原因| 甲状腺查什么| 下身灼热感什么原因| 大便粘便池是什么原因| 血小板低是什么病| 赵本山是什么学历| 爱长闭口用什么护肤品| 家去掉一点念什么| 音调是由什么决定的| 动物园里有什么动物| 超声心动图检查什么| 魔鬼是什么意思| 嫡孙是什么意思| 女人胆固醇高什么原因| 5岁属什么| 身体寒湿重吃什么好| 喝豆浆有什么好处和坏处| 药物流产后需要注意什么| 绿茶喝多了有什么危害| 什么是介入手术| 姓彭的女孩子取什么名字好| 23度穿什么衣服合适| 喝水经常呛到是什么原因| item什么意思| 什么时间量血压最准确| 什么叫等离子| 舌苔厚发黄是什么原因| 钛色是什么颜色| 阴道有腥味是什么原因| 三伏天吃什么| 腰扭伤吃什么药最有效| 眼睛红吃什么药| 大鱼际疼是什么原因| 自费是什么意思| 上将是什么级别| 公鸡的尾巴像什么| 乱伦是什么| 金黄金黄的什么填空| 脂肪肝吃什么药效果好| 什么是衰老| 北齐是什么朝代| 骨骼肌率是什么意思| 蟑螂讨厌什么味道| 学士学位证书有什么用| 希爱力是什么药| 为什么喝酒后会头疼| 今天适合穿什么衣服| 脸上突然长斑是什么原因引起的| 棱是什么| 融合是什么意思| 空调什么牌子最好| 休是什么意思| 冷淡是什么意思| 眼帘是什么意思| 现在领结婚证需要什么| 维生素b3又叫什么| 天理是什么意思| 突然头昏是什么原因引起的| 玉的五行属性是什么| 唐氏宝宝是什么意思| 水柔棉是什么面料| 什么叫做t| 为什么老是口腔溃疡| 木薯是什么东西| 榆钱是榆树的什么| 肝s5是什么意思| 子宫内膜炎症有什么症状| 带状疱疹不能吃什么食物| 轻轻地什么| 宫颈肥大是什么原因造成的| 一个三点水一个除念什么| 心脏扩大吃什么药好| crayon是什么意思| 换手率什么意思| 腊月初七是什么星座| other什么意思| 炒什么菜好吃又简单| 什么是妈妈臀| 血压低吃什么能补上来| 张的五行属性是什么| 看肝挂什么科| 牡丹花什么时候开| 牙龈疼吃什么药| 为什么肛门会出血| 讳莫如深是什么意思| 艳阳高照是什么生肖| 口腔溃疡白色的是什么| 海豚用什么呼吸| 什么意| 吃什么容易拉肚子| 痰多是什么原因造成的| cdfi未见明显血流信号是什么意思| 骸骨是什么意思| 什么茶助眠| 孕妇适合喝什么茶| 甲亢平时要注意什么| 亚硝酸钠是什么| 益生菌什么时候吃| 什么的月季| 金牛座和什么星座最配| 硒酵母胶囊对甲状腺的作用是什么| 娘家人是什么意思| 什么是手帐| 头发麻是什么病的前兆| 什么立雪| icp是什么| 什么叫庚日| 什么是植发| 术后吃什么消炎药| 舌根白苔厚是什么原因| 吃什么能拉肚子| 捞人什么意思| 永恒是什么意思| 万箭穿心代表什么生肖| 屁股上长痘是什么原因| 绿松石五行属什么| 牡丹花是什么颜色| boys是什么意思| 抓包是什么意思| 萎缩性胃炎可以吃什么水果| 低压高用什么药| miles是什么意思| 宫后积液是什么意思| 高危hpv有什么症状| 什么最珍贵| 你真狗是什么意思| 6月五行属什么| 做胃镜之前需要做什么准备| 那是什么| nova是什么牌子| 一级护理是什么意思| rpr阴性是什么意思| 老古董是什么意思| 活珠子是什么| 高挑是什么意思| 猪脚炖什么| 通勤是什么意思| 理想是什么意思| 吃薄荷叶有什么好处和坏处| 什么人适合戴玉| 医学pr是什么意思| 附属国是什么意思| 什么是鸡眼| 1.15是什么星座| 阿莫西林治什么病| swag什么意思| 依非韦伦片治什么病的| 阑尾炎痛起来什么感觉| rc是什么| 王禹读什么| 蝎子的天敌是什么| 怀孕16周要做什么检查| 怕热的人是什么体质| 狗懒子是什么意思| 宫颈病变是什么原因引起的| 世界上最大的岛是什么岛| 乞丐是什么生肖| 王八是什么| 东宫是什么意思| 扁平比是什么意思| 肩胛骨缝疼吃什么药| 励志是什么意思| camper是什么牌子| 考试前吃什么提神醒脑| 孕妇能吃什么| 什么是电信诈骗| 百度

青海:全省信访工作会议召开 王建军对信访工作提出要求

(Redirected from HTML character references)
百度 以今日中国船舶收盘价元计算,8名投资者浮亏亿元。

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

Specifying the document's character encoding

edit

There are two general ways to specify which character encoding is used in the document.

First, the web server can include the character encoding or "charset" in the Hypertext Transfer Protocol (HTTP) Content-Type header, which would typically look like this:[1]

Content-Type: text/html; charset=utf-8

This method gives the HTTP server a convenient way to alter document's encoding according to content negotiation; certain HTTP server software can do it, for example Apache with the module mod_charset_lite.[2]

Second, a declaration can be included within the document itself.

For HTML it is possible to include this information inside the head element near the top of the document:[3]

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

HTML5 also allows the following syntax to mean exactly the same:[3]

<meta charset="utf-8">

XHTML documents have a third option: to express the character encoding via XML declaration, as follows:[4]

<?xml version="1.0" encoding="utf-8"?>

With this second approach, because the character encoding cannot be known until the declaration is parsed, there is a problem knowing which character encoding is used in the document up to and including the declaration itself. If the character encoding is an ASCII extension then the content up to and including the declaration itself should be pure ASCII and this will work correctly. For character encodings that are not ASCII extensions (i.e. not a superset of ASCII), such as UTF-16BE and UTF-16LE, a processor of HTML, such as a web browser, should be able to parse the declaration in some cases through the use of heuristics.

Encoding detection algorithm

edit

As of HTML5 the recommended charset is UTF-8.[3] An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including:

  1. Explicit user instruction
  2. An explicit meta tag within the first 1024 bytes of the document
  3. A byte order mark (BOM) within the first three bytes of the document
  4. The HTTP Content-Type or other transport layer information
  5. Analysis of the document bytes looking for specific sequences or ranges of byte values,[5] and other tentative detection mechanisms.

Characters outside of the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for English-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In Chinese, Japanese, and Korean (CJK) language environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override incorrect charset label manually as well.

It is increasingly common for multilingual websites and websites in non-Western languages to use UTF-8, which allows use of the same encoding for all languages. UTF-16 or UTF-32, which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a byte-oriented ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.

Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.

Permitted encodings

edit

The WHATWG Encoding Standard, referenced by recent HTML standards (the current WHATWG HTML Living Standard, as well as the formerly competing W3C HTML 5.0 and 5.1) specifies a list of encodings which browsers must support. The HTML standards forbid support of other encodings.[6][7][8] The Encoding Standard further stipulates that new formats, new protocols (even when existing formats are used) and authors of new documents are required to use UTF-8 exclusively.[9]

Besides UTF-8, the following encodings are explicitly listed in the HTML standard itself, with reference to the Encoding Standard:[8]

  1. ^ Also specified for TIS-620, ISO-8859-11 and related labels.[9]
  2. ^ Also specified for ASCII, ISO-8859-1 and related labels.[9]
  3. ^ Also specified for ISO-8859-9 and related labels.[9]
  4. ^ Specified with 0xA3A0 as a duplicate encoding of the ideographic space (U+3000) for compatibility reasons, and as such excluding U+E5E5 (a private use character).[10][11] Also, specified with 0x80 accepted as an alternative encoding of the euro sign (U+20AC; see Windows-936).[12] Otherwise, follows the mappings from the 2005 standard.[11]
  5. ^ Hong Kong Supplementary Character Set variant,[13] although most of the HKSCS extensions (those with lead bytes less than 0xA1) are not included by the encoder, only by the decoder.[14]
  6. ^ The specification includes IBM and NEC extensions,[15] and is more precisely Windows-31J.[13]
  7. ^ The specification uses the same index as used for Shift JIS (insofar as is within reach), i.e. includes NEC extensions. Half-width kana is converted to fullwidth by the encoder,[16] but accepted using an escape sequence (ESC 0x28 0x49) by the decoder.[17] Shift Out and Shift In (0x0E and 0x0F) are excluded entirely to prevent attacks.[17][18]
  8. ^ Actually Unified Hangul Code (Windows-949), which is a superset which covers the entire Hangul Syllables block.[13][19]
  9. ^ Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in UTF-8.[20]
  10. ^ For compatibility with deployed content, also specified for the plain UTF-16 label,[21] although a byte order mark (BOM), if present, takes priority over any label.[22] Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in UTF-8.[20]
  11. ^ Maps 0x00 through 0x7F to U+0000 through U+007F, and 0x80 through 0xFF to U+F780 through U+F7FF (a Private Use Area range), such that the low 8 bits of the code point always match the original byte.[23]

The following additional encodings are listed in the Encoding Standard, and support for them is therefore also required:[9]

  1. ^ Uses the same encoder and decoder as ISO-8859-8, but is not subject to the visual-order behaviour which is used for documents labelled as ISO-8859-8.[24]
  2. ^ Titled KOI8-U and specified for both KOI8-U and KOI8-RU labels;[9] follows KOI8-RU in positions 0xAE and 0xBE (i.e. includes ?/?)[25][26] but KOI8-U in positions 0x93–9F.[25]
  3. ^ Also specified for GB2312 and related labels. Handled the same as GB 18030 for decoding purposes.[27] For encoding purposes, labelling as GBK (or GB 2312) excludes four-byte codes, and favours the one-byte 0x80 representation for U+20AC.[10]
  4. ^ The specification uses the same index as used for Shift JIS (insofar as is within reach of the EUC code set 1), i.e. includes NEC extensions. JIS X 0212 is included for decoding only.[28]

The following encodings are listed as explicit examples of forbidden encodings:[8]

The standard also defines a "replacement" decoder, which maps all content labelled as certain encodings to the replacement character (?), refusing to process it at all. This is intended to prevent attacks (e.g. cross site scripting) which may exploit a difference between the client and server in what encodings are supported in order to mask malicious content.[29] Although the same security concern applies to ISO-2022-JP and UTF-16, which also allow sequences of ASCII bytes to be interpreted differently, this approach was not seen as feasible for them since they are comparatively more frequently used in deployed content.[30] The following encodings receive this treatment:[31]

Character references

edit

In addition to native character encodings, characters can also be encoded as character references, which can be numeric character references (decimal or hexadecimal) or character entity references. Character entity references are also sometimes referred to as named entities, or HTML entities for HTML. HTML's usage of character references derives from SGML.

HTML character references

edit

A numeric character reference in HTML refers to a character by its Universal Character Set/Unicode code point, and uses the format

&#nnnn;

or

&#xhhhh;

where nnnn is the code point in decimal form, and hhhh is the code point in hexadecimal form. The x must be lowercase in XML documents. The nnnn or hhhh may be any number of digits and may include leading zeros. The hhhh may mix uppercase and lowercase, though uppercase is the usual style.

Not all web browsers or email clients used by receivers of HTML documents, or text editors used by authors of HTML documents, will be able to render all HTML characters. Most modern software is able to display most or all of the characters for the user's language, and will draw a box or other clear indicator for characters they cannot render.

For codes from 0 to 127, the original 7-bit ASCII standard set, most of these characters can be used without a character reference. Codes from 160 to 255 can all be created using character entity names. Only a few higher-numbered codes can be created using entity names, but all can be created by decimal number character reference.

Character entity references can also have the format &name; where name is a case-sensitive alphanumeric string. For example, "λ" can also be encoded as &lambda; in an HTML document. The character entity references &lt;, &gt;, &quot; and &amp; are predefined in HTML and SGML, because <, >, " and & are already used to delimit markup. This notably did not include XML's &apos; (') entity prior to HTML5. For a list of all named HTML character entity references along with the versions in which they were introduced, see List of XML and HTML character entity references.

Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, then HTML character references are usually only required for markup delimiting characters as mentioned above, and for a few special characters (or none at all if a native Unicode encoding like UTF-8 is used). Incorrect HTML entity escaping may also open up security vulnerabilities for injection attacks such as cross-site scripting. If HTML attributes are left unquoted, certain characters, most importantly whitespace, such as space and tab, must be escaped using entities. Other languages related to HTML have their own methods of escaping characters.

XML character references

edit

Unlike traditional HTML with its large range of character entity references, in XML there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:[32]

&amp; & ampersand U+0026
&lt; < less-than sign U+003C
&gt; > greater-than sign U+003E
&quot; " quotation mark U+0022
&apos; ' apostrophe U+0027

All other character entity references have to be defined before they can be used. For example, use of &eacute; (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the x in hexadecimal numeric references be in lowercase: for example &#xA1b rather than &#XA1b. XHTML, which is an XML application, supports the HTML entity set, along with XML's predefined entities.

See also

edit

References

edit
  1. ^ Fielding, R.; Reschke, J. (June 2014), "Content-Type", in Fielding, R; Reschke, J (eds.), Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content, IETF, doi:10.17487/RFC7231, S2CID 14399078, retrieved 30 July 2014
  2. ^ "Apache Module mod_charset_lite".
  3. ^ a b c "Specifying the document's character encoding", HTML5, World Wide Web Consortium, 14 December 2017, retrieved 28 May 2018
  4. ^ Bray, T.; Paoli, J.; Sperberg-McQueen, C.; Maler, E.; Yergeau, F. (26 November 2008), "Prolog and Document Type Declaration", XML, W3C, retrieved 8 March 2010
  5. ^ "HTML5 prescan a byte stream to determine its encoding".
  6. ^ "8.2.2.3. Character encodings". HTML 5.1 Standard. W3C.
  7. ^ "8.2.2.3. Character encodings". HTML 5 Standard. W3C.
  8. ^ a b c "12.2.3.3 Character encodings". HTML Living Standard. WHATWG.
  9. ^ a b c d e f van Kesteren, Anne. "4.2: Names and labels". Encoding Standard. WHATWG.
  10. ^ a b van Kesteren, Anne. "10.2.2. gb18030 encoder". Encoding Standard. WHATWG.
  11. ^ a b van Kesteren, Anne. "5. Indexes (§ index gb18030)". Encoding Standard. WHATWG.
  12. ^ van Kesteren, Anne. "10.2.1. gb18030 decoder". Encoding Standard. WHATWG.
  13. ^ a b c Mozilla Foundation. "Notable Differences from IANA Naming". Crate encoding_rs. docs.rs.
  14. ^ van Kesteren, Anne. "5. Indexes (§ index Big5 pointer)". Encoding Standard. WHATWG.
  15. ^ van Kesteren, Anne. "5. Indexes (§ Index jis0208)". Encoding Standard. WHATWG.
  16. ^ van Kesteren, Anne. "5. Indexes (§ Index ISO-2022-JP katakana)". Encoding Standard. WHATWG.
  17. ^ a b van Kesteren, Anne. "12.2.1. ISO-2022-JP decoder". Encoding Standard. WHATWG.
  18. ^ van Kesteren, Anne. "12.2.2. ISO-2022-JP encoder". Encoding Standard. WHATWG.
  19. ^ van Kesteren, Anne. "5. Indexes (§ index EUC-KR)". Encoding Standard. WHATWG.
  20. ^ a b van Kesteren, Anne. "4.3. Output encodings". Encoding Standard. WHATWG.
  21. ^ van Kesteren, Anne. "14.4. UTF-16LE". Encoding Standard. WHATWG.
  22. ^ van Kesteren, Anne. "6. Hooks for standards (§ decode)". Encoding Standard. WHATWG.
  23. ^ van Kesteren, Anne. "14.5. x-user-defined". Encoding Standard. WHATWG.
  24. ^ van Kesteren, Anne. "9. Legacy single-byte encodings (§ Note)". Encoding Standard. WHATWG.
  25. ^ a b van Kesteren, Anne. "index KOI8-U visualization". Encoding Standard. WHATWG.
  26. ^ "Bug 17053: Support KOI8-RU mapping for KOI8-U". W3C Bugzilla. 19 August 2015.
  27. ^ van Kesteren, Anne. "10.1. GBK". Encoding Standard. WHATWG.
  28. ^ van Kesteren, Anne. "5. Indexes (§ Index jis0212)". Encoding Standard. WHATWG.
  29. ^ van Kesteren, Anne. "14.1: replacement". Encoding Standard. WHATWG.
  30. ^ van Kesteren, Anne. "2: Security background". Encoding Standard. WHATWG.
  31. ^ van Kesteren, Anne. "4.2: Names and labels (§ replacement)". Encoding Standard. WHATWG.
  32. ^ Bray, T.; Paoli, J.; Sperberg-McQueen, C.; Maler, E.; Yergeau, F. (26 November 2008), "Character and Entity References", XML, W3C, retrieved 8 March 2010
edit
手脚出汗什么原因 什么是气溶胶 心下痞是什么意思 福相是什么意思 平胸是什么原因导致的怎样解决
灵芝有什么作用与功效 龟头炎是什么症状 什么的诉说 扁桃体炎吃什么药最好效果好 河南古代叫什么
思字属于五行属什么 急性扁桃体化脓是什么原因引起的 南冠指的是什么 呀啦嗦是什么意思 残留是什么意思
吃什么利于排便 感情洁癖什么意思 发烧为什么不能吃鸡蛋 穹隆什么意思 自得其乐是什么意思
什么山不能爬脑筋急转弯gysmod.com A型血为什么是完美血型hcv9jop4ns5r.cn 什么是偶数hcv8jop2ns9r.cn 冠心病是什么hcv9jop0ns3r.cn 精索静脉曲张是什么原因导致的gangsutong.com
细胞骨架是由什么构成shenchushe.com 条状血流信号是什么意思hcv7jop6ns9r.cn 挑什么hcv8jop6ns3r.cn 记性越来越差是什么原因imcecn.com 男人经常熬夜炖什么汤onlinewuye.com
苦荞茶有什么功效hcv9jop4ns1r.cn 吗丁啉是什么药aiwuzhiyu.com 低压高吃什么降压药好hcv8jop4ns6r.cn 紧急避孕药吃了有什么副作用mmeoe.com 出轨是什么意思hcv8jop7ns3r.cn
子宫癌筛查做什么检查hcv9jop1ns2r.cn 孕期用什么护肤品naasee.com 肌酸激酶偏低说明什么hcv8jop8ns9r.cn 农历又叫什么hcv8jop6ns6r.cn 什么是抗体jasonfriends.com
百度