xml

字数 667 · 2021-10-04

Extensible Markup Language

Documents

Each XML document has both a logical and a physical structure

physical
- entities
logical
- declarations
- elements
- comments
- character references
- processing instructions

Characters

Definition: A parsed entity contains text, a sequence of characters, which may represent markup or character data.

entity
- text
  - markup
    - character
  - data
    - character

Unicode

Entities

Applications

[1] document ::= prolog element Misc*
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
[3] S ::= (#x20 | #x9 | #xD | #xA)+
[4] NameStartChar ::= : | [A-Z] | _ | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar ::= NameStartChar | “-“ | “.” | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5] Name ::= NameStartChar (NameChar)*
[6] Names ::= Name (#x20 Name)*
[7] Nmtoken ::= (NameChar)+
[8] Nmtokens ::= Nmtoken (#x20 Nmtoken)*

Unicode

[#xC0-#xD6] [#xD8-#xF6] [#xF8-#x0FF] - Latin-1 supplement characters
- #xD7 - ×
- #xF7 - ÷
[#x0100 - #x2FF] - Latin Extended-A/B & IPA Extensions
[#x0300 - #x033F] - CJK Symbols and Punctuation
[#x370-#x37D] [#x37F-#x3FF] - Greek and Coptic
- #x37E - ;
[#x400 - #x1FFF] - Cyrillic … Arabic … Tibetan … Runic … Greek Extended
[#x2000 - #206F] - General Punctuation
- #x200C - “
- #x200D - ”
[#x2070 - #x218F] - Superscripts and Subscripts …Currency Symbols … Number Forms
[#x2190 - #x2BFF]
- Arrows
- Mathematical Operators
- …
- Block Elements
- [2B00 - 2BFF] - Miscellaneous Symbols and Arrows
[#x2C00 - #x2FEF]
- [#x2C00 - #x2E7F] - Unknown
- [#x2E80 - #x2EFF] - CJK Radicals Supplement
- [#x2F00 - #x2FDF] - Kangxi Radicals
- [#x2FE0 - #x2FEF] - Unknown
- [#x2FF0 - #x2FFF] - Ideographic Description Characters
[#x3001 - #xD7FF]
- [#x3000 - #x303F] - CJK Symbols and Punctuation
  - #x3000 - 　 - Ideographic Space
- Hiragana
- Katakana
- Bopomofo
- …
- [4E00 — 9FFF] - CJK Unified Ideographs
- [AC00 — D7AF] - Hangul Syllables
- [D7B0 - D7FF] - Unknown
[#xD800 - #xF8FF]
- High Surrogates - leading bytes
- High Private Use Surrogates
- Low Surrogates - trailing bytes
- [E000 — F8FF] - Private Use Area - will not be assigned characters by the Unicode Consortium
[#xF900 - #xFDCF]|[#xFDF0-#xFFFD]
- [F900 — FAFF] - CJK Compatibility Ideographs
- [FB00 — FB4F] - Alphabetic Presentation Forms
- [FB50 — FDFF] - Arabic Presentation Forms-A
  - [FDD0 - FDEF] - Undefined
- …
- [FFF0 — FFFF] - Specials
  - FFFD - � - Replacement Character
  - FFFE FFFF - Undefined
[#x10000-#xEFFFF]
- [10000 — 1007F] - Linear B Syllabary
- [10080 — 100FF] - Linear B Ideograms
- [10100 — 1013F] - Aegean Numbers

https://jrgraphix.net/research/unicode_blocks.php

https://en.wikipedia.org/wiki/Unicode_block

https://www.fileformat.info/info/unicode/char/1F746/index.htm

Everson Mono
https://fontlibrary.org/en/font/symbola
https://unifoundry.com/unifont/
https://github.com/unicode-org/last-resort-font