xml

字数 667 · 2021-10-04

Extensible Markup Language

Documents

Each XML document has both a logical and a physical structure

  • physical
    • entities
  • logical
    • declarations
    • elements
    • comments
    • character references
    • processing instructions

Characters

Definition: A parsed entity contains text, a sequence of characters, which may represent markup or character data.

  • entity
    • text
      • markup
        • character
      • data
        • character

Unicode

Entities

Applications

  • RSS
  • SVG

[1] document ::= prolog element Misc*
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
[3] S ::= (#x20 | #x9 | #xD | #xA)+
[4] NameStartChar ::= : | [A-Z] | _ | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar ::= NameStartChar | “-“ | “.” | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5] Name ::= NameStartChar (NameChar)*
[6] Names ::= Name (#x20 Name)*
[7] Nmtoken ::= (NameChar)+
[8] Nmtokens ::= Nmtoken (#x20 Nmtoken)*

Unicode

  • [#xC0-#xD6] [#xD8-#xF6] [#xF8-#x0FF] - Latin-1 supplement characters
    • #xD7 - ×
    • #xF7 - ÷
  • [#x0100 - #x2FF] - Latin Extended-A/B & IPA Extensions
  • [#x0300 - #x033F] - CJK Symbols and Punctuation
  • [#x370-#x37D] [#x37F-#x3FF] - Greek and Coptic
    • #x37E - ;
  • [#x400 - #x1FFF] - Cyrillic … Arabic … Tibetan … Runic … Greek Extended
  • [#x2000 - #206F] - General Punctuation
    • #x200C -
    • #x200D -
  • [#x2070 - #x218F] - Superscripts and Subscripts …Currency Symbols … Number Forms
  • [#x2190 - #x2BFF]
    • Arrows
    • Mathematical Operators
    • Block Elements
    • [2B00 - 2BFF] - Miscellaneous Symbols and Arrows
  • [#x2C00 - #x2FEF]
    • [#x2C00 - #x2E7F] - Unknown
    • [#x2E80 - #x2EFF] - CJK Radicals Supplement
    • [#x2F00 - #x2FDF] - Kangxi Radicals
    • [#x2FE0 - #x2FEF] - Unknown
    • [#x2FF0 - #x2FFF] - Ideographic Description Characters
  • [#x3001 - #xD7FF]
    • [#x3000 - #x303F] - CJK Symbols and Punctuation
      • #x3000 -   - Ideographic Space
    • Hiragana
    • Katakana
    • Bopomofo
    • [4E00 — 9FFF] - CJK Unified Ideographs
    • [AC00 — D7AF] - Hangul Syllables
    • [D7B0 - D7FF] - Unknown
  • [#xD800 - #xF8FF]
    • High Surrogates - leading bytes
    • High Private Use Surrogates
    • Low Surrogates - trailing bytes
    • [E000 — F8FF] - Private Use Area - will not be assigned characters by the Unicode Consortium
  • [#xF900 - #xFDCF]|[#xFDF0-#xFFFD]
    • [F900 — FAFF] - CJK Compatibility Ideographs
    • [FB00 — FB4F] - Alphabetic Presentation Forms
    • [FB50 — FDFF] - Arabic Presentation Forms-A
      • [FDD0 - FDEF] - Undefined
    • [FFF0 — FFFF] - Specials
      • FFFD - - Replacement Character
      • FFFE FFFF - Undefined
  • [#x10000-#xEFFFF]
    • [10000 — 1007F] - Linear B Syllabary
    • [10080 — 100FF] - Linear B Ideograms
    • [10100 — 1013F] - Aegean Numbers

https://jrgraphix.net/research/unicode_blocks.php

https://en.wikipedia.org/wiki/Unicode_block

https://www.fileformat.info/info/unicode/char/1F746/index.htm

Everson Mono
https://fontlibrary.org/en/font/symbola
https://unifoundry.com/unifont/
https://github.com/unicode-org/last-resort-font