Unicode and HTML
Web pages authored using HyperText Markup Language (HTML) may contain multilingual text represented with the Unicode universal character set. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in a HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes.
- Comment
- enWeb pages authored using HyperText Markup Language (HTML) may contain multilingual text represented with the Unicode universal character set. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in a HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes.
- Date
- 3 November 2007
- Has abstract
- enWeb pages authored using HyperText Markup Language (HTML) may contain multilingual text represented with the Unicode universal character set. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in a HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes. In RFC 1866, the initial HTML 2.0 standard, the document character set was defined as ISO-8859-1 (later HTML standard defaults to Windows-1252 encoding). It was extended to ISO 10646 (which is basically equivalent to Unicode) by RFC 2070. It does not vary between documents of different languages or created on different platforms. The external character encoding is chosen by the author of the document (or the software the author uses to create the document) and determines how the bytes used to store and/or transmit the document map to characters from the document character set. Characters not present in the chosen external character encoding may be represented by character entity references. The relationship between Unicode and HTML tends to be a difficult topic for many computer professionals, document authors, and web users alike. The accurate representation of text in web pages from different natural languages and writing systems is complicated by the details of character encoding, markup language syntax, font, and varying levels of support by web browsers.
- Is primary topic of
- Unicode and HTML
- Label
- enUnicode and HTML
- Link from a Wikipage to an external page
- www.hotpeachpages.net/a/characters.html
- www.pinyin.info/tools/converter/chars2uninumbers.html
- www.alanwood.net/unicode/cjk_compatibility_ideographs.html
- www.w3.org/TR/REC-html40/HTMLlat1.ent
- www.w3.org/TR/REC-html40/HTMLspecial.ent
- www.w3.org/TR/REC-html40/HTMLsymbol.ent
- unicode.coeurlumiere.com/
- www.alanwood.net/unicode/
- www.unicode.org/charts/
- www.unicodemap.org/
- www.w3.org/TR/unicode-xml/
- web.archive.org/web/20071103125951/http:/unicode.coeurlumiere.com/
- www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm
- scripts.sil.org/cms/scripts/page.php%3Fsite_id=nrsi
- web.archive.org/web/20110924073701/http:/www.w3.org/TR/html5/semantics.html%23charset
- Link from a Wikipage to another Wikipage
- 7 (number)
- A
- Abstraction
- Arabic alphabet
- ASCII
- Basic Multilingual Plane
- Bit
- Byte
- Byte order mark
- Category:HTML
- Category:Unicode
- Character (computing)
- Character encoding
- Character encodings in HTML
- Character reference
- Charset detection
- CJK Unified Ideographs
- Code2000
- Comparison of Unicode encodings
- Computer font
- Computer network
- Computer storage
- Cyrillic script
- Decimal
- Delta (letter)
- Document Type Definition
- Em dash
- Endianness
- Face with Tears of Joy emoji
- Fe (rune)
- File system
- Ge'ez alphabet
- Grapheme
- Greek alphabet
- Háček
- Hangul
- Hebrew alphabet
- Hexadecimal
- Hiragana
- HTML
- HTML5
- HTML email
- HTTP
- IEC 8859-1
- Internet Explorer
- Internet Explorer 6
- ISO 10646
- ISO 8859-1
- Latin alphabet
- List of typefaces
- List of XML and HTML character entity references
- Malayalam alphabet
- Markup language
- Mem
- Meta:Help:Special characters
- MIME
- Mozilla Firefox
- Natural language
- Netscape Navigator
- Numerical digit
- Numeric character reference
- Numeric character references
- Octet (computing)
- Opera (web browser)
- Operating system
- Programming language
- Qha
- Qoph
- Runic alphabet
- Safari (web browser)
- Short I
- Simplified Chinese characters
- ß
- Syllable
- Thai alphabet
- Thorn (letter)
- Traditional Chinese characters
- Unicode
- Unicode block
- Unicode Transformation Format
- Universal Character Set
- UTF-16
- UTF-16BE
- UTF-16LE
- UTF-32BE
- UTF-32LE
- UTF-8
- Web browser
- Web page
- Windows-1251
- Windows-1252
- World Wide Web
- Writing system
- XHTML
- XML
- SameAs
- 3GmQD
- m.07vv9
- Q3549946
- Subject
- Category:HTML
- Category:Unicode
- Url
- https://web.archive.org/web/20071103125951/http:/unicode.coeurlumiere.com/
- WasDerivedFrom
- Unicode and HTML?oldid=1116218032&ns=0
- WikiPageLength
- 22301
- Wikipage page ID
- 31985
- Wikipage revision ID
- 1116218032
- WikiPageUsesTemplate
- Template:Citation needed
- Template:Essay-like
- Template:Html series
- Template:IETF RFC
- Template:Main
- Template:Multiple issues
- Template:Primary sources
- Template:Refimprove
- Template:Reflist
- Template:Rewrite
- Template:Short description
- Template:Snd
- Template:SpecialChars
- Template:Toomanylinks
- Template:U+
- Template:Unicode navigation
- Template:Webarchive