Data Types (OCR A-Level Computer Science): Revision Notes
Character Sets
Overview
In computing, a character set is a standardised way to represent text and symbols in binary so that computers can process and display them correctly. Each character (e.g., letters, digits, punctuation) is assigned a unique binary code. Understanding how character sets like ASCII and Unicode work is crucial for handling text data across different systems and languages.
What is a Character Set?
- A character set maps characters (letters, numbers, symbols) to specific binary values.
- This enables computers to store, transmit, and display text correctly, even across different devices and platforms.
- Each character is assigned a unique numeric code, which is then converted to binary.
ASCII (American Standard Code for Information Interchange)
Overview: One of the earliest character sets, ASCII uses 7 bits to represent characters.
7-bit ASCII
Can represent 128 characters ( = 128), including:
- Uppercase and lowercase English letters (A–Z, a–z)
- Digits (0–9)
- Punctuation and special symbols (e.g.,
!,@,#) - Control characters (e.g., newline, tab)
8-bit ASCII (Extended ASCII)
Extends the set to 256 characters ( = 256), adding support for additional symbols and simple graphical characters.
Usage: Suitable for English text and basic symbols but limited for international use.
Example:
- Character:
A - ASCII Code: 65
- Binary: 01000001
Unicode
Overview: A more comprehensive character set designed to support a wide range of characters from multiple languages and scripts.
-
16-bit Unicode (UTF-16): Initially supported 65,536 characters.
-
UTF-8 Encoding: Variable-length encoding that uses 1 to 4 bytes, ensuring compatibility with ASCII for the first 128 characters. Why Unicode?:
-
Supports thousands of characters, including non-Latin scripts (e.g., Chinese, Arabic).
-
Includes emojis, mathematical symbols, and more. Usage: Essential for global applications, such as web development, where diverse languages must be supported.
Example:
- Character:
€(Euro symbol) - Unicode Code Point: U+20AC
- Binary (UTF-8): 11100010 10000010 10101100
Differences Between ASCII and Unicode
| Feature | ASCII | Unicode |
|---|---|---|
| Bit Length | 7-bit (or 8-bit extended) | 8 to 32 bits (variable length) |
| Character Support | 128 (7-bit) or 256 (8-bit) | Over 1 million characters |
| Scope | English and basic symbols | Global, supports all languages |
| Compatibility | Not suitable for international use | Backward-compatible with ASCII |
Why Character Sets Matter
- Data Interoperability:
- Ensures consistent representation of text across different systems.
- Internationalisation:
- Unicode enables software to support multiple languages and scripts.
- Storage and Transmission:
- Efficient storage of text data in binary format, critical for data processing and network communication.
Examples
Example 1: ASCII to Binary
Convert the ASCII character B to binary.
ASCII value of B = 66.
Binary representation: 01000010.
Example 2: Binary to Character (Unicode) Given the binary sequence 11000010 10100010 (UTF-8), determine the character.
Combine and convert to hexadecimal: C2 A2.
Unicode character for C2 A2 is ¢ (cent symbol).
Note Summary
Common Mistakes
- Confusing ASCII and Unicode:
- ASCII is limited to basic English characters, while Unicode supports global characters.
- Assuming Fixed Length for Unicode:
- Unicode uses variable-length encoding (e.g., UTF-8), where different characters may take 1 to 4 bytes.
- Incorrect Binary Conversion:
- Ensure the correct binary length is used for ASCII (7 or 8 bits) or Unicode (variable).
Key Takeaways
- ASCII: Efficient for English text but limited in scope.
- Unicode: A versatile character set that supports most languages and symbols.
- Conversions:
- Be able to convert characters to binary and vice versa.
- Understand the encoding format (ASCII or Unicode) being used.
- Purpose: Character sets are essential for consistent text representation in computing systems worldwide.