Hashing (OCR A-Level Computer Science): Revision Notes
Hashing
Overview
Hashing is a process that converts data into a fixed-size string of characters, typically using a mathematical algorithm known as a hash function. The output, called a hash value or hash code, is unique to the original data. Hashing is commonly used for data storage, data retrieval, and ensuring data integrity. It is particularly useful for securely storing sensitive information, such as passwords.
Purpose and Need for Hashing
- Data Integrity: Hashing helps verify that data has not been altered, as even a tiny change in the original data produces a completely different hash. This makes it useful for detecting data corruption or tampering.
- Efficient Data Retrieval: In databases, hashing enables fast data access by assigning unique hash values to data entries. This allows direct access to data without needing to search through all entries.
- Security for Sensitive Data: Hashing protects sensitive information, such as passwords, by storing only the hash of the data rather than the data itself. Even if an attacker gains access to the hash, they cannot retrieve the original data directly.
How Hashing Works
- Hash Function: A hash function takes an input (like a password or document) and returns a fixed-length hash value. Common hash functions include MD5 (Message Digest Algorithm 5), SHA-1 (Secure Hash Algorithm 1), and SHA-256 (Secure Hash Algorithm 256).
- Deterministic Output: The same input always produces the same hash output, ensuring consistency for data retrieval and comparison.
- Fixed Length: Regardless of the input size, hash values have a fixed length (e.g., 256 bits for SHA-256), making them ideal for efficient storage and indexing.
Properties of Hashing
- Uniqueness: Ideally, each unique input produces a unique hash value. Although not guaranteed, a good hash function minimises the probability of two different inputs producing the same hash (a collision).
- Irreversibility: Hash functions are one-way functions, meaning it's computationally infeasible to derive the original data from the hash value alone. This property is crucial for storing sensitive information like passwords.
- Speed: Hash functions are designed to process data quickly, making them efficient for large datasets and rapid lookups.
Uses of Hashing
Storing Passwords Securely
- Purpose: Rather than storing plain text passwords, systems store the hash of the password. When a user enters their password, the system hashes it and compares the result to the stored hash. This ensures that even if the hash database is compromised, the original passwords remain secure.
- Password Verification Process:
- User enters a password.
- The system hashes the entered password.
- The resulting hash is compared with the stored hash. If they match, the password is correct.
Example:
A user's password, "MySecret123", is hashed to produce a hash value such as b2f5ff47436671b6e533d8dc3614845d.
The hash value, not the actual password, is stored in the system.
Data Retrieval in Hash Tables
Purpose: Hash tables use hashing to store and retrieve data quickly by associating each data item with a unique hash. This provides constant time complexity, making it highly efficient for large datasets.
Example: A hash table for storing student IDs might use the student's name as input, generating a hash value that determines the storage location of the student's record.
When the student's name is entered, the hash is recalculated, allowing the system to quickly retrieve the associated record.
Data Integrity Verification
Purpose: Hashing verifies that files or messages haven't been altered, ensuring data integrity. A hash of the original data is created and compared to a recalculated hash when the data is received.
Example: When a file is downloaded, a hash value provided by the source can be recalculated and compared with the hash of the downloaded file.
If the hashes match, the file is unaltered; if they don't, it indicates potential corruption or tampering.
Common Hash Functions
MD5 (Message Digest Algorithm 5)
- Produces a 128-bit hash.
- Generally used for basic checksums but is now considered weak due to vulnerabilities that allow hash collisions (where two different inputs produce the same hash).
SHA-1 (Secure Hash Algorithm 1)
- Produces a 160-bit hash.
- Previously popular, but like MD5, it is now considered insecure for sensitive data due to collision vulnerabilities.
SHA-256 (Secure Hash Algorithm 256)
- Part of the SHA-2 family, producing a 256-bit hash.
- Currently one of the most secure and widely used hashing algorithms, especially for applications needing strong data integrity and security.
Example of Hashing in Python
Below is an example of hashing a password using Python's hashlib library and the SHA-256 algorithm.
import hashlib
def hash_password(password):
# Encode the password and hash it using SHA-256
hash_object = hashlib.sha256(password.encode())
hashed_password = hash_object.hexdigest()
return hashed_password
# Example usage
password = "MySecurePassword123"
hashed = hash_password(password)
print("Original Password:", password)
print("Hashed Password:", hashed)
Explanation of the Example
- Encoding: The password is encoded to a byte format compatible with the hash function.
- Hashing: The SHA-256 algorithm hashes the password.
- Result: The hashed password is stored as a hexadecimal string, which is much shorter and more secure to store than the original password.
Benefits and Drawbacks of Hashing
Benefits:
- Data Security: Hashing provides a secure way to store sensitive information by transforming it into a fixed-length value that cannot easily be reversed.
- Efficient Data Retrieval: Hash tables use hashing to enable fast and efficient data lookup, which is ideal for large databases.
- Data Integrity: Hashing ensures data integrity by verifying that data has not been altered, as any changes will produce a different hash.
Drawbacks:
- Collisions: Although rare, different inputs can occasionally produce the same hash value, resulting in a collision. Stronger hashing algorithms like SHA-256 minimise this risk.
- Irreversibility: Hashing is a one-way function, meaning the original data cannot be retrieved from the hash. This is beneficial for security but limits its use in situations where data needs to be recovered.
- Vulnerability to Brute Force Attacks: Simple hashing without additional security measures (like salting) is vulnerable to brute force or dictionary attacks.
Note Summary
Key Takeaways
- Hashing is a process that converts data into a fixed-size hash value using a hash function, making it essential for data integrity and security.
- Uses of Hashing: Commonly used for secure password storage, quick data retrieval in hash tables, and data integrity verification.
- Hash Functions: Popular algorithms include MD5, SHA-1, and SHA-256, with SHA-256 being the preferred choice for security-sensitive applications.
- Hashing is irreversible, making it secure for storing sensitive data, although additional measures like salting are often used to enhance security for applications such as password storage.