24 Oct 2024

Privacy-Preserving Data Publishing - Part 2: Understanding Mondrian

Mondrian algorithm provides an efficient way to achieve k-anonymity through multi-dimensional partitioning. Let's dive into how it works.

What is the Mondrian Algorithm?

The Mondrian algorithm, inspired by the geometric paintings of Piet Mondrian, achieves k-anonymity through recursive partitioning of data along different dimensions. Think of it as creating a kd-tree for privacy preservation.

Key Components

Multi-dimensional space
- Each quasi-identifier becomes a dimension
- Records are points in this space
- Partitions are hyper-rectangles
Recursive partitioning
- Choose dimension with widest normalized range
- Split at median
- Continue until k-anonymity would be violated

Implementation Guide

Step 1: Data Preparation

class DataPoint:
    def __init__(self, attributes):
        self.attributes = attributes
        self.normalized = self.normalize_attributes()
    
    def normalize_attributes(self):
        # Scale each attribute to [0,1] range
        normalized = {}
        for attr, value in self.attributes.items():
            min_val = GLOBAL_MINS[attr]
            max_val = GLOBAL_MAXS[attr]
            normalized[attr] = (value - min_val) / (max_val - min_val)
        return normalized

Step 2: Core Mondrian Algorithm

def mondrian_partition(data_points, k):
    if len(data_points) < 2 * k:
        return [data_points]
        
    # Find dimension with widest normalized range
    widest_dim = find_widest_dimension(data_points)
    
    # Sort points by chosen dimension
    sorted_points = sorted(data_points, 
                         key=lambda x: x.normalized[widest_dim])
    
    # Find median point
    median_idx = len(sorted_points) // 2
    
    # Split into two groups
    left_partition = sorted_points[:median_idx]
    right_partition = sorted_points[median_idx:]
    
    # Recurse if both partitions maintain k-anonymity
    if len(left_partition) >= k and len(right_partition) >= k:
        return (mondrian_partition(left_partition, k) + 
                mondrian_partition(right_partition, k))
    
    return [data_points]

Optimization Techniques

1. Dimension Selection

def find_widest_dimension(points):
    ranges = {}
    for dim in QUASI_IDENTIFIERS:
        values = [p.normalized[dim] for p in points]
        ranges[dim] = max(values) - min(values)
    
    return max(ranges.items(), key=lambda x: x[1])[0]

2. Efficient Median Finding

def quick_select_median(points, dim):
    """Using QuickSelect algorithm for O(n) median finding"""
    k = len(points) // 2
    return quick_select(points, 0, len(points)-1, k, dim)

Common Implementation Challenges

Handling Categorical Data
- Create hierarchical generalization structures
- Convert to numerical scales where appropriate
- Use domain-specific distance metrics
Dealing with Skewed Distributions
- Consider using adaptive splitting points
- Implement density-aware partitioning
- Apply pre-processing techniques
Performance Optimization
- Use efficient data structures
- Implement parallel processing for large datasets
- Cache intermediate results

Best Practices

Pre-processing
- Remove outliers that could skew partitioning
- Handle missing values appropriately
- Normalize numerical attributes
Partition Management
- Keep track of partition boundaries
- Maintain summary statistics
- Implement efficient storage structures
Quality Assurance
- Verify k-anonymity after partitioning
- Measure information loss
- Track partition sizes

Real-world Example: The Netflix Prize Dataset

In 2006, Netflix released an anonymized dataset containing 100 million movie ratings from 480,000 customers as part of their $1 million Netflix Prize competition to improve their recommendation algorithm. Their anonymization approach included several key techniques:

Data Preprocessing
- Removed customer identifying information
- Replaced actual user IDs with random numbers
- Deliberately modified some ratings slightly
- Removed obvious temporal correlations
What Went Wrong Despite these precautions, researchers Arvind Narayanan and Vitaly Shmatikov demonstrated in 2008 that:
- Users could be identified by matching Netflix rating patterns with public IMDb reviews
- With just 8 movie ratings and dates (±2 weeks), they could uniquely identify 99% of users in the dataset
- Even if the ratings were stripped of dates, 84% of users could still be uniquely identified
Key Lessons
- Simple removal of identifiers isn’t enough
- Auxiliary data sources must be considered (in this case, IMDb)
- Even seemingly innocent data like movie ratings can be identifying
- Temporal data adds significant re-identification risk

This case fundamentally changed how we approach data anonymization, leading to:

More rigorous privacy guarantees like differential privacy
Better understanding of the mosaic effect (combining multiple data sources)
Increased focus on formal privacy models rather than ad-hoc anonymization

Source: This case was documented in “Robust De-anonymization of Large Sparse Datasets” by Narayanan and Shmatikov, presented at the 2008 IEEE Symposium on Security and Privacy. The research led Netflix to cancel a planned second competition and settle a privacy lawsuit.

What’s Next?

In Part 3, we’ll explore generalization techniques and how they complement the Mondrian algorithm for achieving optimal data utility while maintaining privacy.

Preview of Coming Topics

Generalization hierarchies
Information loss metrics
Hybrid approaches
Advanced optimization techniques

References

LeFevre, K., DeWitt, D. J., & Ramakrishnan, R. (2006). Mondrian Multidimensional K-Anonymity. IEEE International Conference on Data Engineering (ICDE).
The implementation examples in this post are inspired by LeFevre et al.’s original Mondrian paper, adapted for clarity and modern Python practices.