Privacy-Preserving Data Publishing - Part 2: Understanding Mondrian
Mondrian algorithm provides an efficient way to achieve k-anonymity through multi-dimensional partitioning. Let's dive into how it works.

What is the Mondrian Algorithm?
The Mondrian algorithm, inspired by the geometric paintings of Piet Mondrian, achieves k-anonymity through recursive partitioning of data along different dimensions. Think of it as creating a kd-tree for privacy preservation.
Key Components
- Multi-dimensional space
- Each quasi-identifier becomes a dimension
- Records are points in this space
- Partitions are hyper-rectangles
- Recursive partitioning
- Choose dimension with widest normalized range
- Split at median
- Continue until k-anonymity would be violated
Implementation Guide
Step 1: Data Preparation
class DataPoint:
def __init__(self, attributes):
self.attributes = attributes
self.normalized = self.normalize_attributes()
def normalize_attributes(self):
# Scale each attribute to [0,1] range
normalized = {}
for attr, value in self.attributes.items():
min_val = GLOBAL_MINS[attr]
max_val = GLOBAL_MAXS[attr]
normalized[attr] = (value - min_val) / (max_val - min_val)
return normalized
Step 2: Core Mondrian Algorithm
def mondrian_partition(data_points, k):
if len(data_points) < 2 * k:
return [data_points]
# Find dimension with widest normalized range
widest_dim = find_widest_dimension(data_points)
# Sort points by chosen dimension
sorted_points = sorted(data_points,
key=lambda x: x.normalized[widest_dim])
# Find median point
median_idx = len(sorted_points) // 2
# Split into two groups
left_partition = sorted_points[:median_idx]
right_partition = sorted_points[median_idx:]
# Recurse if both partitions maintain k-anonymity
if len(left_partition) >= k and len(right_partition) >= k:
return (mondrian_partition(left_partition, k) +
mondrian_partition(right_partition, k))
return [data_points]
Optimization Techniques
1. Dimension Selection
def find_widest_dimension(points):
ranges = {}
for dim in QUASI_IDENTIFIERS:
values = [p.normalized[dim] for p in points]
ranges[dim] = max(values) - min(values)
return max(ranges.items(), key=lambda x: x[1])[0]
2. Efficient Median Finding
def quick_select_median(points, dim):
"""Using QuickSelect algorithm for O(n) median finding"""
k = len(points) // 2
return quick_select(points, 0, len(points)-1, k, dim)
Common Implementation Challenges
- Handling Categorical Data
- Create hierarchical generalization structures
- Convert to numerical scales where appropriate
- Use domain-specific distance metrics
- Dealing with Skewed Distributions
- Consider using adaptive splitting points
- Implement density-aware partitioning
- Apply pre-processing techniques
- Performance Optimization
- Use efficient data structures
- Implement parallel processing for large datasets
- Cache intermediate results
Best Practices
- Pre-processing
- Remove outliers that could skew partitioning
- Handle missing values appropriately
- Normalize numerical attributes
- Partition Management
- Keep track of partition boundaries
- Maintain summary statistics
- Implement efficient storage structures
- Quality Assurance
- Verify k-anonymity after partitioning
- Measure information loss
- Track partition sizes
Real-world Example: The Netflix Prize Dataset
In 2006, Netflix released an anonymized dataset containing 100 million movie ratings from 480,000 customers as part of their $1 million Netflix Prize competition to improve their recommendation algorithm. Their anonymization approach included several key techniques:
- Data Preprocessing
- Removed customer identifying information
- Replaced actual user IDs with random numbers
- Deliberately modified some ratings slightly
- Removed obvious temporal correlations
- What Went Wrong
Despite these precautions, researchers Arvind Narayanan and Vitaly Shmatikov demonstrated in 2008 that:
- Users could be identified by matching Netflix rating patterns with public IMDb reviews
- With just 8 movie ratings and dates (±2 weeks), they could uniquely identify 99% of users in the dataset
- Even if the ratings were stripped of dates, 84% of users could still be uniquely identified
- Key Lessons
- Simple removal of identifiers isn’t enough
- Auxiliary data sources must be considered (in this case, IMDb)
- Even seemingly innocent data like movie ratings can be identifying
- Temporal data adds significant re-identification risk
This case fundamentally changed how we approach data anonymization, leading to:
- More rigorous privacy guarantees like differential privacy
- Better understanding of the mosaic effect (combining multiple data sources)
- Increased focus on formal privacy models rather than ad-hoc anonymization
Source: This case was documented in “Robust De-anonymization of Large Sparse Datasets” by Narayanan and Shmatikov, presented at the 2008 IEEE Symposium on Security and Privacy. The research led Netflix to cancel a planned second competition and settle a privacy lawsuit.
What’s Next?
In Part 3, we’ll explore generalization techniques and how they complement the Mondrian algorithm for achieving optimal data utility while maintaining privacy.
Preview of Coming Topics
- Generalization hierarchies
- Information loss metrics
- Hybrid approaches
- Advanced optimization techniques
References
- LeFevre, K., DeWitt, D. J., & Ramakrishnan, R. (2006). Mondrian Multidimensional K-Anonymity. IEEE International Conference on Data Engineering (ICDE).
- The implementation examples in this post are inspired by LeFevre et al.’s original Mondrian paper, adapted for clarity and modern Python practices.