Partition Evolution¶
Iceberg supports partition evolution — changing the partition layout of a table without rewriting existing data. New data is written with the new partition scheme while old data retains its original partitioning.
Get Current Partition Spec¶
Returns:
{
"success": True,
"table": "drls.db.events",
"partition_fields": [
{"field": "ts", "transform": "day"},
]
}
Add Partition Field¶
Add a new partition field with an optional transform:
from drls.iceberg import add_partition_field
# Identity partition (no transform)
add_partition_field(spark, "drls.db.events", field="region")
# With transform
add_partition_field(spark, "drls.db.events", field="ts", transform="month")
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
spark |
SparkSession |
Yes | Active Spark session |
table |
str |
Yes | Fully qualified table name |
field |
str |
Yes | Column name to partition by |
transform |
str \| None |
No | Partition transform (see below) |
Supported Transforms¶
| Transform | Description | Example |
|---|---|---|
year |
Extract year from timestamp/date | year(ts) |
month |
Extract month from timestamp/date | month(ts) |
day |
Extract day from timestamp/date | day(ts) |
hour |
Extract hour from timestamp | hour(ts) |
bucket[N] |
Hash into N buckets | bucket[16](id) |
truncate[N] |
Truncate to width N | truncate[4](name) |
| (none) | Identity (use column value as-is) | region |
Drop Partition Field¶
Remove a partition field:
from drls.iceberg import drop_partition_field
drop_partition_field(spark, "drls.db.events", field="region")
# Drop a transformed partition
drop_partition_field(spark, "drls.db.events", field="ts", transform="month")
Note
Dropping a partition field does not rewrite existing data. Old files retain their original partitioning, while new files are written without the dropped partition.
Example: Evolving Partitions¶
import ray
import drls
from drls.iceberg import add_partition_field, drop_partition_field, get_partition_spec
ray.init()
spark = drls.init_spark("partition-demo", 1, 1, "512M",
iceberg_catalog="hadoop",
iceberg_warehouse="/tmp/warehouse")
spark.sql("""
CREATE TABLE drls.db.logs (
ts TIMESTAMP, level STRING, message STRING
) USING iceberg
""")
# Start with daily partitioning
add_partition_field(spark, "drls.db.logs", field="ts", transform="day")
# Later, switch to hourly for higher granularity
drop_partition_field(spark, "drls.db.logs", field="ts", transform="day")
add_partition_field(spark, "drls.db.logs", field="ts", transform="hour")
# Check the result
spec = get_partition_spec(spark, "drls.db.logs")
print(spec)
drls.stop_spark()
ray.shutdown()