Skip to content

Partition Evolution

Iceberg supports partition evolution — changing the partition layout of a table without rewriting existing data. New data is written with the new partition scheme while old data retains its original partitioning.

Get Current Partition Spec

from drls.iceberg import get_partition_spec

spec = get_partition_spec(spark, "drls.db.events")

Returns:

{
    "success": True,
    "table": "drls.db.events",
    "partition_fields": [
        {"field": "ts", "transform": "day"},
    ]
}

Add Partition Field

Add a new partition field with an optional transform:

from drls.iceberg import add_partition_field

# Identity partition (no transform)
add_partition_field(spark, "drls.db.events", field="region")

# With transform
add_partition_field(spark, "drls.db.events", field="ts", transform="month")

Parameters:

Parameter Type Required Description
spark SparkSession Yes Active Spark session
table str Yes Fully qualified table name
field str Yes Column name to partition by
transform str \| None No Partition transform (see below)

Supported Transforms

Transform Description Example
year Extract year from timestamp/date year(ts)
month Extract month from timestamp/date month(ts)
day Extract day from timestamp/date day(ts)
hour Extract hour from timestamp hour(ts)
bucket[N] Hash into N buckets bucket[16](id)
truncate[N] Truncate to width N truncate[4](name)
(none) Identity (use column value as-is) region

Drop Partition Field

Remove a partition field:

from drls.iceberg import drop_partition_field

drop_partition_field(spark, "drls.db.events", field="region")

# Drop a transformed partition
drop_partition_field(spark, "drls.db.events", field="ts", transform="month")

Note

Dropping a partition field does not rewrite existing data. Old files retain their original partitioning, while new files are written without the dropped partition.

Example: Evolving Partitions

import ray
import drls
from drls.iceberg import add_partition_field, drop_partition_field, get_partition_spec

ray.init()
spark = drls.init_spark("partition-demo", 1, 1, "512M",
                                 iceberg_catalog="hadoop",
                                 iceberg_warehouse="/tmp/warehouse")

spark.sql("""
    CREATE TABLE drls.db.logs (
        ts TIMESTAMP, level STRING, message STRING
    ) USING iceberg
""")

# Start with daily partitioning
add_partition_field(spark, "drls.db.logs", field="ts", transform="day")

# Later, switch to hourly for higher granularity
drop_partition_field(spark, "drls.db.logs", field="ts", transform="day")
add_partition_field(spark, "drls.db.logs", field="ts", transform="hour")

# Check the result
spec = get_partition_spec(spark, "drls.db.logs")
print(spec)

drls.stop_spark()
ray.shutdown()