Dremio Blog

40 minute read · May 31, 2023

How to Convert JSON Files Into an Apache Iceberg Table with Dremio

Alex Merced Alex Merced Head of DevRel, Dremio
Start For Free
How to Convert JSON Files Into an Apache Iceberg Table with Dremio
Copied to clipboard

This article has been revised and updated from its original version published in 2022 to reflect the latest Dremio and Apache Iceberg capabilities.

JSON is the lingua franca of APIs, event streams, and NoSQL databases, but its nested, schema-on-read nature makes it challenging for analytical queries. Converting JSON data to Apache Iceberg tables flattens nested structures into optimized columnar format, adds ACID transactions, and enables file-level pruning that makes queries orders of magnitude faster.

JSON-to-Iceberg conversion is especially valuable for modern data architectures where event data from APIs, IoT sensors, and application logs arrives in JSON format and needs to be made analytically accessible. Dremio simplifies this conversion by supporting JSON files as a native data source, no external ETL tools, no custom parsers, just SQL.

This guide covers the complete process of converting JSON files to optimized Iceberg tables, including handling nested structures, schema evolution, streaming JSON ingestion, and advanced transformation patterns with Dremio.

The Spark Approach: JSON to Iceberg

The traditional way to convert JSON to Iceberg uses Apache Spark's built-in JSON reader: For official documentation, refer to the Dremio SQL documentation.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("JSON to Iceberg") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.iceberg.type", "hadoop") \
    .config("spark.sql.catalog.iceberg.warehouse", "s3://warehouse/") \
    .getOrCreate()

# Read JSON with automatic schema inference
json_df = spark.read.json("s3://data-landing/events/*.json")

# Flatten nested structures
from pyspark.sql.functions import col
flat_df = json_df.select(
    col("event_id"),
    col("user.id").alias("user_id"),
    col("user.name").alias("user_name"),
    col("metadata.source").alias("event_source"),
    col("timestamp").cast("timestamp").alias("event_ts")
)

# Write as partitioned Iceberg table
flat_df.writeTo("iceberg.db.events") \
    .partitionedBy(months("event_ts")) \
    .create()

Spark's JSON reader handles automatic schema inference, it samples the JSON files, detects data types, and constructs a DataFrame schema. For nested JSON, Spark creates StructType columns that you flatten using dot notation (e.g., col("user.id")).

While Spark provides fine-grained control, it requires cluster infrastructure, Python/Scala expertise, and careful dependency management. Dremio handles all of this through its SQL interface with similar schema inference capabilities.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Dremio vs Spark: JSON Conversion

FactorDremio SQLSpark PySpark
Nested field accessuser.name in SELECTcol("user.name").alias("name")
Array handlingFLATTEN() functionexplode() function
Schema inferenceAutomatic on source connectspark.read.json()
Type castingCAST(field AS TYPE).cast("type")
PartitioningPARTITION BY (month(ts)) in CTAS.partitionedBy(months("ts"))
Post-conversionOPTIMIZE TABLErewrite_data_files procedure
AccelerationReflectionsNot available

When to use Spark: You have complex nested JSON requiring custom transformations (custom UDFs, complex array operations, cross-document joins during conversion), or when you need to process multi-GB JSON files that benefit from distributed parallel parsing.

When to use Dremio: For most enterprise JSON-to-Iceberg workflows, API event data, webhook payloads, application logs, IoT telemetry, Dremio's SQL-based approach is faster to implement, easier to maintain, and includes built-in acceleration through Reflections.

JSON Format Variants

Dremio handles multiple JSON formats:

FormatStructureDremio Support
Standard JSONSingle JSON object or arrayFull support
JSON Lines (JSONL/NDJSON)One JSON object per lineFull support (auto-detected)
Multiline JSONPretty-printed JSON objectsSupported with format options
Nested arraysArrays of objects with sub-arraysFLATTEN() function
Mixed-type fieldsFields with varying types across recordsCOALESCE() + CAST()

JSON Lines (NDJSON) is the most common format for log data and streaming outputs. Each line is a self-contained JSON object, enabling efficient parallel processing. Dremio auto-detects this format and parses each line independently.

Why Convert JSON to Iceberg?

CapabilityJSON FilesIceberg Tables
Schema enforcementNone (schema-on-read)Strict (typed columns, schema evolution)
Nested dataNative (deep nesting)Flattened for analytics, or preserved as STRUCT
Query performanceFull parse on every readColumn pruning and file pruning
CompressionPoor (~10x larger than Parquet)Excellent (columnar + Zstd)
Update/DeleteFull rewriteRow-level operations
Time travelNot possibleQuery any historical version
ACID transactionsNoneFull serializable isolation

Step 1: Connect JSON Data to Dremio

Dremio reads JSON files from any connected source, S3, GCS, ADLS, or NAS:

Single JSON File

SELECT * FROM object_storage."events/2024-03.json" LIMIT 10;

JSON Lines (JSONL/NDJSON) Format

For line-delimited JSON (one JSON object per line), Dremio auto-detects the format:

SELECT * FROM object_storage."logs/application.jsonl" LIMIT 10;

Nested JSON Querying

Dremio supports dot notation for nested fields:

SELECT 
  event_id,
  user.name as user_name,
  user.email as user_email,
  metadata.source as event_source,
  timestamp
FROM object_storage."events/2024-03.json";

Step 2: Flatten and Transform

JSON data typically requires flattening nested structures for analytics:

Basic Flattening

-- Flatten nested user and metadata objects
SELECT 
  event_id,
  user.id as user_id,
  user.name as user_name,
  user.email as user_email,
  metadata.source as event_source,
  metadata.device as event_device,
  CAST(timestamp AS TIMESTAMP) as event_timestamp,
  action,
  CAST(value AS DECIMAL(12,2)) as event_value
FROM object_storage."events/2024-03.json";

Handling Arrays

For JSON arrays, use Dremio's FLATTEN function:

-- Flatten an array of items within each order JSON
SELECT 
  order_id,
  FLATTEN(items) as item
FROM object_storage."orders/2024-03.json";

Step 3: Create the Iceberg Table

Use Dremio's CTAS with the flattened schema:

-- Convert flattened JSON to Iceberg
CREATE TABLE iceberg_catalog.db.events
PARTITION BY (month(event_timestamp))
AS
SELECT 
  event_id,
  user.id as user_id,
  user.name as user_name,
  action,
  CAST(value AS DECIMAL(12,2)) as event_value,
  CAST(timestamp AS TIMESTAMP) as event_timestamp,
  metadata.source as event_source
FROM object_storage."events/2024-03.json";

Dremio writes the data as Parquet with hidden partitioning by month, automatically creating column-level statistics in the manifest metadata.

Step 4: Incremental JSON Loading

For streaming or daily JSON data, use INSERT INTO for incremental loads:

-- Append new JSON events
INSERT INTO iceberg_catalog.db.events
SELECT 
  event_id,
  user.id as user_id,
  user.name as user_name,
  action,
  CAST(value AS DECIMAL(12,2)) as event_value,
  CAST(timestamp AS TIMESTAMP) as event_timestamp,
  metadata.source as event_source
FROM object_storage."events/2024-04.json";

Each INSERT creates a new snapshot, enabling time travel to compare event data across loads.

Step 5: Optimize and Maintain

After loading, optimize for query performance:

-- Compact small files from multiple JSON loads
OPTIMIZE TABLE iceberg_catalog.db.events;

-- Clean up old snapshots
VACUUM TABLE iceberg_catalog.db.events 
EXPIRE SNAPSHOTS older_than = '2024-01-01 00:00:00';

Handling Complex JSON Scenarios

Schema Evolution Across JSON Files

When JSON schemas change over time (new fields added, fields removed), Iceberg handles this gracefully:

-- Add new columns when the JSON schema evolves
ALTER TABLE iceberg_catalog.db.events ADD COLUMNS (
  campaign_id VARCHAR,
  experiment_variant VARCHAR
);

-- New JSON files with extra fields load without issues
INSERT INTO iceberg_catalog.db.events
SELECT 
  event_id, user_id, user_name, action,
  event_value, event_timestamp, event_source,
  campaign_id, experiment_variant
FROM object_storage."events/2024-05.json";

Old data files return NULL for newly added columns, no backfill required.

Deeply Nested JSON to Star Schema

Convert complex nested JSON into an analytics-friendly star schema:

-- Fact table: events
CREATE TABLE iceberg_catalog.db.fact_events AS
SELECT 
  event_id,
  user.id as user_id,
  CAST(timestamp AS TIMESTAMP) as event_ts,
  action,
  CAST(value AS DECIMAL(12,2)) as value
FROM object_storage."events/*.json";

-- Dimension table: users (deduplicated)
CREATE TABLE iceberg_catalog.db.dim_users AS
SELECT DISTINCT ON (user.id)
  user.id as user_id,
  user.name,
  user.email,
  user.region
FROM object_storage."events/*.json"
ORDER BY user.id, timestamp DESC;

Real-World Pipeline: API Event Processing

A SaaS platform processes 10 million API events daily, arriving as JSON files:

  1. Events arrive as JSONL files on S3 every 15 minutes
  2. Dremio reads JSON files through the S3 source connector
  3. CTAS/INSERT INTO flattens nested structures and loads into Iceberg
  4. OPTIMIZE TABLE runs hourly to compact 15-minute batches into larger files
  5. Reflections provide sub-second dashboard queries on event analytics
  6. VACUUM TABLE runs weekly to manage snapshot growth

Result: latency from JSON landing to queryable Iceberg data is under 5 minutes (including Reflection refresh), with query response times under 1 second for BI dashboards.

Dremio's Modern Ingestion Methods for JSON Data

Beyond CTAS and INSERT INTO, Dremio provides three additional ingestion methods optimized for JSON workflows:

Method 1: COPY INTO, Bulk Load JSON into Existing Iceberg Tables

The COPY INTO command loads JSON files from a configured source directly into an existing Iceberg table:

-- Load all JSON files from a directory into an existing table
COPY INTO nessie.db.events
FROM '@s3_source/data-landing/events/'
FILE_FORMAT 'json'
(ON_ERROR 'continue', TRIM_SPACE 'true');

JSON-specific COPY INTO options:

OptionPurposeExample
ON_ERROR'abort' (default) or 'continue'Skip malformed JSON records
TRIM_SPACEStrip whitespace from string values'true'
EMPTY_AS_NULLTreat empty strings as NULL'true'
DATE_FORMATCustom date parsing for date fields'YYYY-MM-DD'
TIMESTAMP_FORMATCustom timestamp parsing'YYYY-MM-DD HH24:MI:SS'
NULL_IFStrings to treat as NULL('null', 'N/A', 'none')
-- Load specific JSON files with regex matching
COPY INTO nessie.db.events
FROM '@s3_source/data-landing/events/'
REGEX '2024-03-.*.json'
FILE_FORMAT 'json'
(ON_ERROR 'continue', NULL_IF ('null', 'N/A'));

Column matching: Dremio matches JSON keys to Iceberg column names (case-insensitive). Extra JSON keys are ignored; missing columns get NULL values.

Method 2: CREATE PIPE, Automated Continuous JSON Ingestion

For real-time event data arriving as JSON files on S3, CREATE PIPE provides fully automated ingestion:

-- Create an auto-ingest pipe for continuous JSON loading
CREATE PIPE json_events_pipe
AS COPY INTO nessie.db.events
FROM '@s3_source/data-landing/event-stream/'
FILE_FORMAT 'json'
(ON_ERROR 'continue', TRIM_SPACE 'true');

How auto-ingest works for JSON:

  1. API/service writes JSON files to S3 (e.g., every 5 minutes)
  2. S3 event notification triggers via AWS SQS
  3. Dremio's pipe detects the new file notification
  4. COPY INTO executes automatically with JSON-specific parsing
  5. Deduplication ensures each file is loaded exactly once
  6. Micro-batching optimizes engine utilization

This replaces custom Kafka consumers, Lambda functions, or Airflow DAGs for JSON ingestion pipelines.

Method 3: Upload, Drag-and-Drop JSON to Iceberg

For ad-hoc JSON analysis (up to 500 MB files), Dremio Cloud provides direct upload:

  1. Navigate to Datasets in the Dremio UI
  2. Click Add Data → Upload File
  3. Drag and drop your JSON or JSONL file
  4. Dremio creates an Iceberg table automatically

This is ideal for data scientists exploring API response dumps, ML training data, or one-off JSON exports without setup overhead.

Choosing the Right JSON Ingestion Method

MethodBest ForLatencyAutomation
CTASInitial table creation with flatteningMinutesManual
INSERT INTOScheduled batch appendsMinutesScript/schedule
COPY INTOBulk loads with error toleranceMinutesManual or scripted
CREATE PIPEContinuous event stream from S3Near real-timeFully automatic
UploadAd-hoc exploration (≤500 MB)InstantManual

Frequently Asked Questions

What about JSON data from Kafka or streaming sources?

For real-time JSON ingestion, consider using Apache Flink or Spark Structured Streaming to write JSON events directly to Iceberg tables. Dremio can then query the resulting tables immediately.

How does JSON compression compare to Iceberg/Parquet?

JSON is a verbose text format. A typical JSON dataset converts to Parquet at a 10-30x compression ratio. 100 GB of JSON typically occupies 3-10 GB as an Iceberg table with Zstd compression, substantial storage savings on cloud object storage.

Can I preserve JSON arrays as Iceberg LIST types?

Yes. Instead of flattening arrays, you can map them to Iceberg's LIST data type for later UNNEST operations. This is useful when the array cardinality is important for analytics.

Enterprise JSON Conversion Patterns

Pattern 1: API Webhook Aggregation

Many SaaS platforms send event data via webhooks. Each webhook payload is a JSON document that lands in S3:

-- Aggregate webhook events from multiple SaaS tools
CREATE TABLE iceberg_catalog.db.all_events AS
SELECT *, 'stripe' as source FROM object_storage."webhooks/stripe/*.json"
UNION ALL
SELECT *, 'hubspot' as source FROM object_storage."webhooks/hubspot/*.json"
UNION ALL
SELECT *, 'segment' as source FROM object_storage."webhooks/segment/*.json";

Pattern 2: IoT Sensor Data Normalization

IoT devices often produce JSON with inconsistent schemas. Normalize during conversion:

CREATE TABLE iceberg_catalog.db.sensor_readings
PARTITION BY (day(reading_time))
AS
SELECT 
  COALESCE(device_id, sensor_id) as device_id,
  CAST(COALESCE(temperature, temp, reading) AS DOUBLE) as temperature,
  CAST(timestamp AS TIMESTAMP) as reading_time,
  location.latitude as lat,
  location.longitude as lon
FROM object_storage."iot/sensors/*.json";

Pattern 3: Building a Dremio Semantic Layer

After JSON conversion, build a semantic layer with Dremio views:

-- Base view: direct Iceberg table access
CREATE VIEW analytics.events_base AS
SELECT * FROM iceberg_catalog.db.events;

-- Business view: enriched with business logic
CREATE VIEW analytics.user_activity AS
SELECT 
  user_id,
  COUNT(*) as total_events,
  COUNT(DISTINCT action) as unique_actions,
  MAX(event_timestamp) as last_active,
  SUM(CASE WHEN action = 'purchase' THEN event_value ELSE 0 END) as total_spend
FROM analytics.events_base
GROUP BY user_id;

This semantic layer allows BI tools and AI agents to query well-defined business metrics without understanding the underlying JSON structure.

Streaming vs Batch JSON Ingestion

FactorBatch (Dremio CTAS/INSERT)Streaming (Flink/Spark Structured Streaming)
LatencyMinutes (batch load)Seconds (micro-batch or continuous)
ComplexitySimple (SQL only)Complex (streaming framework + deployment)
CostLow (on-demand compute)Higher (always-on streaming cluster)
Data qualityValidation in SQLRequires custom logic
Best forHourly/daily feedsReal-time dashboards

For most enterprise use cases, batch JSON ingestion through Dremio provides the best balance of simplicity, cost, and latency. For sub-minute latency requirements, consider streaming to Iceberg with Flink or Kinesis.

Performance: JSON vs Iceberg/Parquet

Converting JSON to Iceberg provides dramatic performance improvements:

MetricJSON on S3Iceberg (Parquet) on S3
Storage size (100M events)50 GB5 GB (10x compression)
Query time (full scan)120 seconds12 seconds
Query time (filtered)120 seconds (no pruning)0.5 seconds (partition + file pruning)
S3 costs per query$0.25$0.001
Schema enforcementNoneFull type safety

With Dremio Reflections on top, filtered queries drop to sub-second response times.


Free Resources to Continue Your Iceberg Journey

Iceberg Lakehouse Books from Dremio Authors


Legacy Content

Apache Iceberg is an open table format that enables robust, affordable, and quick analytics on the data lakehouse and is poised to change the data industry in ways we can only begin to imagine. Check out our Apache Iceberg 101 course to learn all the nuts and bolts about Iceberg. By storing your data in Apache Iceberg tables, you can run your workloads on the data without needing to duplicate it in a data warehouse while also keeping the data accessible so a variety of tools can utilize it. 

This article demonstrates how Dremio Cloud can be employed to transform JSON files into an Iceberg table quickly. This enables faster query execution, running of DML transactions, and time-traveling of the dataset straight from your data lakehouse storage.

This article uses a single JSON file for simplicity’s sake, but you can follow the same steps for a dataset consisting of multiple JSON files. 

Summary of the Steps

Dremio Cloud is an ideal solution for this task because it provides an access layer that is compatible with most sources, enabling you to access data stored in object storage (CSV, JSON, Parquet, etc.), relational databases, or a metastore for Apache Iceberg tables. Furthermore, Dremio allows for DML operations to be run on Iceberg tables, making it easy to convert data from any source into an Iceberg table.

You simply need to:

  1. Connect your Iceberg catalog.
  2. Upload your JSON file or connect a source with a JSON file (like S3).

CTAS the JSON file into a new Iceberg table in your catalog
(or use COPY INTO to directly copy the file into existing Iceberg table).

Step 1 – Get a Dremio Account

The first step is to get a Dremio Cloud account which offers a standard tier free of software and licensing costs (so the only cost would be AWS costs for any clusters you use to run your queries). You can get started with Dremio Cloud in minutes by following the steps highlighted in the video on the getting started page.

Step 2 – Connect Your Iceberg Catalog

You currently have three choices for your Iceberg catalog when using Dremio Cloud to write to Iceberg tables: AWS Glue, AWS S3, and Project Nessie. These can all be connected by clicking the “Add Source” button in the bottom left corner of the Dremio UI and selecting the source you want to connect.

If using an AWS Glue source:

In the source settings under “Advanced Options” add a property called “hive.metastore.warehouse.dir”. This will determine the S3 folder any Iceberg tables you create in the catalog will be written to.

If using an S3 source:

In the source settings under “Advanced Options” make sure to set the root path to the S3 directory you want the source to default to, this will be where Iceberg tables will be created in this catalog.

If using a Dremio Arctic source:

If using Dremio’s Arctic catalog as your Apache Iceberg catalog a bucket associated with your Dremio Sonar’s project-associated cloud will be assigned automatically (this can be changed in the settings of the source if you want).

Now you have two approaches you can use going forward to load your JSON data into an Apache Iceberg table:

  1. CTAS – The JSON must be a dataset in your Dremio Account can then create a new table via a “CREATE TABLE AS” statement from that JSON.
  2. COPY INTO – You tell Dremio to copy the contents of JSON files into an existing Apache Iceberg table. A big benefit of this approach is the data from the JSON file will coerced into the Iceberg tables schema.

Step 3 – Load Your JSON Data

There are two ways to bring a JSON file into Dremio. You can either connect a cloud object storage source like S3 that has JSON files in it or you can upload the file directly into your Dremio account.

To keep it simple, if you click on “add source” you’ll see a “Sample Source” under the object storage category that has all sorts of data you can use for demonstration.

You should now have a source called “Samples”. There is a JSON “zips.json” in this sample source that you will want to promote to a dataset. To do so click on the “Format File” button.

Make sure “JSON” is selected for Format and click “Save”.

Step 4 – Option 1 – Converting a JSON File Into an Iceberg Table

After formatting the JSON file into a dataset, it should automatically bring you to this dataset in the UI. When viewing the dataset, you’ll see the SQL Editor for editing and running SQL statements, along with the following options:

  • Data – This is the current screen with the SQL Editor
  • Details – Here you can write and view documentation on the datasets wiki and other information
  • Graph – Here you can see a datasets lineage graph for where that dataset comes from within Dremio
  • Reflections – Here you can enable data reflections to accelerate queries on a particular dataset further

Click the “Run” button to run the “Select *” query and to see that your data was promoted correctly (all the column headings should be there).

Off to the right, you’ll see an icon you can click to expand the panel that allows you to easily drag and drop your data into your queries, saving you time from having to type long namespaces.

Then you can use a CREATE TABLE AS statement to take the data from the JSON file and write it to a new table in your desired Iceberg catalog.

CREATE TABLE awss3.zips_from_json AS SELECT * FROM Samples."samples.dremio.com"."zips.json"

The image above uses an S3 source but when using any of the three catalogs, the namespace for your new table should look like the following:

  • S3: name_of_s3_source.new_table_name
  • AWS Glue: name_of_glue_source.existing_glue_database.new_table_name
  • Project Nessie: name_of_nessie_source.folder_name.new_table_name

Once you run the query you’ll get a confirmation message that tells you where the table was written:

Step 4 – Option 2 – Using the COPY INTO Approach

The COPY INTO command allows you to copy the contents of CSV, JSON, and other files into an Apache Iceberg table. This can be from a single file or a directory of files. This is particularly useful when adding data from multiple files or making multiple additions at different times since it does not create a new table like the CTAS statement. COPY INTO will also take the values from JSON data and coerce them into the schema of the target table, saving you from doing type conversions manually.

Let’s make an empty table with our desired schema.

CREATE TABLE awss3.zip_from_json_2 (city VARCHAR, loc ARRAY(DOUBLE), pop INT, state VARCHAR, _id VARCHAR);

Then you can use the COPY INTO commands to copy the contents of the JSON file into the existing table (make sure to specify the delimiter for this particular file).

COPY INTO awss3.zips_from_json2 
FROM '@Samples/samples.dremio.com/zips.json'
FILE_FORMAT 'json';

A couple of benefits of the COPY INTO approach:

  • It will use the schema of the destination table, whereas CTAS won’t know the desired schema and will coerce the fields based standard JSON datatype
  • You can add data from a single file as well as from a directory of files

The Iceberg Advantage

Now that your data is in an Iceberg table you can take advantage of full DML from Dremio to run updates, deletes, and upserts on your data.

UPDATE awss3."zips_from_json" SET pop=11111 WHERE city='AGAWAM';
DELETE FROM awss3."zips_from_json" WHERE city='AGAWAM';
MERGE INTO awss3."zips_from_json" z
USING (SELECT * FROM zips_staging) s
ON n.id = s.id
WHEN MATCH THEN UPDATE SET city = s.city, loc = s.loc, pop = s.pop, state = s.state, _id = s._id
WHEN NOT MATCHED THEN INSERT (city, loc, pop, state, _id) VALUES (s.city, s.loc, s.pop, s.state, s_id);

Conclusion

Not only can you query this new Iceberg table using your Dremio account but you can also work with this table using other tools like Apache Spark, Apache Flink, and any other tool that supports Apache Iceberg and connects to your catalog, giving you open access to your data with metadata that enables smarter query planning to speed up performance and lower your cloud compute bill.

Get started with Dremio Cloud today!

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.