Skip to content

CQLAI Parquet Support

CQLAI provides comprehensive support for Apache Parquet format, enabling efficient data exchange between Cassandra and modern analytics ecosystems.

Why Parquet?

Benefit Description
Efficient Storage Columnar format with 50-80% smaller file sizes than CSV
Fast Analytics Optimized for analytical queries in Spark, Presto, DuckDB
Type Preservation Maintains Cassandra data types including UDTs and collections
ML Ready Direct compatibility with pandas, PyArrow, and ML frameworks
Streaming Memory-efficient for large datasets

Quick Start

Export to Parquet

-- Basic export (format auto-detected from extension)
COPY users TO 'users.parquet';

-- With compression
COPY events TO 'events.parquet' WITH COMPRESSION='ZSTD';

-- Explicit format specification
COPY orders TO 'data.parquet' WITH FORMAT='PARQUET';

-- Export specific columns
COPY users (id, name, email) TO 'users_partial.parquet';

Import from Parquet

-- Basic import
COPY users FROM 'users.parquet';

-- Explicit format
COPY analytics FROM 'data.parquet' WITH FORMAT='PARQUET';

Export Options

COPY TO Syntax

COPY table_name [(column_list)] TO 'file.parquet'
[WITH option=value [AND option=value ...]]

Available Options

Option Default Description
FORMAT auto PARQUET or CSV (auto-detected from extension)
COMPRESSION SNAPPY Compression algorithm
CHUNKSIZE 10000 Rows per row group
PAGESIZE 1000 Rows to fetch per page from Cassandra

Compression Algorithms

Algorithm Speed Ratio Use Case
SNAPPY Fast Good General purpose (default)
ZSTD Medium Best Archival, best compression
GZIP Slow Great Compatibility
LZ4 Fastest Moderate High-speed requirements
NONE N/A None When compression not needed

Examples

-- Best compression for archival
COPY events TO 'events_archive.parquet' WITH COMPRESSION='ZSTD';

-- Fast compression for streaming
COPY metrics TO 'metrics.parquet' WITH COMPRESSION='LZ4';

-- Large table with custom chunk size
COPY large_table TO 'data.parquet' WITH CHUNKSIZE=50000;

Import Options

COPY FROM Syntax

COPY table_name [(column_list)] FROM 'file.parquet'
[WITH option=value [AND option=value ...]]

Available Options

Option Default Description
FORMAT auto PARQUET or CSV
MAXROWS -1 Maximum rows to import (-1=unlimited)
MAXINSERTERRORS 1000 Max insert errors before stopping
MAXBATCHSIZE 20 Rows per batch insert
CHUNKSIZE 5000 Rows between progress updates

Examples

-- Import with row limit
COPY users FROM 'users.parquet' WITH MAXROWS=10000;

-- Import with error tolerance
COPY data FROM 'data.parquet' WITH MAXINSERTERRORS=100;

Supported Data Types

CQLAI handles all Cassandra data types in Parquet format:

Primitive Types

Cassandra Type Parquet Type Notes
int INT32
bigint INT64
smallint INT32
tinyint INT32
float FLOAT
double DOUBLE
decimal BYTE_ARRAY (string) Preserves precision
varint BYTE_ARRAY (string)
boolean BOOLEAN
text BYTE_ARRAY (UTF8)
varchar BYTE_ARRAY (UTF8)
ascii BYTE_ARRAY (ASCII)
blob BYTE_ARRAY
uuid FIXED_LEN_BYTE_ARRAY[16]
timeuuid FIXED_LEN_BYTE_ARRAY[16]
timestamp INT64 (TIMESTAMP_MILLIS)
date INT32 (DATE)
time INT64 (TIME_MICROS)
duration BYTE_ARRAY (string) ISO-8601 format
inet BYTE_ARRAY (string) IP address string

Collection Types

Cassandra Type Parquet Type
list<T> LIST
set<T> LIST
map<K,V> MAP
frozen<...> Same as inner type

Complex Types

Cassandra Type Parquet Type
User-Defined Type (UDT) GROUP
tuple<...> GROUP
vector<T, N> LIST

Working with Analytics Tools

Python (pandas / PyArrow)

import pandas as pd

# Read Parquet exported from CQLAI
df = pd.read_parquet('users.parquet')

# Process data
active_users = df[df['status'] == 'active']

# Write back to Parquet for CQLAI import
active_users.to_parquet('active_users.parquet', index=False)

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CassandraAnalytics").getOrCreate()

# Read Parquet
df = spark.read.parquet("events.parquet")

# Process
df.groupBy("event_type").count().show()

# Write for CQLAI import
df.write.parquet("processed_events.parquet")

DuckDB

-- Query Parquet directly
SELECT * FROM 'users.parquet' WHERE status = 'active';

-- Aggregate
SELECT date_trunc('month', created_at) as month, COUNT(*)
FROM 'events.parquet'
GROUP BY 1;

Apache Drill

SELECT *
FROM dfs.`/path/to/users.parquet`
WHERE status = 'active';

Best Practices

Choosing Compression

Scenario Recommended
General purpose SNAPPY
Long-term storage ZSTD
Maximum compatibility GZIP
Real-time streaming LZ4 or NONE

Optimizing Chunk Size

-- Small tables (< 100K rows)
COPY small_table TO 'data.parquet' WITH CHUNKSIZE=5000;

-- Large tables (> 1M rows)
COPY large_table TO 'data.parquet' WITH CHUNKSIZE=50000;

Larger chunk sizes: - Better compression ratio - Faster reads for analytics - Higher memory usage during export

Handling Large Exports

For very large tables:

-- Use larger page size for faster export
COPY huge_table TO 'data.parquet' WITH PAGESIZE=5000 AND CHUNKSIZE=100000;

Data Type Considerations

Timestamps: Parquet stores timestamps in milliseconds. CQLAI preserves Cassandra timestamp precision.

Decimals: Stored as strings to preserve arbitrary precision.

UUIDs: Stored as 16-byte fixed-length arrays for efficient storage.


Troubleshooting

Import Type Mismatches

If column types do not match:

-- Check table schema
DESCRIBE TABLE users;

-- Verify Parquet schema
-- (Use external tool like parquet-tools)
parquet-tools schema users.parquet

Memory Issues with Large Files

For large files, adjust CQLAI memory settings:

{
  "maxMemoryMB": 100
}

Null Value Handling

Parquet preserves null values correctly. However, ensure your table allows nulls for imported columns:

-- Check column definitions
DESCRIBE TABLE my_table;

Examples

ETL Workflow

-- 1. Export from Cassandra
COPY events TO 'raw_events.parquet' WITH COMPRESSION='SNAPPY';

-- 2. Process in Python/Spark (external)

-- 3. Import processed data
COPY processed_events FROM 'enriched_events.parquet';

Backup and Restore

-- Backup
COPY users TO 'backup/users_20240115.parquet' WITH COMPRESSION='ZSTD';
COPY orders TO 'backup/orders_20240115.parquet' WITH COMPRESSION='ZSTD';

-- Restore (to new cluster)
COPY users FROM 'backup/users_20240115.parquet';
COPY orders FROM 'backup/orders_20240115.parquet';

Analytics Export

-- Export for analytics team
COPY events TO 'analytics/events.parquet'
  WITH COMPRESSION='SNAPPY' AND CHUNKSIZE=100000;

Next Steps