Data Streaming and Transformation

Not sure you’re ready?

Take the ~3-minute readiness diagnostic and see where you stand.

Imagine a hydroelectric dam. If you attempt to capture the torrential force of millions of gallons of water per minute using a standard collection bucket, the bucket shatters. The same physical reality applies to modern cloud architectures. When millions of clickstream events, IoT sensor telemetry, or application logs flood into your system every second, a traditional database cannot simply absorb the impact. You must construct a digital river—a streaming architecture—that can continuously ingest, buffer, process, and transform this kinetic data in real time, long before it ever rests in a persistent data lake.

A massive hydroelectric dam illustrates the concept of real-time streaming architectures: both must dynamically channel and continuously process torrential flows of input rather than attempting to statically contain them.

Before engineering a data stream, we must distinguish a stream from a queue. When engineers first encounter high-throughput systems, they often reach for Amazon SQS, which is designed primarily for asynchronous message queuing and point-to-point application decoupling.

Think of SQS like a ticket dispenser at a busy deli. You take a ticket, you are served, and that ticket is thrown away. In SQS, a message is typically deleted from the queue immediately after successful processing by a consumer. This is perfect for task offloading, but terrible for massive data telemetry where multiple independent analytical systems need to observe the exact same historical data.

This is where real-time streaming services enter the picture. Amazon Kinesis Data Streams is a highly scalable real-time data streaming service behaving less like a deli line and more like an immutable ledger. Amazon Kinesis Data Streams continuously retains data records for a specified period regardless of successful consumer processing. The default data retention period for records in Amazon Kinesis Data Streams is 24 hours, but if your compliance or operational replay requirements demand it, the maximum possible data retention period for records is 365 days.

Because the data persists, Amazon Kinesis Data Streams allows multiple independent consumers to read the exact same data stream concurrently. A fraud detection algorithm, a real-time dashboard, and a long-term storage archiver can all read the identical stream of transaction data simultaneously without destroying the data for the others.

To understand Kinesis, you must understand its fundamental unit of scale. Amazon Kinesis Data Streams organizes streaming data into ordered units called shards.

A stream is simply a logical grouping of shards. Therefore, the overall data throughput capacity of a Kinesis Data Stream is directly determined by the number of active shards. When sizing a stream, you calculate the required number of shards based on these strict physical limits:

A single Kinesis Data Stream shard supports up to 1 MB per second of data ingestion.
A single Kinesis Data Stream shard supports up to 1,000 records per second of data ingestion.
A single Kinesis Data Stream shard provides a data emission rate of up to 2 MB per second for standard consumers.

The Partition Key and the "Hot Shard" Problem

When a producer pushes a data record to Kinesis, how does Kinesis know which shard should receive it? It uses a partition key provided by the producer. A Kinesis Data Stream partition key dictates which specific shard will receive a given data record by hashing that key.

Architectural Warning: Using a Kinesis partition key with low cardinality can create a hot shard by overwhelming a specific shard with traffic.

If your partition key is a boolean value (e.g., is_mobile_user), you only have two unique keys. Even if you provision 100 shards, all data will hash to just two of them. Those two shards will max out at 1 MB/s, creating a massive bottleneck, while 98 shards sit idle. You must choose a highly unique partition key—like a user_id or device_id—to distribute the data uniformly.

A hash function maps partition keys to specific shards. If producers use partition keys with low cardinality, disproportionate volumes of data will hash to the exact same location, creating a "hot shard" bottleneck while leaving other shards idle.

Scaling Strategies

Traffic is rarely perfectly uniform. To handle unpredictable workloads, Amazon Kinesis Data Streams offers an on-demand capacity mode. Kinesis Data Streams on-demand capacity mode automatically scales shard capacity in response to fluctuating data traffic, removing the need to manually split or merge shards.

Increasing the number of active Kinesis shards is a form of horizontal scaling. Overall stream throughput is expanded by adding more parallel processing units (shards) rather than increasing the physical capacity limits of a single shard.

As your stream grows, you will likely add more consumer applications. In standard consumption, all consumers share the 2 MB/s read capacity of a single shard. If you have five consumers, they fight for that bandwidth. To solve this, Amazon Kinesis Data Streams supports an Enhanced Fan-Out feature for consumer applications. Enhanced Fan-Out provides a dedicated throughput of 2 MB per second per shard for each registered consumer application, effectively eliminating the "noisy neighbor" problem.

Kinesis is native to AWS, but the broader open-source world largely runs on Apache Kafka. If you are migrating an existing corporate architecture, rewriting applications to use the Kinesis API might be prohibitively expensive.

Amazon MSK is a fully managed AWS service for provisioning and running Apache Kafka clusters. Crucially, Amazon MSK provides native compatibility with existing open-source Apache Kafka applications and ecosystem tools. Your developers don't need to change their code; they just point their Kafka clients at the MSK brokers. Furthermore, if you want Kafka's ecosystem but dread cluster management, Amazon MSK Serverless automatically provisions and scales Apache Kafka compute and storage resources on demand.

Streams are ephemeral. Eventually, this rapid-fire data must be parked in persistent storage for deep analytics. This is the exact purpose of Amazon Data Firehose, a fully managed service for delivering real-time streaming data to persistent storage destinations.

Unlike Kinesis Data Streams, Amazon Data Firehose requires absolutely zero manual provisioning of shards or underlying compute resources. It operates dynamically out of the box. Instead of immediate real-time read access, Amazon Data Firehose buffers incoming data based on size or time thresholds (e.g., 5 MB or 60 seconds) before delivering the batch to a destination.

Amazon Data Firehose natively supports delivering streaming data to Amazon S3 (for data lakes), Amazon Redshift (for data warehousing), and Amazon OpenSearch Service (for log analytics and search).

If your raw streaming data is messy or contains sensitive PII, Amazon Data Firehose integrates directly with AWS Lambda to transform incoming streaming data records before final delivery.

Why does data format matter? If you dump raw JSON or CSV files into Amazon S3, you are paying a massive "laziness tax."

Transforming row-based data formats like CSV into columnar formats like Apache Parquet significantly reduces cloud storage costs. Furthermore, columnar storage formats improve analytical query performance by allowing query engines to scan only the necessary columns. If you have a 100-column CSV and your query only asks for user_region and total_spend, a row-based format forces the query engine to read the entire file. A columnar format (like Parquet or ORC) allows the engine to surgically extract only those two columns, reducing the data scanned by 98%.

Amazon Data Firehose supports automatic format conversion of incoming JSON data to Apache Parquet format, as well as automatic format conversion of incoming JSON data to Apache ORC format. However, Firehose cannot do this blindly; it needs to know what the columns are supposed to look like. Thus, Amazon Data Firehose format conversion features require defining a target schema within the AWS Glue Data Catalog.

A visual representation of a database schema. To automatically convert incoming JSON into columnar formats, Amazon Data Firehose relies on the AWS Glue Data Catalog to provide this exact type of structural blueprint.

AWS Glue is a fully serverless data integration service used for extract, transform, and load (ETL) operations. It is the central nervous system of modern AWS data architectures.

AWS Glue automates the conventional Extract, Transform, Load (ETL) pipeline, providing serverless compute to ingest raw data streams, transform their schema or format, and load the refined data into centralized storage.

To populate the Data Catalog (which Firehose relies on), we use AWS Glue Crawlers. These crawlers automatically scan external data stores (like S3 buckets or JDBC databases) to infer data schemas and partition structures. Upon scanning, AWS Glue Crawlers populate the centralized AWS Glue Data Catalog with discovered table metadata. The AWS Glue Data Catalog then serves as a persistent metadata store for services like Amazon Athena and Amazon EMR, giving them an index of where the data lives and how it is structured.

Executing Transformations in Glue

When raw data lands in S3, AWS Glue provides several ways to transform it:

Managed Apache Spark: AWS Glue can execute managed Apache Spark scripts to perform complex and distributed big data transformations.
Python Shell: If you don't need distributed processing, AWS Glue Python shell jobs are optimized for running lightweight data integration and transformation scripts without Spark.
DynamicFrames: Real-world data is messy and schema changes are frequent. To handle this, AWS Glue DynamicFrames provide a schema-flexible data structure designed specifically for processing semi-structured data where the schema might change record by record.
Job Bookmarks: Running batch ETL daily means you might accidentally process yesterday's data again. AWS Glue Job Bookmarks prevent the redundant reprocessing of old data during subsequent extract, transform, and load runs by maintaining state across executions.

As a Solutions Architect, your most frequent challenge will be selecting the appropriate compute engine for the specific nature of your streaming or ETL workload. Let's strictly classify the proper tools.

Compute Option	Operational Sweet Spot & Constraints
AWS Lambda	AWS Lambda is highly suitable for executing lightweight and event-driven data transformations on real-time streaming data. However, AWS Lambda functions have a hard maximum execution duration limit of 15 minutes. This 15-minute execution limit makes AWS Lambda inherently unsuitable for long-running batch data transformations.
AWS Glue	The premier choice for serverless batch or micro-batch ETL. AWS Glue is generally preferred over Amazon EMR for data processing workloads requiring a completely serverless operational model, eliminating cluster management overhead.
Amazon EMR	Amazon EMR is a managed cluster platform that simplifies running big data frameworks like Apache Hadoop and Apache Spark. Use it when you need fine-grained hardware control; Amazon EMR provides granular configuration control over the underlying Amazon EC2 compute instances used for data processing (e.g., using specific GPU instances or Spot Fleets to lower costs).
Amazon Managed Flink	Amazon Managed Service for Apache Flink enables real-time stateful processing and analysis of streaming data. While Firehose delivers batches, Managed Flink continuously queries data directly from Amazon Kinesis Data Streams, allowing you to perform complex rolling aggregations, anomaly detection, or sliding time-window analytics on data while it is still moving.

Amazon EMR simplifies the deployment and management of multi-node clusters, providing granular configuration control over the underlying infrastructure for heavy data processing workloads like Apache Hadoop and Apache Spark.

Designing a resilient, cost-effective data architecture requires recognizing that no single service is a panacea. The master stroke of cloud architecture is assembling these primitives—Kinesis for ingestion, Firehose for delivery, the Glue Data Catalog for metadata mapping, and the precise compute engine (Lambda, Glue Spark, EMR, or Flink) to forge raw data into analytical gold.