Enrich Streaming Data with Batched Data

PublishedJanuary 16, 2022

•1 min read

Streaming data can be enriched using the following scenarios:

Static References
Dynamic Data Sets
Another Streaming Data Source

This post covers two of the above scenarios: 1 and 2.

First setup the streaming reader

outputPath = f'{working_dir}/output' # working_dir is a predefined path
outputPathBronze = f'{outputPath}/bronze'

deviceStream = (spark
   .readStream
   .format('delta')
   .load(outputPathBronze))

Load static/reference data

def loadStaticData(path):
  df = (spark.read.format('delta').load(path))
  return df

# lookupSourcePath is path to the static reference data
deviceReferenceDF = loadStaticData(lookupSourcePath)

Define a function to enrich the streaming data

from pyspark.sql.functions import col

def bronzeToSilver(deviceStreamReader, silverPath, streamName, lookupDF):
  devicesStream = (deviceStreamReader
   .join(lookupDF, ['device_id'], 'left') # Join with reference static data
   .select(col('params.device_id').alias('device_id'),
           col('eventName'),
           col('params.client_event_time').alias('client_event_time'),
           col('eventDate'),
           col('deviceType'))
  )

  (devicesStream
    .writeStream
    .format('delta')
    .outputMode('append')
    .queryName(streamName)
    .option("checkpointLocation", f"{silverPath}_checkpoint")
    .start(silverPath))

None

#data

Comments

Join the discussion

No comments yet. Be the first to comment.

More from this blog

Week of June 17 2024 - Mindmap Recap

June 17, 2024: Databricks - Issues with Excel Library in Clusters An issue was encountered with the crealytics:spark-excel library in Databricks. This Spark plugin is essential for reading and writing Excel files within Databricks. However, we observ...

Jun 21, 20242 min read

Week of June 3 2024 - Mindmap Recap

The importance of abstraction, reusability, error handling, efficient data manipulation, robust string handling, and performance optimization. Adopting these principles leads to cleaner, more maintainable, and high-performance code that becomes cruci...

Jun 9, 20242 min read

Search All Databricks' Workspace Notebooks

As we accumulate library of sample and reference code through various Databricks notebooks finding the code is not easy. The Purpose behind this notebook is to find the code/text in the Databricks' workspace easier and navigate to it. You can downloa...

Jan 23, 20221 min read

Databricks Temp Views and Caching

There are two kinds of temp views: Session based Global The temp views, once created, are not registered in the underlying metastore. The non-global (session) temp views are session based and are purged when the session ends. The global temp views ...

Dec 28, 20211 min read

The house on the Data Lake

6 posts

Command Palette

Comments

More from this blog