Thursday, September 23, 2021

How to access Azure Open Dataset from Spark

Goal:

This article explains how to access Azure Open Dataset from Spark.

Env:

spark-3.1.1-bin-hadoop2.7

Solution:

Microsoft Azure Open Dataset is curated and cleansed data - including weather, census, and holidays - that you can use with minimal preparation to enrich ML models.

If we want to access it from local Spark environment, we need 2 jars :

  • azure-storage-<version>.jar
  • hadoop-azure-<version>.jar

My Spark is built on Hadoop 2.7, so I have to use a relatively older hadoop-zure jar. 

In this example, I downloaded below two jars:

1. Add above 2 jars into Spark classpath.

spark.executor.extraClassPath
spark.driver.extraClassPath

2. Add Azure Blob Storage related Hadoop configs

For example, I choose to add them directly into Jupyter notebook(or you can add them into core-site.xml):

sc._jsc.hadoopConfiguration().set("fs.azure","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.wasbs.impl", "org.apache.hadoop.fs.azure.Wasbs")
sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")

3.  Follow PySpark commands to access Azure Open Dataset

For example, the PySpark commands are here for accessing "NYC Taxi - Yellow" Azure Open Dataset.







2 comments:

Popular Posts