Goal:

This article explains how to access Azure Open Dataset from Spark.

Env:

spark-3.1.1-bin-hadoop2.7

Solution:

Microsoft Azure Open Dataset is curated and cleansed data - including weather, census, and holidays - that you can use with minimal preparation to enrich ML models.

If we want to access it from local Spark environment, we need 2 jars :

azure-storage-<version>.jar
hadoop-azure-<version>.jar

My Spark is built on Hadoop 2.7, so I have to use a relatively older hadoop-zure jar.

In this example, I downloaded below two jars:

1. Add above 2 jars into Spark classpath.

spark.executor.extraClassPath
spark.driver.extraClassPath

2. Add Azure Blob Storage related Hadoop configs

For example, I choose to add them directly into Jupyter notebook(or you can add them into core-site.xml):

sc._jsc.hadoopConfiguration().set("fs.azure","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.wasbs.impl", "org.apache.hadoop.fs.azure.Wasbs")
sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")

3. Follow PySpark commands to access Azure Open Dataset

For example, the PySpark commands are here for accessing "NYC Taxi - Yellow" Azure Open Dataset.

2 comments:

KrishnaOctober 21, 2022 at 6:10 AM
This configuration solved my problem. I am able to run pyspark commands. Thanks to you
ReplyDelete
Replies
zaraSeptember 4, 2023 at 11:37 AM
https://saglamproxy.com
metin2 proxy
proxy satın al
knight online proxy
mobil proxy satın al
UFPD
ReplyDelete
Replies

Add comment

Thursday, September 23, 2021

How to access Azure Open Dataset from Spark