Goal:

This article explains how to access Azure Open Dataset from Spark.

Env:

spark-3.1.1-bin-hadoop2.7

Solution:

Microsoft Azure Open Dataset is curated and cleansed data - including weather, census, and holidays - that you can use with minimal preparation to enrich ML models.

If we want to access it from local Spark environment, we need 2 jars :

azure-storage-<version>.jar
hadoop-azure-<version>.jar

My Spark is built on Hadoop 2.7, so I have to use a relatively older hadoop-zure jar.

In this example, I downloaded below two jars:

1. Add above 2 jars into Spark classpath.

spark.executor.extraClassPath
spark.driver.extraClassPath

2. Add Azure Blob Storage related Hadoop configs

For example, I choose to add them directly into Jupyter notebook(or you can add them into core-site.xml):

sc._jsc.hadoopConfiguration().set("fs.azure","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.wasbs.impl", "org.apache.hadoop.fs.azure.Wasbs")
sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")

3. Follow PySpark commands to access Azure Open Dataset

For example, the PySpark commands are here for accessing "NYC Taxi - Yellow" Azure Open Dataset.

3 comments:

KrishnaOctober 21, 2022 at 6:10 AM
This configuration solved my problem. I am able to run pyspark commands. Thanks to you
ReplyDelete
Replies
zaraSeptember 4, 2023 at 11:37 AM
https://saglamproxy.com
metin2 proxy
proxy satın al
knight online proxy
mobil proxy satın al
UFPD
ReplyDelete
Replies
AnonymousSeptember 14, 2025 at 2:53 PM
8309953254
Takipçi Satın Al
3D Car Parking Para Kodu
Coin Kazanma
3D Car Parking Para Kodu
Pubg Hassasiyet Kodu (Sekmeyen Hassasiyet Kodu)
ReplyDelete
Replies

Add comment

Thursday, September 23, 2021

How to access Azure Open Dataset from Spark