Goal:
This article explains how to access Azure Open Dataset from Spark.
Env:
spark-3.1.1-bin-hadoop2.7
Solution:
Microsoft Azure Open Dataset is curated and cleansed data - including weather, census, and holidays - that you can use with minimal preparation to enrich ML models.
If we want to access it from local Spark environment, we need 2 jars :
- azure-storage-<version>.jar
- hadoop-azure-<version>.jar
My Spark is built on Hadoop 2.7, so I have to use a relatively older hadoop-zure jar.
In this example, I downloaded below two jars:
1. Add above 2 jars into Spark classpath.
spark.executor.extraClassPath
spark.driver.extraClassPath
2. Add Azure Blob Storage related Hadoop configs
For example, I choose to add them directly into Jupyter notebook(or you can add them into core-site.xml):
sc._jsc.hadoopConfiguration().set("fs.azure","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.wasbs.impl", "org.apache.hadoop.fs.azure.Wasbs")
sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
3. Follow PySpark commands to access Azure Open Dataset
For example, the PySpark commands are here for accessing "NYC Taxi - Yellow" Azure Open Dataset.
This configuration solved my problem. I am able to run pyspark commands. Thanks to you
ReplyDeletehttps://saglamproxy.com
ReplyDeletemetin2 proxy
proxy satın al
knight online proxy
mobil proxy satın al
UFPD
8309953254
ReplyDeleteTakipçi Satın Al
3D Car Parking Para Kodu
Coin Kazanma
3D Car Parking Para Kodu
Pubg Hassasiyet Kodu (Sekmeyen Hassasiyet Kodu)