Open Knowledge Base: Using Spark job to upload files to AWS S3 with Server Side Encryption enabled

Goal:

This article shows an example java code for:
Using Spark job to upload files to AWS S3 with Server Side Encryption enabled

Env:

MapR 5.1 with Hadoop 2.7.0(with aws-java-sdk-1.7.4.jar shipped together)
Spark 1.5.2

Solution:

1. Download my source code from github

git clone git@github.com:viadea/Spark_Upload_S3.git

Please note that in AWS SDK 1.7.4, to enable SSE feature, the method "setServerSideEncryption" in java class "ObjectMetadata" should be used:

objectMetadata.setServerSideEncryption("AES256");

In the later version of AWS SDK, say 1.7.15, this method was replaced by method "setSSEAlgorithm":

objectMetadata.setSSEAlgorithm(ObjectMetadata.AES_256_SERVER_SIDE_ENCRYPTION);

So please make sure you are using the right method in right AWS SDK version, otherwise, you may trigger "NoSuchMethod" error.

2. Compile using maven

mvn clean package

Please note that in pom.xml, I am using aws java sdk 1.7.4 as dependency because Hadoop 2.7.0 also ships with the same version -- aws-java-sdk-1.7.4.jar.
This is to make sure the libs used by spark application are in sync with Hadoop cluster:

        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-java-sdk</artifactId>
            <version>1.7.4</version>
        </dependency>

3. Run the spark job

/opt/mapr/spark/spark-1.5.2/bin/spark-submit \
  --class example.uploads3.UploadS3 \
  --master yarn \
  /mapr/my2.cluster.com/github/Spark_Upload_S3/target/spark_upload_s3-1.0.jar \
  /user/mapr/input/data.txt

This sample job will upload the data.txt to S3 bucket named "haos3" with key name "test/byspark.txt".

4. Confirm that this file will be SSE encrypted.

Check AWS S3 web page, and click "Properties" for this file, we should see SSE enabled with "AES-256" algorithm:

Reference:
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/ObjectMetadata.html

Thursday, September 15, 2016

Using Spark job to upload files to AWS S3 with Server Side Encryption enabled