Friday, June 12, 2015

Spark fails to parse a json object with multiple lines

Env:

Spark 1.3.1

Symptom:

Spark fails to parse a json object with multiple lines.

This issue can happen when either creating a DataFrame using:
val people = sqlContext.jsonFile(path)
or creating a temp table using Spark SQL:
CREATE TEMPORARY TABLE jsonTable2
USING org.apache.spark.sql.json
OPTIONS (
  path "/xxx/test2.json"
);
Sample error message is:
java.lang.RuntimeException: Failed to parse record   "array" : [ {. 
Please make sure that each line of the file (or each string in the RDD) is a valid JSON object or an array of JSON objects.

Root Cause:

As mentioned in Spark Documentation:Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

Solution:

Convert the json object from multiple lines to a single line.
For example, convert below json object:
{
  "array" : [ {
    "count" : "site1",
    "sitename" : "sitename1"
  }, {
    "count" : "site2",
    "sitename" : "sitename2"
  } ]
}
to below:
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
After that, Spark can successfully parse it. For example:
spark-sql> CREATE TEMPORARY TABLE jsonTable
         > USING org.apache.spark.sql.json
         > OPTIONS (
         >   path "/xxx/test_spark.json"
         > )
         > ;
Time taken: 3.738 seconds

spark-sql> select * from jsonTable ;
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
Time taken: 1.4 seconds, Fetched 1 row(s)

spark-sql> select array[0].count,array[0].sitename from jsonTable;
site1 sitename1
Time taken: 0.184 seconds, Fetched 1 row(s)

You can  also put multiple single-line json objects into one file.
For example:
# cat test_spark_multiple.json
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
The query result would be:
spark-sql>  CREATE TEMPORARY TABLE jsonTable3
         > USING org.apache.spark.sql.json
         > OPTIONS (
         >   path "/xxx/test_spark_multiple.json"
         > );
Time taken: 0.234 seconds

spark-sql> select * from jsonTable3 ;
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
Time taken: 0.153 seconds, Fetched 5 row(s)

1 comment:

  1. can you please tell me how we can convert the multi line json object in to single line json object

    ReplyDelete

Popular Posts