Env:
Spark 1.3.1Symptom:
Spark fails to parse a json object with multiple lines.This issue can happen when either creating a DataFrame using:
val people = sqlContext.jsonFile(path)or creating a temp table using Spark SQL:
CREATE TEMPORARY TABLE jsonTable2 USING org.apache.spark.sql.json OPTIONS ( path "/xxx/test2.json" );Sample error message is:
java.lang.RuntimeException: Failed to parse record "array" : [ {.
Please make sure that each line of the file (or each string in the RDD) is a valid JSON object or an array of JSON objects.
Root Cause:
As mentioned in Spark Documentation:Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.Solution:
Convert the json object from multiple lines to a single line.For example, convert below json object:
{
"array" : [ {
"count" : "site1",
"sitename" : "sitename1"
}, {
"count" : "site2",
"sitename" : "sitename2"
} ]
}
to below:{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
After that, Spark can successfully parse it. For example:spark-sql> CREATE TEMPORARY TABLE jsonTable
> USING org.apache.spark.sql.json
> OPTIONS (
> path "/xxx/test_spark.json"
> )
> ;
Time taken: 3.738 seconds
spark-sql> select * from jsonTable ;
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
Time taken: 1.4 seconds, Fetched 1 row(s)
spark-sql> select array[0].count,array[0].sitename from jsonTable;
site1 sitename1
Time taken: 0.184 seconds, Fetched 1 row(s)
You can also put multiple single-line json objects into one file.
For example:
# cat test_spark_multiple.json
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
The query result would be:spark-sql> CREATE TEMPORARY TABLE jsonTable3
> USING org.apache.spark.sql.json
> OPTIONS (
> path "/xxx/test_spark_multiple.json"
> );
Time taken: 0.234 seconds
spark-sql> select * from jsonTable3 ;
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
Time taken: 0.153 seconds, Fetched 5 row(s)
can you please tell me how we can convert the multi line json object in to single line json object
ReplyDeleteThis is just a workaround, if I receive a Json, I cannot stay everytime to convert it into a single line json...
ReplyDelete