Thursday, May 7, 2015

Understanding Drill's timestamp and timezone

Goal:

This article explains the behavior of Apache Drill's timestamp and timezone.

Env:

Drill 0.9

Solution:

1. Drill uses OS's timezone by default.

a. If the OS's timezone is "America/New_York":

# cat /etc/sysconfig/clock
ZONE="America/New_York"
After drillbits start, the BOOT configuration "user.timezone" is set to "America/New_York".
> select * from sys.options where name like '%timezone%';
+------------+------------+------------+------------+------------+------------+------------+
|    name    |    kind    |    type    |  num_val   | string_val |  bool_val  | float_val  |
+------------+------------+------------+------------+------------+------------+------------+
| user.timezone | STRING     | BOOT       | null       | "America/New_York" | null       | null       |
+------------+------------+------------+------------+------------+------------+------------+

b. If the OS's timezone is "UTC":

# cat /etc/sysconfig/clock
ZONE="UTC"
After drillbits start, the BOOT configuration "user.timezone" is set to "UTC".
> select * from sys.options where name like '%timezone%';
+------------+------------+------------+------------+------------+------------+------------+
|    name    |    kind    |    type    |  num_val   | string_val |  bool_val  | float_val  |
+------------+------------+------------+------------+------------+------------+------------+
| user.timezone | STRING     | BOOT       | null       | "UTC"      | null       | null       |
+------------+------------+------------+------------+------------+------------+------------+

c. If the OS's timezone is "America/New_York", we can set "user.timezone" to UTC to override it.

Put "-Duser.timezone=UTC" in DRILL_JAVA_OPTS in drill-env.sh and then restart drillbits.
Then:
>  select * from sys.options where name like '%timezone%';
+------------+------------+------------+------------+------------+------------+------------+
|    name    |    kind    |    type    |  num_val   | string_val |  bool_val  | float_val  |
+------------+------------+------------+------------+------------+------------+------------+
| user.timezone | STRING     | BOOT       | null       | "UTC"      | null       | null       |
+------------+------------+------------+------------+------------+------------+------------+

2. Invalid timestamp for daylight saving.

a. If Drill's timezone is "America/New_York", below timestamp is invalid due to daylight saving.

> select to_timestamp('2015-03-08 02:58:51','YYYY-MM-dd HH:mm:ss') from sys.version;
Error: exception while executing query: Failure while executing query. (state=,code=0)
Query failed: SYSTEM ERROR: Unexpected exception during fragment initialization: Internal error: Error while applying rule ReduceExpressionsRule_Project, args [rel#2186:LogicalProject.NONE.ANY([]).[](input=rel#2185:Subset#0.ENUMERABLE.ANY([]).[],EXPR$0=TO_TIMESTAMP('2015-03-08 02:58:51', 'YYYY-MM-dd HH:mm:ss'))]
From drillbit.log, below error shows up:
Caused by: org.joda.time.IllegalInstantException: 
Cannot parse "2015-03-08 02:58:51": 
Illegal instant due to time zone offset transition (America/New_York)

b. There is no such issue if Drill's timezone is set to UTC.

> select to_timestamp('2015-03-08 02:58:51','YYYY-MM-dd HH:mm:ss') from sys.version;
+------------+
|   EXPR$0   |
+------------+
| 2015-03-08 02:58:51.0 |
+------------+
1 row selected (0.109 seconds)

> select * from sys.options where name like '%timezone%';
+------------+------------+------------+------------+------------+------------+------------+
|    name    |    kind    |    type    |  num_val   | string_val |  bool_val  | float_val  |
+------------+------------+------------+------------+------------+------------+------------+
| user.timezone | STRING     | BOOT       | null       | "UTC"      | null       | null       |
+------------+------------+------------+------------+------------+------------+-

3. Function to_timestamp() expresses the data in Drill's timezone.

a. If the source data has no timezone, function to_timestamp() treats it in UTC.

For example, if Drill's timezone is "America/New_York", the timestamp data "2012-02-04 04:59:59" will be treated as "2012-02-04 04:59:59 UTC";
Then Drill will express it to its timezone then it will become "2012-02-03 23:59:59" America/New_York.
That is the reason for below behavior:
> select * from sys.options where name like '%timezone%';
+------------+------------+------------+------------+------------+------------+------------+
|    name    |    kind    |    type    |  num_val   | string_val |  bool_val  | float_val  |
+------------+------------+------------+------------+------------+------------+------------+
| user.timezone | STRING     | BOOT       | null       | "America/New_York" | null       | null       |
+------------+------------+------------+------------+------------+------------+------------+

> select to_date(to_timestamp('2012-02-04 05:00:00','YYYY-MM-dd HH:mm:ss' )) from sys.version;
+------------+
|   EXPR$0   |
+------------+
| 2012-02-04 |
+------------+

>  select to_date(to_timestamp('2012-02-04 04:59:59','YYYY-MM-dd HH:mm:ss' )) from sys.version;
+------------+
|   EXPR$0   |
+------------+
| 2012-02-03 |
+------------+
If Drill's timezone is "UTC", then there is no such behavior because Drill treats the timestamp data as UTC, and will also express it as UTC. Things are consistent.

b. If the source data has timezone, function to_timestamp() with 'Z' option can treat it in data's timezone. 

For example, if Drill's timezone is "America/New_York", the timestamp data "2012-02-04 05:00:00 -0500" matches Drill's timezone(UTC-5), so the output of to_timestamp() is consistent;
Otherwise, Drill will convert the data's timezone to Drill's timezone.
For example:
> select * from sys.options where name like '%timezone%';
+------------+------------+------------+------------+------------+------------+------------+
|    name    |    kind    |    type    |  num_val   | string_val |  bool_val  | float_val  |
+------------+------------+------------+------------+------------+------------+------------+
| user.timezone | STRING     | BOOT       | null       | "America/New_York" | null       | null       |
+------------+------------+------------+------------+------------+------------+------------+

> select to_timestamp('2012-02-04 05:00:00 -0500','YYYY-MM-dd HH:mm:ss Z' ) from sys.version;
+------------+
|   EXPR$0   |
+------------+
| 2012-02-04 05:00:00.0 |
+------------+

> select to_timestamp('2012-02-04 05:00:00 -0700','YYYY-MM-dd HH:mm:ss Z' ) from sys.version;
+------------+
|   EXPR$0   |
+------------+
| 2012-02-04 07:00:00.0 |
+------------+
Same thing if Drill's timezone is "UTC", the data with timezone can be converted to Drill's timezone -- "UTC".
> select * from sys.options where name like '%timezone%';
+------------+------------+------------+------------+------------+------------+------------+
|    name    |    kind    |    type    |  num_val   | string_val |  bool_val  | float_val  |
+------------+------------+------------+------------+------------+------------+------------+
| user.timezone | STRING     | BOOT       | null       | "UTC"      | null       | null       |
+------------+------------+------------+------------+------------+------------+------------+

> select to_timestamp('2012-02-04 05:00:00 -0500','YYYY-MM-dd HH:mm:ss Z' ) from sys.version;
+------------+
|   EXPR$0   |
+------------+
| 2012-02-04 10:00:00.0 |
+------------+

> select to_timestamp('2012-02-04 05:00:00 -0700','YYYY-MM-dd HH:mm:ss Z' ) from sys.version;
+------------+
|   EXPR$0   |
+------------+
| 2012-02-04 12:00:00.0 |
+------------+

c. If the source data has no timezone and it needs to be converted to other timezone, you need to append the timezone and then use function to_timestamp() with 'Z' option to read it.

For example, if the source data is "2012-02-04 05:00:00" and it needs to be treated in "UTC-5" timezone, you need to append the timezone to the source data as below:
> select * from sys.options where name like '%timezone%';
+------------+------------+------------+------------+------------+------------+------------+
|    name    |    kind    |    type    |  num_val   | string_val |  bool_val  | float_val  |
+------------+------------+------------+------------+------------+------------+------------+
| user.timezone | STRING     | BOOT       | null       | "UTC"      | null       | null       |
+------------+------------+------------+------------+------------+------------+------------+

> select columns[0] from dfs.drill.`test.csv`;
+------------+
|   EXPR$0   |
+------------+
| 2012-02-04 05:00:00 |
+------------+

> SELECT to_timestamp(CONCAT(columns[0], ' -0500'),'YYYY-MM-dd HH:mm:ss Z') from dfs.drill.`test.csv`;
+------------+
|   EXPR$0   |
+------------+
| 2012-02-04 10:00:00.0 |
+------------+

==

3 comments:

  1. Using Drill 1.9, how can I obtain the timezone value Drill is using ? The query `select * from sys.options where name like '%timezone%'` doesn't return results.

    ReplyDelete
    Replies
    1. In latest version, say Drill 1.9, some options are moved to sys.boot from sys.options.
      So:

      > select * from sys.boot where name like '%timezone%';
      +----------------+---------+-------+---------+----------+---------------------+-----------+------------+
      | name | kind | type | status | num_val | string_val | bool_val | float_val |
      +----------------+---------+-------+---------+----------+---------------------+-----------+------------+
      | user.timezone | STRING | BOOT | BOOT | null | "America/New_York" | null | null |
      +----------------+---------+-------+---------+----------+---------------------+-----------+------------+
      1 row selected (0.181 seconds)

      Delete
    2. Using Drill 1.11 - there's no user.timezone neither in sys.options nor in sys.boot.
      Could someone advice on this, please?

      Delete

Popular Posts