Thursday, February 25, 2016

Understanding NullableVarCharHolder and VarCharHolder in Drill UDF code

Goal:

Understanding the differences between NullableVarCharHolder and VarCharHolder in Drill UDF code.

Env:

Drill 1.4

Root Cause:

When writing custom Drill UDF code, here are 2 input types for varchar:
org.apache.drill.exec.expr.holders.VarCharHolder
org.apache.drill.exec.expr.holders.NullableVarCharHolder

Take function to_int(varchar,bigint) for example, and let's focus on the 1st input parameter in varchar type.
If the function is applied on a column of json file, the column is nullable(NullableVarCharHolder), eg:
select to_int(col1,999) from dfs.drill.`a.json`;
If it is applied on a constant value, then it is for sure not-nullable(VarCharHolder),eg:
select to_int('111', 999) from sys.version;

One major difference between NullableVarCharHolder and VarCharHolder is:
NullableVarCharHolder has an attribute "isSet". "isSet=0" means this object is null, vice versa.

Since VarCharHolder does not have that attribute, so if the function's null handling is "nulls = NullHandling.INTERNAL" instead of "nulls = NullHandling.NULL_IF_NULL", you need to implement the same function twice -- one for "NullableVarCharHolder" and one for "VarCharHolder".

Sample Code:

Here are the sample code for  to_int(varchar,bigint) function:
1. For input type = NullableVarCharHolder
https://github.com/viadea/DrillUDF/blob/master/src/main/java/openkb/drill/udf/GetInt.java

2. For input type =  VarCharHolder
https://github.com/viadea/DrillUDF/blob/master/src/main/java/openkb/drill/udf/GetInt2.java

Test

1.  For input type = NullableVarCharHolder
> select col1 from dfs.drill.`a.json`;
+-------+
| col1  |
+-------+
| 123   |
| null  |
+-------+
2 rows selected (0.246 seconds)
> select to_int(col1,999) from dfs.drill.`a.json`;
+---------+
| EXPR$0  |
+---------+
| 123     |
| 999     |
+---------+
2 rows selected (0.23 seconds)

2. For input type =  VarCharHolder
> select to_int('111', 999) from sys.version;
+---------+
| EXPR$0  |
+---------+
| 111     |
+---------+
1 row selected (0.537 seconds)
> select to_int('wrongnumber', 999) from sys.version;
+---------+
| EXPR$0  |
+---------+
| 999     |
+---------+
1 row selected (0.374 seconds)

3 comments:

  1. I learned that the hard way. I find this impossible to live with if I have just 2 more arguments for my UDF to handle ! Why can't the UDF version receiving 'NullableVarCharHolder' also accept constants with the 'isSet' flag always set to '1' ?

    ReplyDelete
    Replies
    1. This is a design issue. I would suggest you open a Apache Drill JIRA for this.

      Delete

Popular Posts