[DRILL-6074] Corrections to UDF tutorial documentation page - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Documentation
Labels:
- doc-impacting

Description

Consider the UDF Tutorial. Some of the details are a bit off.

Step 3:

The function will be generated dynamically, as you can see in the DrillSimpleFuncHolder, and the input parameters and output holders are defined using holders by annotations. Define the parameters using the @Param annotation.

Better: Drill uses your function template to in-line your function code into Drill's own generated code. The @Param annotation identifies the input arguments. The order of the annotated fields indicates the order of the function parameters. Each parameter field must be one of Drill's holder types.

Use a holder classes to provide a buffer to manage larger objects in an efficient way: VarCharHolder or NullableVarCharHolder.

Better: Our function template tells Drill to handle nulls, so all three of our arguments can be declared using the VarCharHolder type.

(Then, fix the code to use that type. The bit about larger objects is probably obsolete: holders are the only way to work with any value: large or otherwise.)

NOTE: Drill doesn’t actually use the Java heap for data being processed in a query but instead keeps this data off the heap and manages the life-cycle for us without using the Java garbage collector.

Better: NOTE: VARCHAR data is stored in direct memory. The DrillBuf object in the VarCharHolder provides access to the data for the VARCHAR.

(For context: simple types, such as INT, are stored on the heap when passed to a UDF, so we don't want to make a blanket statement.)

Step 4.

Also, using the @Output annotation, define the returned value as VarCharHolder type. Because you are manipulating a VarChar, you also have to inject a buffer that Drill uses for the output.

Better: Identify the function's return value using the @Output annotation. Like parameters, the output must be a holder type. Drill, however, does not provide the output buffer; we have to request one using the @Inject annotation. The injected field must be of type DrillBuf. Then, in our code, we set the output holder to point to the injected buffer.

Step 5. The code is inefficient and not a good example. Replace this:

    out.end = outputValue.getBytes().length;
    buffer.setBytes(0, outputValue.getBytes());

With this:

    byte result[] = outputValue.getBytes();
    out.end = result.length;
    buffer.setBytes(0, result);

(But see comments for additional changes.)

While we are at it, we might as well make another line a bit more readable.

    String outputValue = (new StringBuilder(maskSubString)).append(stringValue.substring(numberOfCharToReplace)).toString();

Should be rewritten as:

    String outputValue = new StringBuilder(maskSubString)
        .append(stringValue.substring(numberOfCharToReplace)
        .toString();

Then in the list of steps:

Gets the number of character to replace

The word "character" should be "characters" (plural)

And:

Creates and populates the output buffer

Better:

Copies the new string into the temporary DrillBuf
Sets up the output holder to point to the data in the DrillBuf

Then:

Even to a seasoned Java developer, the eval() method might look a bit strange because Drill generates the final code on the fly to fulfill a query request. This technique leverages Java’s just-in-time (JIT) compiler for maximum speed.

Better: Even to a seasoned Java developer, the eval() method might look a bit strange. It is best to think of the UDF declaration as a Domain-Specific Language (DSL) that Drill uses to describe the function. Drill uses the declaration to in-line your function into generated code. That is, Drill does not call your function code; instead Drill extracts the code and copies it into Drill's own generated code.

(Note: the bit about the JIT compiler is plain wrong. Drills code generation has nothing to do with Java's JIT compiler.)

Basic Coding Rules

To leverage Java’s just-in-time (JIT) compiler for maximum speed, you need to adhere to some basic rules.

Better: Drill's code generation mechanism supports a restricted subset of Java, meaning that you must adhere to some basic rules.

Do not use imports. Instead, use the fully qualified class name as required by the Google Guava API packaged in Apache Drill and as shown in "Step 3: Declare input parameters".

(This mixes up a couple of ideas.) Better: Do not use imports. Instead, use the fully qualified class name.

Manipulate the ValueHolders classes, for example VarCharHolder and IntHolder, as structs by calling helper methods, such as getStringFromVarCharHolder and toStringFromUTF8 as shown in "Step 5: Implement the eval() function".

Do not call methods such as toString because this causes serious problems.

Better: Do not call any methods on the holder classes. The holders will be optimized away by Drill's scalar replacement mechanism.

Some additional restrictions:

All class fields (member variables) must be preceded by one of the three annotations discussed above (@Param, @Output or @Inject), or by the @Workspace annotation which identifies internal temporary fields. (If you omit the annotations, then functions using your query will fail at runtime.)
Do not use static fields (such as to declare constants.) If you must declare constants, declare them in a class other than the UDF class.
Do not pass holders to other functions; all references must be within your UDF.

Prepare the Package

Because Drill generates the source, ...

Better: Because Drill copies your code into is own generated code, ...

Basic Coding Rules
Build and Deploy the Function
Test the New Function

The above three lines probably want to be a heading; it appears as normal text.

Add the JAR files to Drill, by copying them to the following location: <Drill installation directory>/jars/3rdparty

Perhaps add the following: Be sure to copy the jars into the above folder each time you rebuild, reinstall or upgrade Drill. If running in a cluster, copy the jars to the Drill installation on every node.

As an alternative, you can create a site directory as described (need link. Do we describe this anywhere except in the Drill-on-YARN PR?) Copy your files into the $DRILL_SITE/jars folder. This way, you need not remember to copy the jars each time you reinstall Drill.

Corrections to UDF tutorial documentation page

Details

Description

Attachments

Activity

People

Dates