Hive覚え書き・その1

Hiveについて調べたことをつらつらと書く.

デフォルト・パーティション

まずはHiveのダイナミック・インポートのときに作成されるデフォルト・パーティションの扱いについて, 実装から調べていく.

ソースコード

GithubにあるHiveレポジトリのミラー (https://github.com/apache/hive) をforkした https://github.com/cocoatomo/hive をクローンして作業する. まずは取っ掛かりとして定義されている場所を探す.

cd .../hive
grep -nIRl '__HIVE_DEFAULT_PARTITION__' ./*
./common/src/java/org/apache/hadoop/hive/common/FileUtils.java
./common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
./hcatalog/core/src/main/java/org/apache/hive/hcatalog/mapreduce/FileOutputCommitterContainer.java
./hcatalog/core/src/main/java/org/apache/hive/hcatalog/mapreduce/HCatFileUtil.java
./ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java
./ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java
./ql/src/test/queries/clientnegative/default_partition_name.q
./ql/src/test/queries/clientpositive/alter_partition_change_col.q
./ql/src/test/queries/clientpositive/alter_partition_coltype.q
./ql/src/test/queries/clientpositive/alter_table_cascade.q
./ql/src/test/queries/clientpositive/annotate_stats_part.q
./ql/src/test/queries/clientpositive/default_partition_name.q
./ql/src/test/queries/clientpositive/dynamic_partition_skip_default.q
./ql/src/test/queries/clientpositive/dynpart_sort_opt_vectorization.q
./ql/src/test/queries/clientpositive/dynpart_sort_optimization.q
./ql/src/test/queries/clientpositive/extrapolate_part_stats_full.q
./ql/src/test/queries/clientpositive/extrapolate_part_stats_partial.q
./ql/src/test/queries/clientpositive/insert_into_with_schema.q
./ql/src/test/results/beelinepositive/default_partition_name.q.out
./ql/src/test/results/beelinepositive/load_dyn_part14.q.out
./ql/src/test/results/clientnegative/default_partition_name.q.out
./ql/src/test/results/clientpositive/alter_partition_change_col.q.out
./ql/src/test/results/clientpositive/alter_partition_coltype.q.out
./ql/src/test/results/clientpositive/alter_table_cascade.q.out
./ql/src/test/results/clientpositive/analyze_table_null_partition.q.out
./ql/src/test/results/clientpositive/annotate_stats_part.q.out
./ql/src/test/results/clientpositive/autoColumnStats_1.q.out
./ql/src/test/results/clientpositive/autoColumnStats_2.q.out
./ql/src/test/results/clientpositive/default_partition_name.q.out
./ql/src/test/results/clientpositive/dynamic_partition_skip_default.q.out
./ql/src/test/results/clientpositive/dynpart_sort_opt_vectorization.q.out
./ql/src/test/results/clientpositive/dynpart_sort_optimization.q.out
./ql/src/test/results/clientpositive/extrapolate_part_stats_full.q.out
./ql/src/test/results/clientpositive/extrapolate_part_stats_partial.q.out
./ql/src/test/results/clientpositive/insert_into_with_schema.q.out
./ql/src/test/results/clientpositive/llap/dynpart_sort_opt_vectorization.q.out
./ql/src/test/results/clientpositive/llap/dynpart_sort_optimization.q.out
./ql/src/test/results/clientpositive/llap/stats_only_null.q.out
./ql/src/test/results/clientpositive/llap/vector_non_string_partition.q.out
./ql/src/test/results/clientpositive/llap_partitioned.q.out
./ql/src/test/results/clientpositive/load_dyn_part14.q.out
./ql/src/test/results/clientpositive/load_dyn_part14_win.q.out
./ql/src/test/results/clientpositive/spark/load_dyn_part14.q.out
./ql/src/test/results/clientpositive/spark/stats_only_null.q.out
./ql/src/test/results/clientpositive/stats_only_null.q.out
./ql/src/test/results/clientpositive/tez/dynpart_sort_opt_vectorization.q.out
./ql/src/test/results/clientpositive/tez/dynpart_sort_optimization.q.out
./ql/src/test/results/clientpositive/tez/stats_only_null.q.out
./ql/src/test/results/clientpositive/tez/vector_non_string_partition.q.out
./ql/src/test/results/clientpositive/vector_non_string_partition.q.out

最初の6行より後はテストの期待される結果のようなので無視.

定義場所っぽい HiveConf.java を見る.

2 space tabなのか. 珍しい気もする.

public static enum ConfVars {
  // ...
  DEFAULTPARTITIONNAME("hive.exec.default.partition.name", "__HIVE_DEFAULT_PARTITION__",
      "The default partition name in case the dynamic partition column value is null/empty string or any other values that cannot be escaped. \n" +
      "This value must not contain any special character used in HDFS URI (e.g., ':', '%', '/' etc). \n" +
      "The user has to be aware that the dynamic partition value should not contain this value to avoid confusions."),
  // ...
}

おそらく他の場所では DEFAULTPARTITIONNAME の名前で参照されているのだろう.

念の為他の場所も見ておく.

grep -nIR '__HIVE_DEFAULT_PARTITION__' --include '*.java' ./*
./common/src/java/org/apache/hadoop/hive/common/FileUtils.java:275:        // __HIVE_DEFAULT_PARTITION__ was the return value for escapePathName
./common/src/java/org/apache/hadoop/hive/common/FileUtils.java:276:        return "__HIVE_DEFAULT_PARTITION__";
./common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:509:    DEFAULTPARTITIONNAME("hive.exec.default.partition.name", "__HIVE_DEFAULT_PARTITION__",
./hcatalog/core/src/main/java/org/apache/hive/hcatalog/mapreduce/FileOutputCommitterContainer.java:705:      dynPathSpec = dynPathSpec.replaceAll("__HIVE_DEFAULT_PARTITION__", "*");
./hcatalog/core/src/main/java/org/apache/hive/hcatalog/mapreduce/HCatFileUtil.java:71:        sb.append("__HIVE_DEFAULT_PARTITION__");
./ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java:1201:        // if length of (prefix/ds=__HIVE_DEFAULT_PARTITION__/000000_0) is greater than max key prefix
./ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java:1205:        // Now that (prefix/ds=__HIVE_DEFAULT_PARTITION__) is hashed to a smaller prefix it will
./ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java:187:          // to the special partition, __HIVE_DEFAULT_PARTITION__.

コメントだけに現れるのかと思いきや, 案外リテラルも使われている. 大丈夫なんだろか.

今日はここまで.