Hive CookBook

查询配置

设置任务名称 set mapreduce.job.name=xxx_yourname;

指定任务队列 set mapreduce.job.queuename=offline.data;

任务结果输出header信息 set hive.cli.print.header=true;

设置优先级 set mapreduce.job.priority=HIGH;

优先级说明:
配置 mapreduce.job.priority 来调整优先级,支持的值:

对应YARN优先级 说明
NORMAL(默认) 1 普通任务
HIGH 2 nearline 任务
VERY_HIGH 3 在线任务
CRITICAL 6 关键任务
VERY_CRITICAL 9 非常关键的任务

reduce阶段内存配置:

1
2
set mapreduce.reduce.java.opts=-Xmx3072m;
set mapreduce.reduce.memory.mb=4096;

map阶段内存配置:

1
2
set mapreduce.map.java.opts=-Xmx3072m;
set mapreduce.map.memory.mb=4096;

map数过多:

1
2
set mapred.min.split.size=536870912;
set mapred.max.split.size=536870912;

数据查询

使用 Rank() 计算分组 TopN 或 Top Percentile

文档:Rank()

Rank() 可以为原始数据的每一行生成新的一列,值为这一行所在分组的排序 rank; 这样就可以根据 rank 值来保留所需要的 Top N 的行,也可以参考所在分组的总行数,保留 Top n% 的行。

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
SELECT
*
FROM
(
SELECT
ut,
uid,
gid,
it,
RANK() OVER (
PARTITION BY gid
ORDER BY
it
) AS gir
FROM
is_daily
WHERE
`date=` "20170803"
AND is = "f"
) AS tmp_t
WHERE
gir < 10
LIMIT
100

使用 NDV() 实现快速计数

文档: NDV()

相比 count distinct 更快、更节省内存,但存在误差 (HyperLogLog). 在查询很大的 count distinct,且对准确性要求量级正确即可时,e.g. 月活跃用户(MAU),可以考虑用 NDV 代替 count distinct.

NDV() 只能对单列做计数,所以如果需要多列组合计数,需要做一个拼接转换。

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
SELECT
ndv(
concat(
cast(uid AS STRING),
',',
cast(user_uid AS STRING)
)
)
)
FROM
default.isd
WHERE
`date=` "20170803"
AND is = "f"
SELECT
count(distinct uid, user_uid)
FROM
default.isd
WHERE
`date=` "20170803"
AND is = "f"

使用lateral view和explode展开array嵌套结构

对字符串split得到array,再把array展开

示例:it是字符串,取值样例’refresh,push’

1
2
3
4
5
6
7
8
9
10
select
uid,
ut,
it
from
t LATERAL VIEW explode(split(it, ',')) tb as it
where
`date = ` '20161229'
limit
100

使用lateral view和explode展开map嵌套结构

对value值进行过滤,选取符合要求的keys

示例:lt_lk_keyword是Map,取值样例{123: 0.1234, 234: 0.2345}

1
2
3
4
5
6
7
8
9
10
select
uid,
ut,
key,
value
from
x LATERAL VIEW explode(lt_lk_keyword) tb as key,
value
where
`date = ` '20151229'

使用lateral view 和 json_tuple 处理json

示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SELECT
*
FROM
(
select
*
from
x LATERAL VIEW json_tuple(extra, "gs", "uid") t1 as uid,
gs
) g
JOIN default.web_article_item_dict as i on g.date = i.date
and g.kid = iid
where
g.date = "20171016"
limit
10

数据导出

保存查询结果到hive表

直接以查询结果建表

1
2
3
4
CREATE TABLE test.tb_abc AS
SELECT distinct uit, uid
FROM default.isd
WHEREdate="20150803"

数据定义

创建空Parquet分区表

1
2
3
4
5
6
7
8
CREATE EXTERNAL TABLE test.tmp_users1(
ut tinyint,
uid bigint
)
PARTITIONED BY (
`date string`
)
STORED AS PARQUET;

基于JSON数据创建外部数据表

文档:CreateTable

Example: 创建包含2个字段的临时外部数据表

HDFS数据:xx.json。目录中可包含多个文件。json文件内容:

1
2
3
4
5
6
7
8
9
10
11
{'ut': 12, 'uid': 8}
{'ut': 14, 'uid': 12345}
CREATE EXTERNAL TABLE test.tmp_users1(
ut tinyint,
uid bigint
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ('ignore.malformed.json'='true')
STORED AS TEXTFILE
LOCATION '/locate';

基于CSV数据创建外部数据表(数据在本地)

不将数据存储在hdfs,直接从本地load进表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- 12,674930234
-- 14,759234924
-- step1: 建表
create external table csv_load_test(
ut bigint,
uid bigint
)
row format
delimited fields terminated by ','
stored as textfile
-- step2: load数据
load data local inpath './test.csv' overwrite into table csv_load_test

基于CSV数据创建外部数据表(数据在hdfs)

文档:CreateTable

Example: 创建包含2个字段的临时外部数据表

HDFS数据:/users.csv。目录中可包含多个文件。csv文件内容,以\t分隔:

将数据表从hive同步到impala, 在impala执行: invalidate metadata test.source_list_focus;

1
2
3
4
5
6
7
8
9
10
11
-- 12,8
-- 14,12345
CREATE EXTERNAL TABLE test.tmp_users2(
ut tinyint,
uid bigint
)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/locate';

创建包含map和array结构的临时表:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
create external table test.impala_complex_type_json(
`uid string,`
`properties map,`
ids` array<int`>)
ROW FORMAT DELIMITED
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
STORED AS TEXTFILE;
-- local file
u1 {"name":"xiaoming","age":"18","gender":"male"} [1,2,3]
u2 {"name":"dagou","age":"18","gender":"female"} [4,5,6]
-- cli - command
load data local inpath './a' overwrite into table test.impala_complex_type_json