HIVE DDL&DML简介
write by donaldhan, 2020-03-03 23:04引言
上面一篇文章HIVE单机环境搭建,我们搭建了Hive的HA版本和单机版,今天我们来使用单机来看一下HIVE的相关DDL和DML语法。
目录
单机版,创建表,从本地文件加载数据,从hdfs加载数据。
我们使用Beeline进行我们相关语法的测试, 先开启命令行
donaldhan@pseduoDisHadoop:~$ beeline
beeline> !connect jdbc:hive2://pseduoDisHadoop:10000
Connecting to jdbc:hive2://pseduoDisHadoop:10000
Enter username for jdbc:hive2://pseduoDisHadoop:10000: hadoop
Enter password for jdbc:hive2://pseduoDisHadoop:10000: ******
Connected to: Apache Hive (version 2.3.4)
Driver: Hive JDBC (version 2.3.4)
Transaction isolation: TRANSACTION_REPEATABLE_READ
DDL
创建数据库
简单创建数据模式;
create database test;
查看对应的数据库存储位置;
0: jdbc:hive2://pseduoDisHadoop:10000> desc database test;
+----------+----------+----------------------------------------------------+-------------+-------------+-------------+
| db_name | comment | location | owner_name | owner_type | parameters |
+----------+----------+----------------------------------------------------+-------------+-------------+-------------+
| test | | hdfs://pseduoDisHadoop:9000/user/hive/warehouse/test.db | donaldhan | USER | |
+----------+----------+----------------------------------------------------+-------------+-------------+-------------+
1 row selected (0.275 seconds)
由于我们创建数据库时没有指定对应的数仓存储路径,默认为HDFS下的数仓目录user/hive/warehouse+数据库名+.db对应的文件夹。
指定数据库存储位置方式,创建数据库
0: jdbc:hive2://pseduoDisHadoop:10000> create database test1 location '/db/test1';
No rows affected (0.689 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> desc database test1;
+----------+----------+---------------------------------------+-------------+-------------+-------------+
| db_name | comment | location | owner_name | owner_type | parameters |
+----------+----------+---------------------------------------+-------------+-------------+-------------+
| test1 | | hdfs://pseduoDisHadoop:9000/db/test1 | donaldhan | USER | |
+----------+----------+---------------------------------------+-------------+-------------+-------------+
1 row selected (0.278 seconds)
如果test数据库不存在再创建
0: jdbc:hive2://pseduoDisHadoop:10000> create database if not exists test1;
No rows affected (0.247 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> desc database test1
. . . . . . . . . . . . . . . . . . .> ;
+----------+----------+---------------------------------------+-------------+-------------+-------------+
| db_name | comment | location | owner_name | owner_type | parameters |
+----------+----------+---------------------------------------+-------------+-------------+-------------+
| test1 | | hdfs://pseduoDisHadoop:9000/db/test1 | donaldhan | USER | |
+----------+----------+---------------------------------------+-------------+-------------+-------------+
1 row selected (0.264 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> create database if not exists test2;
No rows affected (0.424 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> desc database test2;
+----------+----------+----------------------------------------------------+-------------+-------------+-------------+
| db_name | comment | location | owner_name | owner_type | parameters |
+----------+----------+----------------------------------------------------+-------------+-------------+-------------+
| test2 | | hdfs://pseduoDisHadoop:9000/user/hive/warehouse/test2.db | donaldhan | USER | |
+----------+----------+----------------------------------------------------+-------------+-------------+-------------+
1 row selected (0.289 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
创建数据库,并为数据库添加描述信息
0: jdbc:hive2://pseduoDisHadoop:10000> create database test3 comment 'my test3 db' with dbproperties ('creator'='donaldhan','date'='2020-02-25');
No rows affected (0.524 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> desc database extended test3;
+----------+--------------+----------------------------------------------------+-------------+-------------+---------------------------------------+
| db_name | comment | location | owner_name | owner_type | parameters |
+----------+--------------+----------------------------------------------------+-------------+-------------+---------------------------------------+
| test3 | my test3 db | hdfs://pseduoDisHadoop:9000/user/hive/warehouse/test3.db | donaldhan | USER | {date=2020-02-25, creator=donaldhan} |
+----------+--------------+----------------------------------------------------+-------------+-------------+---------------------------------------+
1 row selected (0.271 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
查看数据库
查看所有数据
0: jdbc:hive2://pseduoDisHadoop:10000> show databases;
+----------------+
| database_name |
+----------------+
| default |
| test |
+----------------+
2 rows selected (3.638 seconds)
模糊查看数据库
0: jdbc:hive2://pseduoDisHadoop:10000> show databases like 'test*';
+----------------+
| database_name |
+----------------+
| test |
| test1 |
| test2 |
| test3 |
+----------------+
4 rows selected (0.225 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
查看指定数据库
0: jdbc:hive2://pseduoDisHadoop:10000> desc database test3;
+----------+--------------+----------------------------------------------------+-------------+-------------+-------------+
| db_name | comment | location | owner_name | owner_type | parameters |
+----------+--------------+----------------------------------------------------+-------------+-------------+-------------+
| test3 | my test3 db | hdfs://pseduoDisHadoop:9000/user/hive/warehouse/test3.db | donaldhan | USER | |
+----------+--------------+----------------------------------------------------+-------------+-------------+-------------+
1 row selected (0.31 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> desc database extended test3;
+----------+--------------+----------------------------------------------------+-------------+-------------+---------------------------------------+
| db_name | comment | location | owner_name | owner_type | parameters |
+----------+--------------+----------------------------------------------------+-------------+-------------+---------------------------------------+
| test3 | my test3 db | hdfs://pseduoDisHadoop:9000/user/hive/warehouse/test3.db | donaldhan | USER | {date=2020-02-25, creator=donaldhan} |
+----------+--------------+----------------------------------------------------+-------------+-------------+---------------------------------------+
1 row selected (0.271 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
desc database extended test 和desc database test 效果一样,但在HIVE CLI下,加extended的查询的结果想详细。
修改数据库
0: jdbc:hive2://pseduoDisHadoop:10000> desc database extended test3;
+----------+--------------+----------------------------------------------------+-------------+-------------+----------------------------------------------------+
| db_name | comment | location | owner_name | owner_type | parameters |
+----------+--------------+----------------------------------------------------+-------------+-------------+----------------------------------------------------+
| test3 | my test3 db | hdfs://pseduoDisHadoop:9000/user/hive/warehouse/test3.db | donaldhan | USER | {date=2020-02-25, creator=donaldhan, modifier=rain} |
+----------+--------------+----------------------------------------------------+-------------+-------------+----------------------------------------------------+
1 row selected (0.234 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
删除数据库
删除数据库test3, 有如下两种方式:
直接删除
drop database test3
存在则删除
drop database if exists test3
0: jdbc:hive2://pseduoDisHadoop:10000> drop database test3;
No rows affected (0.819 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> drop database if exists test3;
No rows affected (0.043 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> show databases like 'test*';
+----------------+
| database_name |
+----------------+
| test |
| test1 |
| test2 |
+----------------+
3 rows selected (0.248 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> drop database test3;
Error: Error while compiling statement: FAILED: SemanticException [Error 10072]: Database does not exist: test3 (state=42000,code=10072)
如果数据库不存在删除,则回报:
Error: Error while compiling statement: FAILED: SemanticException [Error 10072]: Database does not exist: test3
如果数据库中有0或多个表时,不能直接删除,需要先删除表再删除数据库,否则回报如下错误
InvalidOperationException(message:Database test3 is not empty. One or more tables exist.)
0: jdbc:hive2://pseduoDisHadoop:10000> use test1;
No rows affected (0.255 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> create table `user`(
. . . . . . . . . . . . . . . . . . .> id int,
. . . . . . . . . . . . . . . . . . .> name string)
. . . . . . . . . . . . . . . . . . .> row format delimited fields terminated by '\t';
No rows affected (0.808 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> drop database test1;
Error: org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database test1 is not empty. One or more tables exist.)
...
如果想要删除含有表的数据库,在删除时加上cascade,表示级联删除(慎用),可以使用如下命令
0: jdbc:hive2://pseduoDisHadoop:10000> drop database if exists test1 cascade;
No rows affected (1.36 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> show databases like 'test*';
+----------------+
| database_name |
+----------------+
| test |
+----------------+
1 row selected (0.201 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
创建表
create table `emp`(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)
row format delimited fields terminated by '\t';
0: jdbc:hive2://pseduoDisHadoop:10000> use test;
No rows affected (0.227 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> create table `emp`(
. . . . . . . . . . . . . . . . . . .> empno int,
. . . . . . . . . . . . . . . . . . .> ename string,
. . . . . . . . . . . . . . . . . . .> job string,
. . . . . . . . . . . . . . . . . . .> mgr int,
. . . . . . . . . . . . . . . . . . .> hiredate string,
. . . . . . . . . . . . . . . . . . .> sal double,
. . . . . . . . . . . . . . . . . . .> comm double,
. . . . . . . . . . . . . . . . . . .> deptno int
. . . . . . . . . . . . . . . . . . .> )
. . . . . . . . . . . . . . . . . . .> row format delimited fields terminated by '\t';
No rows affected (1.406 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> show tables;
+-----------+
| tab_name |
+-----------+
| emp |
+-----------+
1 row selected (0.207 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> show tables;
+-----------+
| tab_name |
+-----------+
| emp |
+-----------+
1 row selected (0.207 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> desc emp;
+-----------+------------+----------+
| col_name | data_type | comment |
+-----------+------------+----------+
| empno | int | |
| ename | string | |
| job | string | |
| mgr | int | |
| hiredate | string | |
| sal | double | |
| comm | double | |
| deptno | int | |
+-----------+------------+----------+
8 rows selected (0.289 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> desc extended emp;
+-----------------------------+----------------------------------------------------+-----------------+
| col_name | data_type | comment |
+-----------------------------+----------------------------------------------------+-----------------+
| empno | int | |
| ename | string | |
| job | string | |
| mgr | int | |
| hiredate | string | |
| sal | double | |
| comm | double | |
| deptno | int | |
| | NULL | NULL |
| Detailed Table Information | Table(tableName:emp, dbName:test, owner:donaldhan, createTime:1582640961, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:empno, type:int, comment:null), FieldSchema(name:ename, type:string, comment:null), FieldSchema(name:job, type:string, comment:null), FieldSchema(name:mgr, type:int, comment:null), FieldSchema(name:hiredate, type:string, comment:null), FieldSchema(name:sal, type:double, comment:null), FieldSchema(name:comm, type:double, comment:null), FieldSchema(name:deptno, type:int, comment:null)], location:hdfs://pseduoDisHadoop:9000/user/hive/warehouse/test.db/emp, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format= | , field.delim= |
+-----------------------------+----------------------------------------------------+-----------------+
10 rows selected (0.391 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
加载本地文件数据到数据表
编辑emp数据text文本,以制表符为字段分割附
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ vim emp.txt
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ cat emp.txt
123 donald lawer 4568 2019-02-26 8000.0 3000.0 7896
456 rain teacher 1314 2018-01-25 5000.1 2000.2 4654
789 jamel cleaner 20170609 3000.0 1000.3 7895
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$
注意此文件内容不能简单拷贝,简单拷贝会出现,加载数据为空的情况, 使用vim的时候,要使用 set list 把制表符显示为^I ,用$标示行尾(使用list分辨尾部的字符是tab还是空格), 两个tab之间为空,则字段为Null
从本地文件加载数据到数据表
0: jdbc:hive2://pseduoDisHadoop:10000> load data local inpath '/bdp/hive/hiveLocalTables/emp.txt' overwrite into table emp;
No rows affected (4.38 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp;
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredate | emp.sal | emp.comm | emp.deptno |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| 123 | donald | lawer | 4568 | 2019-02-26 | 8000.0 | 3000.0 | 7896 |
| 456 | rain | teacher | 1314 | 2018-01-25 | 5000.1 | 2000.2 | 4654 |
| 789 | jamel | cleaner | NULL | 20170609 | 3000.0 | 1000.3 | 7895 |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
3 rows selected (4.226 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
复制表
- 只拷贝表结构
0: jdbc:hive2://pseduoDisHadoop:10000> create table emp2 like emp;
No rows affected (1.005 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> desc emp2;
+-----------+------------+----------+
| col_name | data_type | comment |
+-----------+------------+----------+
| empno | int | |
| ename | string | |
| job | string | |
| mgr | int | |
| hiredate | string | |
| sal | double | |
| comm | double | |
| deptno | int | |
+-----------+------------+----------+
8 rows selected (0.4 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp2;
+-------------+-------------+-----------+-----------+----------------+-----------+------------+--------------+
| emp2.empno | emp2.ename | emp2.job | emp2.mgr | emp2.hiredate | emp2.sal | emp2.comm | emp2.deptno |
+-------------+-------------+-----------+-----------+----------------+-----------+------------+--------------+
+-------------+-------------+-----------+-----------+----------------+-----------+------------+--------------+
No rows selected (0.419 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
- 拷贝表结构及数据 会运行MapReduce作业
0: jdbc:hive2://pseduoDisHadoop:10000> create table emp3 as select * from emp;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (357.932 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp3;
+-------------+-------------+-----------+-----------+----------------+-----------+------------+--------------+
| emp3.empno | emp3.ename | emp3.job | emp3.mgr | emp3.hiredate | emp3.sal | emp3.comm | emp3.deptno |
+-------------+-------------+-----------+-----------+----------------+-----------+------------+--------------+
| 123 | donald | lawer | 4568 | 2019-02-26 | 8000.0 | 3000.0 | 7896 |
| 456 | rain | teacher | 1314 | 2018-01-25 | 5000.1 | 2000.2 | 4654 |
| 789 | jamel | cleaner | NULL | 20170609 | 3000.0 | 1000.3 | 7895 |
+-------------+-------------+-----------+-----------+----------------+-----------+------------+--------------+
3 rows selected (0.522 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
注意由于要运行MapReduce作业,需要开启hadoop的yarn;
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /bdp/hadoop/hadoop-2.7.1/logs/yarn-donaldhan-resourcemanager-pseduoDisHadoop.out
localhost: starting nodemanager, logging to /bdp/hadoop/hadoop-2.7.1/logs/yarn-donaldhan-nodemanager-pseduoDisHadoop.out
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ jps
2962 RunJar
3655 ResourceManager
3113 RunJar
2570 DataNode
3787 NodeManager
2428 NameNode
2765 SecondaryNameNode
4333 Jps
- 拷贝表的限定列 创建表emp到emp_copy,emp_copy中只包含三列:empno,ename,job
0: jdbc:hive2://pseduoDisHadoop:10000> create table emp4 as select empno,ename,job from emp;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (78.651 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp4;
+-------------+-------------+-----------+
| emp4.empno | emp4.ename | emp4.job |
+-------------+-------------+-----------+
| 123 | donald | lawer |
| 456 | rain | teacher |
| 789 | jamel | cleaner |
+-------------+-------------+-----------+
3 rows selected (0.332 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
此种拷贝方式也会执行MR任务,注意在HIVE 2中,HIVE-on-MR已经丢弃。在将来的版本中将不可用。可以考虑使用其他执行引擎或者使用HIVE 1.x版本。
查询表
查看数据库下的所有表
0: jdbc:hive2://pseduoDisHadoop:10000> show tables;
+-----------+
| tab_name |
+-----------+
| emp |
| emp2 |
| emp3 |
| emp4 |
+-----------+
4 rows selected (0.282 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> show tables 'emp*';
+-----------+
| tab_name |
+-----------+
| emp |
| emp2 |
| emp3 |
| emp4 |
+-----------+
4 rows selected (0.24 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
查看表结构
0: jdbc:hive2://pseduoDisHadoop:10000> desc emp;
+-----------+------------+----------+
| col_name | data_type | comment |
+-----------+------------+----------+
| empno | int | |
| ename | string | |
| job | string | |
| mgr | int | |
| hiredate | string | |
| sal | double | |
| comm | double | |
| deptno | int | |
+-----------+------------+----------+
8 rows selected (0.297 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> desc extended emp;
+-----------------------------+----------------------------------------------------+-----------------+
| col_name | data_type | comment |
+-----------------------------+----------------------------------------------------+-----------------+
| empno | int | |
| ename | string | |
| job | string | |
| mgr | int | |
| hiredate | string | |
| sal | double | |
| comm | double | |
| deptno | int | |
| | NULL | NULL |
| Detailed Table Information | Table(tableName:emp, dbName:test, owner:donaldhan, createTime:1582640961, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:empno, type:int, comment:null), FieldSchema(name:ename, type:string, comment:null), FieldSchema(name:job, type:string, comment:null), FieldSchema(name:mgr, type:int, comment:null), FieldSchema(name:hiredate, type:string, comment:null), FieldSchema(name:sal, type:double, comment:null), FieldSchema(name:comm, type:double, comment:null), FieldSchema(name:deptno, type:int, comment:null)], location:hdfs://pseduoDisHadoop:9000/user/hive/warehouse/test.db/emp, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format= | , field.delim= |
+-----------------------------+----------------------------------------------------+-----------------+
10 rows selected (0.448 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
desc extended *方式,相比desc*可以查看表的更详细信息,比如存储位置,所属db,ower,创建时间,上次访问时间等。
查看表的创建语句
0: jdbc:hive2://pseduoDisHadoop:10000> show create table emp;
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE TABLE `emp`( |
| `empno` int, |
| `ename` string, |
| `job` string, |
| `mgr` int, |
| `hiredate` string, |
| `sal` double, |
| `comm` double, |
| `deptno` int) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| WITH SERDEPROPERTIES ( |
| 'field.delim'='\t', |
| 'serialization.format'='\t') |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://pseduoDisHadoop:9000/user/hive/warehouse/test.db/emp' |
| TBLPROPERTIES ( |
| 'transient_lastDdlTime'='1582724633') |
+----------------------------------------------------+
22 rows selected (8.453 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
修改表
修改表名emp2为emp_bak
0: jdbc:hive2://pseduoDisHadoop:10000> alter table emp4 rename to emp_bak;
No rows affected (0.704 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> show tables;
+-----------+
| tab_name |
+-----------+
| emp |
| emp2 |
| emp3 |
| emp_bak |
+-----------+
4 rows selected (0.245 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from e
else end end-exec escape except exception
exec execute exists external extract
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp_bak;
+----------------+----------------+--------------+
| emp_bak.empno | emp_bak.ename | emp_bak.job |
+----------------+----------------+--------------+
| 123 | donald | lawer |
| 456 | rain | teacher |
| 789 | jamel | cleaner |
+----------------+----------------+--------------+
3 rows selected (0.327 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
删除表
删除表emp_bak
0: jdbc:hive2://pseduoDisHadoop:10000> drop table if exists emp_bak;
No rows affected (4.845 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> show tables;
+-----------+
| tab_name |
+-----------+
| emp |
| emp2 |
| emp3 |
+-----------+
3 rows selected (0.23 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
清空表中所有数据
0: jdbc:hive2://pseduoDisHadoop:10000> truncate table emp3;
No rows affected (0.521 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp3;
+-------------+-------------+-----------+-----------+----------------+-----------+------------+--------------+
| emp3.empno | emp3.ename | emp3.job | emp3.mgr | emp3.hiredate | emp3.sal | emp3.comm | emp3.deptno |
+-------------+-------------+-----------+-----------+----------------+-----------+------------+--------------+
+-------------+-------------+-----------+-----------+----------------+-----------+------------+--------------+
No rows selected (0.246 seconds)
DML
这部分主要是对表的操作.
内部表和外部表
HIVE提供了两种模式的表,我们分别来看一下。
内部表
创建一张内部表 emp_managed
0: jdbc:hive2://pseduoDisHadoop:10000> create table emp_managed as select * from emp;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (65.511 seconds)
查看表emp_managed在hdfs中是否存在(/user/hive/warehouse/test.db/emp_managed)
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ hdfs dfs -ls /user/hive/warehouse/test.db
Found 4 items
drwxrwxrwx - donaldhan supergroup 0 2020-02-26 21:43 /user/hive/warehouse/test.db/emp
drwxrwxrwx - donaldhan supergroup 0 2020-02-26 21:50 /user/hive/warehouse/test.db/emp2
drwxrwxrwx - donaldhan supergroup 0 2020-02-26 22:50 /user/hive/warehouse/test.db/emp3
drwxrwxrwx - donaldhan supergroup 0 2020-02-26 23:09 /user/hive/warehouse/test.db/emp_managed
donaldhan@pseduoDisHadoop
存在
登录mysql(TBLS中存放了hive中的所有表)数据库,查看hive的元数据信息
mysql> use single_hive_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> select * from tbls;
+--------+-------------+-------+------------------+-----------+-----------+-------+-------------+---------------+--------------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | IS_REWRITE_ENABLED |
+--------+-------------+-------+------------------+-----------+-----------+-------+-------------+---------------+--------------------+--------------------+--------------------+
| 3 | 1582640961 | 6 | 0 | donaldhan | 0 | 3 | emp | MANAGED_TABLE | NULL | NULL | |
| 6 | 1582725013 | 6 | 0 | donaldhan | 0 | 6 | emp2 | MANAGED_TABLE | NULL | NULL | |
| 7 | 1582725780 | 6 | 0 | donaldhan | 0 | 7 | emp3 | MANAGED_TABLE | NULL | NULL | |
| 9 | 1582729774 | 6 | 0 | donaldhan | 0 | 9 | emp_managed | MANAGED_TABLE | NULL | NULL | |
+--------+-------------+-------+------------------+-----------+-----------+-------+-------------+---------------+--------------------+--------------------+--------------------+
4 rows in set (0.00 sec)
mysql>
emp_managed表元数据存在
删除表emp_managed
0: jdbc:hive2://pseduoDisHadoop:10000> drop table if exists emp_managed;
No rows affected (0.875 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
mysql 中hive的元数据被删除
mysql> mysql> select * from tbls;
+--------+-------------+-------+------------------+-----------+-----------+-------+----------+---------------+--------------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | IS_REWRITE_ENABLED |
+--------+-------------+-------+------------------+-----------+-----------+-------+----------+---------------+--------------------+--------------------+--------------------+
| 3 | 1582640961 | 6 | 0 | donaldhan | 0 | 3 | emp | MANAGED_TABLE | NULL | NULL | |
| 6 | 1582725013 | 6 | 0 | donaldhan | 0 | 6 | emp2 | MANAGED_TABLE | NULL | NULL | |
| 7 | 1582725780 | 6 | 0 | donaldhan | 0 | 7 | emp3 | MANAGED_TABLE | NULL | NULL | |
+--------+-------------+-------+------------------+-----------+-----------+-------+----------+---------------+--------------------+--------------------+--------------------+
3 rows in set (0.01 sec)
mysql>
表emp_managed 对应hdfs 中的文件夹被删除
ehouse/test.db
Found 3 items
drwxrwxrwx - donaldhan supergroup 0 2020-02-26 21:43 /user/hive/warehouse/test.db/emp
drwxrwxrwx - donaldhan supergroup 0 2020-02-26 21:50 /user/hive/warehouse/test.db/emp2
drwxrwxrwx - donaldhan supergroup 0 2020-02-26 22:50 /user/hive/warehouse/test.db/emp3
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$
外部表
创建一张外部表
create external table `emp_external`(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/external/emp';
0: jdbc:hive2://pseduoDisHadoop:10000> create external table `emp_external`(
. . . . . . . . . . . . . . . . . . .> empno int,
. . . . . . . . . . . . . . . . . . .> ename string,
. . . . . . . . . . . . . . . . . . .> job string,
. . . . . . . . . . . . . . . . . . .> mgr int,
. . . . . . . . . . . . . . . . . . .> hiredate string,
. . . . . . . . . . . . . . . . . . .> sal double,
. . . . . . . . . . . . . . . . . . .> comm double,
. . . . . . . . . . . . . . . . . . .> deptno int
. . . . . . . . . . . . . . . . . . .> )
. . . . . . . . . . . . . . . . . . .> row format delimited fields terminated by '\t'
. . . . . . . . . . . . . . . . . . .> location '/user/hive/warehouse/external/emp';
No rows affected (1.055 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
查看mysql中hive的元数据
mysql> select * from tbls;
+--------+-------------+-------+------------------+-----------+-----------+-------+--------------+----------------+--------------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | IS_REWRITE_ENABLED |
+--------+-------------+-------+------------------+-----------+-----------+-------+--------------+----------------+--------------------+--------------------+--------------------+
| 3 | 1582640961 | 6 | 0 | donaldhan | 0 | 3 | emp | MANAGED_TABLE | NULL | NULL | |
| 6 | 1582725013 | 6 | 0 | donaldhan | 0 | 6 | emp2 | MANAGED_TABLE | NULL | NULL | |
| 7 | 1582725780 | 6 | 0 | donaldhan | 0 | 7 | emp3 | MANAGED_TABLE | NULL | NULL | |
| 10 | 1582730557 | 6 | 0 | donaldhan | 0 | 10 | emp_external | EXTERNAL_TABLE | NULL | NULL | |
+--------+-------------+-------+------------------+-----------+-----------+-------+--------------+----------------+--------------------+--------------------+--------------------+
4 rows in set (0.00 sec)
mysql>
外部表类型为:EXTERNAL_TABLE。
新创建的表emp_external中是没有数据的,我们将emp.txt文件上传到hdfs的/user/hive/warehouse/external/emp目录下
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ hdfs dfs -put /bdp/hive/hiveLocalTables/emp.txt /user/hive/warehouse/external/emp
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ hdfs dfs -ls /user/hive/warehouse/external/emp
Found 1 items
-rw-r--r-- 1 donaldhan supergroup 151 2020-02-26 23:29 /user/hive/warehouse/external/emp/emp.txt
上传完成后,表emp_external就有数据了,使用sql查看
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp_external;
+---------------------+---------------------+-------------------+-------------------+------------------------+-------------------+--------------------+----------------------+
| emp_external.empno | emp_external.ename | emp_external.job | emp_external.mgr | emp_external.hiredate | emp_external.sal | emp_external.comm | emp_external.deptno |
+---------------------+---------------------+-------------------+-------------------+------------------------+-------------------+--------------------+----------------------+
| 123 | donald | lawer | 4568 | 2019-02-26 | 8000.0 | 3000.0 | 7896 |
| 456 | rain | teacher | 1314 | 2018-01-25 | 5000.1 | 2000.2 | 4654 |
| 789 | jamel | cleaner | NULL | 20170609 | 3000.0 | 1000.3 | 7895 |
+---------------------+---------------------+-------------------+-------------------+------------------------+-------------------+--------------------+----------------------+
3 rows selected (0.45 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
然后我们来删除这张表,它是一张外部表,注意和内部表有什么区别
hive 中 表emp_external被删除
0: jdbc:hive2://pseduoDisHadoop:10000> show tables;
+-----------+
| tab_name |
+-----------+
| emp |
| emp2 |
| emp3 |
+-----------+
3 rows selected (0.22 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
mysql 中 元数据被删除
mysql> select * from tbls;
+--------+-------------+-------+------------------+-----------+-----------+-------+----------+---------------+--------------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | IS_REWRITE_ENABLED |
+--------+-------------+-------+------------------+-----------+-----------+-------+----------+---------------+--------------------+--------------------+--------------------+
| 3 | 1582640961 | 6 | 0 | donaldhan | 0 | 3 | emp | MANAGED_TABLE | NULL | NULL | |
| 6 | 1582725013 | 6 | 0 | donaldhan | 0 | 6 | emp2 | MANAGED_TABLE | NULL | NULL | |
| 7 | 1582725780 | 6 | 0 | donaldhan | 0 | 7 | emp3 | MANAGED_TABLE | NULL | NULL | |
+--------+-------------+-------+------------------+-----------+-----------+-------+----------+---------------+--------------------+--------------------+--------------------+
3 rows in set (0.01 sec)
hdfs 中的文件并不会被删除
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ hdfs dfs -ls /user/hive/warehouse/external/emp
Found 1 items
-rw-r--r-- 1 donaldhan supergroup 151 2020-02-26 23:29 /user/hive/warehouse/external/emp/emp.txt
小节 内部表与外部表的区别 如果是内部表,在删除时,MySQL中的元数据和HDFS中的数据都会被删除 如果是外部表,在删除时,MySQL中的元数据会被删除,HDFS中的数据不会被删除
表数据加载load
具体语法如下:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
LOCAL:如果加上表示从本地加载数据;默认不加,从hdfs中加载数据. OVERWRITE:加上表示覆盖表中数据
加载数据到文件有两种方式,一种是从本地文件加载,一种从hdfs文件加载;我们先看第一种。
本地文件加载数据
从本地文件emp.txt加载数据到emp表中
0: jdbc:hive2://pseduoDisHadoop:10000> load data local inpath '/bdp/hive/hiveLocalTables/emp.txt' overwrite into table emp;
No rows affected (4.38 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp;
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredate | emp.sal | emp.comm | emp.deptno |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| 123 | donald | lawer | 4568 | 2019-02-26 | 8000.0 | 3000.0 | 7896 |
| 456 | rain | teacher | 1314 | 2018-01-25 | 5000.1 | 2000.2 | 4654 |
| 789 | jamel | cleaner | NULL | 20170609 | 3000.0 | 1000.3 | 7895 |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
3 rows selected (4.226 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> load data local inpath '/bdp/hive/hiveLocalTables/emp.txt' into table emp;
No rows affected (4.266 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp;
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredate | emp.sal | emp.comm | emp.deptno |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| 123 | donald | lawer | 4568 | 2019-02-26 | 8000.0 | 3000.0 | 7896 |
| 456 | rain | teacher | 1314 | 2018-01-25 | 5000.1 | 2000.2 | 4654 |
| 789 | jamel | cleaner | NULL | 20170609 | 3000.0 | 1000.3 | 7895 |
| 123 | donald | lawer | 4568 | 2019-02-26 | 8000.0 | 3000.0 | 7896 |
| 456 | rain | teacher | 1314 | 2018-01-25 | 5000.1 | 2000.2 | 4654 |
| 789 | jamel | cleaner | NULL | 20170609 | 3000.0 | 1000.3 | 7895 |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
6 rows selected (4.651 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> load data local inpath '/bdp/hive/hiveLocalTables/emp.txt' overwrite into table emp;
No rows affected (4.046 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp;
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredate | emp.sal | emp.comm | emp.deptno |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| 123 | donald | lawer | 4568 | 2019-02-26 | 8000.0 | 3000.0 | 7896 |
| 456 | rain | teacher | 1314 | 2018-01-25 | 5000.1 | 2000.2 | 4654 |
| 789 | jamel | cleaner | NULL | 20170609 | 3000.0 | 1000.3 | 7895 |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
3 rows selected (0.444 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
从上面可以看出,使用overwrite方式,在加载数据前删除原有数据。再来看从hdfs文件加载。
hdfs文件加载数据
首先将文件上传到HDFS中
donaldhan@pseduoDisHadoop:~$ hdfs dfs -mkdir -p /user/hive/data
donaldhan@pseduoDisHadoop:~$ hdfs dfs -put /bdp/hive/hiveLocalTables/emp.txt /user/hive/data
donaldhan@pseduoDisHadoop:~$ hdfs dfs -ls /user/hive/data
Found 1 items
-rw-r--r-- 1 donaldhan supergroup 151 2020-02-27 22:07 /user/hive/data/emp.txt
donaldhan@pseduoDisHadoop:~$
加载数据到表emp中
0: jdbc:hive2://pseduoDisHadoop:10000> load data inpath '/user/hive/data/emp.txt' overwrite into table emp;
No rows affected (1.28 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp;
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredate | emp.sal | emp.comm | emp.deptno |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| 123 | donald | lawer | 4568 | 2019-02-26 | 8000.0 | 3000.0 | 7896 |
| 456 | rain | teacher | 1314 | 2018-01-25 | 5000.1 | 2000.2 | 4654 |
| 789 | jamel | cleaner | NULL | 20170609 | 3000.0 | 1000.3 | 7895 |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
3 rows selected (0.383 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> load data inpath '/user/hive/data/emp.txt' into table emp;
Error: Error while compiling statement: FAILED: SemanticException Line 1:17 Invalid path ''/user/hive/data/emp.txt'': No files matching path hdfs://pseduoDisHadoop:9000/user/hive/data/emp.txt (state=42000,code=40000)
0: jdbc:hive2://pseduoDisHadoop:10000>
从hdfs方式加载完数据,需要注意hdfs上的文件将会被删除,一刀hdfs的垃圾箱中。
插入数据
向表 emp 中插入数据
insert into emp(empno,ename,job) values(1001,'TOM','MANAGER');
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (110.63 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp;
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredate | emp.sal | emp.comm | emp.deptno |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| 1001 | TOM | MANAGER | NULL | NULL | NULL | NULL | NULL |
| 123 | donald | lawer | 4568 | 2019-02-26 | 8000.0 | 3000.0 | 7896 |
| 456 | rain | teacher | 1314 | 2018-01-25 | 5000.1 | 2000.2 | 4654 |
| 789 | jamel | cleaner | NULL | 20170609 | 3000.0 | 1000.3 | 7895 |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
4 rows selected (0.327 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
插入数据实际为一个MR任务。
查询数据
查询部门编号为10的员工信息
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp where deptno=10;
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredate | emp.sal | emp.comm | emp.deptno |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
No rows selected (0.884 seconds)
查询姓名为SMITH的员工
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp where ename='SMITH';
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredate | emp.sal | emp.comm | emp.deptno |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
No rows selected (0.31 seconds)
查询员工编号小于等于7766的员工
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp where empno <= 7766;
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredate | emp.sal | emp.comm | emp.deptno |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| 1001 | TOM | MANAGER | NULL | NULL | NULL | NULL | NULL |
| 123 | donald | lawer | 4568 | 2019-02-26 | 8000.0 | 3000.0 | 7896 |
| 456 | rain | teacher | 1314 | 2018-01-25 | 5000.1 | 2000.2 | 4654 |
| 789 | jamel | cleaner | NULL | 20170609 | 3000.0 | 1000.3 | 7895 |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
4 rows selected (0.287 seconds)
查询员工工资大于1000小于1500的员工
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp where sal between 1000 and 1500;
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredate | emp.sal | emp.comm | emp.deptno |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
No rows selected (0.297 seconds)
查询前5条记录
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp limit 5;
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredate | emp.sal | emp.comm | emp.deptno |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| 1001 | TOM | MANAGER | NULL | NULL | NULL | NULL | NULL |
| 123 | donald | lawer | 4568 | 2019-02-26 | 8000.0 | 3000.0 | 7896 |
| 456 | rain | teacher | 1314 | 2018-01-25 | 5000.1 | 2000.2 | 4654 |
| 789 | jamel | cleaner | NULL | 20170609 | 3000.0 | 1000.3 | 7895 |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
4 rows selected (0.315 seconds)
查询姓名为SCOTT或MARTIN的员工
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp where ename in ('SCOTT','MARTIN');
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredate | emp.sal | emp.comm | emp.deptno |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
No rows selected (0.266 seconds)
查询有津贴的员工
0: jdbc:hive2://pseduoDisHadoop:10000> select * from emp where comm is not null;
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredate | emp.sal | emp.comm | emp.deptno |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
| 123 | donald | lawer | 4568 | 2019-02-26 | 8000.0 | 3000.0 | 7896 |
| 456 | rain | teacher | 1314 | 2018-01-25 | 5000.1 | 2000.2 | 4654 |
| 789 | jamel | cleaner | NULL | 20170609 | 3000.0 | 1000.3 | 7895 |
+------------+------------+----------+----------+---------------+----------+-----------+-------------+
3 rows selected (0.357 seconds)
统计部门10下共有多少员工
0: jdbc:hive2://pseduoDisHadoop:10000> select count(*) from emp where deptno=10;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+------+
| _c0 |
+------+
| 0 |
+------+
1 row selected (72.948 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
COUNT为MR任务。
查询员工的最大、最小、平均工资及所有工资的和
0: jdbc:hive2://pseduoDisHadoop:10000> select max(sal),min(sal),avg(sal),sum(sal) from emp;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+---------+--------------------+----------+
| _c0 | _c1 | _c2 | _c3 |
+---------+---------+--------------------+----------+
| 8000.0 | 3000.0 | 5333.366666666667 | 16000.1 |
+---------+---------+--------------------+----------+
1 row selected (65.214 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
查询平均工资大于2000的部门
0: jdbc:hive2://pseduoDisHadoop:10000> select deptno,avg(sal) from emp group by deptno having avg(sal) > 2000;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+---------+
| deptno | _c1 |
+---------+---------+
| 4654 | 5000.1 |
| 7895 | 3000.0 |
| 7896 | 8000.0 |
+---------+---------+
3 rows selected (60.831 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
查询员工的姓名和工资等级,按如下规则显示
select ename, sal,
case
when sal < 1000 then 'lower'
when sal > 1000 and sal <= 2000 then 'middle'
when sal > 2000 and sal <= 4000 then 'high'
else 'highest' end
from emp;
0: jdbc:hive2://pseduoDisHadoop:10000> select ename, sal,
. . . . . . . . . . . . . . . . . . .> case
. . . . . . . . . . . . . . . . . . .> when sal < 1000 then 'lower'
. . . . . . . . . . . . . . . . . . .> when sal > 1000 and sal <= 2000 then 'middle'
. . . . . . . . . . . . . . . . . . .> when sal > 2000 and sal <= 4000 then 'high'
. . . . . . . . . . . . . . . . . . .> else 'highest' end
. . . . . . . . . . . . . . . . . . .> from emp;
+---------+---------+----------+
| ename | sal | _c2 |
+---------+---------+----------+
| TOM | NULL | highest |
| donald | 8000.0 | highest |
| rain | 5000.1 | highest |
| jamel | 3000.0 | high |
+---------+---------+----------+
4 rows selected (0.263 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
从上面可以看出,聚合类的操作,都需要运行MR任务。
导出数据
语法:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
SELECT ... FROM ...
导出数据与加载数据LOCAL使用基本一致。如果加上LOCAL表示导出到本地默认不加,导出到hdfs,OVERWRITE关键字为不可选,即先删除,在导出。
导出数据到本地
将表emp中的数据导出到本地目录/bdp/hive/data/tmp下
0: jdbc:hive2://pseduoDisHadoop:10000> insert overwrite local directory '/bdp/hive/data/tmp' row format delimited fields terminated by '\t' select * from emp;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (50.128 seconds)
查看文件:
donaldhan@pseduoDisHadoop:~$ cd /bdp/hive/data/tmp/
donaldhan@pseduoDisHadoop:/bdp/hive/data/tmp$ ll
total 16
drwxrwxr-x 2 donaldhan donaldhan 4096 Feb 27 22:41 ./
drwxrwxr-x 3 donaldhan donaldhan 4096 Feb 27 22:41 ../
-rw-r--r-- 1 donaldhan donaldhan 185 Feb 27 22:41 000000_0
-rw-r--r-- 1 donaldhan donaldhan 12 Feb 27 22:41 .000000_0.crc
donaldhan@pseduoDisHadoop:/bdp/hive/data/tmp$ cat 000000_0
1001 TOM MANAGER \N \N \N \N \N
123 donald lawer 4568 2019-02-26 8000.0 3000.0 7896
456 rain teacher 1314 2018-01-25 5000.1 2000.2 4654
789 jamel cleaner \N 20170609 3000.0 1000.3 7895
donaldhan@pseduoDisHadoop:/bdp/hive/data/tmp$
重新执行一遍
0: jdbc:hive2://pseduoDisHadoop:10000> insert overwrite local directory '/bdp/hive/data/tmp' row format delimited fields terminated by '\t' select * from emp;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (46.835 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
donaldhan@pseduoDisHadoop:/bdp/hive/data/tmp$ ll
total 16
drwxrwxr-x 2 donaldhan donaldhan 4096 Feb 27 22:45 ./
drwxrwxr-x 3 donaldhan donaldhan 4096 Feb 27 22:41 ../
-rw-r--r-- 1 donaldhan donaldhan 185 Feb 27 22:45 000000_0
-rw-r--r-- 1 donaldhan donaldhan 12 Feb 27 22:45 .000000_0.crc
donaldhan@pseduoDisHadoop:/bdp/hive/data/tmp$ cat 000000_0
1001 TOM MANAGER \N \N \N \N \N
123 donald lawer 4568 2019-02-26 8000.0 3000.0 7896
456 rain teacher 1314 2018-01-25 5000.1 2000.2 4654
789 jamel cleaner \N 20170609 3000.0 1000.3 7895
donaldhan@pseduoDisHadoop:/bdp/hive/data/tmp$
文件内容被覆盖。
导出数据到HDFS
将表emp中的数据导出到hdfs的/user/hive/data目录下
0: jdbc:hive2://pseduoDisHadoop:10000> insert overwrite directory '/user/hive/data' row format delimited fields terminated by '\t' select * from emp;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (54.917 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
在hdfs中查看文件数据
donaldhan@pseduoDisHadoop:~$ hdfs dfs -ls /user/hive/data
Found 1 items
-rwxr-xr-x 1 donaldhan supergroup 185 2020-02-27 22:48 /user/hive/data/000000_0
donaldhan@pseduoDisHadoop:~$ hdfs dfs -cat /user/hive/data/000000_0
1001 TOM MANAGER \N \N \N \N \N
123 donald lawer 4568 2019-02-26 8000.0 3000.0 7896
456 rain teacher 1314 2018-01-25 5000.1 2000.2 4654
789 jamel cleaner \N 20170609 3000.0 1000.3 7895
donaldhan@pseduoDisHadoop:~$
donaldhan@pseduoDisHadoop:~$
重试一遍
0: jdbc:hive2://pseduoDisHadoop:10000> insert overwrite directory '/user/hive/data' row format delimited fields terminated by '\t' select * from emp;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (55.794 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
donaldhan@pseduoDisHadoop:~$ hdfs dfs -ls /user/hive/data
Found 1 items
-rwxr-xr-x 1 donaldhan supergroup 185 2020-02-27 22:51 /user/hive/data/000000_0
donaldhan@pseduoDisHadoop:~$ hdfs dfs -cat /user/hive/data/000000_0
1001 TOM MANAGER \N \N \N \N \N
123 donald lawer 4568 2019-02-26 8000.0 3000.0 7896
456 rain teacher 1314 2018-01-25 5000.1 2000.2 4654
789 jamel cleaner \N 20170609 3000.0 1000.3 7895
donaldhan@pseduoDisHadoop:~$
数据被覆盖。
分区表
Hive 可以创建分区表,主要用于解决由于单个数据表数据量过大进而导致的性能问题 Hive 中的分区表分为两种:静态分区和动态分区
静态分区
静态分区由分为两种:单级分区和多级分区。我们分别来看这两种方式
- 单级分区
创建一种订单分区表
create table `order_partition`(
order_number string,
event_time string
)
partitioned by (event_month string)
row format delimited fields terminated by '\t';
0: jdbc:hive2://pseduoDisHadoop:10000> create table `order_partition`(
. . . . . . . . . . . . . . . . . . .> order_number string,
. . . . . . . . . . . . . . . . . . .> event_time string
. . . . . . . . . . . . . . . . . . .> )
. . . . . . . . . . . . . . . . . . .> partitioned by (event_month string)
. . . . . . . . . . . . . . . . . . .> row format delimited fields terminated by '\t';
No rows affected (1.356 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
0: jdbc:hive2://pseduoDisHadoop:10000> show tables;
+------------------------+
| tab_name |
+------------------------+
| emp |
| emp2 |
| emp3 |
| order_partition |
| values__tmp__table__1 |
+------------------------+
5 rows selected (0.251 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
我们先来看,从本地文件加载数据到分区表。
- 加载本地文件数据到分区表 将order.txt 文件中的数据加载到order_partition表中
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ vim order.txt
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ pwd
/bdp/hive/hiveLocalTables
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ cat order.txt
1 2020-02-27 22:59:23
2 2020-02-27 22:59:24
3 2020-02-27 22:59:25
0: jdbc:hive2://pseduoDisHadoop:10000> load data local inpath '/bdp/hive/hiveLocalTables/order.txt' overwrite into table order_partition partition (event_month='2020-02');
No rows affected (2.185 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from order_partition;
+-------------------------------+-----------------------------+------------------------------+
| order_partition.order_number | order_partition.event_time | order_partition.event_month |
+-------------------------------+-----------------------------+------------------------------+
| 1 | 2020-02-27 22:59:23 | 2020-02 |
| 2 | 2020-02-27 22:59:24 | 2020-02 |
| 3 | 2020-02-27 22:59:25 | 2020-02 |
+-------------------------------+-----------------------------+------------------------------+
3 rows selected (0.34 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
查看hdfs上的文件夹order_partition下多一个分区字段+分区Value的文件加载(event_month=2020-02) ,他下面的文件就是我们本地对应的order.txt文件。
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ hdfs dfs -ls /user/hive/warehouse/test.db/order_partition
Found 1 items
drwxrwxrwx - donaldhan supergroup 0 2020-02-27 23:02 /user/hive/warehouse/test.db/order_partition/event_month=2020-02
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ hdfs dfs -ls /user/hive/warehouse/test.db/order_partition/event_month=2020-02
Found 1 items
-rwxrwxrwx 1 donaldhan supergroup 66 2020-02-27 23:02 /user/hive/warehouse/test.db/order_partition/event_month=2020-02/order.txt
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ hdfs dfs -cat /user/hive/warehouse/test.db/order_partition/event_month=2020-02/order.txt
1 2020-02-27 22:59:23
2 2020-02-27 22:59:24
3 2020-02-27 22:59:25
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$
- 使用hadoop shell 加载数据
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ hdfs dfs -rm -r /user/hive/warehouse/test.db/order_partition
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ hdfs dfs -rm -r /user/hive/warehouse/test.db/order_partition
20/02/27 23:19:26 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/hive/warehouse/test.db/order_partition
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ hdfs dfs -ls /user/hive/warehouse/test.db/order_partition
ls: `/user/hive/warehouse/test.db/order_partition': No such file or directory
一不小心把分区表的文件夹给删了,不过没关系,这正是我们要做的,其实删除表文件下的分区问价夹即可。
查询数据,这是表中已经没有数据量
0: jdbc:hive2://pseduoDisHadoop:10000> select * from order_partition;
+-------------------------------+-----------------------------+------------------------------+
| order_partition.order_number | order_partition.event_time | order_partition.event_month |
+-------------------------------+-----------------------------+------------------------------+
+-------------------------------+-----------------------------+------------------------------+
No rows selected (0.32 seconds)
创建分区分区文件夹,并将本地数据文件order.txt,上传到对应的分区目录下 /user/hive/warehouse/test.db/order_partition/event_month=2020-02:
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ hdfs dfs -mkdir /user/hive/warehouse/test.db/order_partition
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ hdfs dfs -mkdir /user/hive/warehouse/test.db/order_partition/event_month=2020-02
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ hdfs dfs -put /bdp/hive/hiveLocalTables/order.txt /user/hive/warehouse/test.db/order_partition/event_month=2020-02
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ hdfs dfs -ls /user/hive/warehouse/test.db/order_partition/event_month=2020-02
Found 1 items
-rw-r--r-- 1 donaldhan supergroup 66 2020-02-27 23:22 /user/hive/warehouse/test.db/order_partition/event_month=2020-02/order.txt
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$
查看数据
0: jdbc:hive2://pseduoDisHadoop:10000> select * from order_partition;
+-------------------------------+-----------------------------+------------------------------+
| order_partition.order_number | order_partition.event_time | order_partition.event_month |
+-------------------------------+-----------------------------+------------------------------+
| 1 | 2020-02-27 22:59:23 | 2020-02 |
| 2 | 2020-02-27 22:59:24 | 2020-02 |
| 3 | 2020-02-27 22:59:25 | 2020-02 |
+-------------------------------+-----------------------------+------------------------------+
3 rows selected (0.306 seconds)
当前数据已经有了。如果没有数据,我们可以使用如下命令进行修复
msck repair table order_partition;
我们再在HIVE数仓目录下,创建一个分区目录event_month=2020-01,并将数据文件order.txt上传上去;
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ hdfs dfs -mkdir /user/hive/warehouse/test.db/order_partition/event_month=2020-01
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ hdfs dfs -put /bdp/hive/hiveLocalTables/order.txt /user/hive/warehouse/test.db/order_partition/event_month=2020-01
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$
这次我们再次查询没有将数据加载表中,我们修复分区数据后,再次查询,可以看到了。
0: jdbc:hive2://pseduoDisHadoop:10000> select * from order_partition;
+-------------------------------+-----------------------------+------------------------------+
| order_partition.order_number | order_partition.event_time | order_partition.event_month |
+-------------------------------+-----------------------------+------------------------------+
| 1 | 2020-02-27 22:59:23 | 2020-02 |
| 2 | 2020-02-27 22:59:24 | 2020-02 |
| 3 | 2020-02-27 22:59:25 | 2020-02 |
+-------------------------------+-----------------------------+------------------------------+
3 rows selected (0.299 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> msck repair table order_partition;
No rows affected (0.776 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from order_partition;
+-------------------------------+-----------------------------+------------------------------+
| order_partition.order_number | order_partition.event_time | order_partition.event_month |
+-------------------------------+-----------------------------+------------------------------+
| 1 | 2020-02-27 22:59:23 | 2020-01 |
| 2 | 2020-02-27 22:59:24 | 2020-01 |
| 3 | 2020-02-27 22:59:25 | 2020-01 |
| 1 | 2020-02-27 22:59:23 | 2020-02 |
| 2 | 2020-02-27 22:59:24 | 2020-02 |
| 3 | 2020-02-27 22:59:25 | 2020-02 |
+-------------------------------+-----------------------------+------------------------------+
6 rows selected (0.329 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
对于分区表,不建议直接使用select * 查询,性能低,建议查询时加上条件,如果加上条件后它会直接从指定的分区中查找数据
0: jdbc:hive2://pseduoDisHadoop:10000> select * from order_partition where event_month='2020-02';
+-------------------------------+-----------------------------+------------------------------+
| order_partition.order_number | order_partition.event_time | order_partition.event_month |
+-------------------------------+-----------------------------+------------------------------+
| 1 | 2020-02-27 22:59:23 | 2020-02 |
| 2 | 2020-02-27 22:59:24 | 2020-02 |
| 3 | 2020-02-27 22:59:25 | 2020-02 |
+-------------------------------+-----------------------------+------------------------------+
3 rows selected (0.483 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
mysql> select * from partitions;
+---------+-------------+------------------+---------------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME | SD_ID | TBL_ID |
+---------+-------------+------------------+---------------------+-------+--------+
| 1 | 1582815779 | 0 | event_month=2020-02 | 17 | 16 |
| 2 | 1582817419 | 0 | event_month=2020-01 | 18 | 16 |
+---------+-------------+------------------+---------------------+-------+--------+
2 rows in set
mysql> select * from partition_keys;
+--------+--------------+-------------+-----------+-------------+
| TBL_ID | PKEY_COMMENT | PKEY_NAME | PKEY_TYPE | INTEGER_IDX |
+--------+--------------+-------------+-----------+-------------+
| 16 | NULL | event_month | string | 0 |
+--------+--------------+-------------+-----------+-------------+
1 row in set
- 多级分区
创建表order_multi_partition
create table `order_multi_partition`(
order_number string,
event_time string
)
partitioned by (event_month string, step string)
row format delimited fields terminated by '\t';
加载数据到表order_multi_partition
0: jdbc:hive2://pseduoDisHadoop:10000> create table `order_multi_partition`(
. . . . . . . . . . . . . . . . . . .> order_number string,
. . . . . . . . . . . . . . . . . . .> event_time string
. . . . . . . . . . . . . . . . . . .> )
. . . . . . . . . . . . . . . . . . .> partitioned by (event_month string, step string)
. . . . . . . . . . . . . . . . . . .> row format delimited fields terminated by '\t';
No rows affected (1.521 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> load data local inpath '/bdp/hive/hiveLocalTables/order.txt' overwrite into table order_multi_partition partition (event_month='2020-02',step=1);
No rows affected (4.165 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from order_multi_partition;
+-------------------------------------+-----------------------------------+------------------------------------+-----------------------------+
| order_multi_partition.order_number | order_multi_partition.event_time | order_multi_partition.event_month | order_multi_partition.step |
+-------------------------------------+-----------------------------------+------------------------------------+-----------------------------+
| 1 | 2020-02-27 22:59:23 | 2020-02 | 1 |
| 2 | 2020-02-27 22:59:24 | 2020-02 | 1 |
| 3 | 2020-02-27 22:59:25 | 2020-02 | 1 |
+-------------------------------------+-----------------------------------+------------------------------------+-----------------------------+
3 rows selected (4.09 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
把step修改为2,再次加载数据
0: jdbc:hive2://pseduoDisHadoop:10000> select * from order_multi_partition;
+-------------------------------------+-----------------------------------+------------------------------------+-----------------------------+
| order_multi_partition.order_number | order_multi_partition.event_time | order_multi_partition.event_month | order_multi_partition.step |
+-------------------------------------+-----------------------------------+------------------------------------+-----------------------------+
| 1 | 2020-02-27 22:59:23 | 2020-02 | 1 |
| 2 | 2020-02-27 22:59:24 | 2020-02 | 1 |
| 3 | 2020-02-27 22:59:25 | 2020-02 | 1 |
| 1 | 2020-02-27 22:59:23 | 2020-02 | 2 |
| 2 | 2020-02-27 22:59:24 | 2020-02 | 2 |
| 3 | 2020-02-27 22:59:25 | 2020-02 | 2 |
+-------------------------------------+-----------------------------------+------------------------------------+-----------------------------+
6 rows selected (0.51 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
查看hdfs中的目录结构
donaldhan@pseduoDisHadoop:~$ hdfs dfs -ls /user/hive/warehouse/test.db/order_multi_partition/event_month=2020-02
Found 2 items
drwxrwxrwx - donaldhan supergroup 0 2020-03-02 22:34 /user/hive/warehouse/test.db/order_multi_partition/event_month=2020-02/step=1
drwxrwxrwx - donaldhan supergroup 0 2020-03-02 22:36 /user/hive/warehouse/test.db/order_multi_partition/event_month=2020-02/step=2
donaldhan@pseduoDisHadoop:~$
从上面可以看出,单级分区和多级分区唯一的区别就是多级分区在hdfs中的目录为多级。
动态分区
hive 中默认是静态分区,想要使用动态分区,需要设置如下参数,笔者使用的是临时设置,你也可以写在配置文件(hive-site.xml)里,永久生效。临时配置如下
开启动态分区(默认为false,不开启)
set hive.exec.dynamic.partition=true;
指定动态分区模式,默认为strict,即必须指定至少一个分区为静态分区,nonstrict模式表示允许所有的分区字段都可以使用动态分区。
set hive.exec.dynamic.partition.mode=nonstrict;
创建表student
create table `student`(
id int,
name string,
tel string,
age int
)
row format delimited fields terminated by '\t';
student.txt文件内容如下
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ vim student.txt
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ cat student.txt
1 donald 15965839766 23
2 rain 13697082376 18
3 jamel 15778566988 36
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$
将文件student.txt中的内容加载到student表中
load data local inpath '/bdp/hive/hiveLocalTables/student.txt' overwrite into table student;
0: jdbc:hive2://pseduoDisHadoop:10000> load data local inpath '/bdp/hive/hiveLocalTables/student.txt' overwrite into table student;
No rows affected (1.649 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from student;
+-------------+---------------+--------------+--------------+
| student.id | student.name | student.tel | student.age |
+-------------+---------------+--------------+--------------+
| 1 | donald | 15965839766 | 23 |
| 2 | rain | 13697082376 | 18 |
| 3 | jamel | 15778566988 | 36 |
+-------------+---------------+--------------+--------------+
3 rows selected (0.414 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
创建分区表stu_age_partition
create table `stu_age_partition`(
id int,
name string,
tel string
)
partitioned by (age int)
row format delimited fields terminated by '\t';
将student表的数据以age为分区插入到stu_age_partition表,试想如果student表中的数据很多,使用insert一条一条插入数据,很不方便,所以这个时候可以使用hive的动态分区来实现
insert into table stu_age_partition partition (age) select id,name,tel,age from student;
0: jdbc:hive2://pseduoDisHadoop:10000> insert into table stu_age_partition partition (age) select id,name,tel,age from student;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (104.21 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from stu_age_partition;
+-----------------------+-------------------------+------------------------+------------------------+
| stu_age_partition.id | stu_age_partition.name | stu_age_partition.tel | stu_age_partition.age |
+-----------------------+-------------------------+------------------------+------------------------+
| 2 | rain | 13697082376 | 18 |
| 1 | donald | 15965839766 | 23 |
| 3 | jamel | 15778566988 | 36 |
+-----------------------+-------------------------+------------------------+------------------------+
3 rows selected (0.473 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
查询时,建议加上分区条件,性能高
select * from stu_age_partition;
select * from stu_age_partition where age > 20;
0: jdbc:hive2://pseduoDisHadoop:10000> select * from stu_age_partition where age > 20;
+-----------------------+-------------------------+------------------------+------------------------+
| stu_age_partition.id | stu_age_partition.name | stu_age_partition.tel | stu_age_partition.age |
+-----------------------+-------------------------+------------------------+------------------------+
| 1 | donald | 15965839766 | 23 |
| 3 | jamel | 15778566988 | 36 |
+-----------------------+-------------------------+------------------------+------------------------+
2 rows selected (0.998 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from stu_age_partition;
+-----------------------+-------------------------+------------------------+------------------------+
| stu_age_partition.id | stu_age_partition.name | stu_age_partition.tel | stu_age_partition.age |
+-----------------------+-------------------------+------------------------+------------------------+
| 2 | rain | 13697082376 | 18 |
| 1 | donald | 15965839766 | 23 |
| 3 | jamel | 15778566988 | 36 |
+-----------------------+-------------------------+------------------------+------------------------+
3 rows selected (0.337 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
复杂数据类型操作
这部分建议使用HIVE2(beeline),hive2中解决了hive1中的单点故障,可以搭建高可用的hive集群,hive2界面显示格式要比hive1直观、好看(类似于mysql中的shell界面)
Array
创建一张带有数组的表tb_array
create table `tb_array`(
name string,
work_locations array<string>
)
row format delimited fields terminated by '\t'
collection items terminated by ',';
0: jdbc:hive2://pseduoDisHadoop:10000> desc tb_array;
+-----------------+----------------+----------+
| col_name | data_type | comment |
+-----------------+----------------+----------+
| name | string | |
| work_locations | array<string> | |
+-----------------+----------------+----------+
2 rows selected (0.399 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
加载数据到表tb_array hive_array.txt文件内容如下
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ vim hive_array.txt
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ cat hive_array.txt
jamel guangzhou,hangzhou
rain fuyang,tongling,nanchang
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$
命令如下:
load data local inpath '/bdp/hive/hiveLocalTables/hive_array.txt' overwrite into table tb_array;
表数据:
0: jdbc:hive2://pseduoDisHadoop:10000> select * from tb_array;
+----------------+-----------------------------------+
| tb_array.name | tb_array.work_locations |
+----------------+-----------------------------------+
| jamel | ["guangzhou","hangzhou"] |
| rain | ["fuyang","tongling","nanchang"] |
+----------------+-----------------------------------+
2 rows selected (4.112 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
查询jamel的第一个工作地点(数组下标从0开始)
select name,work_locations[0] from tb_array where name='jamel';
0: jdbc:hive2://pseduoDisHadoop:10000> select name,work_locations[0] from tb_array where name='jamel';
+--------+------------+
| name | _c1 |
+--------+------------+
| jamel | guangzhou |
+--------+------------+
1 row selected (1.883 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
查询每个人工作地点的数量(size)
select name,size(work_locations) from tb_array;
0: jdbc:hive2://pseduoDisHadoop:10000> select name,size(work_locations) from tb_array;
+--------+------+
| name | _c1 |
+--------+------+
| jamel | 2 |
| rain | 3 |
+--------+------+
2 rows selected (0.434 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
Map
创建一个带有map类型的表tb_map
create table `tb_map`(
name string,
scores map<string,int>
)
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':';
0: jdbc:hive2://pseduoDisHadoop:10000> desc tb_map;;
+-----------+------------------+----------+
| col_name | data_type | comment |
+-----------+------------------+----------+
| name | string | |
| scores | map<string,int> | |
+-----------+------------------+----------+
2 rows selected (0.348 seconds)
加载数据到表tb_map hive_map.txt文件内容如下
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ vim hive_map.txt
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ cat hive_map.txt
jamel math:100,chinese:88,english:96
rain math:89,chinese:68,english:78
命令如下:
load data local inpath '/bdp/hive/hiveLocalTables/hive_map.txt' overwrite into table tb_map;
0: jdbc:hive2://pseduoDisHadoop:10000> load data local inpath '/bdp/hive/hiveLocalTables/hive_map.txt' overwrite into table tb_map;
No rows affected (2.021 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000> select * from tb_map;
+--------------+-----------------------------------------+
| tb_map.name | tb_map.scores |
+--------------+-----------------------------------------+
| jamel | {"math":100,"chinese":88,"english":96} |
| rain | {"math":89,"chinese":68,"english":78} |
+--------------+-----------------------------------------+
2 rows selected (0.364 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
查询所有学生的英语成绩
select name,scores['english'] from tb_map;
0: jdbc:hive2://pseduoDisHadoop:10000> select name,scores['english'] from tb_map;
+--------+------+
| name | _c1 |
+--------+------+
| jamel | 96 |
| rain | 78 |
+--------+------+
2 rows selected (0.344 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
查询所有学生的英语和数学成绩
select name,scores['english'],scores['math'] from tb_map;
0: jdbc:hive2://pseduoDisHadoop:10000> select name,scores['english'],scores['math'] from tb_map;
+--------+------+------+
| name | _c1 | _c2 |
+--------+------+------+
| jamel | 96 | 100 |
| rain | 78 | 89 |
+--------+------+------+
2 rows selected (0.323 seconds)
0: jdbc:hive2://pseduoDisHadoop:10000>
Struct
创建一张带有结构体的表
0: jdbc:hive2://pseduoDisHadoop:10000> desc tb_struct;
+-----------+------------------------------+----------+
| col_name | data_type | comment |
+-----------+------------------------------+----------+
| ip | string | |
| userinfo | struct<name:string,age:int> | |
+-----------+------------------------------+----------+
2 rows selected (0.354 seconds)
加载文件hive_struct.txt中的数据到表tb_struct hive_struct.txt文件内容如下:
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ vim hive_struct.txt
donaldhan@pseduoDisHadoop:/bdp/hive/hiveLocalTables$ cat hive_struct.txt
192.168.1.1#zhangsan:40
192.168.1.2#lisi:50
192.168.1.3#wangwu:60
192.168.1.4#zhaoliu:70
命令如下:
load data local inpath '/bdp/hive/hiveLocalTables/hive_struct.txt' overwrite into table tb_struct;
表数据
0: jdbc:hive2://pseduoDisHadoop:10000> select * from tb_struct;
+---------------+-------------------------------+
| tb_struct.ip | tb_struct.userinfo |
+---------------+-------------------------------+
| 192.168.1.1 | {"name":"zhangsan","age":40} |
| 192.168.1.2 | {"name":"lisi","age":50} |
| 192.168.1.3 | {"name":"wangwu","age":60} |
| 192.168.1.4 | {"name":"zhaoliu","age":70} |
+---------------+-------------------------------+
4 rows selected (0.27 seconds)
查询姓名及年龄
select userinfo.name,userinfo.age from tb_struct;
0: jdbc:hive2://pseduoDisHadoop:10000> select userinfo.name,userinfo.age from tb_struct;
+-----------+------+
| name | age |
+-----------+------+
| zhangsan | 40 |
| lisi | 50 |
| wangwu | 60 |
| zhaoliu | 70 |
+-----------+------+
4 rows selected (0.375 seconds)
退出beeline
0: jdbc:hive2://pseduoDisHadoop:10000> !quit
Closing: 0: jdbc:hive2://pseduoDisHadoop:10000
至此,我们将hive的复查结构数据表讲完。
总结
由于我们创建数据库时没有指定对应的数仓存储路径,默认为HDFS下的数仓目录user/hive/warehouse+数据库名+.db对应的文件夹。
如果数据库中有0或多个表时,不能直接删除,需要先删除表再删除数据库;如果想要删除含有表的数据库,在删除时加上cascade,可以级联删除(慎用)。
内部表与外部表的区别 :
如果是内部表,在删除时,MySQL中的元数据和HDFS中的数据都会被删除
如果是外部表,在删除时,MySQL中的元数据会被删除,HDFS中的数据不会被删除
加载文件数据到Hive表有两种方式,一种是从本地文件加载,一种从hdfs文件加载。 如果加上LOCAL表示从本地加载数据,默认不加,从hdfs中加载数据,添加 OVERWRITE关键字,将会覆盖表中数据,及先删除,在加载。
从hdfs方式加载完数据,需要注意hdfs上的文件将会被删除,移动hdfs的垃圾箱中。
插入数据实际为一个MR任务。聚合类的操作(max,min,avg,count),都需要运行MR任务。
导出数据与加载数据LOCAL和OVERWRITE使用基本一致。如果加上LOCAL表示导出到本地默认不加,导出到hdfs,如果加OVERWRITE关键字,将会覆盖原文件中的数据,及先删除,在导出。
从本地加载文件到分区表时,实际上是,将本地文件放到hdfs上的数据库分区表文件夹(order_partition)下的分区字段+分区Value(event_month=2020-02)文价夹。
单级分区和多级分区唯一的区别就是多级分区在hdfs中的目录为多级。
附
参考文献
Hive DDL DML及SQL操作:https://blog.csdn.net/HG_Harvey/article/details/77488314
LanguageManual DDL:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
LanguageManual DML:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
HiveServer2 Clients:https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-BeelineHiveCommands
Error while compiling statement: FAILED: ParseException
Error: Error while compiling statement: FAILED: ParseException line 1:13 cannot recognize input near ‘user’ ‘(‘ ‘id’ in table name (state=42000,code=40000)
解决方式
主要是因为表明无法识别,表明要用引号`括住。
create table `user`(
id int,
name string)
row format delimited fields terminated by '\t';
从字段分隔符为制表符的文件加载数据,为空的情况
具体原因由于HIVE无法识别文件中的制表符,具体原因,参考如下链接:
hive建表指定字段分隔符为制表符,之后上传文件,文件内容未被hive表正确识别问题 :https://blog.csdn.net/u012443641/article/details/80021226
vim-set命令使用:https://www.jianshu.com/p/97d34b62d40d
Dynamic partition strict mode requires at least one static partition column
0: jdbc:hive2://pseduoDisHadoop:10000> insert into table stu_age_partition partition (age) select id,name,tel,age from student;
Error: Error while compiling statement: FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict (state=42000,code=10096)
0: jdbc:hive2://pseduoDisHadoop:10000> insert into table stu_age_partition partition (age) select id,name,tel,age from student;
Error: Error while compiling statement: FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict (state=42000,code=10096)
解决方式
主要是因为,hive 中默认是静态分区,想要使用动态分区,需要设置如下参数,笔者使用的是临时设置,你也可以写在配置文件(hive-site.xml)里,永久生效。临时配置如下
开启动态分区(默认为false,不开启)
set hive.exec.dynamic.partition=true;
指定动态分区模式,默认为strict,即必须指定至少一个分区为静态分区,nonstrict模式表示允许所有的分区字段都可以使用动态分区。
set hive.exec.dynamic.partition.mode=nonstrict;