尚硅谷大数据技术之Hive(新)第8章 压缩和存储
8.5.5 主流文件存储格式对比实验
从存储文件的压缩比和查询速度两个角度对比。
存储文件的压缩比测试:
- 测试数据
- https://share.weiyun.com/5WwhRJH
-
2.TextFile
(1)创建表,存储数据格式为TEXTFILE
create table log_text (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as textfile ;
(2)向表中加载数据
hive (default)> load data local inpath '/opt/module/datas/log.data' into table log_text ;
(3)查看表中数据大小
hive (default)> dfs -du -h /user/hive/warehouse/log_text;
18.1 M /user/hive/warehouse/log_text/log.data
3.ORC
(1)创建表,存储数据格式为ORC
create table log_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc ;
(2)向表中加载数据
hive (default)> insert into table log_orc select * from log_text ;
(3)查看表中数据大小
hive (default)> dfs -du -h /user/hive/warehouse/log_orc/ ;
2.8 M /user/hive/warehouse/log_orc/000000_0
4.Parquet
(1)创建表,存储数据格式为parquet
create table log_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as parquet ;
(2)向表中加载数据
hive (default)> insert into table log_parquet select * from log_text ;
(3)查看表中数据大小
hive (default)> dfs -du -h /user/hive/warehouse/log_parquet/ ;
13.1 M /user/hive/warehouse/log_parquet/000000_0
存储文件的压缩比总结:
ORC > Parquet > textFile
存储文件的查询速度测试:
1.TextFile
hive (default)> select count(*) from log_text;
_c0
100000
Time taken: 21.54 seconds, Fetched: 1 row(s)
Time taken: 21.08 seconds, Fetched: 1 row(s)
Time taken: 19.298 seconds, Fetched: 1 row(s)
2.ORC
hive (default)> select count(*) from log_orc;
_c0
100000
Time taken: 20.867 seconds, Fetched: 1 row(s)
Time taken: 22.667 seconds, Fetched: 1 row(s)
Time taken: 18.36 seconds, Fetched: 1 row(s)
3.Parquet
hive (default)> select count(*) from log_parquet;
_c0
100000
Time taken: 22.922 seconds, Fetched: 1 row(s)
Time taken: 21.074 seconds, Fetched: 1 row(s)
Time taken: 18.384 seconds, Fetched: 1 row(s)
存储文件的查询速度总结:查询速度相近。
8.6 存储和压缩结合
8.6.1 修改Hadoop集群具有Snappy压缩方式
1.查看hadoop checknative命令使用
[atguigu@hadoop104 hadoop-2.7.2]$ hadoop
checknative [-a|-h] check native hadoop and compression libraries availability
2.查看hadoop支持的压缩方式
[atguigu@hadoop104 hadoop-2.7.2]$ hadoop checknative
17/12/24 20:32:52 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
17/12/24 20:32:52 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /opt/module/hadoop-2.7.2/lib/native/libhadoop.so
zlib: true /lib64/libz.so.1
snappy: false
lz4: true revision:99
bzip2: false
3.将编译好的支持Snappy压缩的hadoop-2.7.2.tar.gz包导入到hadoop102的/opt/software中
4.解压hadoop-2.7.2.tar.gz到当前路径
[atguigu@hadoop102 software]$ tar -zxvf hadoop-2.7.2.tar.gz
5.进入到/opt/software/hadoop-2.7.2/lib/native路径可以看到支持Snappy压缩的动态链接库
[atguigu@hadoop102 native]$ pwd
/opt/software/hadoop-2.7.2/lib/native
[atguigu@hadoop102 native]$ ll
-rw-r--r--. 1 atguigu atguigu 472950 9月 1 10:19 libsnappy.a
-rwxr-xr-x. 1 atguigu atguigu 955 9月 1 10:19 libsnappy.la
lrwxrwxrwx. 1 atguigu atguigu 18 12月 24 20:39 libsnappy.so -> libsnappy.so.1.3.0
lrwxrwxrwx. 1 atguigu atguigu 18 12月 24 20:39 libsnappy.so.1 -> libsnappy.so.1.3.0
-rwxr-xr-x. 1 atguigu atguigu 228177 9月 1 10:19 libsnappy.so.1.3.0
6.拷贝/opt/software/hadoop-2.7.2/lib/native里面的所有内容到开发集群的/opt/module/hadoop-2.7.2/lib/native路径上
[atguigu@hadoop102 native]$ cp ../native/* /opt/module/hadoop-2.7.2/lib/native/
7.分发集群
[atguigu@hadoop102 lib]$ xsync native/
8.再次查看hadoop支持的压缩类型
[atguigu@hadoop102 hadoop-2.7.2]$ hadoop checknative
17/12/24 20:45:02 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
17/12/24 20:45:02 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /opt/module/hadoop-2.7.2/lib/native/libhadoop.so
zlib: true /lib64/libz.so.1
snappy: true /opt/module/hadoop-2.7.2/lib/native/libsnappy.so.1
lz4: true revision:99
bzip2: false
9.重新启动hadoop集群和hive
8.6.2 测试存储和压缩
官网:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
ORC存储方式的压缩:
表6-12
Key
Default
Notes
orc.compress
ZLIB
high level compression (one of NONE, ZLIB, SNAPPY)
orc.compress.size
262,144
number of bytes in each compression chunk
orc.stripe.size
67,108,864
number of bytes in each stripe
orc.row.index.stride
10,000
number of rows between index entries (must be >= 1000)
orc.create.index
true
whether to create row indexes
orc.bloom.filter.columns
""
comma separated list of column names for which bloom filter should be created
orc.bloom.filter.fpp
0.05
false positive probability for bloom filter (must >0.0 and <1.0)
1.创建一个非压缩的的ORC存储方式
(1)建表语句
create table log_orc_none(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc tblproperties ("orc.compress"="NONE");
(2)插入数据
hive (default)> insert into table log_orc_none select * from log_text ;
(3)查看插入后数据
hive (default)> dfs -du -h /user/hive/warehouse/log_orc_none/ ;
7.7 M /user/hive/warehouse/log_orc_none/000000_0
2.创建一个SNAPPY压缩的ORC存储方式
(1)建表语句
create table log_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc tblproperties ("orc.compress"="SNAPPY");
(2)插入数据
hive (default)> insert into table log_orc_snappy select * from log_text ;
(3)查看插入后数据
hive (default)> dfs -du -h /user/hive/warehouse/log_orc_snappy/ ;
3.8 M /user/hive/warehouse/log_orc_snappy/000000_0
3.上一节中默认创建的ORC存储方式,导入数据后的大小为
2.8 M /user/hive/warehouse/log_orc/000000_0
比Snappy压缩的还小。原因是orc存储文件默认采用ZLIB压缩。比snappy压缩的小。
4.存储方式和压缩总结
在实际的项目开发当中,hive表的数据存储格式一般选择:orc或parquet。压缩方式一般选择snappy,lzo。