尚硅谷大数据技术之Hive(新)第8章 压缩和存储

8.5.5 主流文件存储格式对比实验

从存储文件的压缩比和查询速度两个角度对比。

存储文件的压缩比测试:

  1. 测试数据
  2. https://share.weiyun.com/5WwhRJH
  3. 2.TextFile

    (1)创建表,存储数据格式为TEXTFILE

    create table log_text (

    track_time string,

    url string,

    session_id string,

    referer string,

    ip string,

    end_user_id string,

    city_id string

    )

    row format delimited fields terminated by '\t'

    stored as textfile ;

    (2)向表中加载数据

    hive (default)> load data local inpath '/opt/module/datas/log.data' into table log_text ;

    (3)查看表中数据大小

    hive (default)> dfs -du -h /user/hive/warehouse/log_text;

    18.1 M  /user/hive/warehouse/log_text/log.data

    3.ORC

    (1)创建表,存储数据格式为ORC

    create table log_orc(

    track_time string,

    url string,

    session_id string,

    referer string,

    ip string,

    end_user_id string,

    city_id string

    )

    row format delimited fields terminated by '\t'

    stored as orc ;

    (2)向表中加载数据

    hive (default)> insert into table log_orc select * from log_text ;

    (3)查看表中数据大小

    hive (default)> dfs -du -h /user/hive/warehouse/log_orc/ ;

    2.8 M  /user/hive/warehouse/log_orc/000000_0

    4.Parquet

    (1)创建表,存储数据格式为parquet

    create table log_parquet(

    track_time string,

    url string,

    session_id string,

    referer string,

    ip string,

    end_user_id string,

    city_id string

    )

    row format delimited fields terminated by '\t'

    stored as parquet ;

    (2)向表中加载数据

    hive (default)> insert into table log_parquet select * from log_text ;

    (3)查看表中数据大小

    hive (default)> dfs -du -h /user/hive/warehouse/log_parquet/ ;

    13.1 M  /user/hive/warehouse/log_parquet/000000_0

    存储文件的压缩比总结:

    ORC >  Parquet >  textFile

    存储文件的查询速度测试:

    1.TextFile

    hive (default)> select count(*) from log_text;

    _c0

    100000

    Time taken: 21.54 seconds, Fetched: 1 row(s)

    Time taken: 21.08 seconds, Fetched: 1 row(s)

    Time taken: 19.298 seconds, Fetched: 1 row(s)

    2.ORC

    hive (default)> select count(*) from log_orc;

    _c0

    100000

    Time taken: 20.867 seconds, Fetched: 1 row(s)

    Time taken: 22.667 seconds, Fetched: 1 row(s)

    Time taken: 18.36 seconds, Fetched: 1 row(s)

    3.Parquet

    hive (default)> select count(*) from log_parquet;

    _c0

    100000

    Time taken: 22.922 seconds, Fetched: 1 row(s)

    Time taken: 21.074 seconds, Fetched: 1 row(s)

    Time taken: 18.384 seconds, Fetched: 1 row(s)

    存储文件的查询速度总结:查询速度相近。

    8.6 存储和压缩结合

    8.6.1 修改Hadoop集群具有Snappy压缩方式

    1.查看hadoop checknative命令使用

    [atguigu@hadoop104 hadoop-2.7.2]$ hadoop

       checknative [-a|-h]  check native hadoop and compression libraries availability

    2.查看hadoop支持的压缩方式

    [atguigu@hadoop104 hadoop-2.7.2]$ hadoop checknative

    17/12/24 20:32:52 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version

    17/12/24 20:32:52 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library

    Native library checking:

    hadoop:  true /opt/module/hadoop-2.7.2/lib/native/libhadoop.so

    zlib:    true /lib64/libz.so.1

    snappy:  false

    lz4:     true revision:99

    bzip2:   false

    3.将编译好的支持Snappy压缩的hadoop-2.7.2.tar.gz包导入到hadoop102的/opt/software中

    4.解压hadoop-2.7.2.tar.gz到当前路径

    [atguigu@hadoop102 software]$ tar -zxvf hadoop-2.7.2.tar.gz

    5.进入到/opt/software/hadoop-2.7.2/lib/native路径可以看到支持Snappy压缩的动态链接库

    [atguigu@hadoop102 native]$ pwd

    /opt/software/hadoop-2.7.2/lib/native

    [atguigu@hadoop102 native]$ ll

    -rw-r--r--. 1 atguigu atguigu  472950 9月   1 10:19 libsnappy.a

    -rwxr-xr-x. 1 atguigu atguigu     955 9月   1 10:19 libsnappy.la

    lrwxrwxrwx. 1 atguigu atguigu      18 12月 24 20:39 libsnappy.so -> libsnappy.so.1.3.0

    lrwxrwxrwx. 1 atguigu atguigu      18 12月 24 20:39 libsnappy.so.1 -> libsnappy.so.1.3.0

    -rwxr-xr-x. 1 atguigu atguigu  228177 9月   1 10:19 libsnappy.so.1.3.0

    6.拷贝/opt/software/hadoop-2.7.2/lib/native里面的所有内容到开发集群的/opt/module/hadoop-2.7.2/lib/native路径上

    [atguigu@hadoop102 native]$ cp ../native/* /opt/module/hadoop-2.7.2/lib/native/

    7.分发集群

    [atguigu@hadoop102 lib]$ xsync native/

    8.再次查看hadoop支持的压缩类型

    [atguigu@hadoop102 hadoop-2.7.2]$ hadoop checknative

    17/12/24 20:45:02 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version

    17/12/24 20:45:02 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library

    Native library checking:

    hadoop:  true /opt/module/hadoop-2.7.2/lib/native/libhadoop.so

    zlib:    true /lib64/libz.so.1

    snappy:  true /opt/module/hadoop-2.7.2/lib/native/libsnappy.so.1

    lz4:     true revision:99

    bzip2:   false

    9.重新启动hadoop集群和hive

    8.6.2 测试存储和压缩

    官网:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

    ORC存储方式的压缩:

    表6-12

    Key

    Default

    Notes

    orc.compress

    ZLIB

    high level compression (one of NONE, ZLIB, SNAPPY)

    orc.compress.size

    262,144

    number of bytes in each compression chunk

    orc.stripe.size

    67,108,864

    number of bytes in each stripe

    orc.row.index.stride

    10,000

    number of rows between index entries (must be >= 1000)

    orc.create.index

    true

    whether to create row indexes

    orc.bloom.filter.columns

    ""

    comma separated list of column names for which bloom filter should be created

    orc.bloom.filter.fpp

    0.05

    false positive probability for bloom filter (must >0.0 and <1.0)

    1.创建一个非压缩的的ORC存储方式

    (1)建表语句

    create table log_orc_none(

    track_time string,

    url string,

    session_id string,

    referer string,

    ip string,

    end_user_id string,

    city_id string

    )

    row format delimited fields terminated by '\t'

    stored as orc tblproperties ("orc.compress"="NONE");

    (2)插入数据

    hive (default)> insert into table log_orc_none select * from log_text ;

    (3)查看插入后数据

    hive (default)> dfs -du -h /user/hive/warehouse/log_orc_none/ ;

    7.7 M  /user/hive/warehouse/log_orc_none/000000_0

    2.创建一个SNAPPY压缩的ORC存储方式

    (1)建表语句

    create table log_orc_snappy(

    track_time string,

    url string,

    session_id string,

    referer string,

    ip string,

    end_user_id string,

    city_id string

    )

    row format delimited fields terminated by '\t'

    stored as orc tblproperties ("orc.compress"="SNAPPY");

    (2)插入数据

    hive (default)> insert into table log_orc_snappy select * from log_text ;

    (3)查看插入后数据

    hive (default)> dfs -du -h /user/hive/warehouse/log_orc_snappy/ ;

    3.8 M  /user/hive/warehouse/log_orc_snappy/000000_0

    3.上一节中默认创建的ORC存储方式,导入数据后的大小为

    2.8 M  /user/hive/warehouse/log_orc/000000_0

    比Snappy压缩的还小。原因是orc存储文件默认采用ZLIB压缩。比snappy压缩的小。

    4.存储方式和压缩总结

    在实际的项目开发当中,hive表的数据存储格式一般选择:orc或parquet。压缩方式一般选择snappy,lzo。