尚硅谷大数据技术之Hadoop(MapReduce)(新)第4章 Hadoop数据压缩

4.6.2 Map输出端采用压缩

即使你的MapReduce的输入输出文件都是未压缩的文件,你仍然可以对Map任务的中间结果输出做压缩,因为它要写在硬盘并且通过网络传输到Reduce节点,对其压缩可以提高很多性能,这些工作只要设置两个属性即可,我们来看下代码怎么设置。

1.给大家提供的Hadoop源码支持的压缩格式有:BZip2Codec 、DefaultCodec

package com.atguigu.mapreduce.compress;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.compress.BZip2Codec;

import org.apache.hadoop.io.compress.CompressionCodec;

import org.apache.hadoop.io.compress.GzipCodec;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 

public class WordCountDriver {

 

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

 

Configuration configuration = new Configuration();

 

// 开启map端输出压缩

configuration.setBoolean(“mapreduce.map.output.compress”, true);

// 设置map端输出压缩方式

configuration.setClass(“mapreduce.map.output.compress.codec”, BZip2Codec.class, CompressionCodec.class);

 

Job job = Job.getInstance(configuration);

 

job.setJarByClass(WordCountDriver.class);

 

job.setMapperClass(WordCountMapper.class);

job.setReducerClass(WordCountReducer.class);

 

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(IntWritable.class);

 

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

 

FileInputFormat.setInputPaths(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

 

boolean result = job.waitForCompletion(true);

 

System.exit(result ? 1 : 0);

}

}

2.Mapper保持不变

package com.atguigu.mapreduce.compress;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

 

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{

 

Text k = new Text();

IntWritable v = new IntWritable(1);

 

@Override

protected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {

 

// 1 获取一行

String line = value.toString();

 

// 2 切割

String[] words = line.split(” “);

 

// 3 循环写出

for(String word:words){

k.set(word);

context.write(k, v);

}

}

}

3.Reducer保持不变

package com.atguigu.mapreduce.compress;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

 

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

 

IntWritable v = new IntWritable();

 

@Override

protected void reduce(Text key, Iterable<IntWritable> values,

Context context) throws IOException, InterruptedException {

int sum = 0;

 

// 1 汇总

for(IntWritable value:values){

sum += value.get();

}

        v.set(sum);

 

        // 2 输出

context.write(key, v);

}

}


上一篇:
下一篇:
关于尚硅谷
教育理念
名师团队
学员心声
资源下载
视频下载
资料下载
工具下载
加入我们
招聘岗位
岗位介绍
招贤纳师
联系我们
全国统一咨询电话:010-56253825
地址:北京市昌平区宏福科技园综合楼6层(北京校区)

深圳市宝安区西部硅谷大厦B座C区一层(深圳校区)

上海市松江区谷阳北路166号大江商厦6层(上海校区)

武汉市东湖高新开发区东湖网谷(武汉校区)