博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
MapReduce之日志分析
阅读量:4149 次
发布时间:2019-05-25

本文共 6688 字,大约阅读时间需要 22 分钟。

MapReduce之日志分析

一、相关说明


  • 要求:对电商访问日志进行清洗,求出每种商品或者url的访问量(PV)

二、测试数据


  1. 测试数据如下,部分内容如下:
    niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89912 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-"niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89923 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-"niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89933 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-"niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89933 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-"niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89944 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-"niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89944 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-"niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89912 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-"niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89923 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-"niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89933 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-"niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89933 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-"
  2. 可以下载其他的,点击进行下载

三、编程思路


  • 思路
    1、按照wordcount的思路进行编程,区别在如何找到能唯一确定不同商品的标识,如url
    2、可以利用正则表达式查找目标url
    3、可以利用字符串截取找到目标的url地址

四、实现步骤


  1. 在Idea或eclipse中创建maven项目

  2. 在pom.xml中添加hadoop依赖

    org.apache.hadoop
    hadoop-common
    2.7.3
    org.apache.hadoop
    hadoop-hdfs
    2.7.3
    org.apache.hadoop
    hadoop-mapreduce-client-common
    2.7.3
    org.apache.hadoop
    hadoop-mapreduce-client-core
    2.7.3
  3. 添加log4j.properties文件在资源目录下即resources,文件内容如下:

    ### 配置根 ###log4j.rootLogger = debug,console,fileAppender## 配置输出到控制台 ###log4j.appender.console = org.apache.log4j.ConsoleAppenderlog4j.appender.console.Target = System.outlog4j.appender.console.layout = org.apache.log4j.PatternLayoutlog4j.appender.console.layout.ConversionPattern = %d{ABSOLUTE} %5p %c:%L - %m%n### 配置输出到文件 ###log4j.appender.fileAppender = org.apache.log4j.FileAppenderlog4j.appender.fileAppender.File = logs/logs.loglog4j.appender.fileAppender.Append = falselog4j.appender.fileAppender.Threshold = DEBUG,INFO,WARN,ERRORlog4j.appender.fileAppender.layout = org.apache.log4j.PatternLayoutlog4j.appender.fileAppender.layout.ConversionPattern = %-d{yyyy-MM-dd HH:mm:ss} [ %t:%r ] - [ %p ] %m%n
  4. 编写文本类型的mapper即LogMapper

    import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;import java.io.IOException;import java.util.regex.Matcher;import java.util.regex.Pattern;public class LogMapper extends Mapper
    {
    // 按指定模式在字符串查找 String pattern = "\\=[0-9a-z]*"; // 创建 Pattern 对象 Pattern r = Pattern.compile(pattern); protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    //niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89912 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-" String data = value.toString(); // 现在创建 matcher 对象 Matcher m = r.matcher(data); if (m.find()) {
    String idStr = m.group(0); String id = idStr.substring(1); context.write(new Text(id),new IntWritable(1)); } }}
  5. 编写reducer类

    import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;import java.io.IOException;public class LogReducer extends Reducer
    {
    @Override protected void reduce(Text key, Iterable
    values, Context context) throws IOException, InterruptedException {
    int sum = 0; for (IntWritable v: values) {
    sum += v.get(); } context.write(key,new IntWritable(sum)); }}
  6. 编写Driver类

    import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class LogJob {
    public static void main(String[] args) throws Exception {
    Job job = Job.getInstance(new Configuration()); job.setJarByClass(LogJob.class); job.setMapperClass(LogMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setReducerClass(LogReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.setInputPaths(job,new Path("F:\\NIIT\\access2.log")); FileOutputFormat.setOutputPath(job,new Path("F:\\NIIT\\logs\\002")); boolean completion = job.waitForCompletion(true); }}
  7. 本地运行代码,测试下结果正确与否,参考结果如下:

    402857036a2831e001kshdksdsdk89912    2402857036a2831e001kshdksdsdk89923    2402857036a2831e001kshdksdsdk89933    4402857036a2831e001kshdksdsdk89944    2

五、打包上传到集群中运行(仅供参考,自行修改)


  1. 本地运行测试结果正确后,需要对Driver类输出部分代码进行修改,具体修改如下:

    FileOutputFormat.setOutputPath(job,new Path(args[0]));

  2. 修改Job中【数据库】相关的信息

  3. 将程序打成jar包,需要在pom.xml中配置打包插件

    org.apache.maven.plugins
    maven-assembly-plugin
    jar-with-dependencies
    make-assembly
    package
    single

    按照如下图所示进行操作

    在这里插入图片描述
    在这里插入图片描述

  4. 提交集群运行,执行如下命令:

    hadoop jar packagedemo-1.0-SNAPSHOT.jar  com.niit.mr.EmpJob /datas/emp.csv /output/emp/

    至此,所有的步骤已经完成,大家可以试试,祝大家好运~~~~

转载地址:http://gkpti.baihongyu.com/

你可能感兴趣的文章
自定义控件:动态获取控件的高
查看>>
第三方开源库:nineoldandroid:ValueAnimator 动态设置textview的高
查看>>
第三方SDK:百度地图SDK的使用
查看>>
Android studio_迁移Eclipse项目到Android studio
查看>>
JavaScript setTimeout() clearTimeout() 方法
查看>>
CSS border 属性及用border画各种图形
查看>>
转载知乎-前端汇总资源
查看>>
JavaScript substr() 方法
查看>>
JavaScript slice() 方法
查看>>
JavaScript substring() 方法
查看>>
HTML 5 新的表单元素 datalist keygen output
查看>>
(转载)正确理解cookie和session机制原理
查看>>
jQuery ajax - ajax() 方法
查看>>
将有序数组转换为平衡二叉搜索树
查看>>
最长递增子序列
查看>>
从一列数中筛除尽可能少的数,使得从左往右看这些数是从小到大再从大到小...
查看>>
判断一个整数是否是回文数
查看>>
经典shell面试题整理
查看>>
腾讯的一道面试题—不用除法求数字乘积
查看>>
素数算法
查看>>