Using Java to implement MapReduce word frequency statistics sample code

Preface

In this blog, we will learn how to implement Hadoop's MapReduce framework using Java, and understand how MapReduce works with an example of WordCount. MapReduce is a popular large-scale data processing model that is widely used in distributed computing environments.

1. Text

1. Code structure

We will implement the core functionality of MapReduce in the following three files:

: accomplishMapperClass, responsible for splitting the entered text data by word.
: accomplishReducerClass, responsible for summarizing the number of occurrences of words.
: Set up the Job configuration and manage the operation of Map and Reduce.

Next we will analyze these codes one by one.

2. ——Mapper implementation

First, let’s take a look at the code implementation of the Mapper class:

package demo1;

import ;
import ;
import ;
import ;

import ;
import ;


//public class Mapper&lt;KEYIN, VALUEIN, KEYOUT, VALUEOUT&gt;

public class Map extends Mapper&lt;LongWritable, Text, Text, IntWritable&gt; {
    private final static IntWritable one = new IntWritable(1); // Counter    private Text word = new Text(); // Store the currently processed words
    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // Split each line of text data into words, you can use split() to achieve the same function        StringTokenizer tokenizer = new StringTokenizer(());
        while (()) {
            (()); // Get the next word            (word, one); // Output word and its count 1        }
    }
}

Functional interpretation:

The role of Mapper：
MapperThe task of the class is to read the input data by line and process the content of each line. For this example, our task is to split a line of text into words and mark each word its initial count value as1。
Important methods and variables：
- LongWritable key: Indicates the offset of the input data, that is, the position of each line of text in the file.
- Text value: A line of text that is read.
- (word, one): Take the split word as a key (Text), the value is1（IntWritable), output to the framework for the next stage of use.

Notes:

StringTokenizerUsed to split each line of text and divide it into words.

(word, one)Output the result toReducerIt will be aggregated during processing. Every time you encounter a same word, all of it will be added later.1Aggregate into sum.

Generic definition of Mapper class

TypicalMapperThe class definition is as follows:

public class Mapper&lt;KEYIN, VALUEIN, KEYOUT, VALUEOUT&gt;

this meansMapperis a generic class with four type parameters. Each parameter corresponds to a different data type in the Mapper task. Let's explain the meaning of these generic parameters one by one:

KEYIN (type of input key)：
- This is the type of key to input data. In MapReduce programs, input data is usually from a file or other form of data source.KEYINis the key that represents the input data fragment.
- Usually the offset in the file (such as the byte position of the file), so it is often used by Hadoop.LongWritableto represent this offset.
Common Types：LongWritable, indicates the line number or offset in the input file.
VALUEIN (type of input value)：
- This is the type of the value of the input data.VALUEINIt is passed toMapperThe actual data of is usually a line of text.
- It is usually the content of a file, such as a line of text, so it is commonly usedTextCome to express.
Common Types：Text, represents a line of text in the input file.
KEYOUT (type of output key)：
- This isMapperThe type of key of the processed output data. The output of a Mapper is usually a key-value pair.KEYOUTIndicates the type of output key.
- For example, in a word counting program, the output key is usually a word, so it is commonly usedText。
Common Types：Text, represents the processed word (in the word counting program).
VALUEOUT (type of output value)：
- This isMapperThe type of the processed output value.VALUEOUTIt is the type of value corresponding to the Mapper output key.
- In a word counting program, the output value is usually a number, which represents the number of occurrences of a word, so it is commonly usedIntWritable。
Common Types：IntWritable, indicates the number of times the word counts (1）。

3. ——Reducer implementation

Next we implement Reducer:

package demo1;

import ;
import ;
import ;

import ;

public class Reduce extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt; {
    private IntWritable result = new IntWritable();

    @Override
    public void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += (); // Accumulate the number of times the word appears        }
        (sum); // Set the result after the aggregate        (key, result); // Output words and their total number of times    }
}

Functional interpretation:

The role of reducer：
ReducerClasses are used to convertMapperSummary of the output words and their counts. It aggregates all of each word1, get the total count of the word throughout the input.
Important methods and variables：
- Text key: means words.
- Iterable<IntWritable> values: Represents all counts associated with the word (set of 1).
- sum: Used to accumulate the number of times the word appears.
- (key, result): Output words and their total number of occurrences.

Notes:

for (IntWritable val : values)Iterate through all count values and accumulate the total number of words.

The result will be output as<word, number of occurrences>, stored in the final output file.

Generic definition of Reducer class

public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

ReducerThe class has four generic parameters, each parameter corresponds toReducerDifferent data types in the task.

KEYIN(Type of input key):
- KEYINis the type of key received by Reducer, which is theMapperThe output key passed in the type.
- For example, in a word counting program,MapperThe output key is a word, soKEYINUsually it isTexttype.
Common Types：Text, represents a word (in the word counting program).
VALUEIN(Type of input value):
- VALUEINis the type of value received by reducer, it isMapperA collection of types of output values. For eachKEYIN，ReducerA list of values related to this key will be received.
- For example, in a word counting program,MapperThe output value is the number of times each word appears (usuallyIntWritableThe value is 1), soVALUEINThe type of is usuallyIntWritable。
Common Types：IntWritable, indicates the number of times the word appears.
KEYOUT(Type of output key):
- KEYOUTyesReducerThe type of key output.
- In the word counting program,ReducerThe output key is still a word, soKEYOUTUsually, tooTexttype.
Common Types：Text, means words.
VALUEOUT(Type of output value):
- VALUEOUTyesReducerThe type of output value. This value is the result of the Reducer process.
- In the word counting program,ReducerThe output value is the total number of times each word appears, soVALUEOUTUsually it isIntWritable。
Common Types：IntWritable, represents the total number of words.

4. ——Job configuration and execution

Finally, we write the main program to configure and start the MapReduce job:

package demo1;

import ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;

import ;

public class WordCount {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration(); // Configuration Items        Job job = (conf, "word count"); // Create a new job
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if ( != 2) {
            ("Usage: wordcount &lt;in&gt; &lt;out&gt;");
            (1); // Input and output path check        }

        (); // Set the main class        (); // Set Mapper class        (); // Set the Reducer class
        // Set the key-value pair type of Map and Reduce output        ();
        ();

        // Input and output path        (job, new Path(otherArgs[0])); // Input path        (job, new Path(otherArgs[1])); // Output path
        // Submit the job until the job is completed        ((true) ? 0 : 1);
    }
}

Functional interpretation:

Configuration and job initialization：
- Configuration conf = new Configuration(): Create Hadoop configuration object to store the relevant configuration of the job.
- Job job = (conf, "word count"): Create a MapReduce job and name the job as"word count"。
Job settings：
- (): Set the main class at runtime.
- ()and(): Set up separatelyMapperandReducerkind.
Input and output paths：
- (): Specify the path to input data.
- (): Specify the path to the output result.
Job submission and operation：
- ((true) ? 0 : 1): Submit the job and wait for completion. If it is successful, it returns 0 and fails to return 1.

Notes:

GenericOptionsParserUsed to parse command line input and obtain input and output paths.

After submitting the job, the Hadoop framework will automatically run Mapper and Reducer according to the configuration and output the results to the specified path.

2. Knowledge review and supplement

Data offset?

Data offset,Right nowLongWritable key, is in the MapReduce programMapperThe key to enter, which represents the text of each line in the input data fileStart byte position。

example:Suppose we have a text file:

0: Hello world
12: Hadoop MapReduce
32: Data processing

The number (0, 12, 32) in front of each line is the offset of the corresponding line in the file, indicating the distance from the beginning of the file to the starting byte of the line.LongWritableType ofkeyThis means this offset.

existMapper, input is<offset, text line>This key-value pair is provided. Although offsets are not important in word frequency statistics tasks, in some applications, such as file processing and log parsing, offsets can help track the location of data.

Context

ContextYes in the MapReduce frameworkMapperandReducerA very important class in , which provides a way to interact with the framework.

The main role of Context：

Write output:existMapperandReducermiddle,(key, value)Used to output the results to the framework. The framework will automatically process these output results and willMapperThe output ofReducerinput, or the finalReducerSave the output to HDFS.
- existMappermiddle,(word, one)Count each word and its initial count1Pass to the framework.
- existReducermiddle,(key, result)Output each word and its total occurrences to the final result.
Configure access：ContextYou can access the configuration parameters of the job (such asConfiguration), helper obtains environment variables or job parameters.
counter：ContextProvides counter support to count certain events in jobs (such as the number of errors, the number of satisfied specific conditions, etc.).
Record status：ContextIt can report the execution status of the job to help developers track the progress of the job or debug it.

What type of Iterable<IntWritable> values?

existReducerStage,Iterable<IntWritable> valuesRepresents all associated with the same key (i.e., word)IntWritableSet of values.

Type Interpretation：
- IterableRepresents a collection that can be iterated, meaning it can be traversed.
- IntWritableis a packaging class defined by Hadoop, used to encapsulateintValue of type.
In the example of word frequency statistics：
For each word,MapperWill output multiple<word, 1>, so inReducer, for each key (i.e., word), there will be multiple1as a collection of values, i.e.values。ReducerThe task is to deal with these1Perform accumulation to calculate the total number of occurrences of words.

Other traversal methods：

Apart fromfor (IntWritable val : values)We can also use this enhanced for loopIterable traversal

Iterator&lt;IntWritable&gt; iterator = ();
while (()) {
    IntWritable val = ();
    sum += (); // Process each value}

IteratorsupplyhasNext()Method to check if there are more elements,next()The method returns the current element and points to the next one.

Configuration conf = new Configuration() What is the function of?

ConfigurationClasses are used to load and store configuration information for the runtime of Hadoop application. They are a core component of the Hadoop configuration system that allows you to define and access some runtime parameters. Every MapReduce job depends onConfigurationTo initialize the job configuration.

ConfigurationThe specific function of:

Read configuration files：
- It will load the system's Hadoop configuration file by default, such as、、etc. These files contain information about the Hadoop cluster (such as the address of HDFS, job scheduler, etc.).
- If needed, these parameters can be added or overwritten manually through the code.
Custom parameter passing：
- You can pass the MapReduce job when runningConfigurationPass some custom parameters. For example, you can write some control logic to a configuration file or set specific parameters directly in the code and inMapperorReducerPassed()To access these parameters.
- Example:
```
Configuration conf = new Configuration();
("", "some value");
```
Dependencies for job settings：
- ConfigurationIt is the basis for the operation of Hadoop jobs, which isJobProvides context, including input and output formats, job names, runtime dependency libraries, etc.

Why do you need itConfiguration？

In MapReduce applications, the cluster is large in scale, and many configuration parameters (such as file system paths, task scheduler configurations, etc.) are stored in external configuration files.ConfigurationClasses can load these configurations dynamically to avoid hard coding.

Use the split() method to realize the partitioning of the default separator

If you want to achieve something similar toStringTokenizerThe default behavior of (segmented with whitespace characters), you can use regular expressions\\s+, which means matching one or more whitespace characters (withStringTokenizerThe default behavior is the same).

Sample code:

    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // Use split() method to split strings, use whitespace characters as delimiters        String[] tokens = ().split("\\s+");

        // traversal of segmented marks        for (String token : tokens) {
            (token);
            (word, one);
        }
    }

IntWritable

IntWritableIt is a class provided by Hadoop and belongs toBag. It is used in the Hadoop framework to encapsulateintA writable wrapper class for type data.

In Hadoop MapReduce, all data types need to be implementedWritableandComparableinterface so that it can be transmitted between nodes over the network.IntWritableAs one of the basic data types in Hadoop, some convenient ways to store and processintdata.

IntWritableThe role of the class:

In MapReduce, all Hadoop data types need to be implementedWritableinterfaces so that they can be transmitted over the network in distributed systems.IntWritableEncapsulated a JavaintType, for input and output key-value pairs for Hadoop.

Main uses：

MapReduce as value type：IntWritableCommonly used to represent the output values of Mapper and Reducer.
Supports serialization and deserialization: It has achievedWritableInterfaces can efficiently serialize and deserialize in a distributed environment.

How to use IntWritable

IntWritableProvides construction methods and some methods to set and obtainintValue.

1. Create an IntWritable object

You can create objects directly through constructor methods:

Default constructor: Create a value of 0IntWritable。
Parameter constructor: The initial value can be set directly.

// Create an IntWritable object with a default value of 0IntWritable writable1 = new IntWritable();

// Create an IntWritable object with a value of 10IntWritable writable2 = new IntWritable(10);

2. Set value and get value

Can be passedset()Method to set the value, byget()Method to obtainIntWritableEncapsulatedintValue.

IntWritable writable = new IntWritable();

// Set the value to 42(42);

// Get the valueint value = (); // value == 42

3. Use in MapReduce

In the MapReduce task,IntWritableThe value that is usually used for output. For example, in the MapReduce program of the counter, theIntWritableThe value of is set to 1, indicating the number of occurrences of a word.

Example: IntWritable in MapReduce

public class Map extends Mapper&lt;LongWritable, Text, Text, IntWritable&gt; {
    private final static IntWritable one = new IntWritable(1); // IntWritable with a value of 1    private Text word = new Text();

    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] tokens = ().split("\\s+");

        // traverse each word and write context        for (String token : tokens) {
            (token);
            (word, one);  // Output <word, 1>        }
    }
}

In this MapReduce example:

The corresponding value of each word is1,useIntWritable one = new IntWritable(1)To encapsulate this integer value.
()WillTextandIntWritableThe object is output as a key-value pair, the key is a word, and the value is1。

Summarize:

IntWritableThe reason it is used in Hadoop, not nativeintType is because:

Hadoop requires a type that can be transmitted over the network.IntWritableImplementedWritableInterface, can be serialized and deserialized.

IntWritableImplementedComparableinterface, so it can be used in the sorting operation of Hadoop.

Job job = (conf, "word count"); What does this mean?

Job job = (conf, "word count");

This line of code creates and configures a new MapReduce job instance.

conf:This is a HadoopConfigurationObject, contains the configuration information of the job.
"word count": This is the name of the job, which can be any string. It is mainly used to identify and record jobs.

In Driver, why only the output key-value pair type is set? Don't set input?

();
();

1. Key-value pair type of input data

It is byInputFormat(likeTextInputFormat) determines that the offset and content of each row of data are read by default as key-value pairs to be passed toMapper。

Hadoop MapReduce usageInputFormatClass to read the input data file. The default input format isTextInputFormat, it will automatically parse the input file intoKey-value pairsform, and you do not need to explicitly specify the type of input in Driver.

TextInputFormatThe output of (i.e., passed toMapperThe input) is:
- key: The byte offset of each line of text in the file, type:LongWritable。
- value: The content of each line, type:Text。

so,MapperThe input key-value pair type has beenInputFormatControl, you do not need to specify it manually in Driver.

2. The final output key-value pair type

You need to set it explicitly in Driver because this is the data type written to HDFS.

The role of setOutputKeyClass and setOutputValueClass

In Driver, what you need to specify isKey-value pair type of the final output result,Right nowReducerThe output key-value pair type, because this is the data type written to HDFS.

():SpecifyThe final output keyType isText。
():SpecifyThe final output valueType isIntWritable。

These two settings explicitly tell Hadoop what types of keys and values are in the result file that is finally stored in HDFS.

Summarize

This is the article about using Java to implement MapReduce word frequency statistics. For more related Java MapReduce word frequency statistics, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!