Preface
In this blog, we will learn how to implement Hadoop's MapReduce framework using Java, and understand how MapReduce works with an example of WordCount. MapReduce is a popular large-scale data processing model that is widely used in distributed computing environments.
1. Text
1. Code structure
We will implement the core functionality of MapReduce in the following three files:
-
: accomplish
Mapper
Class, responsible for splitting the entered text data by word. -
: accomplish
Reducer
Class, responsible for summarizing the number of occurrences of words. - : Set up the Job configuration and manage the operation of Map and Reduce.
Next we will analyze these codes one by one.
2. ——Mapper implementation
First, let’s take a look at the code implementation of the Mapper class:
package demo1; import ; import ; import ; import ; import ; import ; //public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> public class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); // Counter private Text word = new Text(); // Store the currently processed words @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // Split each line of text data into words, you can use split() to achieve the same function StringTokenizer tokenizer = new StringTokenizer(()); while (()) { (()); // Get the next word (word, one); // Output word and its count 1 } } }
Functional interpretation:
The role of Mapper:
Mapper
The task of the class is to read the input data by line and process the content of each line. For this example, our task is to split a line of text into words and mark each word its initial count value as1
。-
Important methods and variables:
-
LongWritable key
: Indicates the offset of the input data, that is, the position of each line of text in the file. -
Text value
: A line of text that is read. -
(word, one)
: Take the split word as a key (Text
), the value is1
(IntWritable
), output to the framework for the next stage of use.
-
Notes:
StringTokenizer
Used to split each line of text and divide it into words.
(word, one)
Output the result toReducer
It will be aggregated during processing. Every time you encounter a same word, all of it will be added later.1
Aggregate into sum.
Generic definition of Mapper class
TypicalMapper
The class definition is as follows:
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
this meansMapper
is a generic class with four type parameters. Each parameter corresponds to a different data type in the Mapper task. Let's explain the meaning of these generic parameters one by one:
-
KEYIN (type of input key):
- This is the type of key to input data. In MapReduce programs, input data is usually from a file or other form of data source.
KEYIN
is the key that represents the input data fragment. - Usually the offset in the file (such as the byte position of the file), so it is often used by Hadoop.
LongWritable
to represent this offset.
Common Types:
LongWritable
, indicates the line number or offset in the input file. - This is the type of key to input data. In MapReduce programs, input data is usually from a file or other form of data source.
-
VALUEIN (type of input value):
- This is the type of the value of the input data.
VALUEIN
It is passed toMapper
The actual data of is usually a line of text. - It is usually the content of a file, such as a line of text, so it is commonly used
Text
Come to express.
Common Types:
Text
, represents a line of text in the input file. - This is the type of the value of the input data.
-
KEYOUT (type of output key):
- This is
Mapper
The type of key of the processed output data. The output of a Mapper is usually a key-value pair.KEYOUT
Indicates the type of output key. - For example, in a word counting program, the output key is usually a word, so it is commonly used
Text
。
Common Types:
Text
, represents the processed word (in the word counting program). - This is
-
VALUEOUT (type of output value):
- This is
Mapper
The type of the processed output value.VALUEOUT
It is the type of value corresponding to the Mapper output key. - In a word counting program, the output value is usually a number, which represents the number of occurrences of a word, so it is commonly used
IntWritable
。
Common Types:
IntWritable
, indicates the number of times the word counts (1
)。 - This is
3. ——Reducer implementation
Next we implement Reducer:
package demo1; import ; import ; import ; import ; public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += (); // Accumulate the number of times the word appears } (sum); // Set the result after the aggregate (key, result); // Output words and their total number of times } }
Functional interpretation:
The role of reducer:
Reducer
Classes are used to convertMapper
Summary of the output words and their counts. It aggregates all of each word1
, get the total count of the word throughout the input.-
Important methods and variables:
-
Text key
: means words. -
Iterable<IntWritable> values
: Represents all counts associated with the word (set of 1). -
sum
: Used to accumulate the number of times the word appears. -
(key, result)
: Output words and their total number of occurrences.
-
Notes:
for (IntWritable val : values)
Iterate through all count values and accumulate the total number of words.
The result will be output as<word, number of occurrences>
, stored in the final output file.
Generic definition of Reducer class
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
Reducer
The class has four generic parameters, each parameter corresponds toReducer
Different data types in the task.
-
KEYIN
(Type of input key):-
KEYIN
is the type of key received by Reducer, which is theMapper
The output key passed in the type. - For example, in a word counting program,
Mapper
The output key is a word, soKEYIN
Usually it isText
type.
Common Types:
Text
, represents a word (in the word counting program). -
-
VALUEIN
(Type of input value):-
VALUEIN
is the type of value received by reducer, it isMapper
A collection of types of output values. For eachKEYIN
,Reducer
A list of values related to this key will be received. - For example, in a word counting program,
Mapper
The output value is the number of times each word appears (usuallyIntWritable
The value is 1), soVALUEIN
The type of is usuallyIntWritable
。
Common Types:
IntWritable
, indicates the number of times the word appears. -
-
KEYOUT
(Type of output key):-
KEYOUT
yesReducer
The type of key output. - In the word counting program,
Reducer
The output key is still a word, soKEYOUT
Usually, tooText
type.
Common Types:
Text
, means words. -
-
VALUEOUT
(Type of output value):-
VALUEOUT
yesReducer
The type of output value. This value is the result of the Reducer process. - In the word counting program,
Reducer
The output value is the total number of times each word appears, soVALUEOUT
Usually it isIntWritable
。
Common Types:
IntWritable
, represents the total number of words. -
4. ——Job configuration and execution
Finally, we write the main program to configure and start the MapReduce job:
package demo1; import ; import ; import ; import ; import ; import ; import ; import ; import ; public class WordCount { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration conf = new Configuration(); // Configuration Items Job job = (conf, "word count"); // Create a new job String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if ( != 2) { ("Usage: wordcount <in> <out>"); (1); // Input and output path check } (); // Set the main class (); // Set Mapper class (); // Set the Reducer class // Set the key-value pair type of Map and Reduce output (); (); // Input and output path (job, new Path(otherArgs[0])); // Input path (job, new Path(otherArgs[1])); // Output path // Submit the job until the job is completed ((true) ? 0 : 1); } }
Functional interpretation:
-
Configuration and job initialization:
-
Configuration conf = new Configuration()
: Create Hadoop configuration object to store the relevant configuration of the job. -
Job job = (conf, "word count")
: Create a MapReduce job and name the job as"word count"
。
-
-
Job settings:
-
()
: Set the main class at runtime. -
()
and()
: Set up separatelyMapper
andReducer
kind.
-
-
Input and output paths:
-
()
: Specify the path to input data. -
()
: Specify the path to the output result.
-
-
Job submission and operation:
-
((true) ? 0 : 1)
: Submit the job and wait for completion. If it is successful, it returns 0 and fails to return 1.
-
Notes:
GenericOptionsParser
Used to parse command line input and obtain input and output paths.
After submitting the job, the Hadoop framework will automatically run Mapper and Reducer according to the configuration and output the results to the specified path.
2. Knowledge review and supplement
Data offset?
Data offset,Right nowLongWritable key
, is in the MapReduce programMapper
The key to enter, which represents the text of each line in the input data fileStart byte position。
example:Suppose we have a text file:
0: Hello world 12: Hadoop MapReduce 32: Data processing
The number (0, 12, 32) in front of each line is the offset of the corresponding line in the file, indicating the distance from the beginning of the file to the starting byte of the line.LongWritable
Type ofkey
This means this offset.
existMapper
, input is<offset, text line>
This key-value pair is provided. Although offsets are not important in word frequency statistics tasks, in some applications, such as file processing and log parsing, offsets can help track the location of data.
Context
Context
Yes in the MapReduce frameworkMapper
andReducer
A very important class in , which provides a way to interact with the framework.
The main role of Context:
-
Write output:exist
Mapper
andReducer
middle,(key, value)
Used to output the results to the framework. The framework will automatically process these output results and willMapper
The output ofReducer
input, or the finalReducer
Save the output to HDFS.- exist
Mapper
middle,(word, one)
Count each word and its initial count1
Pass to the framework. - exist
Reducer
middle,(key, result)
Output each word and its total occurrences to the final result.
- exist
Configure access:
Context
You can access the configuration parameters of the job (such asConfiguration
), helper obtains environment variables or job parameters.counter:
Context
Provides counter support to count certain events in jobs (such as the number of errors, the number of satisfied specific conditions, etc.).Record status:
Context
It can report the execution status of the job to help developers track the progress of the job or debug it.
What type of Iterable<IntWritable> values?
existReducer
Stage,Iterable<IntWritable> values
Represents all associated with the same key (i.e., word)IntWritable
Set of values.
-
Type Interpretation:
-
Iterable
Represents a collection that can be iterated, meaning it can be traversed. -
IntWritable
is a packaging class defined by Hadoop, used to encapsulateint
Value of type.
-
In the example of word frequency statistics:
For each word,Mapper
Will output multiple<word, 1>
, so inReducer
, for each key (i.e., word), there will be multiple1
as a collection of values, i.e.values
。Reducer
The task is to deal with these1
Perform accumulation to calculate the total number of occurrences of words.
Other traversal methods:
Apart fromfor (IntWritable val : values)
We can also use this enhanced for loopIterable traversal
Iterator<IntWritable> iterator = (); while (()) { IntWritable val = (); sum += (); // Process each value}
Iterator
supplyhasNext()
Method to check if there are more elements,next()
The method returns the current element and points to the next one.
Configuration conf = new Configuration() What is the function of?
Configuration
Classes are used to load and store configuration information for the runtime of Hadoop application. They are a core component of the Hadoop configuration system that allows you to define and access some runtime parameters. Every MapReduce job depends onConfiguration
To initialize the job configuration.
Configuration
The specific function of:
-
Read configuration files:
- It will load the system's Hadoop configuration file by default, such as
、
、
etc. These files contain information about the Hadoop cluster (such as the address of HDFS, job scheduler, etc.).
- If needed, these parameters can be added or overwritten manually through the code.
- It will load the system's Hadoop configuration file by default, such as
-
Custom parameter passing:
- You can pass the MapReduce job when running
Configuration
Pass some custom parameters. For example, you can write some control logic to a configuration file or set specific parameters directly in the code and inMapper
orReducer
Passed()
To access these parameters. - Example:
Configuration conf = new Configuration(); ("", "some value");
- You can pass the MapReduce job when running
-
Dependencies for job settings:
-
Configuration
It is the basis for the operation of Hadoop jobs, which isJob
Provides context, including input and output formats, job names, runtime dependency libraries, etc.
-
Why do you need itConfiguration
?
In MapReduce applications, the cluster is large in scale, and many configuration parameters (such as file system paths, task scheduler configurations, etc.) are stored in external configuration files.Configuration
Classes can load these configurations dynamically to avoid hard coding.
Use the split() method to realize the partitioning of the default separator
If you want to achieve something similar toStringTokenizer
The default behavior of (segmented with whitespace characters), you can use regular expressions\\s+
, which means matching one or more whitespace characters (withStringTokenizer
The default behavior is the same).
Sample code:
@Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // Use split() method to split strings, use whitespace characters as delimiters String[] tokens = ().split("\\s+"); // traversal of segmented marks for (String token : tokens) { (token); (word, one); } }
IntWritable
IntWritable
It is a class provided by Hadoop and belongs toBag. It is used in the Hadoop framework to encapsulate
int
A writable wrapper class for type data.
In Hadoop MapReduce, all data types need to be implementedWritable
andComparable
interface so that it can be transmitted between nodes over the network.IntWritable
As one of the basic data types in Hadoop, some convenient ways to store and processint
data.
IntWritable
The role of the class:
In MapReduce, all Hadoop data types need to be implementedWritable
interfaces so that they can be transmitted over the network in distributed systems.IntWritable
Encapsulated a Javaint
Type, for input and output key-value pairs for Hadoop.
Main uses:
-
MapReduce as value type:
IntWritable
Commonly used to represent the output values of Mapper and Reducer. -
Supports serialization and deserialization: It has achieved
Writable
Interfaces can efficiently serialize and deserialize in a distributed environment.
How to use IntWritable
IntWritable
Provides construction methods and some methods to set and obtainint
Value.
1. Create an IntWritable object
You can create objects directly through constructor methods:
-
Default constructor: Create a value of 0
IntWritable
。 - Parameter constructor: The initial value can be set directly.
// Create an IntWritable object with a default value of 0IntWritable writable1 = new IntWritable(); // Create an IntWritable object with a value of 10IntWritable writable2 = new IntWritable(10);
2. Set value and get value
Can be passedset()
Method to set the value, byget()
Method to obtainIntWritable
Encapsulatedint
Value.
IntWritable writable = new IntWritable(); // Set the value to 42(42); // Get the valueint value = (); // value == 42
3. Use in MapReduce
In the MapReduce task,IntWritable
The value that is usually used for output. For example, in the MapReduce program of the counter, theIntWritable
The value of is set to 1, indicating the number of occurrences of a word.
Example: IntWritable in MapReduce
public class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); // IntWritable with a value of 1 private Text word = new Text(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] tokens = ().split("\\s+"); // traverse each word and write context for (String token : tokens) { (token); (word, one); // Output <word, 1> } } }
In this MapReduce example:
- The corresponding value of each word is
1
,useIntWritable one = new IntWritable(1)
To encapsulate this integer value. -
()
WillText
andIntWritable
The object is output as a key-value pair, the key is a word, and the value is1
。
Summarize:
IntWritable
The reason it is used in Hadoop, not nativeint
Type is because:
Hadoop requires a type that can be transmitted over the network.IntWritable
ImplementedWritable
Interface, can be serialized and deserialized.
IntWritable
ImplementedComparable
interface, so it can be used in the sorting operation of Hadoop.
Job job = (conf, "word count"); What does this mean?
Job job = (conf, "word count");
This line of code creates and configures a new MapReduce job instance.
-
conf:This is a Hadoop
Configuration
Object, contains the configuration information of the job. - "word count": This is the name of the job, which can be any string. It is mainly used to identify and record jobs.
In Driver, why only the output key-value pair type is set? Don't set input?
(); ();
1. Key-value pair type of input data
It is byInputFormat(likeTextInputFormat
) determines that the offset and content of each row of data are read by default as key-value pairs to be passed toMapper
。
Hadoop MapReduce usageInputFormatClass to read the input data file. The default input format isTextInputFormat, it will automatically parse the input file intoKey-value pairsform, and you do not need to explicitly specify the type of input in Driver.
-
TextInputFormatThe output of (i.e., passed to
Mapper
The input) is:- key: The byte offset of each line of text in the file, type:LongWritable。
- value: The content of each line, type:Text。
so,Mapper
The input key-value pair type has beenInputFormatControl, you do not need to specify it manually in Driver.
2. The final output key-value pair type
You need to set it explicitly in Driver because this is the data type written to HDFS.
The role of setOutputKeyClass and setOutputValueClass
In Driver, what you need to specify isKey-value pair type of the final output result,Right nowReducer
The output key-value pair type, because this is the data type written to HDFS.
-
()
:SpecifyThe final output keyType isText
。 -
()
:SpecifyThe final output valueType isIntWritable
。
These two settings explicitly tell Hadoop what types of keys and values are in the result file that is finally stored in HDFS.
Summarize
This is the article about using Java to implement MapReduce word frequency statistics. For more related Java MapReduce word frequency statistics, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!