Writing Hadoop MapReduce Program in Native Python

In the field of big data processing, Hadoop MapReduce is a widely used framework for processing and generating large-scale data sets. It enables efficient data processing by decomposing tasks into multiple small tasks (mapping and reduction) and running on the cluster in parallel. Although Hadoop mainly supports Java programming languages, through the Hadoop Streaming function, we can use other languages such as Python to write MapReduce programs.

This article will introduce in detail how to write a Hadoop MapReduce program using native Python and use a simple example to illustrate its specific application.

Introduction to Hadoop Streaming

Hadoop Streaming is a tool provided by Hadoop that allows users to use any executable script or program as a mapper and reducer. This allows non-Java programmers to use the power of Hadoop for data processing. Hadoop Streaming communicates with external programs through standard input (stdin) and standard output (stdout), so any language that can read stdin and write stdout can be used to write MapReduce programs.

Python environment preparation

Make sure that Python is installed in your environment. Additionally, if your Hadoop cluster does not have Python preinstalled, you need to make sure that the Python environment is installed on all nodes.

Example: Word Count

We will demonstrate how to write a Hadoop MapReduce program in Python with a classic "word count" example. The function of this program is to count the number of occurrences of each word from a given text file.

1. Mapper script

Create a file named , with the following content:

#!/usr/bin/env python
import sys
 
# Read each line from standard inputfor line in :
    # Remove line breaks at the end of the line    line = ()
    # Split lines into words    words = ()
    # Output (word, 1)    for word in words:
        print(f'{word}\t1')

2. Reducer script

Create a file named , with the following content:

#!/usr/bin/env python
import sys
 
current_word = None
current_count = 0
word = None
 
# Read each line from standard inputfor line in :
    # Remove line breaks at the end of the line    line = ()
    # parse input pairs    word, count = ('\t', 1)
    try:
        count = int(count)
    except ValueError:
        # If count is not a number, ignore this line        continue
 
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # output (word, count)            print(f'{current_word}\t{current_count}')
        current_count = count
        current_word = word
 
# Output the last word (if it exists)if current_word == word:
    print(f'{current_word}\t{current_count}')

3. Run the MapReduce job

Assuming you already have a text file, you can run the MapReduce job via the following command:

hadoop jar /path/to/ \
-file ./ -mapper ./ \
-file ./ -reducer ./ \
-input /path/to/ -output /path/to/output

Here, /path/to/ is the path to the Hadoop Streaming JAR file, which you need to replace according to the actual situation. The -input and -output parameters specify the input and output directories respectively.

Through Hadoop Streaming, we can write Hadoop MapReduce programs without writing Java code using scripting languages such as Python. This method not only lowers the development threshold, but also improves development efficiency. Hope this article helps you better understand and use Hadoop Streaming for big data processing.

In the Hadoop ecosystem, MapReduce is a programming model for processing and generating large data sets. Although Hadoop mainly supports Java language to write MapReduce programs, it can also be used in other languages, including Python, through Hadoop Streaming. Hadoop Streaming is a tool that allows users to create and run MapReduce jobs that can read and write data through standard input and output streams.

Method supplement

Here is a simple MapReduce program using native Python that counts the number of occurrences of each word in a text file.

1. Environmental preparation

Make sure that Hadoop is installed in your environment and is configured correctly to run the Hadoop commands. In addition, you need to ensure that the Python environment is available.

2. Write Mapper scripts

The Mapper script is responsible for processing input data and generating key-value pairs. In this example, we output each word as a key and the number 1 as a value.

#!/usr/bin/env python
import sys
 
def read_input(file):
    for line in file:
        yield ().split()
 
def main():
    data = read_input()
    for words in data:
        for word in words:
            print(f"{word}\t1")
 
if __name__ == "__main__":
    main()

Save the above code as .

3. Write Reducer Scripts

The Reducer script receives key-value pairs from Mapper and summarizes the values of the same key. Here we will count the total number of occurrences of each word.

#!/usr/bin/env python
import sys
 
def read_input(file):
    for line in file:
        yield ().split('\t')
 
def main():
    current_word = None
    current_count = 0
    word = None
 
    for line in :
        word, count = next(read_input([line]))
        try:
            count = int(count)
        except ValueError:
            continue
 
        if current_word == word:
            current_count += count
        else:
            if current_word:
                print(f"{current_word}\t{current_count}")
            current_count = count
            current_word = word
 
    if current_word == word:
        print(f"{current_word}\t{current_count}")
 
if __name__ == "__main__":
    main()

Save the above code as .

4. Prepare to enter data

Suppose we have a text file called , with the following content:

hello world
hello hadoop
mapreduce is fun
fun with hadoop

5. Run the MapReduce job

Use the Hadoop Streaming command to run this MapReduce job. First, make sure that you have the corresponding input files in your Hadoop cluster. Then execute the following command:

hadoop jar /path/to/ \
-file ./ -mapper "python " \
-file ./ -reducer "python " \
-input /path/to/ \
-output /path/to/output

Here, /path/to/ is the path to the Hadoop Streaming JAR file, and you need to replace it according to the actual situation. Similarly, /path/to/ and /path/to/output also need to be replaced with your actual HDFS path.

6. View the results

After the job is completed, you can view the results in the specified output directory. For example, use the following command to view the output:

hadoop fs -cat /path/to/output/part-00000

This will display a list of each word and its occurrences.

The above is a basic example of writing a Hadoop MapReduce program using native Python. In this way, you can take advantage of Python's simplicity and powerful library support to handle big data tasks. In the Hadoop ecosystem, MapReduce is a programming model used to process and generate large data sets. While Hadoop mainly supports Java as its main programming language, MapReduce programs can also be written in other languages, including Python. Writing Hadoop MapReduce programs in Python is usually implemented through a tool called Hadoop Streaming. Hadoop Streaming allows users to create and run MapReduce jobs, where Mappers and Reducers are written in any executable file or script (such as Python, Perl, etc.).

Hadoop Streaming Principle

Hadoop Streaming works by passing data to a Mapper script via standard input (stdin) and receiving output from the Mapper script via standard output (stdout). Similarly, the Reducer script also receives the output from the Mapper through standard input and sends the final result through standard output.

Sample MapReduce written in Python

Suppose we want to count the number of times each word appears in a text file. Here is how to write a MapReduce program like this in Python:

1. Mapper script ()

#!/usr/bin/env python
import sys
 
# Read standard inputfor line in :
    # Remove line breaks at the end of the line    line = ()
    # Split lines into words    words = ()
    # Output (word, 1)    for word in words:
        print(f"{word}\t1")

2. Reducer script ()

#!/usr/bin/env python
import sys
 
current_word = None
current_count = 0
word = None
 
# Read data from standard inputfor line in :
    line = ()
    # parse input pairs from mapper    word, count = ('\t', 1)
    try:
        count = int(count)
    except ValueError:
        # If count is not a number, ignore this line        continue
    
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # output (word, count)            print(f"{current_word}\t{current_count}")
        current_count = count
        current_word = word
 
# Output the last word (if needed)if current_word == word:
    print(f"{current_word}\t{current_count}")

3. Run the MapReduce job

To run this MapReduce job, you need to make sure your Hadoop cluster is set up and you have permission to submit the job. You can submit the job using the following command:

hadoop jar /path/to/ \
-file ./ -mapper ./ \
-file ./ -reducer ./ \
-input /path/to/input/files \
-output /path/to/output

Here, /path/to/ is the path to the Hadoop Streaming JAR file. The -file parameter specifies the local file that needs to be uploaded to the Hadoop cluster. The -mapper and -reducer parameters specifies the Mapper and Reducer scripts, respectively. The -input and -output parameters specifies the input and output directories.

Things to note

Make sure your Python script has executable permissions, which can be set via chmod +x .

When processing large amounts of data, consider the data skew problem and design key-value pairs reasonably to avoid excessive burden on some reducers.

When testing Mapper and Reducer scripts, you can first debug using small-scale data in the local environment.

The above is the detailed content of writing Hadoop MapReduce program using native Python. For more information about Python Hadoop MapReduce, please follow my other related articles!