In the field of big data processing, Hadoop MapReduce is a widely used framework for processing and generating large-scale data sets. It enables efficient data processing by decomposing tasks into multiple small tasks (mapping and reduction) and running on the cluster in parallel. Although Hadoop mainly supports Java programming languages, through the Hadoop Streaming function, we can use other languages such as Python to write MapReduce programs.
This article will introduce in detail how to write a Hadoop MapReduce program using native Python and use a simple example to illustrate its specific application.
Introduction to Hadoop Streaming
Hadoop Streaming is a tool provided by Hadoop that allows users to use any executable script or program as a mapper and reducer. This allows non-Java programmers to use the power of Hadoop for data processing. Hadoop Streaming communicates with external programs through standard input (stdin) and standard output (stdout), so any language that can read stdin and write stdout can be used to write MapReduce programs.
Python environment preparation
Make sure that Python is installed in your environment. Additionally, if your Hadoop cluster does not have Python preinstalled, you need to make sure that the Python environment is installed on all nodes.
Example: Word Count
We will demonstrate how to write a Hadoop MapReduce program in Python with a classic "word count" example. The function of this program is to count the number of occurrences of each word from a given text file.
1. Mapper script
Create a file named , with the following content:
#!/usr/bin/env python import sys # Read each line from standard inputfor line in : # Remove line breaks at the end of the line line = () # Split lines into words words = () # Output (word, 1) for word in words: print(f'{word}\t1')
2. Reducer script
Create a file named , with the following content:
#!/usr/bin/env python import sys current_word = None current_count = 0 word = None # Read each line from standard inputfor line in : # Remove line breaks at the end of the line line = () # parse input pairs word, count = ('\t', 1) try: count = int(count) except ValueError: # If count is not a number, ignore this line continue if current_word == word: current_count += count else: if current_word: # output (word, count) print(f'{current_word}\t{current_count}') current_count = count current_word = word # Output the last word (if it exists)if current_word == word: print(f'{current_word}\t{current_count}')
3. Run the MapReduce job
Assuming you already have a text file, you can run the MapReduce job via the following command:
hadoop jar /path/to/ \
-file ./ -mapper ./ \
-file ./ -reducer ./ \
-input /path/to/ -output /path/to/output
Here, /path/to/ is the path to the Hadoop Streaming JAR file, which you need to replace according to the actual situation. The -input and -output parameters specify the input and output directories respectively.
Through Hadoop Streaming, we can write Hadoop MapReduce programs without writing Java code using scripting languages such as Python. This method not only lowers the development threshold, but also improves development efficiency. Hope this article helps you better understand and use Hadoop Streaming for big data processing.
In the Hadoop ecosystem, MapReduce is a programming model for processing and generating large data sets. Although Hadoop mainly supports Java language to write MapReduce programs, it can also be used in other languages, including Python, through Hadoop Streaming. Hadoop Streaming is a tool that allows users to create and run MapReduce jobs that can read and write data through standard input and output streams.
Method supplement
Here is a simple MapReduce program using native Python that counts the number of occurrences of each word in a text file.
1. Environmental preparation
Make sure that Hadoop is installed in your environment and is configured correctly to run the Hadoop commands. In addition, you need to ensure that the Python environment is available.
2. Write Mapper scripts
The Mapper script is responsible for processing input data and generating key-value pairs. In this example, we output each word as a key and the number 1 as a value.
#!/usr/bin/env python import sys def read_input(file): for line in file: yield ().split() def main(): data = read_input() for words in data: for word in words: print(f"{word}\t1") if __name__ == "__main__": main()
Save the above code as .
3. Write Reducer Scripts
The Reducer script receives key-value pairs from Mapper and summarizes the values of the same key. Here we will count the total number of occurrences of each word.
#!/usr/bin/env python import sys def read_input(file): for line in file: yield ().split('\t') def main(): current_word = None current_count = 0 word = None for line in : word, count = next(read_input([line])) try: count = int(count) except ValueError: continue if current_word == word: current_count += count else: if current_word: print(f"{current_word}\t{current_count}") current_count = count current_word = word if current_word == word: print(f"{current_word}\t{current_count}") if __name__ == "__main__": main()
Save the above code as .
4. Prepare to enter data
Suppose we have a text file called , with the following content:
hello world
hello hadoop
mapreduce is fun
fun with hadoop
5. Run the MapReduce job
Use the Hadoop Streaming command to run this MapReduce job. First, make sure that you have the corresponding input files in your Hadoop cluster. Then execute the following command:
hadoop jar /path/to/ \
-file ./ -mapper "python " \
-file ./ -reducer "python " \
-input /path/to/ \
-output /path/to/output
Here, /path/to/ is the path to the Hadoop Streaming JAR file, and you need to replace it according to the actual situation. Similarly, /path/to/ and /path/to/output also need to be replaced with your actual HDFS path.
6. View the results
After the job is completed, you can view the results in the specified output directory. For example, use the following command to view the output:
hadoop fs -cat /path/to/output/part-00000
This will display a list of each word and its occurrences.
The above is a basic example of writing a Hadoop MapReduce program using native Python. In this way, you can take advantage of Python's simplicity and powerful library support to handle big data tasks. In the Hadoop ecosystem, MapReduce is a programming model used to process and generate large data sets. While Hadoop mainly supports Java as its main programming language, MapReduce programs can also be written in other languages, including Python. Writing Hadoop MapReduce programs in Python is usually implemented through a tool called Hadoop Streaming. Hadoop Streaming allows users to create and run MapReduce jobs, where Mappers and Reducers are written in any executable file or script (such as Python, Perl, etc.).
Hadoop Streaming Principle
Hadoop Streaming works by passing data to a Mapper script via standard input (stdin) and receiving output from the Mapper script via standard output (stdout). Similarly, the Reducer script also receives the output from the Mapper through standard input and sends the final result through standard output.
Sample MapReduce written in Python
Suppose we want to count the number of times each word appears in a text file. Here is how to write a MapReduce program like this in Python:
1. Mapper script ()
#!/usr/bin/env python import sys # Read standard inputfor line in : # Remove line breaks at the end of the line line = () # Split lines into words words = () # Output (word, 1) for word in words: print(f"{word}\t1")
2. Reducer script ()
#!/usr/bin/env python import sys current_word = None current_count = 0 word = None # Read data from standard inputfor line in : line = () # parse input pairs from mapper word, count = ('\t', 1) try: count = int(count) except ValueError: # If count is not a number, ignore this line continue if current_word == word: current_count += count else: if current_word: # output (word, count) print(f"{current_word}\t{current_count}") current_count = count current_word = word # Output the last word (if needed)if current_word == word: print(f"{current_word}\t{current_count}")
3. Run the MapReduce job
To run this MapReduce job, you need to make sure your Hadoop cluster is set up and you have permission to submit the job. You can submit the job using the following command:
hadoop jar /path/to/ \
-file ./ -mapper ./ \
-file ./ -reducer ./ \
-input /path/to/input/files \
-output /path/to/output
Here, /path/to/ is the path to the Hadoop Streaming JAR file. The -file parameter specifies the local file that needs to be uploaded to the Hadoop cluster. The -mapper and -reducer parameters specifies the Mapper and Reducer scripts, respectively. The -input and -output parameters specifies the input and output directories.
Things to note
Make sure your Python script has executable permissions, which can be set via chmod +x .
When processing large amounts of data, consider the data skew problem and design key-value pairs reasonably to avoid excessive burden on some reducers.
When testing Mapper and Reducer scripts, you can first debug using small-scale data in the local environment.
The above is the detailed content of writing Hadoop MapReduce program using native Python. For more information about Python Hadoop MapReduce, please follow my other related articles!