preamble
One day my friend A complained to me, his boss asked him to hundreds of word filled out word form resume information organized into excel, watched him one by one will be the name, age ...... from the word form to copy and paste into the excel, while pasting the heart while secretly cursing their bosses ...... But after all, the novice white, and can not go against the wishes of the boss said I do not do it, love it, so over to me for help. I said, this is a good thing to do ah, learn python can be solved ah, simple and easy to get started. Well, next to enter the main topic.
Idea: first analyze for each word form
How can I use python to get to the information inside the word table, the initial idea is to turn the word inside the table into a web page format, after all, mixing crawler shallow water for many years, using regular expressions to deal with the web page to get the information is relatively easy, so the idea of word into a web page format, so that the whole thing went crazy, hundreds of documents to open and then turn it into a web page, that there is also A lot of labor ah. So in the online search for a long time, found that the docx file itself is a compressed file, open the zip archive found that there is a special storage word inside the text of the file.
Open the file and find that all the information we want is hidden in a file named
So the basic process can be determined
1. Open the docx archive
2. Get the body of the word inside the information
3. Match the information we want using regular expressions
4. Store information in txt (txt can be opened in excel)
5. Batch call the above process to complete the extraction of 10,000 resumes
6. (Checking for errors or missing data)
0x01 Getting docx information
Utilizes python's zipfile library as well as the re library to process the information in the files inside a docx archive.
import zipfile import re def get_document(filepath): z = (filepath, "r") text = ("word/").decode("UTF-8") text = (r"<.*?>", "", text)# Remove all markers from xml ###If multiple resumes are in the same word file### #table_list = ("XX Resumes")[1:]#Split each resume by its title. #return table_list return text
Print the result of text
Since then, all relevant information from the resume has been exported
0x02 Grab the value of each field
Next grab the value of each field based on this relevant information
import re def get_field_value(text): value_list = [] m = (r"be surnamed place (e.g. among winners)(.*?)suffix forming noun from adjective, corresponding -ness or -ity pin", table) value_list.append(m) m = (r"suffix forming noun from adjective, corresponding -ness or -ity pin(.*?)study pass through", table) value_list.append(m) m = (r"nationality ethnicity(.*?)health status", table) value_list.append(m) ''' Omit other field matches here ''' return value_list
This returns the content matched by each field as a list
0x03 Write contents to file
Next, write the contents of this list to txt
str1 = "" for value in value_list: str1 = str1 + str(value[0]) + "\t"# Each field value is separated by a tab \t str1 = str1 + "\n" with open("", "a+") as f:# Write the contents as an append to the (str1)
The above is converting a word to txt
Just batch process the files in the folder again and you're ok!
0x04 Batch Processing Complete Code
The full code is attached below
import re import zipfile import os def get_document(filepath): z = (filepath, "r") text = ("word/").decode("UTF-8") text = (r"<.*?>", "", text)# Remove all markers from xml ###If multiple resumes are in the same word file### table_list = ("XX Resume")[1:]#Split each resume by its headline return table_list def get_field_value(text): value_list = [] m = (r"be surnamed place (e.g. among winners)(.*?)suffix forming noun from adjective, corresponding -ness or -ity pin", table) value_list.append(m) m = (r"suffix forming noun from adjective, corresponding -ness or -ity pin(.*?)science pass through", table) value_list.append(m) m = (r"nationality ethnicity(.*?)health status", table) value_list.append(m) ''' Omit other field matches here ''' return value_list cv_list = [] for i in (()): a = (() + "\\" + i)# Get the filenames of all files in the current directory if a[1] == '.docx':# If the file suffix print(()+"\\"+i) cv_list = cv_list + get_document(() + "\\" + i)# Each resume information as a list element for i in cv_list: value_list = get_field_value(i) str1 = "" for value in value_list: str1 = str1 + str(value[0]) + "\t" str1 = str1 + "\n" with open("", "a+") as f: (str1)
10,000 word form resume info converted to txt, then just open the txt in excel.
Supplementary: python word table some operations
Data format (datas): list set list
aa =[ [1,2,3,4,5],[6,7,8,9],[]…]
import os import requests import json import datetime from docx import Document from import Inches, Pt, Cm from import qn from import WD_PARAGRAPH_ALIGNMENT def create_insert_word_table(datas, stday, etday, s): """Creating word tables and inserting data""" doc = Document() ['Normal']. = 'Calibri' # is used to set the font when the text is in Spanish. ['Normal']._element.(qn('w:eastAsia'), u'Song Style') # is used to set the font when the text is in Chinese # ['Normal']. = Pt(14) # Set all text font size to 14 distance = Inches(0.5) sec = [0] # sections corresponds to "sections" in the document. sec.left_margin = distance # Set the left, right, top, and bottom page margins in the following order sec.right_margin = distance sec.top_margin = distance sec.bottom_margin = distance sec.page_width = Inches(11.7) # Set the page width # sec.page_height = Inches(9) # set the page height # doc.add_heading() # set the heading, but it doesn't meet my criteria, can only try the following p.add_run('I am text') p = doc.add_paragraph() # Add paragraph = WD_PARAGRAPH_ALIGNMENT.CENTER # Setting up central alignment run = p.add_run('I am words') = Pt(22) doc.add_paragraph() # Add empty paragraph # Add table table = doc.add_table(rows=1, cols=10, style='Table Grid') = 'Table Grid' = Pt(14) [0].height = Cm(20) title = [0].cells title[0].text = 'Name' title[1].text = '1' title[2].text = '2' title[3].text = '3' title[4].text = '4' title[5].text = '5' title[6].text = '6 ' title[7].text = '7' title[8].text = '8' title[9].text = '9' for i in range(len(datas)): cels = table.add_row().cells for j in range(len(datas[i])): # cels[j].text = str(datas[i][j]) p = cels[j].paragraphs[0] = WD_PARAGRAPH_ALIGNMENT.CENTER # Setting up central alignment p.add_run(str(datas[i][j])) ph_format = p.paragraph_format # ph_format.space_before = Pt(10) # Set the paragraph before spacing # ph_format.space_after = Pt(12) # Set the space after the paragraph ph_format.line_spacing = Pt(40) # Set line spacing ('. /files/project-summary.docx')
Generate Example
Possible error, [Errno 13] Permission denied: '. /files/project-progress-summary.docx'
It's because you can't close the file you're opening, close it and you'll be fine.
The above is a personal experience, I hope it can give you a reference, and I hope you can support me more. If there is any mistake or something that has not been fully considered, please do not hesitate to advise me.