Processing 10,000 word form resume operations using python

preamble

One day my friend A complained to me, his boss asked him to hundreds of word filled out word form resume information organized into excel, watched him one by one will be the name, age ...... from the word form to copy and paste into the excel, while pasting the heart while secretly cursing their bosses ...... But after all, the novice white, and can not go against the wishes of the boss said I do not do it, love it, so over to me for help. I said, this is a good thing to do ah, learn python can be solved ah, simple and easy to get started. Well, next to enter the main topic.

Idea: first analyze for each word form

How can I use python to get to the information inside the word table, the initial idea is to turn the word inside the table into a web page format, after all, mixing crawler shallow water for many years, using regular expressions to deal with the web page to get the information is relatively easy, so the idea of word into a web page format, so that the whole thing went crazy, hundreds of documents to open and then turn it into a web page, that there is also A lot of labor ah. So in the online search for a long time, found that the docx file itself is a compressed file, open the zip archive found that there is a special storage word inside the text of the file.

Open the file and find that all the information we want is hidden in a file named

So the basic process can be determined

1. Open the docx archive

2. Get the body of the word inside the information

3. Match the information we want using regular expressions

4. Store information in txt (txt can be opened in excel)

5. Batch call the above process to complete the extraction of 10,000 resumes

6. (Checking for errors or missing data)

0x01 Getting docx information

Utilizes python's zipfile library as well as the re library to process the information in the files inside a docx archive.

import zipfile
import re
def get_document(filepath):
  z = (filepath, "r")
  text = ("word/").decode("UTF-8")
  text = (r"<.*?>", "", text)# Remove all markers from xml
  ###If multiple resumes are in the same word file###
  #table_list = ("XX Resumes")[1:]#Split each resume by its title.
  #return table_list
  return text

Print the result of text

Since then, all relevant information from the resume has been exported

0x02 Grab the value of each field

Next grab the value of each field based on this relevant information

import re
def get_field_value(text):
  value_list = []
  m = (r"be surnamed place (e.g. among winners)(.*?)suffix forming noun from adjective, corresponding -ness or -ity  pin", table)
  value_list.append(m)
  m = (r"suffix forming noun from adjective, corresponding -ness or -ity  pin(.*?)study  pass through", table)
  value_list.append(m)
  m = (r"nationality ethnicity(.*?)health status", table)
  value_list.append(m)  
  '''
  Omit other field matches here
  '''
  return value_list

This returns the content matched by each field as a list

0x03 Write contents to file

Next, write the contents of this list to txt

str1 = ""
for value in value_list:
  str1 = str1 + str(value[0]) + "\t"# Each field value is separated by a tab \t
str1 = str1 + "\n"
with open("", "a+") as f:# Write the contents as an append to the
  (str1)

The above is converting a word to txt

Just batch process the files in the folder again and you're ok!

0x04 Batch Processing Complete Code

The full code is attached below

import re
import zipfile
import os
def get_document(filepath):
  z = (filepath, "r")
  text = ("word/").decode("UTF-8")
  text = (r"<.*?>", "", text)# Remove all markers from xml
  ###If multiple resumes are in the same word file###
  table_list = ("XX Resume")[1:]#Split each resume by its headline
  return table_list
def get_field_value(text):
  value_list = []
  m = (r"be surnamed place (e.g. among winners)(.*?)suffix forming noun from adjective, corresponding -ness or -ity  pin", table)
  value_list.append(m)
  m = (r"suffix forming noun from adjective, corresponding -ness or -ity  pin(.*?)science  pass through", table)
  value_list.append(m)
  m = (r"nationality ethnicity(.*?)health status", table)
  value_list.append(m)  
  '''
  Omit other field matches here
  '''
  return value_list
cv_list = []
for i in (()):
  a = (() + "\\" + i)# Get the filenames of all files in the current directory
  if a[1] == '.docx':# If the file suffix
    print(()+"\\"+i)
    cv_list = cv_list + get_document(() + "\\" + i)# Each resume information as a list element
for i in cv_list:
  value_list = get_field_value(i)
  str1 = ""
  for value in value_list:
    str1 = str1 + str(value[0]) + "\t"
  str1 = str1 + "\n"
  with open("", "a+") as f:
    (str1)

10,000 word form resume info converted to txt, then just open the txt in excel.

Supplementary: python word table some operations

Data format (datas): list set list

aa =[ [1,2,3,4,5],[6,7,8,9],[]…]

import os
import requests
import json
import datetime
from docx import Document
from  import Inches, Pt, Cm
from  import qn
from  import WD_PARAGRAPH_ALIGNMENT
def create_insert_word_table(datas, stday, etday, s):
  """Creating word tables and inserting data"""
  doc = Document()
  ['Normal']. = 'Calibri' # is used to set the font when the text is in Spanish.
  ['Normal']._element.(qn('w:eastAsia'), u'Song Style') # is used to set the font when the text is in Chinese
  # ['Normal']. = Pt(14) # Set all text font size to 14
  distance = Inches(0.5)
  sec = [0] # sections corresponds to "sections" in the document.
  sec.left_margin = distance # Set the left, right, top, and bottom page margins in the following order
  sec.right_margin = distance
  sec.top_margin = distance
  sec.bottom_margin = distance
  sec.page_width = Inches(11.7) # Set the page width
  # sec.page_height = Inches(9) # set the page height
  # doc.add_heading() # set the heading, but it doesn't meet my criteria, can only try the following p.add_run('I am text')
  p = doc.add_paragraph() # Add paragraph
   = WD_PARAGRAPH_ALIGNMENT.CENTER # Setting up central alignment
  run = p.add_run('I am words')
   = Pt(22)
  doc.add_paragraph() # Add empty paragraph
  # Add table
  table = doc.add_table(rows=1, cols=10, style='Table Grid')
   = 'Table Grid'
   = Pt(14)
  [0].height = Cm(20)
  title = [0].cells
  title[0].text = 'Name'
  title[1].text = '1'
  title[2].text = '2'
  title[3].text = '3'
  title[4].text = '4'
  title[5].text = '5'
  title[6].text = '6 '
  title[7].text = '7'
  title[8].text = '8'
  title[9].text = '9'
  for i in range(len(datas)):
    cels = table.add_row().cells
    for j in range(len(datas[i])):
      # cels[j].text = str(datas[i][j])
      p = cels[j].paragraphs[0]
       = WD_PARAGRAPH_ALIGNMENT.CENTER # Setting up central alignment
      p.add_run(str(datas[i][j]))
      ph_format = p.paragraph_format
      # ph_format.space_before = Pt(10) # Set the paragraph before spacing
      # ph_format.space_after = Pt(12) # Set the space after the paragraph
      ph_format.line_spacing = Pt(40) # Set line spacing
  ('. /files/project-summary.docx')

Generate Example

Possible error, [Errno 13] Permission denied: '. /files/project-progress-summary.docx'

It's because you can't close the file you're opening, close it and you'll be fine.

The above is a personal experience, I hope it can give you a reference, and I hope you can support me more. If there is any mistake or something that has not been fully considered, please do not hesitate to advise me.