Example of using python to batch read word documents and organize key information into excel sheets

goal

A computer interest group was recently formed in the lab

Initiative to document and share more of your problem-solving experience

It's like blogging at CSDN.

Although just starting out

However, considering that more and more information will be recorded later on this type of experience

That's why it's important to get the template design right at the beginning (as shown below)

python批量读取word,整理关键信息到excel

Facilitate the creation of electronic databases at a later stage

This allows others to quickly search for relevant records

They say, "Life is short. I use python."

So it was decided to extract the header information from the docx document using python

Then update the information into an xls spreadsheet like the following (just po the results)

python批量读取word,整理关键信息到excel

And clicking on the file path can directly open the corresponding file (including hyperlinks)

python批量读取word,整理关键信息到excel

code implementation

1. Capture header information from docx files

# -*- coding:utf-8 -*-
 
# This program scans docx files in Log and returns basic information
 
import docx
from docx import Document
 
test_d = '.. /log/sublime building an IDE for python.docx'
 
def docxInfo(addr):
 document = Document(addr)
 
 info = {'title':[],
 'keywords':[],
 'author':[],
 'date':[],
 'question':[]}
 
 lines = [0 for i in range(len())]
 k = 0
 for paragraph in :
 lines[k] = 
 k = k+1
 
 index = [0 for i in range(5)]
 k = 0
 for line in lines:
 if ('Title'):
 index[0] = k
 if ('Keywords'):
 index[1] = k
 if ('The Author'):
 index[2] = k
 if ('Date'):
 index[3] = k
 if ('Description of the problem'):
 index[4] = k
 k = k+1
 
 info['title'] = lines[index[0]+1]
 
 keywords = []
 for line in lines[index[1]+1:index[2]]:
 (line)
 info['keywords'] = keywords
 
 info['author'] = lines[index[2]+1]
 
 info['date'] = lines[index[3]+1]
 
 info['question'] = lines[index[4]+1]
 
 return info
 
if __name__ == '__main__':
 print(docxInfo(test_d))

2. Traversing the log folder for information updates

# -*- coding:utf-8 -*-
 
# This program can batch scan the files in the log, if you come across docx documents.
# Then call readfile() to extract the document information and save it to the digger.
# Log list.xls for quick retrieval at a later stage.
 
import os,datetime
import time
import xlrd
from xlrd import xldate_as_tuple
import xlwt
from readfile import docxInfo
from  import copy
 
# Open the log list to read the update date of the most recent record.
memo_d = '.. /log/digger log list.xls'
memo = xlrd.open_workbook(memo_d) #Read excel
sheet0 = memo.sheet_by_index(0) #Read the 1st table
memo_date = sheet0.col_values(5) #Read column 5
memo_n = len(memo_date) # Remove the title
if memo_n>0:
 xlsx_date = memo_date[memo_n-1] # Read the date of the last record.
 latest_date = sheet0.cell_value(memo_n-1,5)
 # Return timestamp
 
# Create a new xlsx
memo_new = copy(memo)
sheet1 = memo_new.get_sheet(0)
 
# Rebuild hyperlinks
hyperlinks = sheet0.col_values(6) # xlrd also reads text, causing hyperlinks to be lost.
k = 1
n_hyperlink = len(hyperlinks)
for k in range(n_hyperlink):
 link = 'HYPERLINK("%s";"%s")' %(hyperlinks[k],hyperlinks[k])
 (k,6,(link))
 k = k+1
 
 
# Determine the file suffix
def endWith(s,*endstring):
 array = map(,endstring)
 if True in array:
  return True
 else:
  return False
 
# Traverse the log folder and query
log_d = '../log'
logFiles = (log_d)
for file in logFiles:
 if endWith(file,'.docx'):
 timestamp = (log_d+'/'+file)
 if timestamp>latest_date:
 info = docxInfo(log_d+'/'+file)
 (memo_n,0,info['title'])
 keywords_text = ','.join(info['keywords'])
 (memo_n,1,keywords_text)
 (memo_n,2,info['author'])
 (memo_n,3,info['date'])
 (memo_n,4,info['question'])
 # Get the current time
 time_now = () # Floating point values, accurate to milliseconds
 (memo_n,5, time_now)
 link = 'HYPERLINK("%s";"%s")' %(file,file)
 (memo_n,6,(link))
 memo_n = memo_n+1
(memo_d)
memo_new.save(memo_d)
print('memo was updated!')

In fact, there are some better modules for operating spreadsheets, such as panda, xlsxwriter, openpyxl and so on. However, the above code has been basically able to achieve the function, and research dog after all, not so much time to write code to do debugging, so later have time to update it!

a thank-you note

Borrowed heavily from the various experiences of the greats in the CSDN forums in the process!!!!