I have a docx file with 40 pages of text and I want to separate each page and import its context into a list. Is this possible? The only way I have found is to find the empty spots in my list but that does not always mean a page break. With my code I get the text after the word "Subject" is found and it stops
after a blank spot is found. The thing is that need a way to recognise pagebreak in my code to solve some issues. This way page break is also being treated as a " " . Thanks in advance
import os
import docx
def read(name):
doc = docx.Document(name)
text =[]
for par in doc.paragraphs:
text.append(par.text)
return text
''''''
for basename in os.listdir('files'):
path = os.path.join('files', basename)
jerk = read(path)
lari =[]
vaccum = []
indices = []
for i in jerk:
if not i.find('Subject'):
lari.append(jerk.index(i))
indices.append(jerk.index(i))
for j in jerk:
if jerk.index(j) in lari:
for k in range(20):
if jerk[jerk.index(j)+k]!='':
vaccum.append(jerk[jerk.index(j) + k + 1])
else:
break
final =[]
var =''
for k in vaccum:
var = var+k
if k =='':
final.append(var)
var =''
print(vaccum)
question from:
https://stackoverflow.com/questions/65602630/i-can-not-find-a-way-to-deal-with-new-pages-in-docx-using-python 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…