Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
600 views
in Technique[技术] by (71.8m points)

python - Export Pandas data frame with text column containg utf-8 text and URLs to Excel

My Pandas data frame consists of Tweets and meta data of each tweet (300.000 rows). Some of my colleagues need to work with this data in Excel which is why I need to export it.

I wanted to use either .to_csv or .to_excel which are both provided by Pandas but I can't get it to work properly.

When I use .to_csv my problem is that it keeps failing in the text part of the data frame. I've played around with different separators but the file is never 100% aligned. The text column seems to contain tabs, pipe characters etc. which confuses Excel.

df.to_csv('test.csv', sep='', encoding='utf-8')

When I try to use .to_excel together with the xlsxwriter engine I'm confronted with a different problem, which is that my text column contains to many URLs (I think). xlswriter tries to make special clickable links of these URLs instead of just handling them as strings. I've found some information on how to circumvent this but, again, I can't get it to work.

The following bit of code should be used to disable the function that I think is causing trouble:

workbook = xlsxwriter.Workbook(filename, {'strings_to_urls': False})

However, when using to_excel I can't seem to adjust this setting of the Workbook object before I load the data frame into the Excel file.

In short how do I export a column with wildly varying text from a Pandas data frame to something that Excel understands?

edit: example:

@geertwilderspvv @telegraaf ach Wilders toch, nep-voorzitter van een nep-partij met maar één lid, 
zeur niet over nep-premier of parlement!

So in this case It is obviously a line brake that is my data. I will try to find some more examples.

edit2:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><logFileName>error047600_01.xml</logFileName><summary>Er zijn fouten aangetroffen in bestand C:UsersGuy MahieuDesktopVu ipython notebookpandas_simple.xlsx</summary><removedRecords summary="Hier volgt een lijst van verwijderde records:"><removedRecord>Verwijderde records: Formule van het onderdeel /xl/worksheets/sheet1.xml</removedRecord></removedRecords></recoveryLog>

Translation of Dutch stuff:

Errors were found in "file". Here follows a list of removed records: removed records: formula of the part /xl/worksheets/sheet1.xml

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I don't think it is currently possible to pass XlsxWriter constructor options via the Pandas API but you can workaround the strings_to_url issue as follows:

import pandas as pd

df = pd.DataFrame({'Data': ['http://python.org']})

# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter')

# Don't convert url-like strings to urls.
writer.book.strings_to_urls = False

# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')

# Close the Pandas Excel writer and output the Excel file.
writer.save()

Update: In recent version of Pandas you can pass XlsxWriter constructor options to ExcelWriter() directly and you do not need to set writer.book.strings_to_urls indirectly:

writer = pd.ExcelWriter('pandas_simple.xlsx', 
                        engine='xlsxwriter', 
                        options={'strings_to_urls': False})

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...