Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
377 views
in Technique[技术] by (71.8m points)

python - 更改Pandas中列的数据类型(Change data type of columns in Pandas)

I want to convert a table, represented as a list of lists, into a Pandas DataFrame.

(我想将表示为列表列表的表转换为Pandas DataFrame。)

As an extremely simplified example:

(作为一个极其简化的示例:)

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a)

What is the best way to convert the columns to the appropriate types, in this case columns 2 and 3 into floats?

(将列转换为适当类型的最佳方法是什么,在这种情况下,将列2和3转换为浮点数?)

Is there a way to specify the types while converting to DataFrame?

(有没有一种方法可以在转换为DataFrame时指定类型?)

Or is it better to create the DataFrame first and then loop through the columns to change the type for each column?

(还是先创建DataFrame然后遍历各列以更改各列的类型更好?)

Ideally I would like to do this in a dynamic way because there can be hundreds of columns and I don't want to specify exactly which columns are of which type.

(理想情况下,我想以动态方式执行此操作,因为可以有数百个列,并且我不想确切指定哪些列属于哪种类型。)

All I can guarantee is that each columns contains values of the same type.

(我可以保证的是,每一列都包含相同类型的值。)

  ask by translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You have three main options for converting types in pandas:

(您可以使用三种主要选项来转换熊猫的类型:)

  1. to_numeric() - provides functionality to safely convert non-numeric types (eg strings) to a suitable numeric type.

    (to_numeric() -提供了将非数字类型(例如字符串)安全地转换为合适的数字类型的功能。)

    (See also to_datetime() and to_timedelta() .)

    ((另请参见to_datetime()to_timedelta() 。))

  2. astype() - convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so).

    (astype() -将(几乎)任何类型转换为(几乎)任何其他类型(即使这样做不一定明智)。)

    Also allows you to convert to categorial types (very useful).

    (还允许您转换为分类类型(非常有用)。)

  3. infer_objects() - a utility method to convert object columns holding Python objects to a pandas type if possible.

    (infer_objects() -一种实用的方法,可以将保存Python对象的对象列转换为熊猫类型。)

Read on for more detailed explanations and usage of each of these methods.

(继续阅读以获取每种方法的更详细的解释和用法。)


1. to_numeric() (1. to_numeric())

The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric() .

(将DataFrame的一列或多列转换为数值的最佳方法是使用pandas.to_numeric() 。)

This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate.

(此函数将尝试将非数字对象(例如字符串)适当地更改为整数或浮点数。)

Basic usage (基本用法)

The input to to_numeric() is a Series or a single column of a DataFrame.

(to_numeric()的输入是Series或DataFrame的单个列。)

>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0      8
1      6
2    7.5
3      3
4    0.9
dtype: object

>>> pd.to_numeric(s) # convert everything to float values
0    8.0
1    6.0
2    7.5
3    3.0
4    0.9
dtype: float64

As you can see, a new Series is returned.

(如您所见,将返回一个新的Series。)

Remember to assign this output to a variable or column name to continue using it:

(请记住,将此输出分配给变量或列名以继续使用它:)

# convert Series
my_series = pd.to_numeric(my_series)

# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])

You can also use it to convert multiple columns of a DataFrame via the apply() method:

(您还可以通过apply()方法使用它来转换DataFrame的多个列:)

# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame

# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)

As long as your values can all be converted, that's probably all you need.

(只要您的值都可以转换,那可能就是您所需要的。)

Error handling (错误处理)

But what if some values can't be converted to a numeric type?

(但是,如果某些值不能转换为数字类型怎么办?)

to_numeric() also takes an errors keyword argument that allows you to force non-numeric values to be NaN , or simply ignore columns containing these values.

(to_numeric()还采用了errors关键字参数,该参数允许您将非数字值强制为NaN ,或仅忽略包含这些值的列。)

Here's an example using a Series of strings s which has the object dtype:

(这是使用具有对象dtype的一系列字符串s的示例:)

>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object

The default behaviour is to raise if it can't convert a value.

(如果无法转换值,则默认行为是引发。)

In this case, it can't cope with the string 'pandas':

(在这种情况下,它不能处理字符串“ pandas”:)

>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string

Rather than fail, we might want 'pandas' to be considered a missing/bad numeric value.

(我们可能希望将“ pandas”视为丢失/错误的数值,而不是失败。)

We can coerce invalid values to NaN as follows using the errors keyword argument:

(我们可以使用errors关键字参数将无效值强制为NaN ,如下所示:)

>>> pd.to_numeric(s, errors='coerce')
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64

The third option for errors is just to ignore the operation if an invalid value is encountered:

(errors的第三个选项是,如果遇到无效值,则忽略该操作:)

>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched

This last option is particularly useful when you want to convert your entire DataFrame, but don't not know which of our columns can be converted reliably to a numeric type.

(当您要转换整个DataFrame,但又不知道我们哪些列可以可靠地转换为数字类型时,最后一个选项特别有用。)

In that case just write:

(在这种情况下,只需写:)

df.apply(pd.to_numeric, errors='ignore')

The function will be applied to each column of the DataFrame.

(该函数将应用于DataFrame的每一列。)

Columns that can be converted to a numeric type will be converted, while columns that cannot (eg they contain non-digit strings or dates) will be left alone.

(可以转换为数字类型的列将被转换,而不能转换(例如,它们包含非数字字符串或日期)的列将被保留。)

Downcasting (下垂)

By default, conversion with to_numeric() will give you either a int64 or float64 dtype (or whatever integer width is native to your platform).

(默认情况下,使用to_numeric()转换将为您提供int64float64 dtype(或平台固有的任何整数宽度)。)

That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32 , or int8 ?

(通常这就是您想要的,但是如果您想节省一些内存并使用更紧凑的dtype(例如float32int8呢?)

to_numeric() gives you the option to downcast to either 'integer', 'signed', 'unsigned', 'float'.

(to_numeric()使您可以选择向下转换为'integer','signed','unsigned','float'。)

Here's an example for a simple series s of integer type:

(这是一个整数类型的简单序列s示例:)

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

Downcasting to 'integer' uses the smallest possible integer that can hold the values:

(向下转换为“整数”将使用可以保存值的最小整数:)

>>> pd.to_numeric(s, downcast='integer')
0    1
1    2
2   -7
dtype: int8

Downcasting to 'float' similarly picks a smaller than normal floating type:

(向下转换为“ float”类似地选择了一个比普通浮点型小的类型:)

>>> pd.to_numeric(s, downcast='float')
0    1.0
1    2.0
2   -7.0
dtype: float32

2. astype() (2. astype())

The astype() method enables you to be explicit about the dtype you want your DataFrame or Series to have.

(astype()方法使您可以明确表示希望DataFrame或Series具有的dtype。)

It's very versatile in that you can try and go from one type to the any other.

(它非常通用,可以尝试从一种类型转换为另一种类型。)

Basic usage (基本用法)

Just pick a type: you can use a NumPy dtype (eg np.int16 ), some Python types (eg bool), or pandas-specific types (like the categorical dtype).

(只需选择一个类型即可:您可以使用NumPy np.int16 (例如np.int16 ),某些Python类型(例如bool)或特定于熊猫的类型(例如类别dtype)。)

Call the method on the object you want to convert and astype() will try and convert it for you:

(在要转换的对象上调用方法, astype()将尝试为您转换它:)

# convert all DataFrame columns to the int64 dtype
df = df.astype(int)

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

# convert Series to float16 type
s = s.astype(np.float16)

# convert Series to Python strings
s = s.astype(str)

# convert Series to categorical type - see docs for more details
s = s.astype('category')

Notice I said "try" - if


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...