有效地将一列字符串转换为pandas中的几列单个字符

Question

我有几个相当大的数据帧（> 100万行）。在一列中是不同长度的字符串。我想将这些字符串拆分为单个字符，每个字符都放在一列中。

我可以使用pd.DataFrame.apply()来做到这一点 - 见下文 - 然而实际使用它太慢了（而且它也有崩溃核心的倾向）。

import pandas as pd

df = pd.DataFrame(['AAVFD','TYU?W_Z', 'SomeOtherString', 'ETC.'], columns = ['One'])

print df
    One
0   AAVFD
1   TYU?W_Z
2   SomeOtherString
3   ETC.

将字符串转换为不同长度的列表：

S1 = df.One.apply(list)
print S1
0                                  [A, A, V, F, D]
1                            [T, Y, U, ?, W, _, Z]
2    [S, o, m, e, O, t, h, e, r, S, t, r, i, n, g]
3                                     [E, T, C, .]
Name: One, dtype: object

将每个字符放入一列：

df2 = pd.DataFrame(S1.values.tolist())
print df2
   0  1  2  3     4     5     6     7     8     9    10    11    12    13  \
0  A  A  V  F     D  None  None  None  None  None  None  None  None  None   
1  T  Y  U  ?     W     _     Z  None  None  None  None  None  None  None   
2  S  o  m  e     O     t     h     e     r     S     t     r     i     n   
3  E  T  C  .  None  None  None  None  None  None  None  None  None  None   

     14  
0  None  
1  None  
2     g  
3  None

不幸的是，这很慢。看起来我应该能够通过直接处理df.One列底层的numpy数组来以某种方式对其进行矢量化。但是，当我尝试过这种情况时，我认为字符串的长度变化很困难。

Answer 1

我几乎不知道pandas，但操作的numpy方面可以这样完成（在Python 3上;在Python 2上使用'S1'代替'U1'）：

npchrs = df.values.astype(str).view('U1')
# array([['A', 'A', 'V', 'F', 'D', '', '', '', '', '', '', '', '', '', ''],
#        ['T', 'Y', 'U', '?', 'W', '_', 'Z', '', '', '', '', '', '', '', ''],
#        ['S', 'o', 'm', 'e', 'O', 't', 'h', 'e', 'r', 'S', 't', 'r', 'i', 'n', 'g'],
#        ['E', 'T', 'C', '.', '', '', '', '', '', '', '', '', '', '', '']],
#       dtype='<U1')

如果你可以使用空字符串而不是Nones，或者如果在pandas中替换它们很容易，你可以将它转换回df并完成。

根据@ COLDSPEED的时间安排，下面的步骤很慢，所以如果你能避免它会更好。如果不：

npobjs = npchrs.astype(object)
npobjs[npobjs==''] = None
# array([['A', 'A', 'V', 'F', 'D', None, None, None, None, None, None, None,
#         None, None, None],
#        ['T', 'Y', 'U', '?', 'W', '_', 'Z', None, None, None, None, None,
#         None, None, None],
#        ['S', 'o', 'm', 'e', 'O', 't', 'h', 'e', 'r', 'S', 't', 'r', 'i', 'n', 'g'],
#        ['E', 'T', 'C', '.', None, None, None, None, None, None, None, None,
#         None, None, None]], dtype=object)

Answer 2

使用列表理解的替代方案，我认为应该非常快 -

df = pd.DataFrame([list(x) for x in df.One])
df

  0  1  2  3     4     5     6     7     8     9     10    11    12    13  \
0  A  A  V  F     D  None  None  None  None  None  None  None  None  None   
1  T  Y  U  ?     W     _     Z  None  None  None  None  None  None  None   
2  S  o  m  e     O     t     h     e     r     S     t     r     i     n   
3  E  T  C  .  None  None  None  None  None  None  None  None  None  None   

     14  
0  None  
1  None  
2     g  
3  None

计时

df = pd.concat([df] * 10000, ignore_index=True)

# original answer
%timeit pd.DataFrame(df.One.apply(list).values.tolist())
10 loops, best of 3: 36.1 ms per loop

# Paul Panzer's answer
%%timeit
npchrs = df.values.astype(str).view('U1')
npobjs = npchrs.astype(object)
npobjs[npobjs==''] = None
pd.DataFrame(npobjs)

10 loops, best of 3: 37.5 ms per loop

# My list comp answer 
%timeit pd.DataFrame([list(x) for x in df.One.values])
10 loops, best of 3: 32.8 ms per loop

# improved version of Paul Panzer's answer
%timeit pd.DataFrame(df.values.astype(str).view('U1'))
10 loops, best of 3: 20.1 ms per loop

免责声明 - 时间根据数据，python版本，环境和操作系统而有所不同。

Answer 3

这是使用string-join，np.fromstring和masking（从this post借来的想法）的一种方法 -

def join_mask(df):
    lens = np.array([len(i) for i in df.One])
    n = lens.max()
    out = np.full((len(df),n), None)
    out[lens[:,None] > np.arange(n)] = np.fromstring(''.join(df.One), dtype='S1')
    return pd.DataFrame(out)

样品运行 -

In [160]: df
Out[160]: 
               One
0            AAVFD
1          TYU?W_Z
2  SomeOtherString
3             ETC.

In [161]: join_mask(df)
Out[161]: 
  0  1  2  3     4     5     6     7     8     9     10    11    12    13    14
0  A  A  V  F     D  None  None  None  None  None  None  None  None  None  None
1  T  Y  U  ?     W     _     Z  None  None  None  None  None  None  None  None
2  S  o  m  e     O     t     h     e     r     S     t     r     i     n     g
3  E  T  C  .  None  None  None  None  None  None  None  None  None  None  None

计时

使用@cᴏʟᴅsᴘᴇᴇᴅ的时序设置生成正确的None填充输出df的方法 -

In [173]: df = pd.concat([df] * 10000, ignore_index=True)

# original answer
In [175]: %timeit pd.DataFrame(df.One.apply(list).values.tolist())
10 loops, best of 3: 27.2 ms per loop

# @Paul Panzer's answer
In [176]: %%timeit
     ...: npchrs = df.values.astype(str).view('S1')
     ...: npobjs = npchrs.astype(object)
     ...: npobjs[npobjs==''] = None
     ...: pd.DataFrame(npobjs)
10 loops, best of 3: 20.3 ms per loop

# @cᴏʟᴅsᴘᴇᴇᴅ's answer 
In [177]: %timeit pd.DataFrame([list(x) for x in df.One.values])
10 loops, best of 3: 27.6 ms per loop

# Using solution in this post
In [178]: %timeit join_mask(df)
100 loops, best of 3: 13.8 ms per loop

有效地将一列字符串转换为pandas中的几列单个字符

问题描述投票：4回答：3

3个回答

最新问题

有效地将一列字符串转换为pandas中的几列单个字符

问题描述 投票：4回答：3

3个回答

最新问题

问题描述投票：4回答：3