我有来自CNC(工作中心)数据的特定文件格式。像.txt一样保存。我想把这个表读到pandas数据帧,但我之前从未见过这种格式。
_MASCHINENNUMMER : >0-251-11-0950/51< SACHBEARB.: >BSTWIN32<
_PRODUKTSCHLUESSEL : >BST 500< DATUM : >05-20-2016<
---------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X !DELTA-Y !DURCHMESSER! KOMMENTAR
----------+----------+----------+----------+-----------+-------------------
[NoValidForUse]
A21 ! 1!62! 0.000! 0.000! 0.000!
[V11]
A12 ! -1!62! 0.000! -160.000! 0.000!
A12 ! 2!62! 0.000! -128.000! 3.000! 70.0
A12 ! -3!62! 0.000! -96.000! 0.000!
A12 ! 4!62! 0.000! -64.000! 0.000!
---------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X !DELTA-Y !DURCHMESSER! KOMMENTAR
----------+----------+----------+----------+-----------+-------------------
[V11]
O11 ! -9!62! 0.000! -96.000! 0.000!
O11 ! 10!62! 0.000! -128.000! 5.000! 70.0
问题:1。是否可以阅读此内容并转换为pandas Dataframe?侯做这个?
预期产量:
两个pandas DataFrames首先:
---------------------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X !DELTA-Y !DURCHMESSER! KOMMENTAR ! TYPE
----------+----------+----------+----------+-----------+-------------------------------
A21 ! 1!62! 0.000! 0.000! 0.000! !NoValidForUse
A12 ! -1!62! 0.000! -160.000! 0.000! !V11
A12 ! 2!62! 0.000! -128.000! 3.000! 70.0 !V11
A12 ! -3!62! 0.000! -96.000! 0.000! !V11
A12 ! 4!62! 0.000! -64.000! 0.000! !V11
第二个:
---------------------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X !DELTA-Y !DURCHMESSER! KOMMENTAR ! TYPE
----------+----------+----------+----------+-----------+-------------------------------
O11 ! -9!62! 0.000! -96.000! 0.000! !V11
O11 ! 10!62! 0.000! -128.000! 5.000! 70.0 !V11
Dataframe1和dataframe2的标题可以不同:
_MASCHINENNUMMER : >0-251-11-0950/51< SACHBEARB.: >BSTWIN32<
_PRODUKTSCHLUESSEL : >BST 500< DATUM : >05-20-2016<
---------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X !DELTA-Y !DURCHMESSER! KOMMENTAR
----------+----------+----------+----------+-----------+-------------------
[NoValidForUse]
A21 ! 1!62! 0.000! 0.000! 0.000!
[V11]
A12 ! -1!62! 0.000! -160.000! 0.000!
A12 ! 2!62! 0.000! -128.000! 3.000! 70.0
A12 ! -3!62! 0.000! -96.000! 0.000!
---------------------------------------------------------------------------
*BOHRKOPF ! !X-POS !Y-POS ! !
----------+----------+----------+----------+-----------+-------------------
[V11]
O11 ! ! 0.000! -96.000! !
O11 ! ! 0.000! -128.000! !
是的,这是可能的,但实际上依赖于数据:
read_csv
省略第一个3
行并省略第一个空格strip
列中的尾随空格TYPE
之间的extract
值创建列[]
并向前填充下一行DataFrame
由startswith
和cumsum
contains
行,其中第一列以[
,--
或*
开头df = pd.read_csv(file, sep="!", skiprows=3, skipinitialspace=True)
df.columns = df.columns.str.strip()
df['TYPE'] = df['*BOHRKOPF'].str.extract('\[(.*)\]', expand=False).ffill()
df['G'] = df['*BOHRKOPF'].str.startswith('*').cumsum()
df = df[~df['*BOHRKOPF'].str.contains('^\[|^--|^\*')]
print (df)
*BOHRKOPF SPINDEL WK DELTA-X DELTA-Y DURCHMESSER KOMMENTAR \
2 A21 1 62 0.000 0.000 0.000 NaN
4 A12 -1 62 0.000 -160.000 0.000 NaN
5 A12 2 62 0.000 -128.000 3.000 70.0
6 A12 -3 62 0.000 -96.000 0.000 NaN
7 A12 4 62 0.000 -64.000 0.000 NaN
12 O11 -9 62 0.000 -96.000 0.000 NaN
13 O11 10 62 0.000 -128.000 5.000 70.0
TYPE G
2 NoValidForUse 0
4 V11 0
5 V11 0
6 V11 0
7 V11 0
12 V11 1
13 V11 1
然后按G
列过滤:
df1 = df[df['G'] == 0].drop('G', axis=1)
print (df1)
*BOHRKOPF SPINDEL WK DELTA-X DELTA-Y DURCHMESSER KOMMENTAR \
2 A21 1 62 0.000 0.000 0.000 NaN
4 A12 -1 62 0.000 -160.000 0.000 NaN
5 A12 2 62 0.000 -128.000 3.000 70.0
6 A12 -3 62 0.000 -96.000 0.000 NaN
7 A12 4 62 0.000 -64.000 0.000 NaN
TYPE
2 NoValidForUse
4 V11
5 V11
6 V11
7 V11
df2 = df[df['G'] == 1].drop('G', axis=1)
print (df2)
*BOHRKOPF SPINDEL WK DELTA-X DELTA-Y DURCHMESSER KOMMENTAR TYPE
12 O11 -9 62 0.000 -96.000 0.000 NaN V11
13 O11 10 62 0.000 -128.000 5.000 70.0 V11
如果在文件中是多个DataFrames可能使用list comprehension
为list of DataFrames
:
dfs = [v.drop('G', axis=1) for k, v in df.groupby('G')]
print (dfs[0])
*BOHRKOPF SPINDEL WK DELTA-X DELTA-Y DURCHMESSER KOMMENTAR \
2 A21 1 62 0.000 0.000 0.000 NaN
4 A12 -1 62 0.000 -160.000 0.000 NaN
5 A12 2 62 0.000 -128.000 3.000 70.0
6 A12 -3 62 0.000 -96.000 0.000 NaN
7 A12 4 62 0.000 -64.000 0.000 NaN
TYPE
2 NoValidForUse
4 V11
5 V11
6 V11
7 V11
print (dfs[1])
*BOHRKOPF SPINDEL WK DELTA-X DELTA-Y DURCHMESSER KOMMENTAR TYPE
12 O11 -9 62 0.000 -96.000 0.000 NaN V11
13 O11 10 62 0.000 -128.000 5.000 70.0 V11
编辑:
temp=u"""_MASCHINENNUMMER : >0-251-11-0950/51< SACHBEARB.: >BSTWIN32<
_PRODUKTSCHLUESSEL : >BST 500< DATUM : >05-20-2016<
---------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X !DELTA-Y !DURCHMESSER! KOMMENTAR
----------+----------+----------+----------+-----------+-------------------
[NoValidForUse]
A21 ! 1!62! 0.000! 0.000! 0.000!
[V11]
A12 ! -1!62! 0.000! -160.000! 0.000!
A12 ! 2!62! 0.000! -128.000! 3.000! 70.0
A12 ! -3!62! 0.000! -96.000! 0.000!
A12 ! 4!62! 0.000! -64.000! 0.000!
---------------------------------------------------------------------------
*BOHRKOPF ! !X-POS !Y-POS ! !
----------+----------+----------+----------+-----------+-------------------
[V11]
O11 ! ! 0.000! -96.000! !
O11 ! ! 0.000! -128.000! ! """
为默认列名添加参数header
:
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep="!", skiprows=3, skipinitialspace=True, header=None)
df['TYPE'] = df[0].str.extract('\[(.*)\]', expand=False).ffill()
df['G'] = df[0].str.startswith('*').cumsum()
#dont remove rows start with *
df = df[~df[0].str.contains('^\[|^--')]
print (df)
0 1 2 3 4 5 \
0 *BOHRKOPF SPINDEL WK DELTA-X DELTA-Y DURCHMESSER
3 A21 1 62 0.000 0.000 0.000
5 A12 -1 62 0.000 -160.000 0.000
6 A12 2 62 0.000 -128.000 3.000
7 A12 -3 62 0.000 -96.000 0.000
8 A12 4 62 0.000 -64.000 0.000
10 *BOHRKOPF NaN X-POS Y-POS NaN NaN
13 O11 NaN 0.000 -96.000 NaN NaN
14 O11 NaN 0.000 -128.000 NaN NaN
6 TYPE G
0 KOMMENTAR NaN 1
3 NaN NoValidForUse 1
5 NaN V11 1
6 70.0 V11 1
7 NaN V11 1
8 NaN V11 1
10 NaN V11 2
13 NaN V11 2
14 NaN V11 2
对于每个循环删除列G
,重命名所有列,不包括第一行的第2行,删除iloc
的第一行,并在必要时删除所有列仅通过NaN
填充dropna
s:
dfs = [v.drop('G', axis=1).rename(columns=v.iloc[0, :-2]).iloc[1:].dropna(axis=1, how='all') for k, v in df.groupby('G')]
print (dfs[0])
*BOHRKOPF SPINDEL WK DELTA-X DELTA-Y DURCHMESSER KOMMENTAR \
3 A21 1 62 0.000 0.000 0.000 NaN
5 A12 -1 62 0.000 -160.000 0.000 NaN
6 A12 2 62 0.000 -128.000 3.000 70.0
7 A12 -3 62 0.000 -96.000 0.000 NaN
8 A12 4 62 0.000 -64.000 0.000 NaN
TYPE
3 NoValidForUse
5 V11
6 V11
7 V11
8 V11
print (dfs[1])
*BOHRKOPF X-POS Y-POS TYPE
13 O11 0.000 -96.000 V11
14 O11 0.000 -128.000 V11