我的数据集如下(摘录):
2.000 Company A 8.876 0,02
248 Enterprise B 26.028 0,07
193
dasdasdasd (asasas) sdasdasd
adsadsd asdasd asasa asassaas asas
asas asas 31. January 2018 (continue)
asdasd – 99,00% (31. March 2017 – 99,98%) (continue)
amasdasd asas
asasas asas
asas asssssssssss
DDD
asdasdads in %
asdasd adasd asddasad
(continue)
415 Company C Ltd. 21.412 0,06
668 Enterprise D AG 17.332 0,05
1.240 Company E GmbH 31.394 0,09
798 Enterprise OHG 52.586 0,14
我只想提取那些我有“数字字符串数字数字”的行,以便最终我的数据如下所示:
Column 1 Column 2 Column 3 Colum 4
2.000 Company A 8.876 0,02
248 Enterprise B 26.028 0,07
415 Company C Ltd. 21.412 0,06
668 Enterprise D AG 17.332 0,05
1.240 Company E GmbH 31.394 0,09
798 Enterprise OHG 52.586 0,14
任何想法怎么做?基本上,我特别需要帮助的地方是创建正则表达式,以过滤这些行并将提取的信息写入数据框,以便我可以对此进行一些分析。
您可以尝试:
data = """2.000 Company A 8.876 0,02
248 Enterprise B 26.028 0,07
193
dasdasdasd (asasas) sdasdasd
adsadsd asdasd asasa asassaas asas
asas asas 31. January 2018 (continue)
asdasd – 99,00% (31. March 2017 – 99,98%) (continue)
amasdasd asas
asasas asas
asas asssssssssss
DDD
asdasdads in %
asdasd adasd asddasad
(continue)
415 Company C Ltd. 21.412 0,06
668 Enterprise D AG 17.332 0,05
1.240 Company E GmbH 31.394 0,09
798 Enterprise OHG 52.586 0,14"""
reader = StringIO(data)
pattern = re.compile(r'([\d\.\,]+)\s+(\D*)([\d\.\,]+)\s([\d\.\,]+)$')
rows = []
for row in reader:
match = pattern.search(row)
if match:
rows.append([match.group(1), match.group(2), match.group(3), match.group(4)])
df = pd.DataFrame(rows, columns=["Column 1", "Column 2", "Column 3", "Column 4"])
我可以为您提供所需查询的正则表达式:
这将满足您的要求,