我想我一直在搜索,但如果我错过了什么 - 请告诉我。
我正在尝试导入CSV文件,其中所有非数值都包含“。我遇到了一个问题:
df = pd.read_csv(file.csv)
CSV示例:
"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company "MoscowMining" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company" Jankowski,A,B""
由于其中有多个引号和逗号,在这种情况下,pandas会看到比4更多的列(如5或6)。
我已经尝试过了
df = pd.read_csv(file.csv, quotechar='"', quoting=2)
但得到了
ParserError: Error tokenizing data (...)
什么有效是跳过坏线
error_bad_lines=False
但我宁愿把所有数据都考虑在内,而不仅仅是省略它。
非常感谢您的帮助!
这看起来像是格式错误的CSV数据,因为值中的'''字符应该被转义。我经常看到这些值通过将它们加倍或以前缀为\来逃脱。请参阅https://en.wikipedia.org/wiki/Comma-separated_values#cite_ref-13
我要做的第一件事是修复导出这些文件的内容。但是,如果你不能这样做,你可以通过逃避“这是价值的一部分来解决问题。
你最好的选择可能是假设“只有一个逗号或换行符跟着(或者先于),如果它是一个值的结尾。那么你可以做一个正则表达式(从内存工作所以可能不是100% - 但是应该给你正确的想法。你必须适应任何你方便的正则表达式库)
s/([^,\n])"([^,\n])/$1""$2/g
因此,如果你要运行你的示例文件,它会被转义为这样:
"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company ""MoscowMining"" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company"" Jankowski,A,B"""
或使用以下内容
s/([^,\n])"([^,\n])/$1\"$2/g
该文件将被转义如下:
"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company \"MoscowMining\" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company\" Jankowski,A,B\""
根据您的CSV解析器,其中一个应该被接受并按预期工作。
如果@exe建议您的CSV解析器还要求转义值中的逗号,则可以应用类似的正则表达式来替换逗号。
如果我理解你需要的是在熊猫阅读csv之前施放引号和逗号。
像这些:
"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company \"MoscowMining\" Owner1\, Owner2\, Owner3"
"Agriculture","Poland","Warsaw","Company\" Jankowski\,A\,B\""