Spark SQL/DataFrame 中的字符串编码问题

Question

所以我有这个 csv 文件，它有两列：id（int），name（string）。当我通过以下代码将文件读入 pyspark 时：

schema = StructType([
        StructField("id", IntegerType(), True),
        StructField("name", StringType(), True)])
df = sqlContext.read.csv("file.csv", 
                             header=False, schema = schema)

执行

df.first()

时，我得到以下输出：

Row(artistid=1240105, artistname=u'Andr\xe9 Visior')

这是文件中的原始行：

1240105,André Visior

如何按原样显示名称？

Answer 1

0
投票

打开为 CSV(utf-8) 保存 csv 文件

Answer 2

这不是一个非常干净的方法，但这里有一个快速修复方法。

s = "1240105,André Visior"
s.decode('latin-1').encode("utf-8").replace("\xc3\xa9 ","e'")

>>
"1240105,Andre'Visior"

您可能想要查看

Latin-1

到

Unicode

/

ASCII

转换这里

Spark SQL/DataFrame 中的字符串编码问题

问题描述投票：0回答：2

2个回答

最新问题

Spark SQL/DataFrame 中的字符串编码问题

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2