指定字符串长度超过256的pyspark数据帧模式

Question

我正在阅读一个获得描述的时间长于256个字符的来源。我想把它们写到Redshift。

根据：https://github.com/databricks/spark-redshift#configuring-the-maximum-size-of-string-columns它只能在斯卡拉。

根据这个：https://github.com/databricks/spark-redshift/issues/137#issuecomment-165904691它应该是在创建数据帧时指定模式的解决方法。我无法让它发挥作用。

如何使用varchar（max）指定架构？

df = ...from source

schema = StructType([
    StructField('field1', StringType(), True),
    StructField('description', StringType(), True)
])

df = sqlContext.createDataFrame(df.rdd, schema)

Answer 1

Redshift maxlength注释以格式传递

{"maxlength":2048}

所以这是你应该传递给StructField构造函数的结构：

from pyspark.sql.types import StructField, StringType

StructField("description", StringType(), metadata={"maxlength":2048})

或别名方法：

from pyspark.sql.functions import col

col("description").alias("description", metadata={"maxlength":2048})

如果您使用PySpark 2.2或更早版本，请检查How to change column metadata in pyspark?的解决方法。

指定字符串长度超过256的pyspark数据帧模式

问题描述投票：1回答：1

1个回答

最新问题

指定字符串长度超过256的pyspark数据帧模式

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1