使用Pyspark计算主题标签时出错

Question

我正在处理Twitter数据集。我有JSON格式的数据。结构为：

root
|-- _id: string (nullable = true)
 |-- created_at: timestamp (nullable = true)
 |-- lang: string (nullable = true)
 |-- place: struct (nullable = true)
 |    |-- bounding_box: struct (nullable = true)
 |    |    |-- coordinates: array (nullable = true)
 |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |    |-- element: double (containsNull = true)
 |    |    |-- type: string (nullable = true)
 |    |-- country_code: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- place_type: string (nullable = true)
 |-- retweeted_status: struct (nullable = true)
 |    |-- _id: string (nullable = true)
 |    |-- user: struct (nullable = true)
 |    |    |-- followers_count: long (nullable = true)
 |    |    |-- friends_count: long (nullable = true)
 |    |    |-- id_str: string (nullable = true)
 |    |    |-- lang: string (nullable = true)
 |    |    |-- screen_name: string (nullable = true)
 |    |    |-- statuses_count: long (nullable = true)
 |-- text: string (nullable = true)
 |-- user: struct (nullable = true)
 |    |-- followers_count: long (nullable = true)
 |    |-- friends_count: long (nullable = true)
 |    |-- id_str: string (nullable = true)
 |    |-- lang: string (nullable = true)
 |    |-- screen_name: string (nullable = true)
 |    |-- statuses_count: long (nullable = true)

我用于计算主题标签的代码是这样：

non_retweets = tweets.where("retweeted_status IS NULL")
hashtag = non_retweets.select('text').flatMap(lambda x: x.split(" ").filter(lambda x: x.startWith("#"))

hashtag = hashtag.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)

hashtag.collect()

我得到的错误是这个：

 File "<ipython-input-112-11fd8cbc056d>",line 4
hashtag = hashtag.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
      ^
SyntaxError: Invalid syntax

我无法指出我的错误是什么。请帮助！

Answer 1

您忘记添加)。请检查下面的代码。

hashtag = non_retweets.select('text').flatMap(lambda x: x.split(" ").filter(lambda x: x.startWith("#")))

使用Pyspark计算主题标签时出错

问题描述投票：0回答：1

1个回答

最新问题

使用Pyspark计算主题标签时出错

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1