我在记事本 ++(xml 文件)上处理推特数据,我正在尝试删除转推。每个 RT 以 '
我遇到的问题是有时“
这是我捕获单行 RT 的正则表达式:
(
有没有人对我可以添加什么有任何想法,以便它可以收集跨多行的 RT(及其随附的元数据)?
RT示例(我更改了RT的一些名称和内容,但格式保持不变):
<tweet id='827364918734' createdAt='2011-01-16T18:13:02.000Z' language='en' authorId='673829' authorUsername='exampleuser' authorName='example' authorVerified='TRUE' authorDescription='example description' authorLocation='example location' authorCreatedAt='2009-05-10T05:02:51.000Z' authorFollowersCount='830211' authorFollowingCount='1763' authorTweetCount='34209' authorListedCount='7589' referencedTweetId='26690653563912192' referencedTweetCreatedAt='2011-01-16T17:22:02.000Z' referencedTweetText='example reference tweet text' referencedTweetRetweetCount='9' referencedTweetReplyCount='0' referencedTweetLikeCount='2' referencedTweetQuoteCount='0' referencedTweetAuthorUsername='example' referencedTweetAuthorName='example' referencedTweetAuthorVerified='TRUE' referencedTweetAuthorDescription='example description
Check out @example, our new example' referencedTweetAuthorLocation='example' referencedTweetAuthorCreatedAt='2008-08-27T15:24:02.000Z' referencedTweetAuthorFollowersCount='1380523' referencedTweetAuthorFollowingCount='1035' referencedTweetAuthorTweetCount='402492' referencedTweetAuthorListedCount='22425' retweetCount='9' replyCount='0' likeCount='0' quoteCount='0' >RT @example this is an example RT </tweet>
如何使用 python 代替
notepadd++
xml.etree.ElementTree
库和推文内部代码。
它将获取属性的值和 RT 文本。
ElementTree
库pip install pycopy-xml.etree.ElementTree
get-tweet.py
文件。import xml.etree.ElementTree as ET
xml = """\
<tweet id='827364918734' createdAt='2011-01-16T18:13:02.000Z' language='en' authorId='673829' authorUsername='exampleuser' authorName='example' authorVerified='TRUE' authorDescription='example description' authorLocation='example location' authorCreatedAt='2009-05-10T05:02:51.000Z' authorFollowersCount='830211' authorFollowingCount='1763' authorTweetCount='34209' authorListedCount='7589' referencedTweetId='26690653563912192' referencedTweetCreatedAt='2011-01-16T17:22:02.000Z' referencedTweetText='example reference tweet text' referencedTweetRetweetCount='9' referencedTweetReplyCount='0' referencedTweetLikeCount='2' referencedTweetQuoteCount='0' referencedTweetAuthorUsername='example' referencedTweetAuthorName='example' referencedTweetAuthorVerified='TRUE' referencedTweetAuthorDescription='example description
Check out @example, our new example' referencedTweetAuthorLocation='example' referencedTweetAuthorCreatedAt='2008-08-27T15:24:02.000Z' referencedTweetAuthorFollowersCount='1380523' referencedTweetAuthorFollowingCount='1035' referencedTweetAuthorTweetCount='402492' referencedTweetAuthorListedCount='22425' retweetCount='9' replyCount='0' likeCount='0' quoteCount='0' >RT @example this is an example RT </tweet>
"""
root = ET.fromstring(xml)
print("root: " + str(root))
print("root.tag: " + str(root.tag))
print("root.attrib: " + str(root.attrib))
print(type(root.attrib))
for key in root.attrib.keys():
print(key +': '+root.attrib[key])
print("text: " + str(root.text))
python get-tweet.py
$ python get-tweet.py
root: <Element 'tweet' at 0x000001593FED13F0>
root.tag: tweet
root.attrib: {'id': '827364918734', 'createdAt': '2011-01-16T18:13:02.000Z', 'language': 'en', 'authorId': '673829', 'authorUsername': 'exampleuser', 'authorName': 'example', 'authorVerified': 'TRUE', 'authorDescription': 'example description', 'authorLocation': 'example location', 'authorCreatedAt': '2009-05-10T05:02:51.000Z', 'authorFollowersCount': '830211', 'authorFollowingCount': '1763', 'authorTweetCount': '34209', 'authorListedCount': '7589', 'referencedTweetId': '26690653563912192', 'referencedTweetCreatedAt': '2011-01-16T17:22:02.000Z', 'referencedTweetText': 'example reference tweet text', 'referencedTweetRetweetCount': '9', 'referencedTweetReplyCount': '0', 'referencedTweetLikeCount': '2', 'referencedTweetQuoteCount': '0', 'referencedTweetAuthorUsername': 'example', 'referencedTweetAuthorName': 'example', 'referencedTweetAuthorVerified': 'TRUE', 'referencedTweetAuthorDescription': 'example description Check out @example, our new example', 'referencedTweetAuthorLocation': 'example', 'referencedTweetAuthorCreatedAt': '2008-08-27T15:24:02.000Z', 'referencedTweetAuthorFollowersCount': '1380523', 'referencedTweetAuthorFollowingCount': '1035', 'referencedTweetAuthorTweetCount': '402492', 'referencedTweetAuthorListedCount': '22425', 'retweetCount': '9', 'replyCount': '0', 'likeCount': '0', 'quoteCount': '0'}
<class 'dict'>
id: 827364918734
createdAt: 2011-01-16T18:13:02.000Z
language: en
authorId: 673829
authorUsername: exampleuser
authorName: example
authorVerified: TRUE
authorDescription: example description
authorLocation: example location
authorCreatedAt: 2009-05-10T05:02:51.000Z
authorFollowersCount: 830211
authorFollowingCount: 1763
authorTweetCount: 34209
authorListedCount: 7589
referencedTweetId: 26690653563912192
referencedTweetCreatedAt: 2011-01-16T17:22:02.000Z
referencedTweetText: example reference tweet text
referencedTweetRetweetCount: 9
referencedTweetReplyCount: 0
referencedTweetLikeCount: 2
referencedTweetQuoteCount: 0
referencedTweetAuthorUsername: example
referencedTweetAuthorName: example
referencedTweetAuthorVerified: TRUE
referencedTweetAuthorDescription: example description Check out @example, our new example
referencedTweetAuthorLocation: example
referencedTweetAuthorCreatedAt: 2008-08-27T15:24:02.000Z
referencedTweetAuthorFollowersCount: 1380523
referencedTweetAuthorFollowingCount: 1035
referencedTweetAuthorTweetCount: 402492
referencedTweetAuthorListedCount: 22425
retweetCount: 9
replyCount: 0
likeCount: 0
quoteCount: 0
text: RT @example this is an example RT
如果你想从xml文件中读取。 它会得到相同的结果。
import xml.etree.ElementTree as ET
tree = ET.parse('tweet_data.xml')
root = tree.getroot()
print("root: " + str(root))
print("root.tag: " + str(root.tag))
print("root.attrib: " + str(root.attrib))
print(type(root.attrib))
for key in root.attrib.keys():
print(key +': '+root.attrib[key])
print("text: " + str(root.text))