我需要从表列中只提取一行的一部分 - 可以是0到4个字符长:
“地址”: “124”
我知道这可以作为'extract'/ findall函数完成。但事实证明只设置了一个掩模,在该掩模上只有部分线将落在这个掩模之下。正如我所说,代码长度不同,所以这种方法无效。请告诉我如何正确设置选择的掩码。
表列中的示例行:
{'latitude':'37 .80505999961946','human_address':'{“address”:“0”,“city”:“Oakland”,“state”:“Ca”,“zip”:“”}','needs_recoding ':错,'经度':' - 122.27301999967312'}
df['latitude_1'] = df['Location 1'].str.extract('(\"\d\d\d\d)', expand=True)
我希望这有帮助
dic = {'latitude': '37.80505999961946', 'human_address': '{"address":"1234","city":"Oakland","state":"Ca","zip":""}', 'needs_recoding': False, 'longitude': '-122.27301999967312'}, {'latitude': '37.80505999961946', 'human_address': '{"address":"0","city":"Oakland","state":"Ca","zip":""}', 'needs_recoding': False, 'longitude': '-122.27301999967312'}
df = pd.DataFrame(list(dic))
df
human_address latitude longitude needs_recoding
0 {"address":"1234","city":"Oakland","state":"Ca... 37.80505999961946 -122.27301999967312 False
1 {"address":"0","city":"Oakland","state":"Ca","... 37.80505999961946 -122.27301999967312 False
import re
df.human_address.apply(lambda s: re.search('\"address\"*:*\"\d{0,4}\"', s).group())
0 "address":"1234"
1 "address":"0"
Name: human_address, dtype: object
你可以确实使用pandas str.extract,你只需要调整你的正则表达式模式。
以下是来自@Ananay Mital的数据帧。
>>> df
human_address latitude longitude needs_recoding
0 {"address":"1234","city":"Oakland","state":"Ca... 37.80505999961946 -122.27301999967312 False
1 {"address":"0","city":"Oakland","state":"Ca","... 37.80505999961946 -122.27301999967312 False
这是您使用str.extract获取结果的方法:
>>> df.human_address.str.extract('(\"address\":\"\d{0,4}\")')
0
0 "address":"1234"
1 "address":"0"
或者,如下所示..
>>> df.human_address.str.extract(r'("address":"\d{0,4}")')
0
0 "address":"1234"
1 "address":"0"