我有一个大的 zip 存档“Polska_SHP.zip”,其中包含另一个 zip 存档(名为“02_SHP.zip”、“04_SHP.zip”等)。这些档案中的每一个都包含另一个zip档案(例如档案“02_SHP.zip”里面有“0201_SHP.zip”,“0202_SHP.zip”等等)。最后,这些档案包含许多shapefile,我需要使用以下命令读取所有shapefile到目前为止,我已经能够搜索这些形状文件的名称,并尝试读取它们:
import zipfile
from io import BytesIO
import geopandas as gpd
with zipfile.ZipFile("Polska_SHP.zip", "r") as main_zfile:
for name in main_zfile.namelist(): # lista archiwów w głównym folderze
print("name: ", name)
if ".zip" in name:
zfiledata = BytesIO(main_zfile.read(name))
with zipfile.ZipFile(zfiledata) as zfile2:
for name2 in zfile2.namelist():
print("name2: ", name2)
if ".zip" in name2:
zfiledata2 = BytesIO(zfile2.read(name2))
with zipfile.ZipFile(zfiledata2) as zfile3:
for name3 in zfile3.namelist():
if "SWRS" in name3 and ".shp" in name3:
print("name3: ", name3)
gdf = gpd.read_file(name3)
gdf.head()
它打印出我需要的名字:
name: 32_SHP.zip
name2: 32/3209_SHP.zip
name3: PL.PZGiK.339.3209__OT_SWRS_L.shp
但在读取 shapefile 时失败:
CPLE_OpenFailedError 回溯(最近一次调用) fiona._shim.gdal_open_vector() 中的 fiona/_shim.pyx fiona._err.exc_wrap_pointer() 中的 fiona/_err.pyx CPLE_OpenFailedError:PL.PZGiK.339.3209__OT_SWRS_L.shp:没有这样的文件或目录
您传递给
name3
的 gpd.read_file()
变量只是 ZIP 中文件的名称,为此,您首先必须解压 ZIP。
另一种选择是传递存档的类文件对象,尽管这假设 zip 和 shp 文件中只包含一个数据集,并且其所有朋友都位于顶级目录中。请注意,我的示例只有 2 层嵌套档案。 shapefile 具有不同的属性,因此地理数据框列表 -
gdfs
- 用于收集所有数据。在你的情况下,你可能想使用 pandas.concat()
。gdf
)
# python : 3.8.13
# geopandas : 0.10.2
# fiona : 1.8.18
import geopandas as gpd
import zipfile
import re
# shp_regex = "SWRS.*\.shp$"
shp_regex = "^ne_.*\.shp$"
# list of geodataframes
gdfs = []
with zipfile.ZipFile("nat_earth.zip", "r") as main_zfile:
main_zfile.printdir()
print("- " * 40)
# only cycle through *.zip files
for name in [fname for fname in main_zfile.namelist() if fname.endswith(".zip")]:
print(f'>> {name}:')
with main_zfile.open(name, "r") as zipped_shp:
zipped_shp_namelist = zipfile.ZipFile(zipped_shp).namelist()
print(", ".join(zipped_shp_namelist))
# check if any of the files actually matches the pattern
if any(re.search(shp_regex, level2_fname) for level2_fname in zipped_shp_namelist):
# for gpd.read_file() file position must be changed back to 0
zipped_shp.seek(0)
gdfs.append(gpd.read_file(zipped_shp))
rows, cols = gdfs[-1].shape
print(f'GeoDataFrame: {rows} rows, {cols} columns\n')
# head of first gdf
print(gdfs[0].head())
输出:
File Name Modified Size
ne_50m_admin_0_countries.zip 2021-12-08 03:47:44 792663
ne_50m_lakes.zip 2021-12-08 03:49:54 252615
ne_50m_ocean.zip 2021-09-04 08:56:52 461745
ne_50m_rivers_lake_centerlines.zip 2021-12-08 03:49:54 504454
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>> ne_50m_admin_0_countries.zip:
ne_50m_admin_0_countries.README.html, ne_50m_admin_0_countries.VERSION.txt,
ne_50m_admin_0_countries.cpg, ne_50m_admin_0_countries.dbf,
ne_50m_admin_0_countries.prj, ne_50m_admin_0_countries.shp,
ne_50m_admin_0_countries.shx
GeoDataFrame: 242 rows, 162 columns
>> ne_50m_lakes.zip:
ne_50m_lakes.README.html, ne_50m_lakes.VERSION.txt, ne_50m_lakes.cpg,
ne_50m_lakes.dbf, ne_50m_lakes.prj, ne_50m_lakes.shp, ne_50m_lakes.shx
GeoDataFrame: 412 rows, 40 columns
>> ne_50m_ocean.zip:
ne_50m_ocean.README.html, ne_50m_ocean.VERSION.txt, ne_50m_ocean.cpg,
ne_50m_ocean.dbf, ne_50m_ocean.prj, ne_50m_ocean.shp, ne_50m_ocean.shx
GeoDataFrame: 1 rows, 4 columns
>> ne_50m_rivers_lake_centerlines.zip:
ne_50m_rivers_lake_centerlines.README.html,
ne_50m_rivers_lake_centerlines.VERSION.txt,
ne_50m_rivers_lake_centerlines.cpg,
ne_50m_rivers_lake_centerlines.dbf, ne_50m_rivers_lake_centerlines.prj,
ne_50m_rivers_lake_centerlines.shp, ne_50m_rivers_lake_centerlines.shx
GeoDataFrame: 478 rows, 37 columns
featurecla scalerank LABELRANK SOVEREIGNT SOV_A3 ADM0_DIF LEVEL \
0 Admin-0 country 1 3 Zimbabwe ZWE 0 2
1 Admin-0 country 1 3 Zambia ZMB 0 2
2 Admin-0 country 1 3 Yemen YEM 0 2
3 Admin-0 country 3 2 Vietnam VNM 0 2
4 Admin-0 country 5 3 Venezuela VEN 0 2
...
geometry
0 POLYGON ((31.28789 -22.40205, 31.19727 -22.344...
1 POLYGON ((30.39609 -15.64307, 30.25068 -15.643...
2 MULTIPOLYGON (((53.08564 16.64839, 52.58145 16...
3 MULTIPOLYGON (((104.06396 10.39082, 104.08301 ...
4 MULTIPOLYGON (((-60.82119 9.13838, -60.94141 9...
[5 rows x 162 columns]
截至本回答时,geopandas 支持 zipfile 内的路径。您只需使用
!
将它们分开即可
Zipfile_path = "archive.zip"
nested_file_path = "folder/geofile.shp"
gdf = geopandas.read_file(f"{Zipfile_path}!{nested_file_path}")