读取嵌套 zip 档案中的 shapefile

Question

我有一个大的 zip 存档“Polska_SHP.zip”，其中包含另一个 zip 存档（名为“02_SHP.zip”、“04_SHP.zip”等）。这些档案中的每一个都包含另一个zip档案（例如档案“02_SHP.zip”里面有“0201_SHP.zip”，“0202_SHP.zip”等等）。最后，这些档案包含许多shapefile，我需要使用以下命令读取所有shapefile到目前为止，我已经能够搜索这些形状文件的名称，并尝试读取它们：

import zipfile
from io import BytesIO
import geopandas as gpd

with zipfile.ZipFile("Polska_SHP.zip", "r") as main_zfile:
    for name in main_zfile.namelist(): # lista archiwów w głównym folderze
        print("name: ", name)
        if ".zip" in name:
            zfiledata = BytesIO(main_zfile.read(name))
            with zipfile.ZipFile(zfiledata) as zfile2:
                for name2 in zfile2.namelist():
                    print("name2: ", name2)
                    if ".zip" in name2:
                        zfiledata2 = BytesIO(zfile2.read(name2))
                        with zipfile.ZipFile(zfiledata2) as zfile3:
                            for name3 in zfile3.namelist():
                                if "SWRS" in name3 and ".shp" in name3:
                                    print("name3: ", name3)
                                    gdf = gpd.read_file(name3)
                                    gdf.head()

它打印出我需要的名字：

name:  32_SHP.zip
name2:  32/3209_SHP.zip
name3:  PL.PZGiK.339.3209__OT_SWRS_L.shp

但在读取 shapefile 时失败：

CPLE_OpenFailedError 回溯（最近一次调用） fiona._shim.gdal_open_vector() 中的 fiona/_shim.pyx fiona._err.exc_wrap_pointer() 中的 fiona/_err.pyx CPLE_OpenFailedError：PL.PZGiK.339.3209__OT_SWRS_L.shp：没有这样的文件或目录

Answer 1

您传递给

name3

的

gpd.read_file()

变量只是 ZIP 中文件的名称，为此，您首先必须解压 ZIP。

另一种选择是传递存档的类文件对象，尽管这假设 zip 和 shp 文件中只包含一个数据集，并且其所有朋友都位于顶级目录中。请注意，我的示例只有 2 层嵌套档案。 shapefile 具有不同的属性，因此地理数据框列表 -

gdfs

- 用于收集所有数据。在你的情况下，你可能想使用

pandas.concat()

。
（顺便说一句，您当前的循环每次都会尝试覆盖

gdf

）

# python     : 3.8.13
# geopandas  : 0.10.2
# fiona      : 1.8.18

import geopandas as gpd
import zipfile
import re

# shp_regex = "SWRS.*\.shp$"
shp_regex = "^ne_.*\.shp$"

# list of geodataframes
gdfs = []
with zipfile.ZipFile("nat_earth.zip", "r") as main_zfile:
    main_zfile.printdir()
    print("- " * 40)
    # only cycle through *.zip files
    for name in [fname for fname in main_zfile.namelist() if fname.endswith(".zip")]:
        print(f'>> {name}:')
        with main_zfile.open(name, "r") as zipped_shp:
            zipped_shp_namelist = zipfile.ZipFile(zipped_shp).namelist()
            print(", ".join(zipped_shp_namelist))
            # check if any of the files actually matches the pattern
            if any(re.search(shp_regex, level2_fname) for level2_fname in zipped_shp_namelist):
                # for gpd.read_file() file position must be changed back to 0
                zipped_shp.seek(0)
                gdfs.append(gpd.read_file(zipped_shp))
                rows, cols = gdfs[-1].shape
                print(f'GeoDataFrame: {rows} rows, {cols} columns\n')

# head of first gdf
print(gdfs[0].head())

输出：

File Name                                             Modified             Size
ne_50m_admin_0_countries.zip                   2021-12-08 03:47:44       792663
ne_50m_lakes.zip                               2021-12-08 03:49:54       252615
ne_50m_ocean.zip                               2021-09-04 08:56:52       461745
ne_50m_rivers_lake_centerlines.zip             2021-12-08 03:49:54       504454
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
>> ne_50m_admin_0_countries.zip:
ne_50m_admin_0_countries.README.html, ne_50m_admin_0_countries.VERSION.txt, 
ne_50m_admin_0_countries.cpg, ne_50m_admin_0_countries.dbf, 
ne_50m_admin_0_countries.prj, ne_50m_admin_0_countries.shp, 
ne_50m_admin_0_countries.shx
GeoDataFrame: 242 rows, 162 columns

>> ne_50m_lakes.zip:
ne_50m_lakes.README.html, ne_50m_lakes.VERSION.txt, ne_50m_lakes.cpg, 
ne_50m_lakes.dbf, ne_50m_lakes.prj, ne_50m_lakes.shp, ne_50m_lakes.shx
GeoDataFrame: 412 rows, 40 columns

>> ne_50m_ocean.zip:
ne_50m_ocean.README.html, ne_50m_ocean.VERSION.txt, ne_50m_ocean.cpg, 
ne_50m_ocean.dbf, ne_50m_ocean.prj, ne_50m_ocean.shp, ne_50m_ocean.shx
GeoDataFrame: 1 rows, 4 columns

>> ne_50m_rivers_lake_centerlines.zip:
ne_50m_rivers_lake_centerlines.README.html, 
ne_50m_rivers_lake_centerlines.VERSION.txt, 
ne_50m_rivers_lake_centerlines.cpg,  
ne_50m_rivers_lake_centerlines.dbf, ne_50m_rivers_lake_centerlines.prj, 
ne_50m_rivers_lake_centerlines.shp, ne_50m_rivers_lake_centerlines.shx
GeoDataFrame: 478 rows, 37 columns

        featurecla  scalerank  LABELRANK SOVEREIGNT SOV_A3  ADM0_DIF  LEVEL  \
0  Admin-0 country          1          3   Zimbabwe    ZWE         0      2   
1  Admin-0 country          1          3     Zambia    ZMB         0      2   
2  Admin-0 country          1          3      Yemen    YEM         0      2   
3  Admin-0 country          3          2    Vietnam    VNM         0      2   
4  Admin-0 country          5          3  Venezuela    VEN         0      2   
... 
                                            geometry  
0  POLYGON ((31.28789 -22.40205, 31.19727 -22.344...  
1  POLYGON ((30.39609 -15.64307, 30.25068 -15.643...  
2  MULTIPOLYGON (((53.08564 16.64839, 52.58145 16...  
3  MULTIPOLYGON (((104.06396 10.39082, 104.08301 ...  
4  MULTIPOLYGON (((-60.82119 9.13838, -60.94141 9...  

[5 rows x 162 columns]

Answer 2

截至本回答时，geopandas 支持 zipfile 内的路径。您只需使用

将它们分开即可

Zipfile_path = "archive.zip"
nested_file_path = "folder/geofile.shp"

gdf = geopandas.read_file(f"{Zipfile_path}!{nested_file_path}")

读取嵌套 zip 档案中的 shapefile

问题描述投票：0回答：2

2个回答

最新问题

读取嵌套 zip 档案中的 shapefile

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2