我正在处理一个包含超过 3000 万行的巨大镶木地板文件。我只需要其中的一小部分,并希望选择一些随机选择的行。当我检查文件的元数据时,只有一个 row_group 这意味着我无法使用 pyarrow 的 read_row_groups 来选择随机组。
pf = ParquetFile('Flight_Delay.parquet')
created_by: parquet-cpp-arrow version 6.0.1
num_columns: 30
num_rows: 30132672
num_row_groups: 1
format_version: 1.0
serialized_size: 15701
由于我没有您的数据,因此我生成了一个示例文件并将其另存为
parquet
文件:
import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import pyarrow as pa
num_rows = 100000
data = {
"flight_id": np.arange(num_rows),
"delay_minutes": np.random.randint(0, 500, size=num_rows),
"origin": np.random.choice(["JFK", "LAX", "ORD", "ATL", "DFW"], size=num_rows),
"destination": np.random.choice(["MIA", "SEA", "SFO", "PHX", "BOS"], size=num_rows),
"airline": np.random.choice(["AA", "DL", "UA", "SW", "NK"], size=num_rows),
"timestamp": pd.date_range("2023-01-01", periods=num_rows, freq="T"),
}
df = pd.DataFrame(data)
file_path = 'Flight_Delay_sample.parquet'
table = pa.Table.from_pandas(df)
pq.write_table(table, file_path)
现在,要直接对数据进行采样,您可以执行以下操作:
import pyarrow.dataset as ds
dataset = ds.dataset('Flight_Delay_sample.parquet', format="parquet")
num_rows = dataset.count_rows()
sample_size = 1000
random_indices = np.random.choice(num_rows, sample_size, replace=False)
random_sample = dataset.to_table().to_pandas().iloc[random_indices]
print(random_sample)
返回
flight_id delay_minutes origin destination airline timestamp
9227 9227 437 ORD SFO UA 2023-01-07 09:47:00
66081 66081 264 LAX MIA SW 2023-02-15 21:21:00
18665 18665 117 ATL MIA DL 2023-01-13 23:05:00
27862 27862 233 LAX MIA DL 2023-01-20 08:22:00
9149 9149 456 ATL SFO DL 2023-01-07 08:29:00
... ... ... ... ... ... ...
2502 2502 204 JFK SEA AA 2023-01-02 17:42:00
13450 13450 234 ATL PHX NK 2023-01-10 08:10:00
75772 75772 371 JFK BOS SW 2023-02-22 14:52:00
19573 19573 194 JFK SEA AA 2023-01-14 14:13:00
74923 74923 286 JFK BOS UA 2023-02-22 00:43:00
[1000 rows x 6 columns]