我有 2007 年至 2022 年 15 分钟的多年时间序列数据(总共 16 年)。数据看起来像 this。我想从这些数据中提取所有可能的子集。每个子集应该有一年的值。所以基本上应该是 4(15 分钟)x24(小时)x 365 或 366 天(闰年)= 35,040 行数据或闰年 35,136 个数据。
子集的形成方式应包含不同年份的 12 个月。例如:
2021 年 1 月(一个月中的所有 15 分钟应全部集中在子集中) 2018年2月起 2012年3月 2015年4月起 2009年5月起 2014年6月起 2022年7月起 2010年8月起 2015年9月起 2020年10月起 2018年11月起 2007年12月起 同年有两个月也很好。
请帮助我如何继续前进。
这是我迄今为止读取数据的代码:
import pandas as pd
import numpy as np
columns_to_read = ['DateTime', 'PLANT ENERGY MWh']
df = pd.read_excel(r'C:/Users/97150/Data - 15 mins multiyear -R2.xlsx', skiprows=0, usecols=columns_to_read)
df['DateTime'] = pd.to_datetime(df['DateTime'])
df.dropna(subset=['DateTime'], inplace=True)
df['Month'] = df['DateTime'].dt.month.astype(int)
df['Year'] = df['DateTime'].dt.year.astype(int)
#df['Month'] = df['DateTime'].dt.month
#df['Year'] = df['DateTime'].dt.year
df.set_index('DateTime', inplace=True)
这是我的简单解决方案,没有任何装饰或过于复杂的Pythonic调整......
我假设您希望子集中有随机年份,但必须考虑每对“月+年”,并且必须(仅)出现在一个子数据集中。
结果存储在 16 个 pandas DataFrame 的列表中,并打印到文件中。
希望就是您所寻找的!
如果有什么不清楚的地方请告诉我,Ciao!
import pandas as pd
import random
from itertools import product
# Define the columns to be read from the Excel file
columns_to_read = ['DateTime', 'PLANT ENERGY MWh']
# Read data from the Excel file into a DataFrame (path changed for me)
df = pd.read_excel(r'./Data_15_mins_multiyear-R2.xlsx', skiprows=0,
usecols=columns_to_read)
# Convert 'DateTime' column to datetime type
df['DateTime'] = pd.to_datetime(df['DateTime'])
# Drop rows where 'DateTime' is missing
df.dropna(subset=['DateTime'], inplace=True)
## Define start and end dates for data filtering
start_date = pd.Timestamp('2007-01-01')
end_date = pd.Timestamp('2022-12-31')
# Filter the DataFrame to include only data within the specified date range
df = df[(df['DateTime'] >= start_date) & (df['DateTime'] <= end_date)]
## Extract 'Month' and 'Year' from the 'DateTime' column
df['Month'] = df['DateTime'].dt.month
df['Year'] = df['DateTime'].dt.year
# Group the DataFrame by 'Month' and 'Year'
grouped = df.groupby(['Month', 'Year'])
# Get the unique years present in the DataFrame
unique_years = df['Year'].unique()
# Create a dictionary to hold data subsets for each year
yearly_data = {}
### Populate the yearly_data dictionary with subsets of data for each year
for year in unique_years:
subset_df = df[df['Year'] == year].reset_index(drop=True)
yearly_data[year] = subset_df
N = len(yearly_data)
## Get all possible combinations of months and years and shuffle
all_combinations = list(product(range(1, 13), unique_years))
random.shuffle(all_combinations)
# Create an empty list to hold the final datasets
datasets_list = []
#################### Build all datasets:
for _ in range(N):
# Create an empty DataFrame to store the current dataset
curr_dataset = pd.DataFrame()
# Loop through months 1 to 12
for month in range(1, 13):
# Retrieves the index from the iterable (the desired combination)
# according to the condition (placeholder for the useless year value)
index_to_pop = next((i for i, (m, _) in enumerate(all_combinations) \
if m == month), None)
if index_to_pop is not None:
# Remove the combination from the list and get the associated year
_, year = all_combinations.pop(index_to_pop)
# Get the df for the selected year
subset_df = yearly_data[year]
# Filter the subsets to include only rows with the current month
subset_month = subset_df[subset_df['Month'] == month]
## Concatenate the subset_month df to the current one.
# Avoid to use "append" instead of "concat", since the former is deprecated
curr_dataset = pd.concat([curr_dataset, subset_month], ignore_index=True)
# Append the subdf to the datasets_list
datasets_list.append(curr_dataset)
########## Print the result into a file
## Indicate that all columns and rows must be displayed, without any truncation
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
for i, dataset in enumerate(datasets_list, start=1):
# Write on a file in update mode
with open('my_output_file.txt', 'a') as file:
print(f"SubDataset {i}:", file=file)
print(dataset, file=file)
print()
==>输出<==
我的输出文件是:
SubDataset 1:
DateTime PLANT ENERGY MWh Month Year
0 2013-01-01 00:07:00 0.000000 1 2013
1 2013-01-01 00:22:00 0.000000 1 2013
2 2013-01-01 00:37:00 0.000000 1 2013
3 2013-01-01 00:52:00 0.000000 1 2013
..... ......... ........ .......... ......
34941 2022-12-30 23:22:00 0.000000 12 2022
34942 2022-12-30 23:37:00 0.000000 12 2022
34943 2022-12-30 23:52:00 0.000000 12 2022
SubDataset 2:
DateTime PLANT ENERGY MWh Month Year
0 2019-01-01 00:07:00 0.000000 1 2019
1 2019-01-01 00:22:00 0.000000 1 2019
2 2019-01-01 00:37:00 0.000000 1 2019
3 2019-01-01 00:52:00 0.000000 1 2019
4 2019-01-01 01:07:00 0.000000 1 2019
..... ......... ........ .......... ......
20446 2014-07-31 23:37:00 0.000000 7 2014
20447 2014-07-31 23:52:00 0.000000 7 2014
20448 2015-08-01 00:07:00 0.000000 8 2015
20449 2015-08-01 00:22:00 0.000000 8 2015
20450 2015-08-01 00:37:00 0.000000 8 2015
..... ......... ........ .......... ......
快速检查每个数据集中一个月x的年份分布…
month_x = []
x = 1
# Loop through each dataset
for dataset in datasets_list:
# Filter rows with Month equal to x
row_to_append = dataset.loc[dataset['Month'] == x].iloc[0]
month_x.append(row_to_append)
month_x 是:
[DateTime 2013-01-01 00:07:00
PLANT ENERGY MWh 0.0
Month 1
Year 2013
Name: 0, dtype: object,
DateTime 2019-01-01 00:07:00
PLANT ENERGY MWh 0.0
Month 1
Year 2019
Name: 0, dtype: object,
DateTime 2018-01-01 00:07:00
PLANT ENERGY MWh 0.0
Month 1
Year 2018
Name: 0, dtype: object,
........................................
........................................