将长行转换为宽行,填充所有单元格

问题描述 投票:0回答:2

我有关于企业的长格式数据,其中一行代表每次移动到不同位置的情况,并键入企业 ID - 任何一个企业机构都可以有多个移动事件。

我希望重塑为宽格式,这通常是每个

tablefunc
模块的跨表区域。

+-------------+-----------+---------+---------+
| business_id | year_move |  long   |   lat   |
+-------------+-----------+---------+---------+
|   001013580 |      1991 | 71.0557 | 42.3588 |
|   001015924 |      1993 | 71.0728 | 42.3504 |
|   001015924 |      1996 | -122.28 | 37.654  |
|   001020684 |      1992 | 84.3381 | 33.5775 |
+-------------+-----------+---------+---------+

然后我就这样变身:

SELECT longbyyear.*
FROM crosstab($$
    SELECT 
    business_id, 
    year_move, 
    max(longitude::float)
    from business_moves
    where year_move::int between 1991 and 2010 
    group by business_id, year_move
    order by business_id, year_move;
    $$
) 
AS longbyyear(biz_id character varying, "long91" float,"long92" float,"long93" float,"long94" float,"long95" float,"long96" float,"long97" float, "long98" float, "long99" float,"long00" float,"long01" float,
"long02" float,"long03" float,"long04" float,"long05" float, 
"long06" float, "long07" float, "long08" float, "long09" float, "long10" float);

而且它——大部分——让我得到想要的输出。

+---------+----------+----------+----------+--------+---+--------+--------+--------+
| biz_id  |  long91  |  long92  |  long93  | long94 | … | long08 | long09 | long10 |
+---------+----------+----------+----------+--------+---+--------+--------+--------+
| 1000223 | 121.3784 | 121.3063 | 121.3549 | 82.821 | … |        |        |        |
| 1000678 | 118.224  |          |          |        | … |        |        |        |
| 1002158 | 121.98   |          |          |        | … |        |        |        |
| 1004092 | 71.2384  |          |          |        | … |        |        |        |
| 1007801 | 118.0312 |          |          |        | … |        |        |        |
| 1007855 | 71.1769  |          |          |        | … |        |        |        |
| 1008697 | 71.0394  | 71.0358  |          |        | … |        |        |        |
| 1008986 | 71.1013  |          |          |        | … |        |        |        |
| 1009617 | 119.9965 |          |          |        | … |        |        |        |
+---------+----------+----------+----------+--------+---+--------+--------+--------+

唯一的障碍是,理想情况下我应该填充每年的值,而不仅仅是移动年份的值。因此,所有字段都将被填充,每年都有一个值,最新的地址将延续到下一年。如果每个都是空白,我可以通过手动更新来解决此问题,请使用上一列,我只是想知道是否有一种聪明的方法可以使用

crosstab()
函数或其他方式(可能与自定义函数结合使用)来完成此操作。

sql postgresql postgresql-9.1 crosstab generate-series
2个回答
3
投票

我假设您有每次业务变动的实际日期,因此我们可以每年做出有意义的选择。使用这个更有意义的测试用例:

CREATE TABLE business_moves (
  business_id int  -- why (inefficient) varchar here?
, move_date date
, longitude float
, latitude float
);

INSERT INTO business_moves VALUES 
  (001013580, '1991-1-1', 71.0557, 42.3588)
, (001015924, '1993-1-1', 71.0728, 42.3504)
, (001015924, '1993-3-3', 73.0728, 43.3504)  -- 2nd move this year
, (001015924, '1996-1-1', -122.28, 37.654)
, (001020684, '1992-1-1', 84.3381, 33.5775)
;

完整、非常快速的解决方案

SELECT *
FROM   crosstab(
   $$
   SELECT business_id, year
        , first_value(x) OVER (PARTITION BY business_id, grp ORDER BY year) AS x
   FROM  (
      SELECT *
           , count(x) OVER (PARTITION BY business_id ORDER BY year) AS grp
      FROM  (SELECT DISTINCT business_id FROM business_moves) b
      CROSS  JOIN generate_series(1991, 2010) year
      LEFT   JOIN (
         SELECT DISTINCT ON (1,2)
                business_id
              , EXTRACT('year' FROM move_date)::int AS year
              , point(longitude, latitude) AS x
         FROM   business_moves
         WHERE  move_date >= '1991-1-1'
         AND    move_date <  '2011-1-1'
         ORDER  BY 1,2, move_date DESC
         ) bm USING (business_id, year)
      ) sub
   $$
, 'VALUES
      (1991),(1992),(1993),(1994),(1995),(1996),(1997),(1998),(1999),(2000)
    , (2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010)'
   ) AS t(biz_id int
         , x91 point, x92 point, x93 point, x94 point, x95 point
         , x96 point, x97 point, x98 point, x99 point, x00 point
         , x01 point, x02 point, x03 point, x04 point, x05 point
         , x06 point, x07 point, x08 point, x09 point, x10 point);

结果:

 biz_id  |        x91        |        x92        |        x93        |        x94        |        x95        |        x96        |        x97        ...
---------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------
 1013580 | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) ...
 1015924 |                   |                   | (73.0728,43.3504) | (73.0728,43.3504) | (73.0728,43.3504) | (-122.28,37.654)  | (-122.28,37.654)  ...
 1020684 |                   | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) ...

一步一步

步骤1

修复你拥有的东西:

SELECT *
FROM crosstab(
   $$
   SELECT DISTINCT ON (1,2)
          business_id
        , EXTRACT('year' FROM move_date) AS year
        , point(longitude, latitude) AS long_lat
   FROM   business_moves
   WHERE  move_date >= '1991-1-1'
   AND    move_date <  '2011-1-1'
   ORDER  BY 1,2, move_date DESC
   $$
 ,'VALUES
      (1991),(1992),(1993),(1994),(1995),(1996),(1997),(1998),(1999),(2000)
    , (2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010)'
   ) AS t(biz_id int
        , x91 point, x92 point, x93 point, x94 point, x95 point
        , x96 point, x97 point, x98 point, x99 point, x00 point
        , x01 point, x02 point, x03 point, x04 point, x05 point
        , x06 point, x07 point, x08 point, x09 point, x10 point);

要使

lat
lon
有意义,请从两者组成
point
。 (或者,只需连接一个
text
表示即可。)

您可能需要更多数据。使用

DISTINCT ON
而不是
max()
来获取每年最新(完整)的行。参见:

虽然整个网格可能会缺少值,但您必须使用带有两个输入参数的

crosstab()
变体。参见:

我调整了该函数以使用

move_date date
而不是
year_move

步骤2

我最好每年都有填充值

使用

CROSS JOIN
的企业和年份构建完整的价值网格(每个企业和年份一个单元格):

SELECT *
FROM  (SELECT DISTINCT business_id FROM business_moves) b
CROSS  JOIN generate_series(1991, 2010) year
LEFT   JOIN (
   SELECT DISTINCT ON (1,2)
          business_id
        , EXTRACT('year' FROM move_date)::int AS year
        , point(longitude, latitude) AS x
   FROM   business_moves
   WHERE  move_date >= '1991-1-1'
   AND    move_date <  '2011-1-1'
   ORDER  BY 1,2, move_date DESC
   ) bm USING (business_id, year);

年份集合来自

generate_series()

不同的业务来自单独的

SELECT
。如果有的话,请使用企业表。这也可以解释为什么企业从未搬迁。

LEFT JOIN
每年的实际业务变动,以达到完整的价值网格

步骤3

填写默认值:

最近的地址可以延续到下一年。

SELECT business_id, year
     , COALESCE(first_value(x) OVER (PARTITION BY business_id, grp ORDER BY year)
              , '(0,0)') AS x
FROM  (
   SELECT *, count(x) OVER (PARTITION BY business_id ORDER BY year) AS grp
   FROM  (SELECT DISTINCT business_id FROM business_moves) b
   CROSS  JOIN generate_series(1991, 2010) year
   LEFT   JOIN (
      SELECT DISTINCT ON (1,2)
             business_id
           , EXTRACT('year' FROM move_date)::int AS year
           , point(longitude, latitude) AS x
      FROM   business_moves
      WHERE  move_date >= '1991-1-1'
      AND    move_date <  '2011-1-1'
      ORDER  BY 1,2, move_date DESC
      ) bm USING (business_id, year)
   ) sub;

在子查询

sub
中,以步骤 2 中的查询为基础,形成共享同一位置的单元格组 (
grp
)。

为此目的,利用众所周知的聚合函数

count()
作为窗口聚合函数。 NULL 值不计算在内,因此该值会随着每次实际移动而增加,从而形成共享同一位置的单元格组。

在外部查询中,使用窗口函数

first_value()
为同一组中的每行选择每组的第一个值。瞧。

最重要的是,可以选择(!)将其包裹在

COALESCE
中,以用
(0,0)
填充位置未知的剩余单元格(尚未移动)。如果这样做,则没有剩余的
NULL
值,并且您可以使用更简单的
crosstab()
形式。这是一个品味问题。

sqlfiddle

步骤4

在更新的

crosstab()
调用中使用步骤 3 中的查询。
应该尽可能。索引可能会有所帮助。


2
投票

为了获取任何给定年份每个business_id的当前位置,您需要两件事:

  1. 用于选择年份的参数化查询,以 SQL 语言函数实现。
  2. 按年份聚合、按business_id 分组并保持坐标不变的肮脏伎俩。这是通过 CTE 中的子查询完成的。

该函数如下所示:

CREATE FUNCTION business_location_in_year_x (int) RETURNS SETOF business_moves AS $$
  WITH last_move AS (
    SELECT business_id, MAX(year_move) AS yr
    FROM business_moves
    WHERE year_move <= $1
    GROUP BY business_id)
  SELECT lm.business_id, $1::int AS yr, longitude, latitude
  FROM business_moves bm, last_move lm
  WHERE bm.business_id = lm.business_id
  AND bm.year_move = lm.yr;
$$ LANGUAGE sql;

子查询仅选择每个营业地点的最近移动。然后,主查询添加经度和纬度列,并将请求的年份放入返回的表中,而不是最近发生移动的年份。需要注意的是:您需要在此表中拥有一条记录,该记录给出了每个business_id 的建立和初始位置,否则只有在移动到其他位置后才会显示。

使用通常的

SELECT * FROM business_location_in_year_x(1997)
调用此函数。另请参阅 SQL fiddle

如果您真的需要交叉表,那么您可以调整此代码,为您提供一系列年份的营业地点,然后将其输入到

crosstab()
函数中。

© www.soinside.com 2019 - 2024. All rights reserved.