以下问题与
Microsoft SQL Azure (RTM) - 12.0.2000.8
有关。
我有一个发票数据集,如下所示 (
raw_data.invoices
):
发票_id | 发票日期 | 机构 | 账单收件人 | 项目 | 数量 |
---|---|---|---|---|---|
12345 | 2024-07-12 | 1111 | 约翰·史密斯 | 电话 | 20 |
12345 | 2024-07-12 | 1111 | 约翰·史密斯 | 按键 | 5 |
12345 | 2024-07-12 | 1111 | 简·史密斯 | 按键 | 2 |
12346 | 2024-07-05 | 1111 | 约翰·史密斯 | 电话 | 20 |
12346 | 2024-07-05 | 1111 | 简·史密斯 | 按键 | 2 |
我有一个看法,根据一些业务需求整理一下上表(
myview.invoices
):
select
D.invoice_date,
D.invoice_id,
D.institution,
C.institution_name,
lower(trim(substring(C.institution_name, 1, charindex('-', C.institution_name)-1))) as institution_name,
D.billed_to,
D.item,
D.qty
from raw_data.invoices
left join catalogues.institutions C
on
C.institution_code = D.institution
以及识别每个机构的两个最新发票日期的视图 (
myview.last_2_inv_cycles
):
select
A.institution
, A.current_inv_cycle
, B.last_inv_cycle
from (
select
x.institution
, max(x.invoice_date) as current_inv_cycle
from myview.invoices x
group by
x.institution
) A
inner join (
select
z.institution
, max(z.invoice_date) as last_inv_cycle
from (
select
x.institution
, x.invoice_date
from myview.invoices x
where concat(x.institution, x.invoice_date) not in (
select concat(y.institution, max(y.invoice_date))
from myview.invoices y
group by y.institution
)
) z
group by
z.institution
) B
on A.institution=B.institution
最终,我将这两个视图连接在一起,以识别最新两张发票(每周收到)上带有
qty > 15
的任何发票行 (myview.qty_over_15
):
with over_2_weeks as (
select
x.institution,
x.billed_to,
x.item
from myview.invoices x
inner join myview.last_2_inv_cycles y
on x.institution = y.institution
and (x.invoice_date = y.last_inv_cycle or x.invoice_date = y.current_inv_cycle)
group by
x.institution,
x.billed_to,
x.item
having
sum(case when x.qty > 15 then 1 else 0 end) >= 2
-- exceptions defined by the business
and x.institution <> '2222'
)
select
A.invoice_date,
A.invoice_id,
A.institution,
A.billed_to,
A.item,
A.qty
from myview.invoices A
inner join over_2_weeks D
-- problematic join; takes over an hour
on
A.institution=D.institution
AND A.billed_to=D.billed_to
AND A.item=D.item
inner join myview.last_2_inv_cycles C
on
A.institution=C.institution
and A.invoice_date=C.current_inv_cycle
-- more exception list
where
A.billed_to not in (
'Jake Johnson', 'Bill Gates'
)
正如您所知,查询太复杂并且需要很长时间(即使运行了 4 个多小时,最终视图也无法加载)。
myview.qty_over_15
与 myview.invoices
具有相同的美观要求,这就是为什么我在视图而不是 raw_data.invoices
表上运行查询;我想保持这种方式,除非有更好的方法来实现相同的目标。
至于
raw_data.invoices
上的索引:
create index idx_search_invoice_id
on raw_data.invoices(invoice_id)
create index idx_search_invoice_date
on raw_data.invoices(invoice_date)
create index idx_search_institution
on raw_data.invoices(institution)
create index idx_invoice_of_institution
on raw_data.invoices(invoice_id, institution)
create index idx_search_billed_to
on raw_data.invoices(institution, billed_to)
create index idx_search_billed_to_item
on raw_data.invoices(billed_to, item)
create index idx_search_bill
on raw_data.invoices(qty, billed_to)
create index idx_search_item_charge
on raw_data.invoices(
institution
, invoice_id
, billed_to
, item
)
请帮我解答这些疑问。我不太确定在哪里寻找更快的速度。
我相信仅使用一个窗口函数就可以大大简化所有查询
DENSE_RANK
。
WITH invoices AS (
SELECT
D.invoice_date,
D.invoice_id,
D.institution,
C.institution_name,
LOWER(TRIM(SUBSTRING(C.institution_name, 1, CHARINDEX('-', C.institution_name)-1))) as institution_name,
DENSE_RANK() OVER(PARTITION BY institution ORDER BY invoice_date DESC) AS invoice_rank,
D.billed_to,
D.item,
D.qty
FROM raw_data.invoices
LEFT JOIN catalogues.institutions C
ON
C.institution_code = D.institution
)
SELECT * FROM invoices
WHERE invoice_rank IN (1,2)
AND x.institution <> '2222'
AND billed_to NOT IN (
'Jake Johnson', 'Bill Gates'
)
AND qty > 15
我不确定您想通过
sum(case when x.qty > 15 then 1 else 0 end) >= 2
实现什么目的,如果您能澄清为什么 qty > 15
不适合,我可以调整上述查询以满足要求。