我是SQL的新手,正试图了解我在python中了解的知识。我有一个脚本,可在其中连接到SSMS的odbc以在Python中处理数据:
import pyodbc
import pandas as pd
#odbc
conn = pyodbc.connect('Driver={SQL Server};'
'Server=PMZZ315\RION;'
'Database=Warehouse;'
'Trusted_Connection=yes;')
cursor = conn.cursor()
df = pd.read_sql_query("SELECT [LetId],[StreetAddressLine1],[CompanyName] FROM Dim.Let", conn)
df
df.head()
#print(df.columns)
# Select duplicate rows except first occurrence based on all columns
duplicateRowsDF = df[df.duplicated(['CompanyName','StreetAddressLine1'])]
#print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)
duplicateRowsDF.to_csv("duplicateRowsDFodbc.csv")
SQL中的哪个函数可以替代df.duplicated函数?我要尝试做的是,如果重复公司名称和街道地址,则忽略重复的记录而忽略第一例
输出数据集的代表:
LetId StreetAddressLine1 CompanyName
32 1451 West Brimson View Court Palmer
405 1808 North Lonion Ave Ozark
465 4223 Monty Hwy Alabama
SQL表表示无序集。排序仅由数据中的列提供。没有命令就没有“第一”。让我假设letid
定义了顺序。
SQL中的规范方法使用row_number()
:
select t.*
from (select t.*,
row_number() over (partition by CompanyName, StreetAddressLine1 order by letid) as seqnum
from t
) t
where seqnum = 1;