我试图在数据框中填充两个新空列,其中包含来自同一数据框中其他列的数据,具体取决于它们是否已填充。
我正在尝试填充HIGH_PRCN_LAT和HIGH_PRCN_LON(以前称为F_Lat和F_Lon)的值,这些值表示这些行的最终纬度和行程,这将基于表中其他列的值。
情况1:填充Lat / Lon2(如ID 1和2中),使用大圆算法,应计算它们之间的中点,然后放入F_Lat和F_Lon。
情况2:Lat / Lon2为空,然后Lat / Lon1的值应放入F_Lat和F_Lon(与ID 3和4一样)。
我的代码如下,但不起作用(请参阅以前的版本,在编辑中删除)。
我使用的准备代码如下:
incidents <- structure(list(id = 1:9, StartDate = structure(c(1L, 3L, 2L,
2L, 2L, 3L, 1L, 3L, 1L), .Label = c("02/02/2000 00:34", "02/09/2000 22:13",
"20/01/2000 14:11"), class = "factor"), EndDate = structure(1:9, .Label = c("02/04/2006 20:46",
"02/04/2006 22:38", "02/04/2006 23:21", "02/04/2006 23:59", "03/04/2006 20:12",
"03/04/2006 23:56", "04/04/2006 00:31", "07/04/2006 06:19", "07/04/2006 07:45"
), class = "factor"), Yr.Period = structure(c(1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L, 3L), .Label = c("2000 / 1", "2000 / 2", "2000 /3"
), class = "factor"), Description = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "ENGLISH TEXT", class = "factor"),
Location = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L
), .Label = c("Location 1", "Location 1 : Location 2"), class = "factor"),
Location.1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "Location 1", class = "factor"), Postcode.1 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Postcode 1", class = "factor"),
Location.2 = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L,
1L), .Label = c("", "Location 2"), class = "factor"), Postcode.2 = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("", "Postcode 2"
), class = "factor"), Section = structure(c(2L, 2L, 3L, 1L,
4L, 4L, 2L, 1L, 4L), .Label = c("East", "North", "South",
"West"), class = "factor"), Weather.Category = structure(c(1L,
2L, 4L, 2L, 2L, 2L, 4L, 1L, 3L), .Label = c("Animals", "Food",
"Humans", "Weather"), class = "factor"), Minutes = c(13L,
55L, 5L, 5L, 5L, 522L, 1L, 11L, 22L), Cost = c(150L, 150L,
150L, 20L, 23L, 32L, 21L, 11L, 23L), Location.1.Lat = c(53.0506727,
53.8721035, 51.0233529, 53.8721035, 53.6988355, 53.4768766,
52.6874562, 51.6638245, 51.4301359), Location.1.Lon = c(-2.9991256,
-2.4004125, -3.0988341, -2.4004125, -1.3031529, -2.2298073,
-1.8023421, -0.3964916, 0.0213837), Location.2.Lat = c(52.7116187,
53.746791, NA, 53.746791, 53.6787167, 53.4527824, 52.5264907,
NA, NA), Location.2.Lon = c(-2.7493169, -2.4777984, NA, -2.4777984,
-1.489026, -2.1247029, -1.4645023, NA, NA)), class = "data.frame", row.names = c(NA, -9L))
#gpsColumns is used as the following line of code is used for several data frames.
gpsColumns <- c("HIGH_PRCN_LAT", "HIGH_PRCN_LON")
incidents [ , gpsColumns] <- NA
#create separate variable(?) containing a list of which rows are complete
ind <- complete.cases(incidents [,17])
#populate rows with a two Lat/Lons with great circle middle of both values
incidents [ind, c("HIGH_PRCN_LON_2","HIGH_PRCN_LAT_2")] <-
with(incidents [ind,,drop=FALSE],
do.call(rbind, geosphere::midPoint(cbind.data.frame(Location.1.Lon, Location.1.Lat), cbind.data.frame(Location.2.Lon, Location.2.Lat))))
#populate rows with one Lat/Lon with those values
incidents[!ind, c("HIGH_PRCN_LAT","HIGH_PRCN_LON")] <- incidents[!ind, c("Location.1.Lat","Location.1.Lon")]
我将根据推荐使用geosphere :: midPoint函数:http://r.789695.n4.nabble.com/Midpoint-between-coordinates-td2299999.html。
不幸的是,当有几种情况时,这种填充列的方式似乎不起作用。
抛出的当前错误是:
Error in `$<-.data.frame`(`*tmp*`, F_Lat, value = integer(0)) :
replacement has 0 rows, data has 178012
编辑:也发布到reddit:https://www.reddit.com/r/Rlanguage/comments/bdvavx/conditional_updating_column_in_dataframe/
编辑:添加了我不理解的代码部分的清晰度。
#replaces the F_Lat2/F_Lon2 columns in rows with a both sets of input coordinates
dataframe[ind, c("F_Lat2","F_Lon2")] <-
#I am unclear on what this means, specifically what the "with" function does and what "drop=FALSE" does and also why they were used in this case.
with(dataframe[ind,,drop=FALSE],
#I am unclear on what do.call and rbind are doing here, but the second half (geosphere onwards) is binding the Lats and Lons to make coordinates as inputs for the gcIntermediate function.
do.call(rbind, geosphere::gcIntermediate(cbind.data.frame(Lat1, Lon1),
cbind.data.frame(Lat2, Lon2), n = 1)))
虽然你的代码不适合我,但我无法计算你期望的相同精确值,我怀疑你看到的错误可以通过这些步骤来解决。 (这里的数据在底部。)
complete.cases
步骤,它将节省时间。cbind.data.frame
进行内部gcIntermediate
。我推断自己
gcIntermediate([dataframe...
^
this is an error in R
你将这些列绑定在一起,所以我将使用cbind.data.frame
。 (使用cbind
本身从geosphere
产生了一些可忽略的警告,所以你可以使用它而不是suppressWarnings
,但是这个功能有点强大,因为它也会掩盖其他警告。)
此外,因为看起来你想为每对坐标一个中间值,我添加了gcIntermediate(..., n=1)
参数。
do.call(rbind, ...)
的使用是因为gcIntermediate
返回list
,所以我们需要将它们组合在一起。
dataframe$F_Lon2 <- dataframe$F_Lat2 <- NA_real_
ind <- complete.cases(dataframe[,4])
dataframe[ind, c("F_Lat2","F_Lon2")] <-
with(dataframe[ind,,drop=FALSE],
do.call(rbind, geosphere::gcIntermediate(cbind.data.frame(Lat1, Lon1),
cbind.data.frame(Lat2, Lon2), n = 1)))
dataframe[!ind, c("F_Lat2","F_Lon2")] <- dataframe[!ind, c("Lat1","Lon1")]
dataframe
# ID Lat1 Lon1 Lat2 Lon2 F_Lat F_Lon F_Lat2 F_Lon2
# 1 1 19.05067 -3.999126 92.71332 -6.759169 55.88200 -5.379147 55.78466 -6.709509
# 2 2 58.87210 -1.400413 54.74679 -4.479840 56.80945 -2.940126 56.81230 -2.942029
# 3 3 33.02335 -5.098834 NA NA 33.02335 -5.098834 33.02335 -5.098834
# 4 4 54.87210 -4.400412 NA NA 54.87210 -4.400412 54.87210 -4.400412
更新,使用您的新incidents
数据并切换到geosphere::midPoint
。
试试这个:
incidents$F_Lon2 <- incidents$F_Lat2 <- NA_real_
ind <- complete.cases(incidents[,4])
incidents[ind, c("F_Lat2","F_Lon2")] <-
with(incidents[ind,,drop=FALSE],
geosphere::midPoint(cbind.data.frame(Location.1.Lat,Location.1.Lon),
cbind.data.frame(Location.2.Lat,Location.2.Lon)))
incidents[!ind, c("F_Lat2","F_Lon2")] <- dataframe[!ind, c("Lat1","Lon1")]
一个(大)差异是geosphere::gcIntermediate(..., n=1)
返回结果列表,而geosphere::midPoint(...)
(没有n=
)只返回一个矩阵,因此不需要rbind
ing。
数据:
dataframe <- read.table(header=T, stringsAsFactors=F, text="
ID Lat1 Lon1 Lat2 Lon2 F_Lat F_Lon
1 19.0506727 -3.9991256 92.713318 -6.759169 55.88199535 -5.3791473
2 58.8721035 -1.4004125 54.746791 -4.47984 56.80944725 -2.94012625
3 33.0233529 -5.0988341 NA NA 33.0233529 -5.0988341
4 54.8721035 -4.4004125 NA NA 54.8721035 -4.4004125")