在循环(更改日期)中从网站提取数据到工作表

问题描述 投票:1回答:1

首先,我是VBA的新手,想在Excel中建立一个天气统计数据。非常感谢帮助!

为此,我需要来自多个网站的数据,包括测量温度,气象站和时间的信息。

通过我在互联网上找到的一些VBA信息,到目前为止我已经编写了一个代码,它只在即时窗口中为我提供了一个网站所需的信息。

看起来像这样:

7.4°C | Wien-Mariabrunn (225m) | 14:00
7.6°C | Wien-Hohe Warte (198m) | 14:00
7.6°C | Wien-Unterlaa (200m) | 14:00
7.7°C | Wien-Schwechat (183m) | 14:00
7.8°C | Wien-Donaufeld (160m) | 14:00
8.1°C | Grossenzersdorf (154m) | 14:00
8.2°C | Wien-City (177m) | 14:00

Dim xmlReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument

Dim Temps1 As MSHTML.IHTMLElementCollection
Dim temps2 As MSHTML.IHTMLElementCollection
Dim Temp As MSHTML.IHTMLElement

xmlReq.Open "GET", "https://kachelmannwetter.com/at/messwerte/wien/temperatur/20190101-1300z.html", False
xmlReq.send

If xmlReq.Status <> 200 Then
    MsgBox "Problem" & vbNewLine & xmlReq.Status & " - " & xmlReq.statusText
    Exit Sub
End If

HTMLDoc.body.innerHTML = xmlReq.responseText

Set Temps1 = HTMLDoc.getElementsByClassName("ap o o-1 o-tmp-5")
Set temps2 = HTMLDoc.getElementsByClassName("ap o o-1 o-tmp-1")

For Each Temp In Temps1
    Debug.Print Temp.Title
Next Temp

For Each Temp In temps2
    Debug.Print Temp.Title
Next Temp

我遇到的第一个问题是,我不知道如何将这些信息放入一个或多个单元格中。

第二个问题是,我需要来自多个网站的相同数据,从开始日期到结束日期。我在这个示例代码中使用的网站是https://kachelmannwetter.com/at/messwerte/wien/temperatur/20190101-1300z.html。最后,您可以找到日期,在此示例中为“20190101”,时间为“1300”。所以为此,我需要某种循环。

所以最后我需要一个工作表,其中A列带有日期,B列带有时间,C列带有数据(每个日期和时间)。

我希望这在某种程度上是可以理解的,我非常感谢任何帮助。

html excel vba web-scraping
1个回答
0
投票

以下适用于短日期范围。较长的日期范围会导致响应变慢。很可能网站阻塞/限制。为此:

  1. 我包含一个变量pauseIndex,这意味着在添加下一个请求之前,每个x(= pauseIndex)URL的数量,延迟为y秒(由waitSeconds指定)。你可以玩这个。
  2. 您可以考虑调整代码以每日/每月批次运行,并附加到现有数据集的底部或使用当前日/月作为标题写入新工作表。
  3. 也许旋转/更改IP并使用MSXML2.ServerXMLHTTP

修补上述内容以获得最佳设置和批量请求大小(如果进行批处理)。


指定开始日期时间和结束日期时间:

日期范围在名为Date ranges的工作表中指定。它有以下设置:


构建数据集:

我建议您构建一个平面数据集,在其中指定要返回信息的确切站点。并非每个时间段都显示每个电台。

stations = Array("Wien-Schwechat", "Wien-Unterlaa", "Wien-Mariabrunn", "Wien-Hohe Warte", "Grossenzersdorf", _
                     "Wien-Donaufeld", "Wien-City")

你可以扩展它。我包含一个字典变量newStations,它存储了所有未在列表中进行监视的站点。您可以轻松编写这些内容,以帮助确定要在数据集中监视/包含的其他工作站。

缺失站点读数的占位符值用于确保完整的数据集。

您可能希望将“异常值”标准化 - 例如,实际小时值可以在范围内而不是在小时内。在下面的演示中,为一个站准确检索16:20。您可以将其标准化为16:00。


助手功能/子:

代码中使用了许多辅助函数和1个子函数。

  1. GetAllLinks。生成开始日期时间和结束日期时间之间的所有请求URL。请参阅代码中的注释。这些可以循环以发出每个数据请求
  2. EmptyDict - 确保在请求之间清除站数据
  3. UpdateDictForNoReading。处理未在指定日期时间内报告受监控工作站的情况。它使用"No reading"更新温度和长站描述
  4. WriteOutResults。生成"flat",即不嵌套的2D数组结构,并将结果写入指定的输出表

检索站点和站点数据:

我使用css attribute = value选择器和contains运算符来定位站数据。

以示例站的HTML为例

<a class="ap o o-1 o-tmp--1" data-target="#obs-detail-3h" data-toggle="modal" data-left="635" data-top="545" onclick="obs_detail_3h('-1.0°C', 'Wien-Schwechat (183m)', '16:20','110360', '201901031500');" title="-1.0°C | Wien-Schwechat (183m) | 16:20" style="left: 408.533px; top: 337.757px;">-1</a>

如果我们查看class属性,我们会看到如下:

class="ap o o-1 o-tmp--1"

class属性的值是"ap o o-1 o-tmp--1",它实际上是一系列由空格分隔的类。每个站类值具有相同的子字符串,即o-tmp。你可以稍微改变一下。我使用querySelectorAll返回在class属性值中具有此子字符串的所有元素的nodeList。

Set mapStations = html.querySelectorAll("[class*='o-tmp']")

这匹配页面上的所有站点(地图)。

nodeList(title)中每个节点的mapStations属性包含感兴趣的数据:

title="-1.0°C | Wien-Schwechat (183m) | 16:20"

该字符串包含管道(|)分隔符。我可以使用split()生成一个包含每一位信息的数组:

arr = Split(mapStations.item(i).Title, " | ")

这会生成一个数组,其中包含不同指数的-1.0°CWien-Schwechat (183m)16:20。使用这个例子,我将-1.0°C存储在变量temp中,Wien-Schwechat (183m)存储在变量stationFull中,只是将站点名称Wien-Schwechatin station16:20存储在time中。


ALL:

  1. 重构以降低嵌套水平
  2. 处理状态码<> 200等情况时的错误处理
  3. 将变量声明更接近其用法

要求:

  1. VBE>工具>引用>添加对Microsoft HTML对象库的引用
  2. 工作表名为Date ranges
  3. 工作表名为Output

Date ranges中的数据应如上图所示。


VBA:

Option Explicit
Public Sub GetInfo()
    'VBE > Tools > References > Microsoft HTML Object Library
    'Collect hourly temperature readings from list of stations in array stations.
    'Missing readings are populated with "Missing reading". Times specified in request are not necessarily identical _
    'to that correctly returned from page as reading reported within an hour interval may not be on the hour
    Dim html As HTMLDocument, i As Long, arr() As String, mapStations As Object, dict As Object, newStations As Object
    Dim time As String, station As String, temp As String, stations(), results(), j As Long
    Dim urls As Object, url As Variant, startOfDateString As Long, currDate As String, stationFull As String
    Dim outputSht As Worksheet, x As Long
    Const pauseIndex As Long = 20
    Const waitSeconds As Long = 1
    Const PREFIX As String = "https://kachelmannwetter.com/at/messwerte/wien/temperatur/"
    Const SUFFIX As String = "z.html"
    startOfDateString = InStrRev(PREFIX, "/") + 1
    Set outputSht = ThisWorkbook.Worksheets("Output")
    Set urls = GetAllLinks(PREFIX, SUFFIX)
    Set html = New HTMLDocument
    Set dict = CreateObject("Scripting.Dictionary")
    Set newStations = CreateObject("Scripting.Dictionary")
    stations = Array("Wien-Schwechat", "Wien-Unterlaa", "Wien-Mariabrunn", "Wien-Hohe Warte", "Grossenzersdorf", _
                     "Wien-Donaufeld", "Wien-City") 'order of stations here should match that in sheet
    j = 1

    For i = LBound(stations) To UBound(stations)
        dict(stations(i)) = vbNullString
    Next

    ReDim results(1 To 1 * urls.Count)

    With CreateObject("MSXML2.XMLHTTP")

        For Each url In urls
            x = x + 1
            If x Mod pauseIndex = 0 Then Application.Wait Now + TimeSerial(0, 0, waitSeconds)
            DoEvents
            .Open "GET", url, False
            .setRequestHeader "User-Agent", "Mozilla/5.0"
            .send
            html.body.innerHTML = .responseText

            Set mapStations = html.querySelectorAll("[class*='o-tmp']")

            For i = 0 To mapStations.Length - 1
                arr = Split(mapStations.item(i).Title, " | ")
                currDate = Join(Array(Mid$(url, startOfDateString + 4, 2), Mid$(url, startOfDateString + 6, 2), Mid$(url, startOfDateString, 4)), "-")
                temp = arr(0)
                station = Split(arr(1), " (")(0)
                stationFull = arr(1)
                time = arr(2)
                If Not dict.Exists(station) Then
                    newStations(station) = vbNullString 'in case you are interested in which stations exist you are not monitoring
                Else
                    dict(station) = Array(currDate, time, station, stationFull, temp)
                End If
            Next

            Set dict = UpdateDictForNoReading(dict, currDate, time)
            results(j) = dict.items
            j = j + 1
            Set dict = EmptyDict(dict)
        Next
    End With

    WriteOutResults outputSht, results, UBound(stations) + 1
End Sub

Public Function UpdateDictForNoReading(ByVal dict As Object, ByVal currDate As String, ByVal time As String) As Object
    'Loop dictionary containing station readings. If current value for key is not an array then no readings where found. _
    'then dict is updated with "No reading" for station long text (which includes m e.g. Wien-Schwechat (183m)) and temperature
    Dim key As Variant
    For Each key In dict
        If Not IsArray(dict(key)) Then dict(key) = Array(currDate, time, key, "No reading", "No reading")
    Next
    Set UpdateDictForNoReading = dict
End Function

Public Sub WriteOutResults(ByVal ws As Worksheet, ByRef results As Variant, ByVal stationCount As Long)
    'Loop results array which at each index should have a child array which is comprised of all stations specified _
    'The code unravels the nested structure into "flat" array for writing out to sheet. Aim is to be more efficient _
    'with writing out to sheet
    'The sheet to write results to is passed as argument ws. Headers are stated below.
    Dim headers(), outputArr(), i As Long, arr(), j As Long, r As Long, c As Long
    headers = Array("Date", "Time", "Station", "StationFull", "Temp")
    ReDim outputArr(1 To UBound(results) * stationCount, 1 To UBound(headers) + 1)

    For i = LBound(results) To UBound(results)
        arr = results(i)                         '0-6
        For j = LBound(arr) To UBound(arr)
            r = r + 1
            If IsArray(arr(j)) Then
                For c = LBound(arr(j)) To UBound(arr(j))
                    outputArr(r, c + 1) = arr(j)(c)
                Next
            End If
        Next
    Next
    With ws
        .Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        .Cells(2, 1).Resize(UBound(outputArr, 1), UBound(outputArr, 2)) = outputArr
    End With
End Sub

Public Function EmptyDict(ByVal dict As Object) As Object
    'Ensures no data remains from prior request
    Dim key As Variant
    For Each key In dict
        dict(key) = vbNullString
    Next
    Set EmptyDict = dict
End Function

Public Function GetAllLinks(ByVal PREFIX As String, ByVal SUFFIX As String) As Object
    'Generate all urls between start date time and end date time. Accounts for fact that there is no real midnight. 00:00 uhr retrieves the 23:00 of prior day.
    'Times selected on page e.g. 1:00 uhr are one hour ahead of what is used in url string e.g. 1 > 0.
    Dim ws As Worksheet, hours(), urls As Collection

    Set urls = New Collection
    Set ws = ThisWorkbook.Worksheets("Date ranges")
    'url "0000" = 1am. The selection of 00:00 in the sheet gives 23:00 of prior date
    hours = Array("0000", "0100", "0200", "0300", "0400", "0500", "0600", "0700", "0800", "0900", "1000", "1100", "1200", _
                  "1300", "1400", "1500", "1600", "1700", "1800", "1900", "2000", "2100", "2200", "2300")

    Dim startDate As Date, endDate As Date, startTime As String, endTime As String, currentDate As Date
    Dim endIndex As Long, startIndex As Long

    With ws
        startDate = .Cells(1, 2).Value2          'Required for input yyyy-mm-dd; Required for output yyyymmdd
        endDate = .Cells(1, 5).Value2
        startTime = .Cells(2, 2)
        endTime = .Cells(2, 5)
    End With

    startIndex = Application.Match(startTime, hours) - 2
    endIndex = Application.Match(endTime, hours) - 2
    currentDate = startDate

    Dim i As Long, s As Long, e As Long

    Do While currentDate <= endDate
        If startDate = endDate Then
            s = startIndex
            e = endIndex
        Else
            Select Case currentDate
            Case startDate
                s = startIndex
                e = UBound(hours)
            Case endDate
                s = LBound(hours)
                e = endIndex
            Case Else
                s = LBound(hours)
                e = UBound(hours)
            End Select
        End If
        For i = s To e
            urls.Add PREFIX & Format$(currentDate, "yyyymmdd") & "-" & hours(i) & SUFFIX
        Next
        currentDate = DateAdd("d", 1, currentDate)

    Loop
    Set GetAllLinks = urls
End Function

示例输出:

© www.soinside.com 2019 - 2024. All rights reserved.