我需要抓取一个网站的 html,该网站是从 .url 文件启动的,然后找到某一行,并抓取它下面的每一行到某一点。 html 代码示例如下:
</p><ul><li>(None)</li></ul><h2><span style="font-size:18px;">Authorized Administrators and Users</span></h2><pre><b>Authorized Administrators:</b>
jim (you)
password: (blank/none)
bob
password: Littl3@birD
batman
password: 3ndur4N(e&home
dab
password: captain
<b>Authorized Users:</b>
bag
crab
oliver
james
scott
john
apple
</pre><h2><span style="font-size:18px;">Competition Guidelines</span></h2>
我需要将所有授权管理员放入一个txt文件,将授权用户放入一个txt文件,并将两者放入另一个txt文件。仅使用批处理和 powershell 可以完成此操作吗?
这是我尝试得到你想要的东西。
$url = '<THE URL TAKEN FROM THE .URL SHORTCUT FILE>'
$outputPath = '<THE PATH WHERE YOU WANT THE CSV FILES TO BE CREATED>'
# get the content of the web page
$html = (Invoke-WebRequest -Uri $url).Content
# load the assembly to de-entify the HTML content
Add-Type -AssemblyName System.Web
$html = [System.Web.HttpUtility]::HtmlDecode($html)
# get the Authorized Admins block
if ($html -match '(?s)<b>Authorized Administrators:</b>(.+)<b>') {
$adminblock = $matches[1].Trim()
# inside this text block, get the admin usernames and passwords
$admins = @()
$regex = [regex] '(?m)^(?<name>.+)\s*password:\s+(?<password>.+)'
$match = $regex.Match($adminblock)
while ($match.Success) {
$admins += [PSCustomObject]@{
'Name' = $($match.Groups['name'].Value -replace '\(you\)', '').Trim()
'Type' = 'Admin'
# comment out this next property if you don't want passwords in the output
'Password' = $match.Groups['password'].Value.Trim()
}
$match = $match.NextMatch()
}
} else {
Write-Warning "Could not find 'Authorized Administrators' text block."
}
# get the Authorized Users block
if ($html -match '(?s)<b>Authorized Users:</b>(.+)</pre>') {
$userblock = $matches[1].Trim()
# inside this text block, get the authorized usernames
$users = @()
$regex = [regex] '(?m)(?<name>.+)'
$match = $regex.Match($userblock)
while ($match.Success) {
$users += [PSCustomObject]@{
'Name' = $match.Groups['name'].Value.Trim()
'Type' = 'User'
}
$match = $match.NextMatch()
}
} else {
Write-Warning "Could not find 'Authorized Users' text block."
}
# write the csv files
$admins | Export-Csv -Path $(Join-Path -Path $outputPath -ChildPath 'admins.csv') -NoTypeInformation -Force
$users | Export-Csv -Path $(Join-Path -Path $outputPath -ChildPath 'users.csv') -NoTypeInformation -Force
($admins + $users) | Export-Csv -Path $(Join-Path -Path $outputPath -ChildPath 'adminsandusers.csv') -NoTypeInformation -Force
完成后,您将获得三个 CSV 文件:
admins.csv
Name Type Password
---- ---- --------
jim Admin (blank/none)
bob Admin Littl3@birD
batman Admin 3ndur4N(e&home
dab Admin captain
用户.csv
Name Type
---- ----
bag User
crab User
oliver User
james User
scott User
john User
apple User
adminsandusers.csv
Name Type Password
---- ---- --------
jim Admin (blank/none)
bob Admin Littl3@birD
batman Admin 3ndur4N(e&home
dab Admin captain
bag User
crab User
oliver User
james User
scott User
john User
apple User
一般来说,如上所述,使用专用的 HTML 解析器是更好的,但是考虑到输入中易于识别的封闭标签(假设没有变化),您可以使用基于正则表达式的解决方案。
这是基于正则表达式的 PSv4+ 解决方案,但请注意,它依赖于包含空格(换行符、前导空格)的输入,与您的问题中所示完全一致:
# $html is assumed to contain the input HTML text (can be a full document).
$admins, $users = (
# Split the HTML text into the sections of interest.
$html -split
'\A.*<b>Authorized Administrators:</b>|<b>Authorized Users:</b>' `
-ne '' `
-replace '<.*'
).ForEach({
# Extract admin lines and user lines each, as an array.
, ($_ -split '\r?\n' -ne '')
})
# Clean up the $admins array and transform the username-password pairs
# into custom objects with .username and .password properties.
$admins = $admins -split '\s+password:\s+' -ne ''
$i = 0;
$admins.ForEach({
if ($i++ % 2 -eq 0) { $co = [pscustomobject] @{ username = $_; password = '' } }
else { $co.password = $_; $co }
})
# Create custom objects with the same structure for the users.
$users = $users.ForEach({
[pscustomobject] @{ username = $_; password = '' }
})
# Output to CSV files.
$admins | Export-Csv admins.csv
$users | Export-Csv users.csv
$admins + $users | Export-Csv all.csv
考虑到您的问题没有充实要求,对所需的输出格式进行了假设(并且诸如
&
之类的 HTML 实体不会被解码)。
这确实相当丑陋,而且非常脆弱。一个好的 HTML 解析器将是做到这一点的更好方法。
但是,假设您没有足够的资源,这里有一种获取数据的方法。如果您真的想再生成两个文件 [管理员和用户],您可以从该对象中执行此操作...
# fake reading in a text file
# in real life, use Get-Content
$InStuff = @'
</p><ul><li>(None)</li></ul><h2><span style="font-size:18px;">Authorized Administrators and Users</span></h2><pre><b>Authorized Administrators:</b>
jim (you)
password: (blank/none)
bob
password: Littl3@birD
batman
password: 3ndur4N(e&home
dab
password: captain
<b>Authorized Users:</b>
bag
crab
oliver
james
scott
john
apple
</pre><h2><span style="font-size:18px;">Competition Guidelines</span></h2>
'@ -split [environment]::NewLine
$CleanedInStuff = $InStuff.
Where({
$_ -notmatch '^</' -and
$_ -notmatch '^ ' -and
$_
})
$UserType = 'Administrator'
$UserInfo = foreach ($CIS_Item in $CleanedInStuff)
{
if ($CIS_Item.StartsWith('<b>'))
{
$UserType = 'User'
continue
}
[PSCustomObject]@{
Name = $CIS_Item.Trim()
UserType = $UserType
}
}
# on screen
$UserInfo
# to CSV
$UserInfo |
Export-Csv -LiteralPath "$env:TEMP\LandonBB.csv" -NoTypeInformation
屏幕输出...
Name UserType
---- --------
jim (you) Administrator
bob Administrator
batman Administrator
dab Administrator
bag User
crab User
oliver User
james User
scott User
john User
apple User
CSV 文件内容...
"Name","UserType"
"jim (you)","Administrator"
"bob","Administrator"
"batman","Administrator"
"dab","Administrator"
"bag","User"
"crab","User"
"oliver","User"
"james","User"
"scott","User"
"john","User"
"apple","User"