从多个网页中提取URL

问题描述 投票:0回答:1

我想从多个域中提取 URL 并将唯一的输出值保存到一个 txt 文件中。 URL 有不同的格式,有些有 http、https、127.0.0.1。我只想获取 URL 并删除前缀,特别是“127.0.0.1”,我尝试了以下 ps 脚本,但它没有给我任何结果。请帮忙解决这个问题。

`$threatFeedUrls =@("https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate%20versions%20Anti-Malware%20List/AntiMalwareHosts.txt",
                     "https://osint.digitalside.it/Threat- 
Intel/lists/latestdomains.txt")

#Initialize an array to store all extracted URLs
$allUrls = @()

#Loop through the lists of URLs
foreach ($url in $threatFeedUrls) {

# Download the threat feed data
$threatFeedData = Invoke-RestMethod -Uri $threatFeedUrl

# Define a regular expression pattern to match URLs starting with '127.0.0.1'
$pattern = '127\.0\.0\.1 ([^\s]+)'

# Use the regular expression to find matches in the threat feed data
$matchList = [regex]::Matches($threatFeedData, $pattern)

# Create and populate the list with matched URLs
$urlList = 
foreach ($match in $matchList) {
   $match.Groups[1].Value
      }

# Specify the output file path
$outputFilePath = 'output250.txt'
  
# Save the URLs to the output file
$urlList | Out-File -FilePath $outputFilePath

Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to        $outputFilePath."
}`

我编写了 PS 脚本来提取所有 URL。但输出并不是我所期望的。我想从列出的域中提取所有 URL,删除重复项并将它们保存在一个 txt 文件中

powershell security
1个回答
0
投票

你可以试试这个:

# Define the URLs to get
$threatFeedUrls = @(
    "https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate%20versions%20Anti-Malware%20List/AntiMalwareHosts.txt",
    "https://osint.digitalside.it/Threat-Intel/lists/latestdomains.txt"
)
# Get all the raw files
$Result = $threatFeedUrls | foreach {Irm -Uri $_ -UseBasicParsing} 

# Filter out comments and empty lines
$OnlyInterestingLines = $Result -split "`n" | where {$_ -notmatch "^(#|\s|$)" }

# Remove 127.0.0.1 at the beginning of lines followed by any amount of whitespace, sort it and return only unique addresses
$urlList = $OnlyInterestingLines -replace "^127\.0\.0\.1\s*" | Sort-Object -Unique

# Specify the output file path
$outputFilePath = 'output250.txt'

# Save the URLs to the output file
$urlList | Out-File -FilePath $outputFilePath

结果是 21.780 行主机名

© www.soinside.com 2019 - 2024. All rights reserved.