为什么我的网页抓取脚本可以在我的笔记本电脑上运行,但在我的电脑上却出现“非数字端口:‘端口’”错误?

问题描述 投票:0回答:1

我有一个可以从 CoinGecko 抓取数据的 Python 脚本。它在我的笔记本电脑上运行良好,但在我的 PC 上运行时抛出错误。脚本如下:

import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import gzip
import brotli
import io
import time
import traceback

# Function to get the page content with custom headers
def get_page_content(url):
    req = urllib.request.Request(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate, br',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1'
    })
    response = urllib.request.urlopen(req)
   
    # Handle different content encodings
    if response.info().get('Content-Encoding') == 'gzip':
        buf = io.BytesIO(response.read())
        data = gzip.GzipFile(fileobj=buf).read()
    elif response.info().get('Content-Encoding') == 'br':
        data = brotli.decompress(response.read())
    else:
        data = response.read()
   
    return data

# Function to extract table data from a given page URL
def extract_table_data(page_url):
    try:
        webpage = get_page_content(page_url)
        soup = BeautifulSoup(webpage, 'html.parser')
        div_element = soup.find('div', class_='tw-mb-6 lg:tw-mb-12')
        if div_element:
            html_table = div_element.find('table')
            if html_table:
                df = pd.read_html(str(html_table))[0]
                df = df.loc[:, df.columns[1:-1]]  # Adjust the columns as per your requirement
                return df
            else:
                print(f"No table found in the specified div for URL: {page_url}")
        else:
            print(f"Specified div element not found for URL: {page_url}")
    except Exception as e:
        print(f"An error occurred for URL {page_url}: {str(e)}")
        traceback.print_exc()  # Print the full traceback
    return None

# Base URL
base_url = 'https://www.coingecko.com/en/coins/1/markets/spot?page='

# DataFrame to collect all data
all_data = pd.DataFrame()

# Start page
page = 1
max_retries = 3
retry_delay = 5
max_consecutive_errors = 5
consecutive_errors = 0

while True:
    url = base_url + str(page)
    print(f"Processing {url}")
    retries = 0
    while retries < max_retries:
        try:
            df = extract_table_data(url)
            if df is not None:
                all_data = pd.concat([all_data, df], ignore_index=True)
                consecutive_errors = 0  # Reset consecutive errors counter
                break  # Successfully retrieved data, break out of the retry loop
            else:
                print(f"No data found on page {page}, stopping.")
                consecutive_errors += 1
                break
        except urllib.error.HTTPError as e:
            if e.code == 404:
                print(f"HTTP Error 404 on page {page}. Stopping.")
                consecutive_errors += 1
                break
            else:
                print(f"HTTP Error on page {page}: {e.code}. Retrying...")
                retries += 1
                time.sleep(retry_delay)
        except Exception as e:
            print(f"An error occurred on page {page}: {str(e)}. Retrying...")
            traceback.print_exc()  # Print the full traceback
            retries += 1
            time.sleep(retry_delay)
    
    if consecutive_errors >= max_consecutive_errors:
        print(f"Stopping due to {max_consecutive_errors} consecutive errors.")
        break

    page += 1

# Save the complete DataFrame to CSV in the specified path
save_path = r'C:\Users\hamid\Downloads\Crypto_Data_Table.csv'
all_data.to_csv(save_path, index=False)
print(f"All data saved to '{save_path}'")

我没有使用代理,相同的脚本在我的笔记本电脑上运行良好。是什么原因导致我的电脑出现此问题?如何解决?预先感谢。

这是回溯:

runfile('C:/Users/hamid/OneDrive/Documents/untitled1.py', wdir='C:/Users/hamid/OneDrive/Documents')
Processing https://www.coingecko.com/en/coins/1/markets/spot?page=1
An error occurred for URL https://www.coingecko.com/en/coins/1/markets/spot?page=1: nonnumeric port: 'port'
No data found on page 1, stopping.
Processing https://www.coingecko.com/en/coins/1/markets/spot?page=2
An error occurred for URL https://www.coingecko.com/en/coins/1/markets/spot?page=2: nonnumeric port: 'port'
No data found on page 2, stopping.
Processing https://www.coingecko.com/en/coins/1/markets/spot?page=3
An error occurred for URL https://www.coingecko.com/en/coins/1/markets/spot?page=3: nonnumeric port: 'port'
No data found on page 3, stopping.
Processing https://www.coingecko.com/en/coins/1/markets/spot?page=4
An error occurred for URL https://www.coingecko.com/en/coins/1/markets/spot?page=4: nonnumeric port: 'port'
No data found on page 4, stopping.
Processing https://www.coingecko.com/en/coins/1/markets/spot?page=5
An error occurred for URL https://www.coingecko.com/en/coins/1/markets/spot?page=5: nonnumeric port: 'port'
No data found on page 5, stopping.
Stopping due to 5 consecutive errors.
All data saved to 'C:\Users\hamid\Downloads\Crypto_Data_Table.csv'

runfile('C:/Users/hamid/OneDrive/Documents/untitled1.py', wdir='C:/Users/hamid/OneDrive/Documents')
Processing https://www.coingecko.com/en/coins/1/markets/spot?page=1
An error occurred for URL https://www.coingecko.com/en/coins/1/markets/spot?page=1: nonnumeric port: 'port'
No data found on page 1, stopping.
All data saved to 'C:\Users\hamid\Downloads\Crypto_Data_Table.csv'

runfile('C:/Users/hamid/OneDrive/Documents/untitled1.py', wdir='C:/Users/hamid/OneDrive/Documents')
Processing https://www.coingecko.com/en/coins/1/markets/spot?page=1
An error occurred for URL https://www.coingecko.com/en/coins/1/markets/spot?page=1: nonnumeric port: 'port'
No data found on page 1, stopping.
All data saved to 'C:\Users\hamid\Downloads\Crypto_Data_Table.csv'

runfile('C:/Users/hamid/OneDrive/Documents/untitled1.py', wdir='C:/Users/hamid/OneDrive/Documents')
Processing https://www.coingecko.com/en/coins/1/markets/spot?page=1
An error occurred for URL https://www.coingecko.com/en/coins/1/markets/spot?page=1: nonnumeric port: 'port'
No data found on page 1, stopping.
Processing https://www.coingecko.com/en/coins/1/markets/spot?page=2
An error occurred for URL https://www.coingecko.com/en/coins/1/markets/spot?page=2: nonnumeric port: 'port'
No data found on page 2, stopping.
Processing https://www.coingecko.com/en/coins/1/markets/spot?page=3
An error occurred for URL https://www.coingecko.com/en/coins/1/markets/spot?page=3: nonnumeric port: 'port'
No data found on page 3, stopping.
Processing https://www.coingecko.com/en/coins/1/markets/spot?page=4
An error occurred for URL https://www.coingecko.com/en/coins/1/markets/spot?page=4: nonnumeric port: 'port'
No data found on page 4, stopping.
Processing https://www.coingecko.com/en/coins/1/markets/spot?page=5
An error occurred for URL https://www.coingecko.com/en/coins/1/markets/spot?page=5: nonnumeric port: 'port'
No data found on page 5, stopping.
Stopping due to 5 consecutive errors.
All data saved to 'C:\Users\hamid\Downloads\Crypto_Data_Table.csv'

runfile('C:/Users/hamid/OneDrive/Documents/untitled1.py', wdir='C:/Users/hamid/OneDrive/Documents')
Processing https://www.coingecko.com/en/coins/1/markets/spot?page=1
An error occurred for URL https://www.coingecko.com/en/coins/1/markets/spot?page=1: nonnumeric port: 'port'
No data found on page 1, stopping.
Processing https://www.coingecko.com/en/coins/1/markets/spot?page=2
An error occurred for URL https://www.coingecko.com/en/coins/1/markets/spot?page=2: nonnumeric port: 'port'
No data found on page 2, stopping.
Processing https://www.coingecko.com/en/coins/1/markets/spot?page=3
An error occurred for URL https://www.coingecko.com/en/coins/1/markets/spot?page=3: nonnumeric port: 'port'
No data found on page 3, stopping.
Processing https://www.coingecko.com/en/coins/1/markets/spot?page=4
An error occurred for URL https://www.coingecko.com/en/coins/1/markets/spot?page=4: nonnumeric port: 'port'
No data found on page 4, stopping.
Processing https://www.coingecko.com/en/coins/1/markets/spot?page=5
An error occurred for URL https://www.coingecko.com/en/coins/1/markets/spot?page=5: nonnumeric port: 'port'
No data found on page 5, stopping.
Stopping due to 5 consecutive errors.
All data saved to 'C:\Users\hamid\Downloads\Crypto_Data_Table.csv'
Traceback (most recent call last):
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 890, in _get_hostport
    port = int(host[i+1:])
           ^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'port'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\hamid\onedrive\documents\untitled1.py", line 36, in extract_table_data
    webpage = get_page_content(page_url)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\users\hamid\onedrive\documents\untitled1.py", line 20, in get_page_content
    response = urllib.request.urlopen(req)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 519, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 1317, in do_open
    h = http_class(host, timeout=req.timeout, **http_conn_args)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 1413, in __init__
    super(HTTPSConnection, self).__init__(host, port, timeout,
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 852, in __init__
    (self.host, self.port) = self._get_hostport(host, port)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 895, in _get_hostport
    raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
http.client.InvalidURL: nonnumeric port: 'port'
Traceback (most recent call last):
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 890, in _get_hostport
    port = int(host[i+1:])
           ^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'port'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\hamid\onedrive\documents\untitled1.py", line 36, in extract_table_data
    webpage = get_page_content(page_url)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\users\hamid\onedrive\documents\untitled1.py", line 20, in get_page_content
    response = urllib.request.urlopen(req)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 519, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 1317, in do_open
    h = http_class(host, timeout=req.timeout, **http_conn_args)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 1413, in __init__
    super(HTTPSConnection, self).__init__(host, port, timeout,
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 852, in __init__
    (self.host, self.port) = self._get_hostport(host, port)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 895, in _get_hostport
    raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
http.client.InvalidURL: nonnumeric port: 'port'
Traceback (most recent call last):
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 890, in _get_hostport
    port = int(host[i+1:])
           ^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'port'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\hamid\onedrive\documents\untitled1.py", line 36, in extract_table_data
    webpage = get_page_content(page_url)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\users\hamid\onedrive\documents\untitled1.py", line 20, in get_page_content
    response = urllib.request.urlopen(req)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 519, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 1317, in do_open
    h = http_class(host, timeout=req.timeout, **http_conn_args)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 1413, in __init__
    super(HTTPSConnection, self).__init__(host, port, timeout,
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 852, in __init__
    (self.host, self.port) = self._get_hostport(host, port)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 895, in _get_hostport
    raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
http.client.InvalidURL: nonnumeric port: 'port'
Traceback (most recent call last):
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 890, in _get_hostport
    port = int(host[i+1:])
           ^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'port'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\hamid\onedrive\documents\untitled1.py", line 36, in extract_table_data
    webpage = get_page_content(page_url)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\users\hamid\onedrive\documents\untitled1.py", line 20, in get_page_content
    response = urllib.request.urlopen(req)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 519, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 1317, in do_open
    h = http_class(host, timeout=req.timeout, **http_conn_args)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 1413, in __init__
    super(HTTPSConnection, self).__init__(host, port, timeout,
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 852, in __init__
    (self.host, self.port) = self._get_hostport(host, port)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 895, in _get_hostport
    raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
http.client.InvalidURL: nonnumeric port: 'port'
Traceback (most recent call last):
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 890, in _get_hostport
    port = int(host[i+1:])
           ^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'port'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\hamid\onedrive\documents\untitled1.py", line 36, in extract_table_data
    webpage = get_page_content(page_url)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\users\hamid\onedrive\documents\untitled1.py", line 20, in get_page_content
    response = urllib.request.urlopen(req)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 519, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\urllib\request.py", line 1317, in do_open
    h = http_class(host, timeout=req.timeout, **http_conn_args)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 1413, in __init__
    super(HTTPSConnection, self).__init__(host, port, timeout,
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 852, in __init__
    (self.host, self.port) = self._get_hostport(host, port)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hamid\anaconda3\Lib\http\client.py", line 895, in _get_hostport
    raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
http.client.InvalidURL: nonnumeric port: 'port'
python web-scraping beautifulsoup
1个回答
0
投票

我通过使用以下方法重新安装这些软件包解决了这个问题:

pip install --force-reinstall urllib3 pandas beautifulsoup4

看起来初始软件包安装存在某种问题导致了这种情况。

© www.soinside.com 2019 - 2024. All rights reserved.