致命错误：尝试在 Jupyter notebook 中从 AWS Redshift 查询数据时，Python 内核无响应

Question

我正在配置的集群上运行 jupyter notebook：

*12.2 LTS（包括 Apache Spark 3.3.2、Scala 2.12）

Worker类型：i3.xlarge 30.5GB内存，4核

最少 2 名和最多 8 名工人*

cursor = conn.cursor()
 
cursor.execute(
               """
               SELECT * FROM (
                  select *, random() as sample 
                  from tbl
               WHERE type = 'type1'
               AND created_at >= '2022-01-01 00:00:00' 
               AND created_at <= '2022-06-30 00:00:00') as samp
               WHERE sample < .05; -- return 5% of rows
               """
              )
 
# To Pandas DataFrame
df = DataFrame(cursor.fetchall())
 
# # # Get column names
field_names = [i[0] for i in cursor.description]
df.columns = field_names
 

Fatal error: The Python kernel is unresponsive.
 
---------------------------------------------------------------------------
 
The Python process exited with exit code 137 (SIGKILL: Killed). This may have been caused by an OOM error. Check your command's memory usage.
 
 
 
The last 10 KB of the process's stderr and stdout can be found below. See driver logs for full logs.
 
---------------------------------------------------------------------------
 
Last messages on stderr:
 
ks/python/lib/python3.9/site-packages/IPython/core/ultratb.py", line 1112, in structured_traceback
 
    return FormattedTB.structured_traceback(
 
  File "/databricks/python/lib/python3.9/site-packages/IPython/core/ultratb.py", line 1006, in structured_traceback
 
  File "/databricks/python/lib/python3.9/site-packages/stack_data/core.py", line 649, in included_pieces
 
    pos = scope_pieces.index(self.executing_piece)
 
  File "/databricks/python/lib/python3.9/site-packages/stack_data/utils.py", line 145, in cached_property_wrapper
 
    value = obj.__dict__[self.func.__name__] = self.func(obj)
 
  File "/databricks/python/lib/python3.9/site-packages/executing/executing.py", line 164, in only
 
    raise NotOneValueFound('Expected one value, found 0')
 
executing.executing.NotOneValueFound: Expected one value, found 0
 
Traceback (most recent call last):
 
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/redshift_connector/core.py", line 1631, in execute
 
    ps = cache["ps"][key]
 
KeyError: ('SELECT * \nFROM \n             SELECT * \n             FROM tbl\n             WHERE event_type = %s\n             AND created_at >= %s\n             AND created_at <= %s \n              \n LIMIT 5', ((<RedshiftOID.UNKNOWN: 705>, 0, <function text_out at 0x7f7a348d3790>), (<RedshiftOID.UNKNOWN: 705>, 0, <function text_out at 0x7f7a348d3790>), (<RedshiftOID.UNKNOWN: 705>, 0, <function text_out at 0x7f7a348d3790>)))

数据比较大，大概300-400m行。我需要修改什么配置？或优化查询？并行处理等？

致命错误：尝试在 Jupyter notebook 中从 AWS Redshift 查询数据时，Python 内核无响应

问题描述投票：0回答：0

最新问题

致命错误：尝试在 Jupyter notebook 中从 AWS Redshift 查询数据时，Python 内核无响应

问题描述 投票：0回答：0

最新问题

问题描述投票：0回答：0