将SPSS数据集导入Python

Question

有没有办法将SPSS数据集导入Python，最好是NumPy rearray格式？我环顾四周但找不到任何答案。

俊

Answer 1

SPSS 与 Python 进行了广泛的集成，但这意味着与 SPSS（现在称为 IBM SPSS Statistics）一起使用。有一个 SPSS ODBC 驱动程序，可以与 Python ODBC 支持一起使用来读取 sav 文件。

Answer 2

选项1 正如 rkbarney 指出的那样，可以通过 pypi 使用 Python savReaderWriter。我遇到了两个问题：

除了看似纯Python的实现之外，它还依赖于许多额外的库。几乎在所有情况下，SPSS 文件都是由 IBM 提供的 SPSS I/O 模块读取和写入的。这些模块因平台而异，根据我的经验，“pip install savReaderWriter”并不能让它们开箱即用（在 OS X 上）。
savReaderWriter 的开发虽然还没有结束，但比人们希望的要落后。这使第一个问题变得复杂。它依赖于一些已弃用的软件包来提高速度，并在您导入 savReaderWriter 时发出一些警告（如果它们不可用）。今天这不是一个大问题，但将来可能会带来麻烦，因为 IBM 继续更新 SPSS I/O 模块以处理新的 SPSS 格式（如果没记错的话，它们已经是版本 21 或 22）。

选项2 我选择使用 R 作为中间人。使用 rpy2，我设置了一个简单的函数来将文件读入 R 数据帧，并将其再次输出为 CSV 文件，随后将其导入到 python 中。这有点像鲁布-戈德堡，但确实有效。当然，这需要 R，这在您的环境中安装可能也很麻烦（并且针对不同平台有不同的二进制文件）。

Answer 3

gretl 声称可以导入 SPSS 并以多种格式导出，R 统计套件也是如此。我从未处理过 SPSS 数据，因此无法谈论它们的相对优点。

Answer 4

您可以让 Python 对 spssread 进行外部调用，这是一个 Perl 脚本，可以按照您想要的方式输出 SPSS 文件的内容。

Answer 5

也许这会有所帮助：用于 spss sav 文件的 Python 读取器 + 写入器（Linux、Mac 和 Windows） http://code.activestate.com/recipes/577811-python-reader-writer-for-spss-sav-files-linux-mac-/

Answer 6

需要明确的是，SPSS ODBC 驱动程序不需要安装 SPSS。

Answer 7

也许这对某人有帮助：

http://sourceforge.net/search/?q=python+SPSS

祝你好运！

迈克尔

Answer 8

如果我有代码，我很难找出如何在 google Coolab 中使用 SPDS：将 pandas 导入为 pd 将 numpy 导入为 np 从 pandas 导入 DataFrame 从 sklearn.feature_extraction.text 导入 TfidfVectorizer

从 sklearn.feature_extraction 导入 stop_words

导入nltk 从 nltk.corpus 导入停用词导入javalang

“从解析器导入解析器” “从分词器导入分词器” stop_words = [';', '@', '(', ')', '{', '}', '*', ',', '/']

类解析器： slots = ['name'] # 更快的属性访问

def __init__(self, name):  # Constructor
    self.name = name

def pre_processing(self):  # pre-processing function

    AST = None
    src = open(self.name, 'r')

    # loop to parse each source code
    for x in range(1):

        src = src.read()

        attributes = []
        variables = []

        # Source parsing
        try:
            AST = javalang.parse.parse(src)  # This will return AST
            for path, node in AST:  # Index, Element
                if 'ReferenceType' != node:
                    AST.remove(node)
                print(node, "\n")
                # print(path,"\n")
        except:
            pass

    vectorizer = TfidfVectorizer(stop_words='english')  # Create the vectorize/transform

    vectorizer.fit([str(AST)])  # Learns vocab " CompilationUnit, Imports, path, static, true, util, io "

    print('---------------------------check 2----------------------------------')
    print(vectorizer.vocabulary_)
    print("STOPPPPING WORDS", vectorizer.get_stop_words())
    vector = vectorizer.transform([str(AST)])  # transform document to matrix
    print(vector)
    print('---------------------check 3-------------------------------------------------------------')
    a = np.array(vector.toarray())
    print(a)
    print('---------------------check 4-------------------------------------------------------------')
    df = DataFrame(a)
    print(df)
   # print("Features")
   # print(vectorizer.get_feature_names())
    df.to_csv('featuresExtraction.csv', mode='a', header=False, index=False)

if name == 'main': parser = Parser('') # 创建类解析器的对象

filesNames = pd.read_csv('defectprediction.csv')  # Read all files names
path = filesNames.iloc[:, 1]  # select column 1, all rows
filesNames = filesNames.iloc[:, 0]  # select column 0, rows from 0 to length
filesNames = np.array(filesNames)  # Convert to numpy array
path = np.array(path)  # Convert to numpy array
foundsrc = 0
notdound = 0
fileNum = 1
for i in range(path.shape[0]):
    fileNum+=1
    #print("CURRENT\m")
    #print(filesNames[i]," File number: ", fileNum)
    try:
        fh = open(path[i] + ".java", 'r')
        foundsrc += 1  # Increment found counter
        parser = Parser(path[i] + ".java")
        parser.pre_processing()

    except FileNotFoundError:
        notdound += 1  # Increment not found counter
        data = {'Name': [filesNames[i]],
                'Path': [path[i]],
                'Status': ['NotFound']}
        df = DataFrame(data)  # add them to data frame
        df.to_csv('NotFoundDefectPrediction.csv', mode='a', index=False, header=False)  # Write missing files in csv

print("\n\nfound\t\t", foundsrc)
print("notfound\t", notdound)

将SPSS数据集导入Python

问题描述投票：0回答：8

8个回答

从 sklearn.feature_extraction 导入 stop_words

最新问题

将SPSS数据集导入Python

问题描述 投票：0回答：8

8个回答

从 sklearn.feature_extraction 导入 stop_words

最新问题

问题描述投票：0回答：8