使用 <title> 标签重命名 HTML 文件

Question

我对编程还比较陌生。我有一个带有子文件夹的文件夹，其中包含数千个 html 文件，这些文件的通用名称为 1006.htm、1007.htm，我想使用文件中的标签对其进行重命名。

例如，如果文件 1006.htm 包含 Page Title ，我想将其重命名为 Page Title.htm 。理想情况下，空格替换为破折号。

我一直在 shell 中使用 bash 脚本工作，但没有成功。我该如何使用 bash 或 python 来做到这一点？

这就是我到目前为止所拥有的..

#!/usr/bin/env bashFILES=/Users/Ben/unzipped/*
for f in $FILES
do
   if [ ${FILES: -4} == ".htm" ]
      then
    awk 'BEGIN{IGNORECASE=1;FS="<title>|</title>";RS=EOF} {print $2}' $FILES
   fi
done

我也尝试过

#!/usr/bin/env bash
for f in *.html;
   do
   title=$( grep -oP '(?<=<title>).*(?=<\/title>)' "$f" )
   mv -i "$f" "${title//[^a-zA-Z0-9\._\- ]}".html   
done

但是我从终端收到一个错误，说明如何使用 grep...

Answer 1

在 bash 脚本中使用 awk 而不是 grep，它应该可以工作：

#!/bin/bash   
for f in *.html;
   do
   title=$( awk 'BEGIN{IGNORECASE=1;FS="<title>|</title>";RS=EOF} {print $2}' "$f" )
   mv -i "$f" "${title//[^a-zA-Z0-9\._\- ]}".html   
done

不要忘记在第一行更改你的 bash 环境 ;)

编辑完整答案并进行所有修改

#!/bin/bash
for f in `find . -type f | grep \.html`
   do
   title=$( awk 'BEGIN{IGNORECASE=1;FS="<title>|</title>";RS=EOF} {print $2}' "$f" )
   mv -i "$f" "${title//[ ]/-}".html
done

Answer 2

这是我刚刚写的一个Python脚本：

import os
import re

from lxml import etree


class MyClass(object):
    def __init__(self, dirname=''):
        self.dirname   = dirname
        self.exp_title = "<title>(.*)</title>"
        self.re_title  = re.compile(self.exp_title)

    def rename(self):
        for afile in os.listdir(self.dirname):
            if os.path.isfile(afile):
                originfile = os.path.join(self.dirname, afile)
                with open(originfile, 'rb') as fp:
                    contents = fp.read()
                try:
                    html  = etree.HTML(contents)
                    title = html.xpath("//title")[0].text
                except Exception as e:
                    try:
                        title = self.re_title.findall(contents)[0]
                    except Exception:
                        title = ''

                if title:
                    newfile = os.path.join(self.dirname, title)
                    os.rename(originfile, newfile)


>>> test = MyClass('/path/to/your/dir')
>>> test.rename()

Answer 3

您想使用 HTML 解析器（如

lxml.html

）来解析 HTML 文件。完成后，检索标题标签只需一行（可能是

page.get_element_by_id("title").text_content()

）。

将其转换为文件名并重命名文档应该很简单。

Answer 4

一个 python3 递归通配符版本，在重命名之前进行一些标题清理。

import re
from pathlib import Path
import lxml.html


root = Path('.')
for path in root.rglob("*.html"):

    soup = lxml.html.parse(path)
    title_els = soup.xpath('/html/head/title')

    if len(title_els):
        title = title_els[0].text

        if title:
            print(f'Original title {title}')
            name = re.sub(r'[^\w\s-]', '', title.lower())
            name = re.sub(r'[\s]+', '-', name)
            new_path = (path.parent/name).with_suffix(path.suffix)

            if not Path(new_path).exists():
                print(f'Renaming [{path.absolute()}] to [{new_path}]')
                path.rename(new_path)
            else:
                print(f'{new_path.name} already exists!')

使用 <title> 标签重命名 HTML 文件

问题描述投票：0回答：4

4个回答

最新问题

使用 <title> 标签重命名 HTML 文件

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4