我写了一个准确率接近100%的新闻网页通用提取器

青南是天才 · 2019年09月10日 02:55

目前使用今日头条、网易新闻、游民星空、观察者网、凤凰网、腾讯新闻、ReadHub、新浪新闻做了测试，发现提取效果非常出色，几乎能够达到100%的准确率，理论上可以自动抽取各种新闻网站。

https://github.com/kingname/GeneralNewsExtractor

如果你想体验 GNE 的功能，请按照如下步骤进行：

安装 GNE


# 以下两种方案任选一种即可

# 使用 pip 安装
pip install --upgrade git+https://github.com/kingname/GeneralNewsExtractor.git

# 使用 pipenv 安装
pipenv install git+https://github.com/kingname/GeneralNewsExtractor.git#egg=gne

使用 GNE

>>> from gne import GeneralNewsExtractor

>>> html = '''经过渲染的网页 HTML 代码'''

>>> extractor = GeneralNewsExtractor()
>>> result = extractor.extract(html)
>>> print(result)

{"title": "xxxx", "publish_time": "2019-09-10 11:12:13", "author": "yyy", "content": "zzzz"}

xzhao · 2019年09月10日 03:25

不愧是天才

benren · 2019年09月14日 03:05

厉害厉害。。star 了一个。

chengguyun · 2019年09月15日 00:00

太强了！已star！