Metadata-Version: 2.2 Name: jusText Version: 3.0.2 Summary: Heuristic based boilerplate removal tool Home-page: https://github.com/miso-belica/jusText Author: Jan Pomikálek Author-email: jan.pomikalek@gmail.com Maintainer: Michal Belica Maintainer-email: miso.belica@gmail.com License: The BSD 2-Clause License Classifier: Development Status :: 5 - Production/Stable Classifier: Intended Audience :: Developers Classifier: Natural Language :: English Classifier: License :: OSI Approved :: BSD License Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 2 Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.5 Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Classifier: Programming Language :: Python :: Implementation :: CPython Classifier: Topic :: Internet :: WWW/HTTP Classifier: Topic :: Software Development :: Pre-processors Classifier: Topic :: Text Processing :: Filters Classifier: Topic :: Text Processing :: Markup :: HTML License-File: LICENSE.rst Requires-Dist: lxml[html_clean]>=4.4.2 Requires-Dist: backports.functools-lru-cache; python_version < "3.2" Dynamic: author Dynamic: author-email Dynamic: classifier Dynamic: description Dynamic: home-page Dynamic: license Dynamic: maintainer Dynamic: maintainer-email Dynamic: requires-dist Dynamic: summary .. _jusText: http://code.google.com/p/justext/ .. _Python: http://www.python.org/ .. _lxml: http://lxml.de/ jusText ======= .. image:: https://github.com/miso-belica/jusText/actions/workflows/run-tests.yml/badge.svg :target: https://github.com/miso-belica/jusText/actions/workflows/run-tests.yml Program jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is `designed `_ to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. You can `try it online `_. This is a fork of original (currently unmaintained) code of jusText_ hosted on Google Code. Adaptations of the algorithm to other languages: - `C++ `_ - `Go `_ - `Java `_ Some libraries using jusText: - `chirp `_ - `lazynlp `_ - `off-topic-memento-toolkit `_ - `pears `_ - `readability calculator `_ - `sky `_ Some currently (Jan 2020) maintained alternatives: - `dragnet `_ - `html2text `_ - `inscriptis `_ - `newspaper `_ - `python-readability `_ - `trafilatura `_ Installation ------------ Make sure you have Python_ 2.7+/3.5+ and `pip `_ (`Windows `_, `Linux `_) installed. Run simply: .. code-block:: bash $ [sudo] pip install justext Dependencies ------------ :: lxml (version depends on your Python version) Usage ----- .. code-block:: bash $ python -m justext -s Czech -o text.txt http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/ $ python -m justext -s English -o plain_text.txt english_page.html $ python -m justext --help # for more info Python API ---------- .. code-block:: python import requests import justext response = requests.get("http://planet.python.org/") paragraphs = justext.justext(response.content, justext.get_stoplist("English")) for paragraph in paragraphs: if not paragraph.is_boilerplate: print paragraph.text Testing ------- Run tests via .. code-block:: bash $ py.test-2.7 && py.test-3.5 && py.test-3.6 && py.test-3.7 && py.test-3.8 && py.test-3.9 Acknowledgements ---------------- .. _`Natural Language Processing Centre`: http://nlp.fi.muni.cz/en/nlpc .. _`Masaryk University in Brno`: http://nlp.fi.muni.cz/en .. _PRESEMT: http://presemt.eu/ .. _`Lexical Computing Ltd.`: http://lexicalcomputing.com/ .. _`PhD research`: http://is.muni.cz/th/45523/fi_d/phdthesis.pdf This software has been developed at the `Natural Language Processing Centre`_ of `Masaryk University in Brno`_ with a financial support from PRESEMT_ and `Lexical Computing Ltd.`_ It also relates to `PhD research`_ of Jan Pomikálek. .. :changelog: Changelog for jusText ===================== 3.0.2 (2025-02-25) ------------------ - *BUG FIX:* Handle urllib imports in Python 2 and 3 correctly `#51 `_. 3.0.1 (2024-05-09) ------------------ - *BUG FIX:* Fix issue with new version of lxml `#48 `_. 3.0.0 (2021-10-21) ------------------ - *INCOMPATIBLE CHANGE:* Dropped support for Python 3.4 and below. - *BUG FIX:* Don't join words separated only by ``
`` tag. - *BUG FIX:* List available stop-lists alphabetically. 2.2.0 (2016-03-06) ------------------ - *INCOMPATIBLE CHANGE:* Stop words are case insensitive. - *INCOMPATIBLE CHANGE:* Dropped support for Python 3.2 - *BUG FIX:* Preserve new lines from original text in paragraphs. 2.1.1 (2014-05-27) ------------------ - *BUG FIX:* Function ``decode_html`` now respects parameter ``errors`` when falling to ``default_encoding`` `#9 `_. 2.1.0 (2014-01-25) ------------------ - *FEATURE:* Added XPath selector to the paragrahs. XPath selector is also available in detailed output as ``xpath`` attribute of ``

`` tag `#5 `_. 2.0.0 (2013-08-26) ------------------ - *FEATURE:* Added pluggable DOM preprocessor. - *FEATURE:* Added support for Python 3.2+. - *INCOMPATIBLE CHANGE:* Paragraphs are instances of ``justext.paragraph.Paragraph``. - *INCOMPATIBLE CHANGE:* Script 'justext' removed in favour of command ``python -m justext``. - *FEATURE:* It's possible to enter an URI as input document in CLI. - *FEATURE:* It is possible to pass unicode string directly. 1.2.0 (2011-08-08) ------------------ - *FEATURE:* Character counts used instead of word counts where possible in order to make the algorithm work well in the language independent mode (without a stoplist) for languages where counting words is not easy (Japanese, Chinese, Thai, etc). - *BUG FIX:* More robust parsing of meta tags containing the information about used charset. - *BUG FIX:* Corrected decoding of HTML entities € to Ÿ 1.1.0 (2011-03-09) ------------------ - First public release.