HTML Tag Reader
If you’re opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
%pip install llama-index-readers-file
!pip install llama-index
Download HTML file
Section titled “Download HTML file”%%bashwget -e robots=off --no-clobber --page-requisites \ --html-extension --convert-links --restrict-file-names=windows \ --domains docs.ray.io --no-parent --accept=html \ -P data/ https://docs.ray.io/en/master/ray-overview/installation.html
Both --no-clobber and --convert-links were specified, only --convert-links will be used.--2023-09-07 16:36:36-- https://docs.ray.io/en/master/ray-overview/installation.htmlResolving docs.ray.io (docs.ray.io)... 104.18.1.163, 104.18.0.163Connecting to docs.ray.io (docs.ray.io)|104.18.1.163|:443... connected.HTTP request sent, awaiting response... 200 OKLength: unspecified [text/html]Saving to: ‘data/docs.ray.io/en/master/ray-overview/installation.html’
0K .......... .......... .......... .......... .......... 125M 50K .......... .......... .......... .......... .......... 21.4M 100K .......... .......... .......... ........ 1.01M=0.04s
2023-09-07 16:36:37 (3.37 MB/s) - ‘data/docs.ray.io/en/master/ray-overview/installation.html’ saved [142067]
FINISHED --2023-09-07 16:36:37--Total wall clock time: 0.3sDownloaded: 1 files, 139K in 0.04s (3.37 MB/s)Converting links in data/docs.ray.io/en/master/ray-overview/installation.html... 116.48-68Converted links in 1 files in 0.002 seconds.
from llama_index.readers.file import HTMLTagReader
reader = HTMLTagReader(tag="section", ignore_no_id=True)docs = reader.load_data( "data/docs.ray.io/en/master/ray-overview/installation.html")
for doc in docs: print(doc.metadata)
{'tag': 'section', 'tag_id': 'installing-ray', 'file_path': 'data/docs.ray.io/en/master/ray-overview/installation.html'}{'tag': 'section', 'tag_id': 'official-releases', 'file_path': 'data/docs.ray.io/en/master/ray-overview/installation.html'}{'tag': 'section', 'tag_id': 'from-wheels', 'file_path': 'data/docs.ray.io/en/master/ray-overview/installation.html'}{'tag': 'section', 'tag_id': 'daily-releases-nightlies', 'file_path': 'data/docs.ray.io/en/master/ray-overview/installation.html'}{'tag': 'section', 'tag_id': 'installing-from-a-specific-commit', 'file_path': 'data/docs.ray.io/en/master/ray-overview/installation.html'}{'tag': 'section', 'tag_id': 'install-ray-java-with-maven', 'file_path': 'data/docs.ray.io/en/master/ray-overview/installation.html'}{'tag': 'section', 'tag_id': 'install-ray-c', 'file_path': 'data/docs.ray.io/en/master/ray-overview/installation.html'}{'tag': 'section', 'tag_id': 'm1-mac-apple-silicon-support', 'file_path': 'data/docs.ray.io/en/master/ray-overview/installation.html'}{'tag': 'section', 'tag_id': 'windows-support', 'file_path': 'data/docs.ray.io/en/master/ray-overview/installation.html'}{'tag': 'section', 'tag_id': 'installing-ray-on-arch-linux', 'file_path': 'data/docs.ray.io/en/master/ray-overview/installation.html'}{'tag': 'section', 'tag_id': 'installing-from-conda-forge', 'file_path': 'data/docs.ray.io/en/master/ray-overview/installation.html'}{'tag': 'section', 'tag_id': 'building-ray-from-source', 'file_path': 'data/docs.ray.io/en/master/ray-overview/installation.html'}{'tag': 'section', 'tag_id': 'docker-source-images', 'file_path': 'data/docs.ray.io/en/master/ray-overview/installation.html'}{'tag': 'section', 'tag_id': 'launch-ray-in-docker', 'file_path': 'data/docs.ray.io/en/master/ray-overview/installation.html'}{'tag': 'section', 'tag_id': 'test-if-the-installation-succeeded', 'file_path': 'data/docs.ray.io/en/master/ray-overview/installation.html'}{'tag': 'section', 'tag_id': 'installed-python-dependencies', 'file_path': 'data/docs.ray.io/en/master/ray-overview/installation.html'}