Skip to content
LlamaIndex Framework
Component Guides
Loading
Documents And Nodes

Metadata Extraction Usage Pattern

You can use LLMs to automate metadata extraction with our Metadata Extractor modules.

Our metadata extractor modules include the following “feature extractors”:

  • SummaryExtractor - automatically extracts a summary over a set of Nodes
  • QuestionsAnsweredExtractor - extracts a set of questions that each Node can answer
  • TitleExtractor - extracts a title over the context of each Node
  • EntityExtractor - extracts entities (i.e. names of places, people, things) mentioned in the content of each Node

Then you can chain the Metadata Extractors with our node parser:

from llama_index.core.extractors import (
TitleExtractor,
QuestionsAnsweredExtractor,
)
from llama_index.core.node_parser import TokenTextSplitter
text_splitter = TokenTextSplitter(
separator=" ", chunk_size=512, chunk_overlap=128
)
title_extractor = TitleExtractor(nodes=5)
qa_extractor = QuestionsAnsweredExtractor(questions=3)
# assume documents are defined -> extract nodes
from llama_index.core.ingestion import IngestionPipeline
pipeline = IngestionPipeline(
transformations=[text_splitter, title_extractor, qa_extractor]
)
nodes = pipeline.run(
documents=documents,
in_place=True,
show_progress=True,
)

or insert into an index:

from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(
documents, transformations=[text_splitter, title_extractor, qa_extractor]
)