Skip to content
⌘K

Parallel Processing SimpleDirectoryReader

In this notebook, we demonstrate how to use parallel processing when loading data with SimpleDirectoryReader. Parallel processing can be useful with heavier workloads i.e., loading from a directory consisting of many files. (NOTE: if using Windows, you may see less gains when using parallel processing for loading data. This has to do with the differences between how multiprocess works in linux/mac and windows e.g., see here or here)

import cProfile, pstats
from pstats import SortKey

In this demo, we’ll use the PatronusAIFinanceBenchDataset llama-dataset from llamahub. This dataset is based off of a set of 32 PDF files which are included in the download from llamahub.

!llamaindex-cli download-llamadataset PatronusAIFinanceBenchDataset --download-dir ./data
from llama_index.core import SimpleDirectoryReader
# define our reader with the directory containing the 32 pdf files
reader = SimpleDirectoryReader(input_dir="./data/source_files")

Sequential loading is the default behaviour and can be executed via the load_data() method.

documents = reader.load_data()
len(documents)
4306
cProfile.run("reader.load_data()", "oldstats")
p = pstats.Stats("oldstats")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)
Wed Jan 10 12:40:50 2024 oldstats
1857432165 function calls (1853977584 primitive calls) in 391.159 seconds
Ordered by: cumulative time
List reduced from 292 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 391.159 391.159 {built-in method builtins.exec}
1 0.003 0.003 391.158 391.158 <string>:1(<module>)
1 0.000 0.000 391.156 391.156 base.py:367(load_data)
32 0.000 0.000 391.153 12.224 base.py:256(load_file)
32 0.127 0.004 391.149 12.223 docs_reader.py:24(load_data)
4306 1.285 0.000 387.685 0.090 _page.py:2195(extract_text)
4444/4306 5.984 0.001 386.399 0.090 _page.py:1861(_extract_text)
4444 0.006 0.000 270.543 0.061 _data_structures.py:1220(operations)
4444 43.270 0.010 270.536 0.061 _data_structures.py:1084(_parse_content_stream)
36489963/33454574 32.688 0.000 167.817 0.000 _data_structures.py:1248(read_object)
23470599 19.764 0.000 100.843 0.000 _page.py:1944(process_operation)
48258569 37.205 0.000 75.145 0.000 _utils.py:200(read_until_regex)
25208954 11.215 0.000 64.272 0.000 _base.py:481(read_from_stream)
18016574 23.488 0.000 49.305 0.000 __init__.py:88(crlf_space_check)
8642699 20.779 0.000 48.224 0.000 _utils.py:14(read_hex_string_from_stream)
<pstats.Stats at 0x16bb3d300>

To load using parallel processes, we set num_workers to a positive integer value.

documents = reader.load_data(num_workers=10)
len(documents)
4306
cProfile.run("reader.load_data(num_workers=10)", "newstats")
p = pstats.Stats("newstats")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)
Wed Jan 10 13:05:13 2024 newstats
12539 function calls in 31.319 seconds
Ordered by: cumulative time
List reduced from 212 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 31.319 31.319 {built-in method builtins.exec}
1 0.003 0.003 31.319 31.319 <string>:1(<module>)
1 0.000 0.000 31.316 31.316 base.py:367(load_data)
24 0.000 0.000 31.139 1.297 threading.py:589(wait)
23 0.000 0.000 31.139 1.354 threading.py:288(wait)
155 31.138 0.201 31.138 0.201 {method 'acquire' of '_thread.lock' objects}
1 0.000 0.000 31.133 31.133 pool.py:369(starmap)
1 0.000 0.000 31.133 31.133 pool.py:767(get)
1 0.000 0.000 31.133 31.133 pool.py:764(wait)
1 0.000 0.000 0.155 0.155 context.py:115(Pool)
1 0.000 0.000 0.155 0.155 pool.py:183(__init__)
1 0.000 0.000 0.153 0.153 pool.py:305(_repopulate_pool)
1 0.001 0.001 0.153 0.153 pool.py:314(_repopulate_pool_static)
10 0.001 0.000 0.152 0.015 process.py:110(start)
10 0.001 0.000 0.150 0.015 context.py:285(_Popen)
<pstats.Stats at 0x29408ab30>
391 / 30
13.033333333333333

As one can observe from the results above, there is a ~13x speed up (or 1200% speed increase) when using parallel processing when loading from a directory with many files.