Skip to content

Parallel Processing SimpleDirectoryReader

In this notebook, we demonstrate how to use parallel processing when loading data with SimpleDirectoryReader. Parallel processing can be useful with heavier workloads i.e., loading from a directory consisting of many files. (NOTE: if using Windows, you may see less gains when using parallel processing for loading data. This has to do with the differences between how multiprocess works in linux/mac and windows e.g., see here or here)

import cProfile, pstats
from pstats import SortKey

In this demo, we’ll use the PatronusAIFinanceBenchDataset llama-dataset from llamahub. This dataset is based off of a set of 32 PDF files which are included in the download from llamahub.

!llamaindex-cli download-llamadataset PatronusAIFinanceBenchDataset --download-dir ./data
from llama_index.core import SimpleDirectoryReader
# define our reader with the directory containing the 32 pdf files
reader = SimpleDirectoryReader(input_dir="./data/source_files")

Sequential loading is the default behaviour and can be executed via the load_data() method.

documents = reader.load_data()
len(documents)
4306
cProfile.run("reader.load_data()", "oldstats")
p = pstats.Stats("oldstats")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)
Wed Jan 10 12:40:50 2024 oldstats
1857432165 function calls (1853977584 primitive calls) in 391.159 seconds
Ordered by: cumulative time
List reduced from 292 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 391.159 391.159 {built-in method builtins.exec}
1 0.003 0.003 391.158 391.158 <string>:1(<module>)
1 0.000 0.000 391.156 391.156 base.py:367(load_data)
32 0.000 0.000 391.153 12.224 base.py:256(load_file)
32 0.127 0.004 391.149 12.223 docs_reader.py:24(load_data)
4306 1.285 0.000 387.685 0.090 _page.py:2195(extract_text)
4444/4306 5.984 0.001 386.399 0.090 _page.py:1861(_extract_text)
4444 0.006 0.000 270.543 0.061 _data_structures.py:1220(operations)
4444 43.270 0.010 270.536 0.061 _data_structures.py:1084(_parse_content_stream)
36489963/33454574 32.688 0.000 167.817 0.000 _data_structures.py:1248(read_object)
23470599 19.764 0.000 100.843 0.000 _page.py:1944(process_operation)
48258569 37.205 0.000 75.145 0.000 _utils.py:200(read_until_regex)
25208954 11.215 0.000 64.272 0.000 _base.py:481(read_from_stream)
18016574 23.488 0.000 49.305 0.000 __init__.py:88(crlf_space_check)
8642699 20.779 0.000 48.224 0.000 _utils.py:14(read_hex_string_from_stream)
<pstats.Stats at 0x16bb3d300>

To load using parallel processes, we set num_workers to a positive integer value.

documents = reader.load_data(num_workers=10)
len(documents)
4306
cProfile.run("reader.load_data(num_workers=10)", "newstats")
p = pstats.Stats("newstats")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)
Wed Jan 10 13:05:13 2024 newstats
12539 function calls in 31.319 seconds
Ordered by: cumulative time
List reduced from 212 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 31.319 31.319 {built-in method builtins.exec}
1 0.003 0.003 31.319 31.319 <string>:1(<module>)
1 0.000 0.000 31.316 31.316 base.py:367(load_data)
24 0.000 0.000 31.139 1.297 threading.py:589(wait)
23 0.000 0.000 31.139 1.354 threading.py:288(wait)
155 31.138 0.201 31.138 0.201 {method 'acquire' of '_thread.lock' objects}
1 0.000 0.000 31.133 31.133 pool.py:369(starmap)
1 0.000 0.000 31.133 31.133 pool.py:767(get)
1 0.000 0.000 31.133 31.133 pool.py:764(wait)
1 0.000 0.000 0.155 0.155 context.py:115(Pool)
1 0.000 0.000 0.155 0.155 pool.py:183(__init__)
1 0.000 0.000 0.153 0.153 pool.py:305(_repopulate_pool)
1 0.001 0.001 0.153 0.153 pool.py:314(_repopulate_pool_static)
10 0.001 0.000 0.152 0.015 process.py:110(start)
10 0.001 0.000 0.150 0.015 context.py:285(_Popen)
<pstats.Stats at 0x29408ab30>
391 / 30
13.033333333333333

As one can observe from the results above, there is a ~13x speed up (or 1200% speed increase) when using parallel processing when loading from a directory with many files.