janus.language.naive.chunk_splitter#

Classes#

ChunkSplitter

Splits into fixed chunk sizes without parsing

Module Contents#

class janus.language.naive.chunk_splitter.ChunkSplitter(language, model=None, max_tokens=4096, skip_merge=False, protected_node_types=(), prune_node_types=(), prune_unprotected=False)#

Bases: janus.language.splitter.Splitter

Splits into fixed chunk sizes without parsing

Parameters:

language (str) – The name of the language to split.
model (janus.llm.models_info.JanusModel | None) – The name of the model to use for counting tokens. If the model is None, will use tiktoken’s default tokenizer to count tokens.
max_tokens (int) – The maximum number of tokens to use for each functional block.
skip_merge (bool) –
Whether to merge child nodes up to the max_token length. May be used for situations like documentation where function-level documentation is preferred. TODO: Maybe instead support something like a list of node types that

shouldnt be merged (e.g. functions, classes)?
prune_unprotected (bool) – Whether to prune unprotected nodes from the tree.
protected_node_types (tuple[str, Ellipsis]) –
prune_node_types (tuple[str, Ellipsis]) –