janus.language.naive.chunk_splitter#
Classes#
Splits into fixed chunk sizes without parsing |
Module Contents#
- class janus.language.naive.chunk_splitter.ChunkSplitter(language, model=None, max_tokens=4096, skip_merge=False, protected_node_types=(), prune_node_types=(), prune_unprotected=False)#
Bases:
janus.language.splitter.Splitter
Splits into fixed chunk sizes without parsing
- Parameters:
language (str) – The name of the language to split.
model (janus.llm.models_info.JanusModel | None) – The name of the model to use for counting tokens. If the model is None, will use tiktoken’s default tokenizer to count tokens.
max_tokens (int) – The maximum number of tokens to use for each functional block.
skip_merge (bool) –
Whether to merge child nodes up to the max_token length. May be used for situations like documentation where function-level documentation is preferred. TODO: Maybe instead support something like a list of node types that
shouldnt be merged (e.g. functions, classes)?
prune_unprotected (bool) – Whether to prune unprotected nodes from the tree.