janus.language.naive.chunk_splitter#

Classes#

ChunkSplitter

Splits into fixed chunk sizes without parsing

Module Contents#

class janus.language.naive.chunk_splitter.ChunkSplitter(language, model=None, max_tokens=4096, skip_merge=False, protected_node_types=(), prune_node_types=(), prune_unprotected=False)#

Bases: janus.language.splitter.Splitter

Splits into fixed chunk sizes without parsing

Parameters:
  • language (str) – The name of the language to split.

  • model (janus.llm.models_info.JanusModel | None) – The name of the model to use for counting tokens. If the model is None, will use tiktoken’s default tokenizer to count tokens.

  • max_tokens (int) – The maximum number of tokens to use for each functional block.

  • skip_merge (bool) –

    Whether to merge child nodes up to the max_token length. May be used for situations like documentation where function-level documentation is preferred. TODO: Maybe instead support something like a list of node types that

    shouldnt be merged (e.g. functions, classes)?

  • prune_unprotected (bool) – Whether to prune unprotected nodes from the tree.

  • protected_node_types (tuple[str, Ellipsis]) –

  • prune_node_types (tuple[str, Ellipsis]) –