Evaluating#

There are a few different ways you can evaluate the products generated with janus-llm:

  • Evaluate a set of requirements with an LLM based on INCOSE standards.

  • Evaluate the reading level of summaries or requirements, etc.

  • Compare the difference between two texts with BLEU, chrF, ROUGE, and similarity scores.

  • Evaluate the cyclomatic complexity, maintainability, Halstead difficulty, effort, and volume of source code.

Some of these evaluation metrics require a target and a reference (e.g., BLEU, chrF, ROUGE, and similarity scores). Others only require a target (e.g., cyclomatic complexity, flesch, etc.).

The output of janus evaluate -h is seen below:

> janus evaluate -h
                                                                                                         
 Usage: janus evaluate [OPTIONS] COMMAND [ARGS]...                                                       
                                                                                                         
 Evaluation of generated source code or documentation                                                    
                                                                                                         
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────╮
│ --help  -h        Show this message and exit.                                                         │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ────────────────────────────────────────────────────────────────────────────────────────────╮
│ bleu                         BLEU score using sacrebleu                                               │
│ chrf                         chrF score using sacrebleu                                               │
│ cyclomatic-complexity        Cyclomatic complexity score                                              │
│ difficulty                   Halstead difficulty score                                                │
│ effort                       Halstead effort score                                                    │
│ flesch                       The Flesch Readability score                                             │
│ flesch-grade                 The Flesch Grade Level Readability score                                 │
│ gunning-fog                  The Gunning-Fog Readability score                                        │
│ gunning-fog-grade            The Gunning-Fog Grade Level Readability score                            │
│ maintainability              Maintainability score                                                    │
│ rouge                        ROUGE score                                                              │
│ similarity-score             Distance between embeddings of strings.                                  │
│ volume                       Halstead volume score                                                    │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────╯

There’s also an LLM Self Evaluation command. The help command has been trimmed for brevity:

❯ janus llm-self-eval -h
                                                                                                         
 Usage: janus llm-self-eval [OPTIONS]                                                                    
                                                                                                         
 Use an LLM to evaluate its own performance.                                                             
                                                                                                         
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────╮
│    --evaluation-type         -e                 [incose|comments]          Type of output to          │
│                                                                            evaluate.                  │
│                                                                            [default: incose]          │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────╯

Evaluation with an LLM#

Adding a Model#

Before you can evaluate with an LLM, you need to add an LLM model to your configuration. You can do this by running the following command:

janus llm add my-gpt --type OpenAI

And then follow the CLI instructions to add the model to your configuration.

Evaluating Requirements#

First, generate this requirements with janus:

janus document  --doc-mode requirements --input janus/cli/ --output janus-docs --llm my-gpt --language python -r RequirementsFormatRefiner

You can use an LLM to evaluate requirements against the INCOSE standard.

janus llm-self-eval --input janus-docs --output janus-evals --llm myazure --language python -e incose

Evaluation without a Reference#

Flesch Grade Level#

Adding -S to an evaluate command will read in the target and reference as a string instead of reading in a file’s contents.

janus evaluate flesch-grade -o test-output.json -t "This is an example of the reading grade level" -S

Evaluation with a Reference#

BLEU#

janus evaluate bleu -o test-output.json -t "This is an example of the BLEU metric" -r "This is a test of the BLEU metric" -S