Repository evaluations - ox/CommaQA

Evaluations/llm as a judge

one-shot-results

qa_train.parquet

Type: text → text

Model:

OpenAI/GPT-4o

Provider:

OpenAI

Target field: is_correct

Prompt

are these lists equivalent? answer with one word "true" or "false" all lowercase

List 1: {answer_spans}
List 2: {prediction}

Queued: Oct 17, 2024, 4:43 PM UTC

Completed: Oct 17, 2024, 4:43 PM UTC

5 row sample

243 tokens

5 rows processed, 243 tokens used

Sample Results completed

7 columns, 1-5 of 300 rows