Output defects for few-shot prompts

Here are output flaws for several LLMs analyzing daily meeting transcripts.

The table columns are LLMs. Gemini 1.5 Pro, GPT-4, and Claude 3 Opus are shortened as G4, GP and CO, respectively. GPT-4 and Claude 3 Opus are used with a system prompt, Gemini gets the prompt as just the first user message (as the size of the prompt exceeds the limit of AI Studio system prompt).
“Input” means the following set of input data: a) a transcript, b) user’s comments for it, c) criteria chosen by the user (can be omitted to use previous criteria).
A cell value is the number of defects in a test with the LLM for the given input. /Defects for initial message (if any) are not counted/.

| Prompt version | Input | GP | G4o | G4 | CO | Defects: +1 for each one [Super outputs: -1 for each one] | | --- | --- | --- | --- | --- | --- | --- | | 1. Brief prompt | | 2 | 1 | 2 | 6 | | | | #1 | 1 | 0 | 1 | 2 | GP: Advice for not all low ratings. G4o: Advice for not all low ratings. [Very actionable advice] G4: Too high ratings. CO: /Requests for transcript and criteria at once./ Names are only in 1 explanation. Advice for not all low ratings. | | | #2 | 0 | 0 | 0 | 2 | GP: Advice is not structured by the criteria. [However, advice is very useful, with examples] G4o: - G4: - CO: No list for advice. No actions | | | #3 | 1 | 0 | 1 | 2 | GP: Advice is not structured by the criteria. G4o: - G4: Extra advice for high ratings. CO: No advice. No actions | | 3. Structured long prompt | | **GP 3** | **G4o 2** | **G4 4** | **CO 2** | | | | #1 | 1 | 0 | 2 | 1 | GP: No names. Advice for not all low ratings. [Very actionable advice] G4o: - G4: No names. Too high ratings. CO: Non-zero rating for the 1st criterion. | | | #2 | 2 | 1 | 1 | 0 | GP: Too low ratings. No names. Advice for not all low ratings. [Very actionable advice] G4o: Extra advice for 1 high ratings. G4: Extra advice for 1 high rating. CO: Too low ratings. [Great explanations with many examples] | | | #3 | 0 | 1 | 1 | 1 | GP: - G4o: Extra advice for high ratings. G4: Summary “who said what”, instead of explanations on most criteria CO: Extra advice for high ratings | | 5. 1-shot prompt | | **GP 0** | **G4o 0** | **G4 2** | **CO 3** | | | | #1 | 0 | 0 | 2 | 1 | GP: /Requests for transcript and criteria at once./ Too high ratings [Great advice following the example] G4o: [Great advice following the example] G4: Fabricated duration. Names are only in 2 explanations. CO: /No request for criteria./ Advice for not all low ratings. | | | #2 | 0 | 0 | 0 | 1 | GP: - G4o: - G4: - CO: No proper question at the end. | | | #3 | 0 | 0 | 0 | 1 | GP: - G4o:
G4: - CO: No proper question at the end. | | 6. 2-shot prompt | | **GP 1** | **G4o 0** | **G4 1** | **CO 4** | | | | #1 | 1 | 0 | 1 | 1 | GP: Advice for not all low ratings. G4o: - G4: /Requests for transcript and criteria at once./ Fabricated duration. CO: /No request for criteria./ Advice for not all low ratings. | | | #2 | 0 | 0 | 0 | 1 | GP: - G4o: - G4: - CO: Summary “who said what”, instead of explanations on most criteria | | | #3 | 0 | 0 | 0 | 2 | GP: - G4o: - G4: - CO: Summary “who said what”, instead of explanations on ALL criteria: mentioning names by the user made it forget the original task. |