Here are output flaws for several LLMs analyzing daily meeting transcripts.
| Prompt version | Input | GP | G4o | G4 | CO | Defects: +1 for each one
[Super outputs: -1 for each one] |
| --- | --- | --- | --- | --- | --- | --- |
| 1. Brief prompt | | 2 | 1 | 2 | 6 | |
| | #1 | 1 | 0 | 1 | 2 | GP: Advice for not all low ratings.
G4o: Advice for not all low ratings. [Very actionable advice]
G4: Too high ratings.
CO: /Requests for transcript and criteria at once./ Names are only in 1 explanation. Advice for not all low ratings. |
| | #2 | 0 | 0 | 0 | 2 | GP: Advice is not structured by the criteria. [However, advice is very useful, with examples]
G4o: -
G4: -
CO: No list for advice. No actions |
| | #3 | 1 | 0 | 1 | 2 | GP: Advice is not structured by the criteria.
G4o: -
G4: Extra advice for high ratings.
CO: No advice. No actions |
| 3. Structured long prompt | | **GP
3** | **G4o
2** | **G4
4** | **CO
2** | |
| | #1 | 1 | 0 | 2 | 1 | GP: No names. Advice for not all low ratings. [Very actionable advice]
G4o: -
G4: No names. Too high ratings.
CO: Non-zero rating for the 1st criterion. |
| | #2 | 2 | 1 | 1 | 0 | GP: Too low ratings. No names. Advice for not all low ratings. [Very actionable advice]
G4o: Extra advice for 1 high ratings.
G4: Extra advice for 1 high rating.
CO: Too low ratings. [Great explanations with many examples] |
| | #3 | 0 | 1 | 1 | 1 | GP: -
G4o: Extra advice for high ratings.
G4: Summary “who said what”, instead of explanations on most criteria
CO: Extra advice for high ratings |
| 5. 1-shot prompt | | **GP
0** | **G4o
0** | **G4
2** | **CO
3** | |
| | #1 | 0 | 0 | 2 | 1 | GP: /Requests for transcript and criteria at once./ Too high ratings [Great advice following the example]
G4o: [Great advice following the example]
G4: Fabricated duration. Names are only in 2 explanations.
CO: /No request for criteria./ Advice for not all low ratings. |
| | #2 | 0 | 0 | 0 | 1 | GP: -
G4o: -
G4: -
CO: No proper question at the end. |
| | #3 | 0 | 0 | 0 | 1 | GP: -
G4o:
G4: -
CO: No proper question at the end. |
| 6. 2-shot prompt | | **GP
1** | **G4o
0** | **G4
1** | **CO
4** | |
| | #1 | 1 | 0 | 1 | 1 | GP: Advice for not all low ratings.
G4o: -
G4: /Requests for transcript and criteria at once./ Fabricated duration.
CO: /No request for criteria./ Advice for not all low ratings. |
| | #2 | 0 | 0 | 0 | 1 | GP: -
G4o: -
G4: -
CO: Summary “who said what”, instead of explanations on most criteria |
| | #3 | 0 | 0 | 0 | 2 | GP: -
G4o: -
G4: -
CO: Summary “who said what”, instead of explanations on ALL criteria: mentioning names by the user made it forget the original task. |