Here are output flaws for several LLMs analyzing daily meeting transcripts.
| Prompt version | Input | GP | G4o | G4 | CO | Defects: +1 for each one
[Super outputs: -1 for each one] |
| --- | --- | --- | --- | --- | --- | --- |
| 1. Brief prompt |  | 2 | 1 | 2 | 6 |  |
|  | #1 | 1 | 0 | 1 | 2 | GP: Advice for not all low ratings.
G4o: Advice for not all low ratings. [Very actionable advice]
G4: Too high ratings.
CO: /Requests for transcript and criteria at once./ Names are only in 1 explanation. Advice for not all low ratings.  |
|  | #2 | 0 | 0 | 0 | 2 | GP: Advice is not structured by the criteria. [However, advice is very useful, with examples]
G4o: -
G4: -
CO: No list for advice. No actions |
|  | #3 | 1 | 0 | 1 | 2 | GP: Advice is not structured by the criteria.
G4o:  -
G4: Extra advice for high ratings.
CO: No advice. No actions |
| 3. Structured long prompt |  | **GP
3** | **G4o
2** | **G4
4** | **CO
2** |  |
|  | #1 | 1 | 0 | 2 | 1 | GP: No names. Advice for not all low ratings. [Very actionable advice]
G4o: -
G4: No names. Too high ratings.
CO:  Non-zero rating for the 1st criterion.  |
|  | #2 | 2 | 1 | 1 | 0 | GP: Too low ratings. No names. Advice for not all low ratings. [Very actionable advice]
G4o: Extra advice for 1 high ratings.
G4: Extra advice for 1 high rating.
CO: Too low ratings. [Great explanations with many examples] |
|  | #3 | 0 | 1 | 1 | 1 | GP: -
G4o: Extra advice for high ratings.
G4: Summary “who said what”, instead of explanations on most criteria
CO: Extra advice for high ratings |
| 5. 1-shot prompt |  | **GP
0** | **G4o
0** | **G4
2** | **CO
3** |  |
|  | #1 | 0 | 0 | 2 | 1 | GP: /Requests for transcript and criteria at once./ Too high ratings [Great advice following the example]
G4o:  [Great advice following the example]
G4: Fabricated duration. Names are only in 2 explanations.
CO: /No request for criteria./ Advice for not all low ratings. |
|  | #2 | 0 | 0 | 0 | 1 | GP: -
G4o: -
G4: -
CO: No proper question at the end.  |
|  | #3 | 0 | 0 | 0 | 1 | GP: -
G4o:
G4: -
CO: No proper question at the end.  |
| 6. 2-shot prompt |  | **GP
1** | **G4o
0** | **G4
1** | **CO
4** |  |
|  | #1 | 1 | 0 | 1 | 1 | GP: Advice for not all low ratings.
G4o: -
G4: /Requests for transcript and criteria at once./ Fabricated duration.
CO: /No request for criteria./ Advice for not all low ratings. |
|  | #2 | 0 | 0 | 0 | 1 | GP: -
G4o: -
G4: -
CO: Summary “who said what”, instead of explanations on most criteria |
|  | #3 | 0 | 0 | 0 | 2 | GP: -
G4o: -
G4: -
CO: Summary “who said what”, instead of explanations on ALL criteria: mentioning names by the user made it forget the original task. |