Here are output flaws for several LLMs analyzing daily meeting transcripts.

| Prompt version | Input | GP | G4o | G4 | CO | Defects: +1 for each one [Super outputs: -1 for each one] | | --- | --- | --- | --- | --- | --- | --- | | 1. Brief prompt | | 2 | 1 | 2 | 6 | | | | #1 | 1 | 0 | 1 | 2 | GP: Advice for not all low ratings. G4o: Advice for not all low ratings. [Very actionable advice] G4: Too high ratings. CO: /Requests for transcript and criteria at once./ Names are only in 1 explanation. Advice for not all low ratings. | | | #2 | 0 | 0 | 0 | 2 | GP: Advice is not structured by the criteria. [However, advice is very useful, with examples] G4o: - G4: - CO: No list for advice. No actions | | | #3 | 1 | 0 | 1 | 2 | GP: Advice is not structured by the criteria. G4o: - G4: Extra advice for high ratings. CO: No advice. No actions | | 3. Structured long prompt | | **GP 3** | **G4o 2** | **G4 4** | **CO 2** | | | | #1 | 1 | 0 | 2 | 1 | GP: No names. Advice for not all low ratings. [Very actionable advice] G4o: - G4: No names. Too high ratings. CO: Non-zero rating for the 1st criterion. | | | #2 | 2 | 1 | 1 | 0 | GP: Too low ratings. No names. Advice for not all low ratings. [Very actionable advice] G4o: Extra advice for 1 high ratings. G4: Extra advice for 1 high rating. CO: Too low ratings. [Great explanations with many examples] | | | #3 | 0 | 1 | 1 | 1 | GP: - G4o: Extra advice for high ratings. G4: Summary “who said what”, instead of explanations on most criteria CO: Extra advice for high ratings | | 5. 1-shot prompt | | **GP 0** | **G4o 0** | **G4 2** | **CO 3** | | | | #1 | 0 | 0 | 2 | 1 | GP: /Requests for transcript and criteria at once./ Too high ratings [Great advice following the example] G4o: [Great advice following the example] G4: Fabricated duration. Names are only in 2 explanations. CO: /No request for criteria./ Advice for not all low ratings. | | | #2 | 0 | 0 | 0 | 1 | GP: - G4o: - G4: - CO: No proper question at the end. | | | #3 | 0 | 0 | 0 | 1 | GP: - G4o:
G4: - CO: No proper question at the end. | | 6. 2-shot prompt | | **GP 1** | **G4o 0** | **G4 1** | **CO 4** | | | | #1 | 1 | 0 | 1 | 1 | GP: Advice for not all low ratings. G4o: - G4: /Requests for transcript and criteria at once./ Fabricated duration. CO: /No request for criteria./ Advice for not all low ratings. | | | #2 | 0 | 0 | 0 | 1 | GP: - G4o: - G4: - CO: Summary “who said what”, instead of explanations on most criteria | | | #3 | 0 | 0 | 0 | 2 | GP: - G4o: - G4: - CO: Summary “who said what”, instead of explanations on ALL criteria: mentioning names by the user made it forget the original task. |