Search for a command to run...
Inclusive usability testing, such as the GenderMag method, wants to identify gender-related usability problems in digital interfaces. Large Language Models (LLMs) have been used by usability engineers in usability evaluations but their contribution is still underexplored, especially regarding inclusive usability testing. Research has shown that GenderMag workshops can produce valuable insights but are resource intense and might show effects from the evaluators’ ability to embody personas with a different cognitive style. Therefore, we need to assess if LLM-agent based testing can aid human-based evaluations. This study evaluates an LLM-agent system for GenderMag persona-based usability testing, and compares its performance to traditional human-led evaluations. The agent system integrates GenderMag persona facets into three LLM-agents, which analyze usability issues of four web interfaces, three generic and one intentionally flawed interface with gender-related usability issues. We quantitatively and qualitatively compare the types, severity, and relevance of usability problems LLM-agents identified to those produced in three GenderMag workshops involving nine participants. Findings show a broad overlap in detecting usability issues of humans and LLM-agents that are not gender specific via the generic interfaces. Agents thereby consistently assign significantly higher severity and relevance ratings. For the intentionally flawed interface, humans and LLM-agents assign similar ratings but the overlap between human and agent gender-related usability issues was low, with each missing issues the other caught. The agent system is an efficient tool for gender-related usability evaluation that humans may overlook, thereby expanding the coverage of evaluation.