Elsweiler, DavidFrummet, AlexanderHarvey, Morgan2021-05-042021-05-0420202020http://dx.doi.org/10.1007/s13222-020-00333-zhttps://dl.gi.de/handle/20.500.12116/36386Systematic and repeatable measurement of information systems via test collections, the Cranfield model, has been the mainstay of Information Retrieval since the 1960s. However, this may not be appropriate for newer, more interactive systems, such as Conversational Search agents. Such systems rely on Machine Learning technologies, which are not yet sufficiently advanced to permit true human-like dialogues, and so research can be enabled by simulation via human agents. In this work we compare dialogues obtained from two studies with the same context, assistance in the kitchen, but with different experimental setups, allowing us to learn about and evaluate conversational IR systems. We discover that users adapt their behaviour when they think they are interacting with a system and that human-like conversations in one of the studies were unpredictable to an extent we did not expect. Our results have implications for the development of new studies in this area and, ultimately, the design of future conversational agents.Conversational searchEvaluationComparing Wizard of Oz & Observational Studies for Conversational IR EvaluationText/Journal Article10.1007/s13222-020-00333-z1610-1995