Although traditional and conjoint forms of concept testing play an important role in the new product development process, they largely ignore data quality issues, as evidenced by the traditional reliance on the percent Top-2-Box scores heuristic. The purpose of this research is to reconsider the design of concept testing from a measurement theory (generalizability theory) perspective and to use it to suggest some ways to improve the psychometric quality of concept testing. Generalizability theory is employed because it can account for the multiple facets of variation in concept testing, and it enables a concept test to be designed to provide a required level of accuracy for decision making in the most effective way, whether the purpose of measurement is to scale concepts or something else, such as to scale respondents. The paper identifies four types of sources—concept-related factors, response task factors, situational factors, and respondent factors—that can contribute to the observed variation in concept testing and develops six research propositions that summarize what is known or assumed about their contribution to observed score variance. Four secondary data sets from different concept testing contexts are then used to test the propositions. The results provide new insights into the design of concept tests and the psychometric quality of the concept testing data: (1) the concepts facet is not a major contributor to response variation; (2) of the response task factors, concept formulations are a trivial source of variance, but items are not always a trivial source of variance; (3) the situational factors that are investigated are trivial sources of variance; (4) respondents are always a major contributor to the total variation; (5) concepts by respondents are not always a major contributor and the other interactions are often not trivial; and (6) residual error is always a major source of variance. Additionally, the analyses of the secondary data sets enable some useful managerial conclusions to be drawn about the design of concept testing. First, the sample size needed to reliably scale concepts depends on the types of concepts being tested. Second, averaging over items provides considerably more reliable information than relying on a single item. Third, which specific item performs best is inconsistent and very context specific. The popular purchase intention item is never the best single item to use. Fourth, not much is gained by sampling levels of the response task factors. Finally, concept testing should be designed to meet the needs of specific managerial tasks.