Abstract
The National Board of Medical Examiners (NBME) item-writing manual is an excellent starting point for a more detailed analysis of the MCQ writing process.1 It is widely referenced in this chapter, being a mainstay of guidance for question writers aiming to produce high-quality questions. The original ‘red book’ was updated to a 4th edition in 2016,2 continuing to be the gold standard guidance book for improving the quality of multiple choice items.
Introduction
The National Board of Medical Examiners (NBME) item-writing manual is an excellent starting point for a more detailed analysis of the MCQ writing process.1 It is widely referenced in this chapter, being a mainstay of guidance for question writers aiming to produce high-quality questions. The original ‘red book’ was updated to a 4th edition in 2016,2 continuing to be the gold standard guidance book for improving the quality of multiple choice items.
Do candidates really need to know the finer details of how to write good-quality SBAs and the processors involved in constructing the section 1 paper? The answer is definitely yes if you experience any major difficulties with this type of summative high-stakes exam. Some candidates do poorly with MCQ type questions, so any guidance is better than nothing.
For most candidates, some general information for the written paper is always useful, especially if it neatly summarises information from a variety of different sources that may be difficult or time-consuming to find otherwise.
Aims
By the end of this chapter, candidates should have a greater appreciation of the complexity of constructing SBAs to ensure a fair, valid and reliable section 1 exam.
Going through the process of how SBAs are constructed will provide general guidance to a candidate in their overall preparation for section 1.3
Investing extra time working through this chapter may score a candidate the extra couple of marks that may pull them over the line as a borderline pass.4
This chapter will make clear why there are so many poor-quality orthopaedic MCQ books out on the market. It is very difficult to construct a good-quality new relevant SBA and much easier to bastardise existing questions already out there or spend an evening producing some poor-quality questions without understanding the sophisticated nuances of SBA construction.
Constructing good-quality SBAs needs considerable examiner training and question writers need to initially attend workshops for training and advice in their construction before being allowed to start contributing to the question bank
Looking ahead, this chapter may prove useful reading if you end up writing MCQ type questions for exams in the future.
For aspiring TPDs or future examiners, it is important to know the intricacies of how to write SBAs and the processors involved in constructing the section 1 paper. This will allow you to give more specific and useful advice to candidates who may be repeated failures on this section of the exam
In any detailed lecture on section 1 of the FRCS (Tr & Orth) exam reliability, content validity and educational theory (Miller’s pyramid, Bloom’s Taxonomy) are all discussed. Therefore, it is worth going over these terms as if unfamiliar these concepts can be difficult to grasp.
Last, those candidates with an educational slant will find the whole process of constructing the section 1 exam fascinating.
Educational Theory
Miller in 1990 introduced an important framework that can be presented as four tiers/levels of a pyramid to categorise the different levels at which trainees needed to be assessed. Although SBAs can be used to test application of knowledge and higher order thinking, their construction is difficult and in general they assess the bottom two levels of ‘knows’ and ‘knows how’ in Miller’s pyramid (Figure 2.1).5
Knows – Knowledge or information that the candidate has learned
Knows how – Application of knowledge to medically relevant situations
Shows how – Simulated demonstration of skills in an examination situation
Does – Behaviour in real-life situations
Workplace-based assessments (WBA) were introduced into the postgraduate curriculum because there were concerns that high-stakes examinations that used tests such as single best answers or EMI encouraged rote learning. It is also known that performance in a controlled assessment correlates poorly with actual performance in professional practice.
Figure 2.1 Miller’s pyramid. The different layers represent the different components of clinical competency and how they can be assessed. WBA attempt to assess how an individual performs in the workplace, i.e. what they actually do.
In 1956, Bloom et al6 described six levels in the cognitive domain: (1) knowledge recall; (2) comprehension; (3) application; (4) analysis; (5) evaluation; and (6) synthesis. Over the years Bloom’s Taxonomy has been revised and alternative taxonomies created. A substantial revision occurred in 2001 to a more dynamic classification that uses action verbs to describe the cognitive processes and a rearrangement of the sequence within the taxonomy (Figure 2.2; Table 2.1).
Figure 2.2 Bloom’s Taxonomy
Remember | Understand | Apply | Analyze | Evaluate | Create |
---|---|---|---|---|---|
Who What When Define Identify Describe Label List Name State Match Recognise Select Examine Locate Memorise Quote Recall Retrieve Reproduce Tabulate Copy | Demonstrate Explain Describe Interpret Clarify Classify Categorise Differentiate Discuss Distinguish Infer Predict Identify Report Select Outline Review Express Translate | Solve Illustrate Calculate Execute Carry out Discover Show Examine Choose Schedule Implement Use Make use of Employ Organise | Differentiate Distinguish Analyse Compare Classify Contrast Separate Explain Select Categorise Divide Order Prioritise Divide Inspect Make assumptions Draw conclusions | Check Co-ordinate Reframe Defend Rate Appraise Critique Judge Support Decide Recommend Summarise Assess Choose Defend Estimate Grade Find errors Compare Rate Measure Provide opinion | Design Compose Create Plan Design Formulate Produce Construct Organise Generate Hypothesise Develop Assemble Rearrange Modify Improve Adapt Elaborate |
More recently, the shape of Bloom’s Taxonomy has been represented not as a pyramid – where there is a large base composed of facts and a tiny peak of creativity (which someone might interpret to mean that we should spend the majority of our time focusing purely on knowledge) to a broad wedge that better highlights the value of creating, evaluating and analysing (Figure 2.3).
Remembering: the candidate can remember previously learned material from long-term memory by recalling facts, terms, basic concepts and answers, e.g.
List the causes of …
What are the steps in … ?
Understanding: the candidate can explain ideas or concepts by organising, translating, interpreting, giving descriptions and stating main ideas, e.g.
Discuss the causes of …
Explain the pathophysiology
Applying: the candidate can solve problems by applying acquired knowledge, facts, techniques and rules in a different way, e.g.
Provide a differential diagnosis
Analysing: the candidate can distinguish between the different parts, how they relate to each other and to the overall structure and purpose. This involves examining and breaking information into parts by identifying motives or causes, making comparisons and finding evidence to support generalisations, e.g.
How will your differential diagnosis be altered in the light of investigation findings?
Evaluating: the candidate makes judgements and justifies decisions about information, presenting and defining opinions by making judgements about information, validity of ideas or quality of work based on a set of criteria e.g.
Justify your management of this patient.
Creating: the candidate puts elements together to form a functional whole, create a new product or point of view, e.g.
What will be your plan of management?
Figure 2.3 Modification of pyramid shape of Bloom’s Taxonomy into broad wedge to better emphasise the value of creating, evaluating and analysing.
Bloom’s Taxonomy is a hierarchical classification, with the lowest cognitive level being ‘remembering’ and the highest being ‘creating’. The lower three levels can be attained with superficial learning so-called Lower Order Thinking Skills (LOTS) such as memorisation. The upper three levels involve Higher Order Thinking Skills (HOTS) and can only be attained by deep learning.
An ongoing development of the examination is the progressive rewriting of questions in the bank that are currently recorded as level 1 questions (factual knowledge) into higher order questions.
In constructing multiple choice items to test higher order thinking, it is helpful to design problems that require multilogical thinking, along with designing alternatives that require a high level of discrimination.
Higher Order Thinking
This is integration/interpretation (questions which require ‘putting the pieces together’) and problem solving (questions which require ‘clinical judgement’), not simple recall (questions which can be answered with a Google search).
Multilogical Thinking
Multilogical thinking is defined as ‘thinking that requires knowledge of more than one fact to logically and systematically apply concepts to a problem’.7 There has been a conscious move to rewrite the question bank with SBAs that require multilogical thinking to answer.
SBAs
Advantages of SBAs
SBAs can assess a wide sample of curriculum content within a relatively short period of time. This leads to high reliability and improved validity.
They are a highly standardised form of assessment where all the trainees are assessed with the same questions. It is a fair assessment in that all the trainees sit the same exam.
They are easy to administer and mark.
SBA marking is mostly automated and hence examiner subjectivity is removed from the assessment process.
Main Disadvantages of SBAs
The trainee’s reasons for selecting a particular option/response cannot be assessed.
Although a wide sample of assessment material can be assessed, the assessment does not provide an opportunity for an in-depth assessment of the content.
Constructing good SBAs needs considerable examiner training.
Exam boards use a utility model to analyse different assessment tools:
R – Reliability. Can the exam results of a given candidate in a given context be reproduced? To what extent can we trust the results?
V – Validity. Does the assessment assess what it purports to assess?
A – Acceptability. How comfortable are the different stakeholders (candidates, examiners, examination boards, public, National Health Service) with the examination system?
E – Educational impact. Does the exam drive the trainees towards educationally and professionally valuable training?
C – Cost effectiveness. Is the expenditure– in terms of money, time and manpower– to develop, run and sustain the examination process worthwhile in relation to what is learned about the candidate?
P – Practicability. How ‘doable or workable’ is the assessment instrument, given the circumstances? Are there sufficient resources to mount the exam?
Applying the utility model for SBAs we get
Reliability: high
The SBA results are highly reliable, as almost identical scores can be obtained if a similar candidate with similar ability is given the same set of SBAs, regardless of who marks the questions.
Validity: high for knowledge recall
An SBA is good at testing factual recall of knowledge. They can also be used to test application of knowledge and higher order thinking, although the construction of such SBAs is difficult and requires training.
Acceptability: high
SBAs have been used extensively in medical education. Both trainees and examiners have come to accept them. Constructing good SBAs, however, is difficult.
Educational impact: (moderately)
Properly constructed SBAs will drive the learner towards learning important information.
However, SBAs developed to test trivial knowledge will lead to rote learning. Fragmentation of knowledge is another criticism.
Cost: moderate
The cost of administering an SBA test is low. In contrast, face-to-face peer review meetings of submitted SBAs are expensive to hold, as they involve substantial travel and accommodation costs. However, the quality of scrutiny that can be brought to bear on the question material justifies this outlay and affords considerable confidence in the quality of the product.
Practicability: high
SBAs are easy to administer as a computer-based assessment.
Item Analysis of SBAs
Item analysis output indicates the percentage of candidates in the various subgroups who selected each option of an SBA.
Each SBA is analysed as to the percentage of candidates scoring it correctly from each subgroup. The test group is usually divided into fifths, as this allows more detailed analysis around the pass/fail than if quartiles were used.
The spread should be like a Gaussian curve. The exam board members are not very interested in distinguishing the very best or worst candidates. The curve is concentrated in the centre and the exam board members want to spread this middle area out so that one question cannot decide if a candidate passes or fails the exam.
Easy Questions
With these questions (Figure 2.4), around 90% of candidates get the correct answer. As such, easy questions do not discriminate between the very good or very bad performing candidate. More important, an easy question does not differentiate between candidates around the level of minimal competence required for a pass. When paper 1 analysis flags up these questions, they are either scrapped or have to be extensively reworked.
Figure 2.4 Easy SBA
Difficult Questions
These questions (Figure 2.5) are just as useless as an easy question. Again, they do not differentiate between a good or bad candidate or, more important, make a distinction between borderline candidates – those who can be passed and those who must re-sit. Similar to easy SBAs, difficult SBAs are discarded if they are also of poor quality and require very extensive rewriting.
Figure 2.5 Difficult SBA
Poorly Performing Questions
These questions (Figure 2.6(a) and (b)) may involve the bottom 20% of candidates getting an SBA mainly correct while the top 20% of candidates scoring mainly incorrectly. It’s a poor SBA, as overall it is not following candidate form. Another example is where there has been a random spread of correct answers between groups.
Usually the question is poorly written, the wrong answer has been selected by the examiners or there has been an error of typing.
Poorly performing questions are removed. Questions that have more than 90% or 10% failed/pass are also removed.
All questions that score poorly, i.e. where the percentage of correct responses to that alternative is below 30%, are checked. Questions where the top 20% of candidates score significantly lower than average are also reviewed.
Good Performing Questions
There is a graduation in candidates obtaining the correct answer from the top one-fifth mainly scoring the question correctly to the bottom one-fifth with candidates mainly scoring it incorrectly (Figure 2.7).
This question discriminates. There is point by point discrimination; if it is >.3, it is a good question
Ideal SBA
These questions (Figure 2.8) discriminate candidates at the pass/fail mark. A good quality question should be answered correctly by 35–85% of just passing candidates (defined as those scoring an overall mark within 10% of the pass mark).
There should also be an obvious positive correlation between the performance of the cohort on the individual question and in the examination as a whole (i.e. the question should be answered correctly by appreciably more passing candidates than failing candidates). A reasonable proportion of candidates (especially those who did not pass) should also have chosen each incorrect option.
Item analysis determines difficulty index (DIF I) (p-value), discrimination index (DI) and distractor efficiency (DE)