Photograph by Prashant Panjiar © Bill & Melinda Gates Foundation
The psychometric rigor or empirical quality of gender equality and empowerment measures is assessed by EMERGE staff, the results of which are provided to site users on each measure page to inform their measure selection. However, it is important to note that scores presented on our site serve neither as recommendation nor endorsement of a given measure by the EMERGE team. The utility of a measure should be determined by the user.
Key takeaways on our scoring methodology:
Formative research, expert input, reliability, and validity are used to assess the psychometric soundness of measures (see How to Create Scientifically Valid Social and Behavioral Measures on Gender Equality and Empowerment Report on our Resources page for more information on psychometrics).
Our psychometric scoring methodology is based on the DeVellis (2017), Essaylamba (2019) and COSMIN (2011) approaches. Following these approaches, we assess reliability in terms of internal consistency, test-retest, and inter-rater reliability and validity in terms of content and face validity, criterion (gold standard) validity, and construct validity. However, our methodology differs from these sources in a few important ways: we do not require that measures use Item Response Theory or Classical Test Theory methods, we do not directly evaluate aspects of study design (missingness, sample size, etc.) or assess cross-cultural validation efforts, and we are not restrictive in the type of statistical tests used.
Our decision to apply a more inclusive scoring methodology enables us to evaluate measure quality across a wide array of disciplines and subject matters and allows the focus to be on unique gender equality and empowerment-related measures as rather than translated iterations of the same measure. Finally, in an effort to ensure scores are provided in a consistent, unambiguous manner, our scoring methodology uses stated statistical “adequacy” guidelines. These guidelines are built into our scoring protocol to limit subjectivity in how evaluations are made. This methodology allows us to robustly evaluate the psychometric rigor of numerous measures across multiple disciplines.
Scoring Procedure and Rubric
Trained EMERGE team members review each measure in terms of formative research, reliability testing and validity testing; measures are then reviewed by a second EMERGE scorer for quality assurance. Scores are assigned as “Adequate”, “Limited”, “Not available” and “Not applicable” using a priori statistical adequacy guidelines.
The rubric used to score the psychometric soundness of published measures is displayed below. The total possible score varies between 7 and 10 points, depending on whether inter-rater validity, criterion validity, and internal reliability are applicable to a given measure. Final scores are presented as aggregated groupings: “Low” (≤33.3%) “Med” (Medium) (33.4%-66.6%), “High” (≥66.7%), or “No Data”.
Measures that could not be scored (i.e., had no psychometric information) are classified as “No Data”.
Preliminary Measure Development:
____ Formative Research and theory to develop items
____ Presence of qualitative research (Adequate = 1pt)
____ Mention of existing literature, theoretical framework (Adequate = 1/2pt)
____ Expert Input on developed items
____ Field expert input (Adequate = 1/2pt)
____ Cognitive Interviews/pilot testing (Adequate = 1pt)
Formal Assessment of Psychometric Properties:
____ Internal reliability (Adequate = 1pt; Limited = 1/2pt)
____ Test-retest reliability (Adequate = 1pt; Limited = 1/2pt)
____ Inter-rater reliability (Adequate = 1pt; Limited = 1/2pt)
____ Content validity (Adequate = 1/2pt)
____ Face validity (Adequate = 1/2pt)
____ Criterion (gold standard) validity (Adequate = 1pts; Limited = 1/2pt)
____ Construct validity (Adequate = 2pts; Limited = 1pt)
Viewing Measure Scoring Details
To view a measure’s scoring details, click the Psychometric Score button (on the right hand side of a measure page). Clicking on these buttons will display additional scoring details for a particular measure.
STATISTICAL ADEQUENCY GUIDELINES
To determine the adequacy of reported reliability or validity statistics, we rely on benchmarks or “cut-points” used in the literature or general consensus. These cut-points include:
In the presence of multiple statistics, a majority rule is applied in which an “Adequate” rating is given if more than half of the statistics are of “A” value and a “Limited” rating is given if more than half of the statistics noted are of “L” value. Though a delineation between adequate (A) and limited (L) is necessary, it is important to recognize that other guidelines exist in the literature depending upon the discipline and measure’s purpose. For statistics that are less clearly understood using cut-points, such as the results of a CFA model, assessments on adequacy are determined by an experienced scholar. For scoring purposes, “A” values received “full points” and “L” values received “partial points” (details found under “Scoring Rubric”).
Citation frequency uses information provided by Google Scholar on the measure’s primary citation. Measures are categorized into “Low” (<20 citations), “Med” (Medium; 20-49 citations), “High” (≥50 citations), or “No Data” (no Google Scholar citation record). This information is updated in real time. To access the publications that have cited the measure’s primary citation (if there are any) click the Citation Frequency button (on the right hand side of each measures page) to be redirected to Google Scholar.