MENU

Scoring Methodology

Scoring Methodology

The psychometric rigor (or empirical quality), and ease of use, of gender equality and empowerment measures are assessed by EMERGE staff, the results of which are provided to site users on each measure page to inform their measure selection. However, it is important to note that scores presented on our site serve neither as recommendation nor endorsement of a given measure by the EMERGE team. The utility of a measure should be determined by the user.

Key takeaways on our scoring methodology:

  • Trained team members score each measure; scoring decisions are reviewed by a second team member.
  • Measures are scored in terms of formative research, measure psychometrics (i.e. reliability and validity testing) and ease of use (readability, scoring clarity and length).
  • Psychometric scores, ease of use scores, and citation frequencies are provided in aggregated groupings: high, medium, low, and no data.
  • The primary citation for each measure is the original paper or report in which the measure was introduced or the first validation paper or report on the measure.
  • A priori adequacy guidelines are used to evaluate statistic values.

Information on scoring, statistical adequacy guidelines, and citation frequency are summarized in the following sections.

SCORING

Formative research, expert input, reliability, and validity are used to assess the psychometric soundness of measures (see How to Create Scientifically Valid Social and Behavioral Measures on Gender Equality and Empowerment Report on our Resources page for more information on psychometrics).

Psychometrics
Our psychometric scoring methodology is based on the DeVellis (2017), Essaylamba (2019) and COSMIN (2011) approaches. Following these approaches, we assess reliability in terms of internal consistency, test-retest, and inter-rater reliability and validity in terms of content and face validity, criterion (gold standard) validity, and construct validity. However, our methodology differs from these sources in a few important ways: we do not require that measures use Item Response Theory or Classical Test Theory methods, we do not directly evaluate aspects of study design (missingness, sample size, etc.) or assess cross-cultural validation efforts, and we are not restrictive in the type of statistical tests used.

Our decision to apply a more inclusive scoring methodology enables us to evaluate measure quality across a wide array of disciplines and subject matters and allows the focus to be on unique gender equality and empowerment-related measures as rather than translated iterations of the same measure. Finally, in an effort to ensure scores are provided in a consistent, unambiguous manner, our scoring methodology uses stated statistical “adequacy” guidelines. These guidelines are built into our scoring protocol to limit subjectivity in how evaluations are made. This methodology allows us to robustly evaluate the psychometric rigor of numerous measures across multiple disciplines.

Scoring Procedure and Rubric

Trained EMERGE team members review each measure in terms of formative research, reliability testing and validity testing; measures are then reviewed by a second EMERGE scorer for quality assurance.  Scores are assigned as “Adequate”, “Limited”, “Not available” and “Not applicable” using a priori statistical adequacy guidelines.

The rubric used to score the psychometric soundness of published measures is displayed below. The total possible score varies between 7 and 10 points, depending on whether inter-rater validity, criterion validity, and internal reliability are applicable to a given measure. Final scores are presented as aggregated groupings: “Low” (≤33.3%) “Med” (Medium) (33.4%-66.6%), “High” (≥66.7%), or “No Data”.
Measures that could not be scored (i.e., had no psychometric information) are classified as “No Data”.

Rubric:
Preliminary Measure Development:
Formative Research and theory to develop items
____ Presence of qualitative research (Adequate = 1pt)
____ Mention of existing literature, theoretical framework (Adequate = 1/2pt)
Expert Input on developed items 
____ Field expert input (Adequate = 1/2pt)
____ Cognitive Interviews/pilot testing (Adequate = 1pt)

Formal Assessment of Psychometric Properties:
Reliability
____ Internal reliability (Adequate = 1pt; Limited = 1/2pt)
____ Test-retest reliability (Adequate = 1pt; Limited = 1/2pt)
____ Inter-rater reliability (Adequate = 1pt; Limited = 1/2pt)
Validity
____ Content validity (Adequate = 1/2pt)
____ Face validity (Adequate = 1/2pt)
____ Criterion (gold standard) validity (Adequate = 1pts; Limited = 1/2pt)
____ Construct validity (Adequate = 2pts; Limited = 1pt)

Viewing Measure Scoring Details
To view a measure’s scoring details, click the Psychometric Score, Ease of Use Score, or Citation Frequency button (on the right hand side of a measure page). Clicking on these buttons will display additional scoring details for a particular measure.

STATISTICAL ADEQUACY GUIDELINES
 
To determine the adequacy of reported reliability or validity statistics, we rely on benchmarks or “cut-points” used in the literature or general consensus. These cut-points include:
  • Alpha, KR-20 statistic, Omega: 0.6 or higher = Adequate; Less than 0.6 = Limited (Henson, 2001; Nunnally, 1967)
  • ICC, Kappa, weighted Kappa: 0.6 or higher = Adequate; Less than 0.6 = Limited (Cicchetti, 1994)
  • Variance explained by EFA, PCF, PCA factor analytic methods: 50% or higher = Adequate; less than 50% = Limited (Field, 2018)
  • Correlation Coefficient: ± 0.3 or higher = Adequate; ± less than 0.3 = Limited (Cohen, 1988; Hemphill, 2003)
  • p-value: < 0.05 = Adequate; p-value > 0.05 = Limited (Field, 2018)
In the presence of multiple statistics, a majority rule is applied in which an “Adequate” rating is given if more than half of the statistics are of “A” value and a “Limited” rating is given if more than half of the statistics noted are of “L” value. Though a delineation between adequate (A) and limited (L) is necessary, it is important to recognize that other guidelines exist in the literature depending upon the discipline and measure’s purpose. For statistics that are less clearly understood using cut-points, such as the results of a CFA model, assessments on adequacy are determined by an experienced scholar.  For scoring purposes, “A” values received “full points” and “L” values received “partial points” (details found under “Scoring Rubric”).
EASE OF USE SCORE

The ease of use of a given measure is a critical aspect of how widely it can practically be used. Our ease of use scoring methodology is based on the Lewis (2021) and Glasgow (2013) approaches, with simplification and adaptation for relevance across settings. We assess ease of use in terms of readability, scoring clarity and length. Readability assess how clearly a measure is written based on sentence and word length, as measured by the Flesch Kincaid Grade Level score. Scoring clarity assesses guidance on measure scoring and interpretation of scores. Measure length assesses the number of items in a measure.

Scoring Procedure and Rubric
Trained EMERGE team members review each measure in terms of readability, scoring clarity and length; measures are then reviewed by a second EMERGE scorer for quality assurance.  Scores are assigned as “Adequate” or “Limited” using predetermined guidelines.
The rubric used to score the ease of use of published measures is displayed below. The total possible score varies between 2 and 3 points, depending on whether a given measure was intended to have a summative score or was a single-item measure. Final scores are presented as aggregated groupings: “Low” (≤50%) “Med” (Medium) (51%-70%), or “High” (≥71%).
 
Rubric:
Ease of Use Scoring:
 
___Readability (adequate = 1pt; limited = 1/2pt)
___Scoring clarity (adequate = 1pt; limited = 1/2pt; none = 0pts)
___ Length (adequate = 1pt; limited = 1/2pt)
CITATION FREQUENCY
Citation frequency uses information provided by Google Scholar on the measure’s primary citation. Measures are categorized into “Low” (<20 citations), “Med” (Medium; 20-49 citations), “High” (≥50 citations), or “No Data” (no Google Scholar citation record). This information is updated in real time.  To access the publications that have cited the measure’s primary citation (if there are any) click the Citation Frequency button (on the right hand side of each measures page) to be redirected to Google Scholar.

Join the EMERGE Community

to get the latest updates on new measures and guidance for survey researchers