Evidence-based Measures of Empowerment for Research on Gender Equality

Photograph by Prashant Panjiar © Bill & Melinda Gates Foundation


The psychometric rigor or empirical quality of gender equality and empowerment measures is assessed by EMERGE staff, the results of which are provided to site users on each measure page to inform their measure selection. However, it is important to note that scores presented on our site serve neither as recommendation nor endorsement of a given measure by the EMERGE team. The utility of a measure should be determined by the user.

Key takeaways on our scoring methodology:

  • Trained team members score each measure; scoring decisions are reviewed by a second team member.
  • Measures are scored in terms of formative research and measure psychometrics (i.e. reliability and validity testing).
  • Psychometric scores and citation frequencies are provided in aggregated groupings: high, medium, low, and no data.
  • The primary citation for each measure is the original paper or report in which the measure was introduced or the first validation paper or report on the measure.
  • A priori adequacy guidelines are used to evaluate statistic values.

Information on scoring, statistical adequacy guidelines, and citation frequency are summarized in the following sections.


Formative research, expert input, reliability, and validity are used to assess the psychometric soundness of measures (see How to Create Scientifically Valid Social and Behavioral Measures on Gender Equality and Empowerment Report on our Resources page for more information on psychometrics).

Our psychometric scoring methodology is based on the DeVellis (2017), Essaylamba (2019) and COSMIN (2011) approaches. Following these approaches, we assess reliability in terms of internal consistency, test-retest, and inter-rater reliability and validity in terms of content and face validity, criterion (gold standard) validity, and construct validity. However, our methodology differs from these sources in a few important ways: we do not require that measures use Item Response Theory or Classical Test Theory methods, we do not directly evaluate aspects of study design (missingness, sample size, etc.) or assess cross-cultural validation efforts, and we are not restrictive in the type of statistical tests used.

Our decision to apply a more inclusive scoring methodology enables us to evaluate measure quality across a wide array of disciplines and subject matters and allows the focus to be on unique gender equality and empowerment-related measures as rather than translated iterations of the same measure. Finally, in an effort to ensure scores are provided in a consistent, unambiguous manner, our scoring methodology uses stated statistical “adequacy” guidelines. These guidelines are built into our scoring protocol to limit subjectivity in how evaluations are made. This methodology allows us to robustly evaluate the psychometric rigor of numerous measures across multiple disciplines.

 Scoring Procedure and Rubric
Trained EMERGE team members review each measure in terms of formative research, reliability testing and validity testing; measures are then reviewed by a second EMERGE scorer for quality assurance.  Scores are assigned as “Adequate”, “Limited”, “Not available” and “Not applicable” using a priori statistical adequacy guidelines.

The rubric used to score the psychometric soundness of published measures is displayed below. The total possible score varies between 7 and 10 points, depending on whether inter-rater validity, criterion validity, and internal reliability are applicable to a given measure. Final scores are presented as aggregated groupings: “Low” (≤33.3%) “Med” (Medium) (33.4%-66.6%), “High” (≥66.7%), or “No Data”.

Measures that could not be scored (i.e., had no psychometric information) are classified as “No Data”.


Preliminary Measure Development:
____  Formative Research and theory to develop items
____ Presence of qualitative research (Adequate = 1pt)
____ Mention of existing literature, theoretical framework (Adequate = 1/2pt)
____   Expert Input on developed items
____ Field expert input (Adequate = 1/2pt)
____ Cognitive Interviews/pilot testing (Adequate = 1pt)

Formal Assessment of Psychometric Properties:
____  Reliability
____ Internal reliability (Adequate = 1pt; Limited = 1/2pt)
____ Test-retest reliability (Adequate = 1pt; Limited = 1/2pt)
____ Inter-rater reliability (Adequate = 1pt; Limited = 1/2pt)
____  Validity
____ Content validity (Adequate = 1/2pt)
____ Face validity (Adequate = 1/2pt)
____ Criterion (gold standard) validity (Adequate = 1pts; Limited = 1/2pt)
____ Construct validity (Adequate = 2pts; Limited = 1pt)

Viewing Measure Scoring Details
To view a measure’s scoring details, click the Psychometric Score button (on the right hand side of a measure page). Clicking on these buttons will display additional scoring details for a particular measure.


To determine the adequacy of reported reliability or validity statistics, we rely on benchmarks or “cut-points” used in the literature or general consensus. These cut-points include:

  • Alpha, KR-20 statistic, Omega: 0.6 or higher = Adequate; Less than 0.6 = Limited (Henson, 2001; Nunnally, 1967)
  • ICC, Kappa, weighted Kappa: 0.6 or higher = Adequate; Less than 0.6 = Limited (Cicchetti, 1994)
  • Variance explained by EFA, PCF, PCA factor analytic methods: 5% or higher = Adequate; less than 5% = Limited
  • Correlation Coefficient: ± 0.3 or higher = Adequate; ± less than 0.3 = Limited (Cohen, 1988; Hemphill, 2003)
  • P-value: < 0.05 = Adequate; > 0.05 = Limited

In the presence of multiple statistics, a majority rule is applied in which an “Adequate” rating is given if more than half of the statistics are of “A” value and a “Limited” rating is given if more than half of the statistics noted are of “L” value. Though a delineation between adequate (A) and limited (L) is necessary, it is important to recognize that other guidelines exist in the literature depending upon the discipline and measure’s purpose. For statistics that are less clearly understood using cut-points, such as the results of a CFA model, assessments on adequacy are determined by an experienced scholar.  For scoring purposes, “A” values received “full points” and “L” values received “partial points” (details found under “Scoring Rubric”).


Citation frequency uses information provided by Google Scholar on the measure’s primary citation. Measures are categorized into “Low” (<20 citations), “Med” (Medium; 20-49 citations), “High” (≥50 citations), or “No Data” (no Google Scholar citation record). This information is updated in real time.  To access the publications that have cited the measure’s primary citation (if there are any) click the Citation Frequency button (on the right hand side of each measures page) to be redirected to Google Scholar.

You have successfully registered

There was an error while trying to send your request. Please try again.

EMERGE will use the information you provide on this form to be in touch with you and to provide updates and marketing.