Evidence-based Measures of Empowerment for Research on Gender Equality

Photograph by Prashant Panjiar © Bill & Melinda Gates Foundation


The psychometric rigor or empirical quality of gender equality and empowerment measures was assessed by EMERGE personnel. Scoring, adequacy guidelines, and citation frequency information are presented in the sections that follow.

Key Points:

  • Measures were scored in terms of formative research and measure psychometrics (i.e., reliability and validity testing)
  • The primary citation for each measure is either the seminal publication or the main measure validation paper
  • Trained team members scored each measure, the extracted information and scoring decisions were then checked by a second team member
  • Adequacy guidelines determined apriori were used to evaluate statistic values
  • Measure psychometric scores and citation frequencies are represented using score class terms: High, Med (Medium), Low, and No Data

Please note that the scores presented on our site serve neither as recommendation nor endorsement of a given measure by our team. The utility of a measure should be determined by the user.


Different aspects were scored to determine the psychometric soundness of measures, they included: (1) Formative Research and Expert Input, (2) Reliability, and (3) Validity (see How to Create Scientifically Valid Social and Behavioral Measures on Gender Equality and Empowerment Report under Resources). In this section we outline the basis of our scoring system, our scoring procedure, provide a scoring rubric, and show the step-by-step way one can access scoring details on our site.

Basis of Scoring System

Our psychometric scoring methodology is based on DeVellis (2017), Essaylamba (2019) and COSMIN (2011) approaches. As recommended by both, we assess reliability in terms of internal consistency, test-retest, and inter-rater reliability and validity in terms of content/face validity, criterion validity, and construct validity (subsumed under COSMIN structural validity and hypothesis testing). Still, our psychometric scoring methodology differs in important ways. In particular, we do not require that measures use Item Response Theory (IRT) or Classical Test Theory (CTT) methods, we do not directly evaluate aspects of study design (missingness, sample size, etc.) or assess cross-cultural validation efforts, and we are not restrictive in the type of statistical tests used.

Our decision to apply a more inclusive scoring methodology has made it possible for us to evaluate measure quality across a wide array of subject matters, is time efficient, and allows the focus to be on the unique measures used in the field as opposed to translated iterations of the same measure. Finally, in an effort to ensure ratings are provided in a consistent, unambiguous manner, our scoring methodology uses stated statistical “adequacy” guidelines. These guidelines are built into our scoring protocol to limit subjectivity in how evaluations are made. For EMERGE, our scoring methodology has enabled us to evaluate the psychometric rigor of numerous measures across multiple disciplines with expertise.

Scoring Procedure

Before each measure was scored, EMERGE team members determined the primary citation for the measure (i.e., the seminal publication for the measure or the main measure validation publication). If the primary citation reference did not include the complete measure items, a second reference was researched to gather non-psychometric measure details (items etc.).

In order to competently score measures, EMERGE team members were trained on how to extract psychometric details from publications for evaluation purposes. The training period lasted between 3-4 weeks and involved the simultaneous scoring of sample measures. During the training period, team members participated in regular meetings to discuss questions or discrepancies. Even after the trained member was allowed to extract and score without supervision, weekly meetings were mandated.

The steps and decisions used to score or allocate points to measure publications are described in Figures 1-3. To determine whether specific statistics would receive “full” or “partial” points for scoring purposes (see “Scoring Rubric”), adequacy guidelines were referenced. These guidelines were developed apriori according to the literature (see Adequacy Guidelines section for additional information).

Following the initial scoring by a trained team member, the extracted information and scoring decisions made were checked by second trained team member. At times, the checking process resulted in changes involving the removal or addition of content. In instances were changes led to different conclusions regarding the adequacy of a statistic or assessment of a type of reliability, validity, or formative research, mediation between team members was required. If consensus could not be reached, input from a more senior scholar was sought.

Scoring Rubric

The rubric used to score the psychometric soundness of published measures is displayed below and matches the point allocation presented in Figures 1-3. The summation of points earned over the total possible points resulted in the “Score Total.” The total possible points varied from 7 points to 10 points. For example, possible points can be 9 points if inter-rater validity, criterion validity, or internal reliability were not applicable, 8 points if two of them were not applicable, and 7 points if all three were not applicable. Final score totals were converted to percent values to determine score class as “Low,” “Med” (Medium), “High”, or “No Data”.

Score classes were divided into thirds. “Low” represented percent values less than or equal to 33.33%. “Med” represented percent values between 33.33% and 66.67%. “High” represented percent values equal to or above 66.67%. Measures that could not be scored (i.e., only had an alpha value or no psychometric information) were classified as “No Data”.

Preliminary Measure Development:
Formative Research and theory to develop items (max: 1.5pt)
____   Presence of qualitative research (full points = 1pt)
____   Mention of existing literature, theoretical framework (full points = 1/2pt)
Expert Input on developed items (max: 1.5pt)
____   Field expert input (full points = 1/2pt)
____   Cognitive Interviews/pilot testing (full points = 1pt)

Formal Assessment of Psychometric Properties:
Reliability (max: 3pts)
____   Internal reliability (full points = 1pt; partial points = 1/2pt)
____   Test-retest/parallel testing (full points = 1pt; partial points = 1/2pt)
____   Inter-rater reliability (full points = 1pt; partial points= 1/2pt)
Validity (max: 4pts)
____   Content (face) by the study scholars (full points = 1/2pt)
____   Face validity by GEH team member (full points = 1/2pt)
____   Criterion validity (full points = 1pts; partial points = 1/2pt)
____   Construct validity (full points = 2pts; partial points = 1pt)

Score Total = ____________________________________________________

Accessing Measure Scoring Details

To access the measure scoring details when on this site click the Psychometric Score button indicating score class (i.e., “High”, “Med” (Medium), “Low”, or “No Data”).

This will generate a pop-up with more specific scoring information on each of the formative research, reliability, and validity aspects included in the final score.

The colored bullets indicate whether that aspect received full points (green), partial points (yellow), was not accessed (grey), or not applicable (black). The color key and the score class, as well as the points earned and possible points, are located on the right-hand side of the pop-up.



To determine the adequacy of reported reliability or validity statistics, we relied on benchmarks or cut-points used in the literature or general consensus. They include:

  • Alpha, KR-20 statistic, Omega: 0.6 or higher = Adequate; Less than 0.6 = Limited (Henson, 2001; Nunnally, 1967)
  • ICC, Kappa, weighted Kappa: 0.6 or higher = Adequate; Less than 0.6 = Limited (Cicchetti, 1994)
  • Variance explained by EFA, PCF, PCA factor analytic methods: 5% or higher = Adequate; less than 5% = Limited
  • Correlation Coefficient: ± 0.3 or higher = Adequate; ± less than 0.3 = Limited (Cohen, 1988; Hemphill, 2003)
  • P-value: < 0.05 = Adequate; > 0.05 = Limited

In the presence of multiple statistics, a majority rule was applied in which “A” was denoted if more than half of the statistics noted were of “A” value while “L” was similar denoted if more than half of the statistics noted were of “L” value. Though a delineation between what was adequate (A) and what was limited (L) was necessary, it is important to recognize that other guidelines exist in the literature depending upon the discipline and measure’s purpose. For statistics that were less clearly understood using cut-points, such as the results of a CFA model, assessments on adequacy were determined by an experienced scholar.  For scoring purposes, “A” values received “full points” and “L” values received “partial points” (details found under “Scoring Rubric”).


Citation frequency uses information provided by Google Scholar on the measure’s primary citation. To determine score class as “Low”, “Med” (Medium), “High”, or “No Data” the following guidelines were used: fewer than 20 citations (“Low”), between 20 and 49 citations (“Med”), 50 or more citations (“High”), and no Google Scholar citation record (“No Data”); this information is updated on an annual basis by team members.

On this site, to access the publications that have cited the measure’s citation (if there are any) click the Citation Frequency button. This will redirect you to Google Scholar to see the various works that have cited the measure’s primary citation.