EAM2018 - Abstracts

Table of Contents

Tuesday, July 24

Wednesday, July 25

Thursday, July 26

Friday, July 27

14:00 - 15:05

symposium

15:30 - 16:00

state-of-the-art

Tuesday, July 24
↑ Go to top ↑

09:00

workshop

Hypothesis evaluation using the Bayes factor Part 1

Herbert Hoijtink

09:00 - 13:00

SR1

workshop

Multilevel Structural Equation Modeling with Lavaan

Yves Rosseel

09:00 - 13:00

SR2

workshop

Intensive Longitudinal Data Analysis / DSEM

Noémi Schuurman

09:00 - 13:00

SR3

14:00

workshop

Hypothesis evaluation using the Bayes factor Part 2

Herbert Hoijtink

14:00 - 18:00

SR1

workshop

Understanding SEM: Where do all the numbers come from?

Yves Rosseel

14:00 - 18:00

SR2

workshop

Survival a.k.a Time-To-Event Analysis

Jan Beyersmann, Morten Moshagen

14:00 - 18:00

SR3

workshop

Theory and analysis of conditional and average causal effects

Rolf Steyer

14:00 - 18:00

HS E028

void

NA

NA

18:00 - NA

19:00

event

Welcome Reception

Cafeteria Zur Rosen

NA

19:00 - 22:00

Wednesday, July 25
↑ Go to top ↑

09:00

opening

Opening Ceremony

09:00 - 10:00

Saal Friedrich Schiller

10:15

keynote

The Statistics of Replication

Heinz Holling

10:15 - 11:00

Saal Friedrich Schiller

11:00

coffee

Coffee Break

11:00 - 11:30

11:30

session

Causal Inference in Dynamic Models

Christian Gische

11:30 - 12:30

Saal Friedrich Schiller

Talk

Issues in Causality in Discrete-Time and Continuous-Time Stochastic Process Models

Keywords:
Causality
Stochastic Process Models
Structural Equation Modeling

It is a classical experimental paradigm to manipulate manifest treatment conditions ($X$) assumed to have an effect on a latent theoretical variable ($\eta_1$) which in turn affects a manifest or latent outcome variable ($\eta_2$). With this paradigm researchers intend to test the theory that $\eta_1$ has a causal effect on $\eta_2$. Therefore they estimate the direct effect of $X$ on $\eta_1$ (manipulation check) and make the erroneous conclusion that a successful manipulation check means a significant total effect of $X$ on $\eta_2$ can be interpreted as evidence for their originally stated hypothesis that $\eta_1$ affects $\eta_2$.
Even if the theoretical constructs considered in a case like this can be represented in a model with only the three random variables $X$, $\eta_1$ and $\eta_2$, a causal interpretation of the putative effects is only possible if the direct effect of $X$ on $\eta_2$ is zero and all relevant pre-treatment variables are taken into account. Controlling for all pre-treatment variables when estimating direct and indirect effects is even necessary in a randomized experiment as Mayer et al. (2016) have shown.\\
Looking at substantive psychological theories in more detail, however, it seems that a theoretical construct can often not be defined as a single random variable (such as $\eta_1$ or $\eta_2$), but rather as a stochastic process with latent variables $\eta_{it}, t \in T$. These stochastic processes might not even be time-discrete, as assumed in cross-lagged panel models and multivariate autoregressive processes. Instead, for many psychological constructs they have to be conceptualised as continuous-time processes. This raises new questions concerning causality in continuous-time processes, some of which are addressed in this talk.

Talk

An Interventionist Approach to Causal Inference Based on Panel Data

Keywords:
Causal Inference
Longitudinal Data
Structural Equation Modeling

During the last two decades a comprehensive theory of causal inference based on directed acyclical graphs (DAGs) has been developed (Pearl, 2000). It is well known that non-recursive structural equation models can be represented as DAGs and thus can be analyzed using Pearl’s framework of causal inference. Despite the crucial role of time for the study of causal effects, however, surprisingly little attention has been put on integrating longitudinal models of change into Pearl’s approach to causal inference. In this presentation, we apply Pearl’s general causal effect formula to the bivariate autoregressive cross-lagged panel model that incorporates variation across individuals (i=1,…,N) and over time (t=1,…,T) and allows for unobserved heterogeneity across individuals.
We apply existing sufficient criteria for causal identification (e.g. the Back-Door criterion) to the above model and show that certain causal effects of interest can be identified through adjustment with respect to unobserved heterogeneity. The suggested non-parametric procedures for estimating the identified causal effects rely on the assumption that all adjustment variables are observable. These procedures are therefore not feasible in the presence of unobserved heterogeneity.
In a next step we turn to existing parametric estimation techniques based on panel data (e.g. fixed-effects, random-effects or likelihood based estimators) and analyze their feasibility for the estimation of causal effects. We explicitly state assumptions under which these parametric techniques yield consistent estimates of the parameters of the interventional distribution. Based on the interventional distribution causal effect quantities can be computed and interval forecasts of effects of interventions can be calculated. We conclude by comparing the expected consequences of actively manipulating a putative cause variable with the consequences of passively observing changes in a putative cause variable.

Talk

A General Nonlinear Model for the Identification of Mediators Without the No Confounder Assumption

Keywords:
Causal Inference
Mediator Variables
Nonlinear
Structural Mean Models

The reliable identification of mediator variables that transmit an effect of an intervention to an outcome has been an ongoing problem in social and behavioral sciences. Most methods developed so far rely on the assumption of no unobserved confounders, meaning that the model under investigation included all relevant covariates. This assumption has been criticized repeatedly and it has been shown that mediators identified under this assumption often provide spurious results because even minimal misspecification may already lead to an identification of artificial mediator variables. One class of alternative models are structural mean models (SMM; Ten Have et al., 2007; Zhang & Zou, 2015) that do not rely on the no confounder assumption. These models provide unbiased results even if relevant covariates are not included in the model. The models have been extended to include interactions between intervention, mediators and covariates. However, these models suffer from two major problems. First, these models are very inefficient compared to traditional mediator models and provide a very lower power in typical applied settings. Second, the estimation of interaction effects is often impossible due to multicollinearity problems in the estimation routine. In this talk, I will present a nonlinear semiparametric extension of the SMM that substantively increases its efficiency and overcomes the multicollinearity problem. Simulation studies are presented that show that the extended model has two to three times more accurate (and efficient) estimates than the original SMM for direct and indirect effects as well as for interaction effects included in the model. Further, the results reveal that the model is robust against violations of its model assumption. In an empirical data set, its application is illustrated. Further extensions and its applicability are discussed.

session

Item Response Theory

Frans Kamphuis

11:30 - 12:50

Salon Schlegel

Talk

A State-Space Approach for Student Growth Percentile Estimation

Keywords:
Multidimensional Item Response Theory
State Space
Student Growth Percentile Estimation
Value-added

Betebenner (2009) introduced student growth percentiles (SGP) for norm- and criterion-referenced
student growth. Each student's current test score is expressed as a percentile rank in the distribution
of current test scores among students who had the same past test scores. In this approach a vertical
scale is not necessary.
Recently Lockwood and Castellano (2015) suggested two alternative approaches for estimating SGP.
One based on modelling the conditional cumulative distribution functions the other based on
multidimensional item response theory. The last approach can overcome problems with
measurement error in both the past and current test score.
Kamphuis and Moelands (2000) formulated a states-space approach: the measurement problem is
separated from the structural model. In this framework we can handle a lot of measurement models
like IRT, CTT etc. The structural part can include many domains, occasions and background
information.
In the talk the methodological aspects and the validity issues related to monitoring growth are
discussed. Emphasized is the need for a framework of reference for the interpretation of growth. A
student monitoring system for secondary schools in Kazakhstan will be presented as an illustrative
Example.

Talk

Psychometric Evaluation of the d2 Test of Sustained Attention With the Rasch Poisson Counts Model

Keywords:
Rasch Poisson Counts Model
Sustained Attention
d2 Test

The d2 Test is a cancellation test of attention and concentration in which respondents have to cross out target variables among similar nontarget stimuli (Brickenkamp & Zillmer, 1998).
The target variable is a “d” with two dots above or below it. The targets are randomly interspersed among nontarget characters. The nontarget characters are d’s with one, three or four dots above or below them and p’s with one, two, three, or four dots above or below them.
The target and nontarget characters are presented in 14 consecutive lines. Separate time limits of 20 seconds are allotted for each line. The construct validity of the test has been investigated with classical methods of factor analysis and criterion measures. To the best of our knowledge no study so far has examined the fit of the d2 Test to IRT models. In this study the fit of the test to the Rasch Poisson Counts Model (RPCM) is examined. The structure of the test (a combination of 14 lines of stimuli each with a separate time limit) makes it an ideal candidate for RPCM scaling. In this study the overall fit of the d2 Test to RPCM, the fit of the individual items (lines), and the reliability of the test are investigated.

Talk

Forecasting Clinical Outcomes by Combining Measurement and Prediction Models for Health Evaluations

Keywords:
Diagnostic Accuracy
Measurement
Prediction
Questionnaires

Both in medical research and clinical practice, questionnaire-based assessments are used to obtain information about the physical, mental, and social well-being as experienced by patients. Commonly, measurement models, such as those based on Item Response Theory (IRT), are employed to meaningfully reduce a patient’s item scores to a single test score (typically denoted $\theta$) or a set of test scores. Such scores are mostly used to evaluate and monitor the patient, and to provide feedback on his or her status. For such purposes, measurement precision is key because the scores should accurately represent the patient’s attributes. However, tests are also used for predictive purposes, such as forecasting a future health state, or a diagnosis based on the gold standard. It has been shown that in test design a trade-off exists between measurement and prediction, and thus that accurate measurement is no prerequisite for good prediction (e.g., Smits et al., 2017). This raises several questions concerning the use of test data for prediction purposes.
In the present study, three questions are answered: (1) Is a single test score sufficient when the purpose of the test is prediction, or should the individual item scores be used? (2) How should multiple test scores be combined for optimal prediction? (3) How should measurement errors of test scores under an IRT model be incorporated into a prediction model to obtain sound estimates of predictive power?
To answer these questions an illustrative data file is used consisting of 735 patients with scores on 42 PROMIS Pain Quality items, and 26 PROMIS Affective Pain items. The patients either did or did not have a clinical condition associated with neuropathic pain; this outcome is considered the gold standard and used as target variable for prediction.

Talk

Exchanging Selection Rules from Cognitive Diagnosis Modeling to Traditional Item Response Theory

Keywords:
Cognitive Diagnosis Modeling
Computerized Adaptive Testing
Gender Development Index
Item Response Theory
Item Selection Rules
Kullback–Leibler Divergence

Nowadays, there are two predominant approaches in adaptive testing. One is based on cognitive diagnosis models and is referred to as cognitive diagnosis computerized adaptive testing (CD-CAT), and the other one is the traditional CAT based on item response theory. The present study evaluates the performance of two item selection rules (ISRs) originally developed in the CD-CAT framework, the double Kullback-Leibler (DKL) and the global discrimination index (GDI), in the context of traditional CAT. The accuracy and test security associated with these two ISRs are compared to those of the point Fisher information and likelihood-weighted KL using a simulation study. Five dependent variables were evaluated: bias and root mean square error relative to the measurement accuracy; overlap rate as indicator of test security; mean values of the a and c parameters administered, with the aim of analyzing the kind of item that tends to be selected by each ISR; and the correlation between the item exposure rates for each pair of ISRs, as indicative of the convergence between them. The impact of the trait level estimation method was also investigated. Maximum likelihood (ML) and expected a posteriori (EAP) estimation methods were compared. The results of the study show that the new ISRs can be used to improve the accuracy of CAT with fewer items administered, particularly in the case of DKL. This is of major importance in contexts (e.g., educational, medical) where testing time is always an issue. In addition, both rules selected a different set of items: with DKL, items with the highest $a$ parameter were administered, whereas items with the lowest $c$ parameter were administered with GDI. Regarding the trait level estimation methods, we found that EAP was generally better in the first stages of the CAT, and converged with ML when a medium to a large number of items was administered. Several implications and possible future directions are provided.

session

Applied Statistics

Patricia Martinkova

11:30 - 12:30

Salon Hölderlin

Talk

Model-Based Reliability to Check for Disparities in Ratings of Internal and External Applicants

Keywords:
Generalizability
Linear Mixed Models
Reliability
Variance Decomposition

In this work we address disparities in ratings of internal and external applicants. We develop model-based inter-rater reliability (IRR) estimate to account for various sources of measurement error, their hierarchical structure and the presence of covariates, such as assessed status, that have the potential to moderate IRR. Using dataset of ratings of applicants to teaching positions in Spokane district in Washington, USA, we first test for bias in ratings of applicants external to the district, which is shown to be significant even after including various measures of teacher quality in the model. Moreover, with model-based IRR, we show that consistency between raters is significantly lower when rating external applicants. We further address how IRR affects the predictive power of measurement in different scenarios and conclude the work by discussing policy implications and applications of our model-based IRR estimate for teacher hiring practices.

Talk

Questionable Research Practices in Student Theses - Prevalence, Antecedents and Implications

Keywords:
Questionable Research Practices

Although questionable research practices (QRPs, e.g. Fiedler & Schwarz, 2016) have received more attention in recent years, little research has focused on their prevalence and acceptance in student samples. Students are the researchers of the future and will also represent the field outside of academia. Therefore, it is vital to establish that they are not learning to use and accept QRPs, which would negatively impact their ability to produce and evaluate meaningful research. 207 psychology students and fresh graduates provided data on the prevalence and predictors of QRPs. We focused on attitudes towards QRPs, beliefs about whether significant results constitute better science or lead to better grades and motivation and stress levels as predictors. Furthermore, we assessed perceived supervisor attitudes towards QRPs as an index of perceived normativeness and an important influencing factor on student's QRP attitudes and use. The results were generally in line with Fiedler & Schwarz' (2016) estimates of QRP prevalence. The best predictor of QRP use was students' QRP attitudes, although perceived supervisor attitudes exerted both a direct and indirect effect via student attitudes. Motivation to write a good thesis was a protective factor, whereas stress had no effect. Students did not subscribe to the belief that significant results were better for science or their grades, which may explain why these beliefs did not impact their QRP attitudes or use. Finally, students engaged more in QRPs pertaining to reporting and analysis than those pertaining to study design. These results imply that supervisors have an important function in shaping students' attitudes towards QRPs and the opportunity to further improve their research practices by motivating them well. Furthermore, this research provides some impetus towards identifying predictors of QRP use in academia.

Talk

Impact of Formal Educational Upgrading on the Likelihood Leaving Unemployment

Keywords:
Discrete Choice
Hartz IV
Marginal Effects
Panel
Structured Education
Unemployment

Using the adult cohort from the National Educational Panel Study (NEPS), we analyze the probability leaving short- or long-term unemployment taking into account structural educational background, complete professional history and demographic characteristics. Our discrete choice model is based on human capital theory including life-long learning and professional experiences. The results show that professional history, the age of an individual at the time of unemployment, and educational background are significant predictors for getting out of unemployment. The job finding probabilities are driven by general and model specific explanatory factors in case of short- or long-term unemployment.

session

Replication Crisis

Jörg Blasius

11:30 - 12:10

Salon Novalis

Talk

Fabrication of Interview Data in PISA and PIACC

Keywords:
Fabricated Survey Data
Large-Scale Data
Questionable Research Practices

The quality of survey data is a function of the three levels of actors involved in survey research projects: the respondents, the interviewers, and the employees of the survey research organizations. I argue that task simplification dynamics can occur at each of these levels resulting in reduced data quality. The precise form of task simplification differs at each of the three levels. For respondents, it might take the form of utilizing only specific parts of the available response options; interviewers might ask the demographic questions and some basic information only and then fabricate plausible responses for the remainder; the third type applied by employees of research institutes is the near-duplication of entire questionnaires. I will use data from the Programme for International Student Assessment (PISA) 2012 and the Programme for the International Assessment of Adult Competencies (PIAAC) to document various task simplification techniques performed at each of these levels and I propose a new statistical method to discover interviewer falsifications through the field work period.

Talk

Transparency and Replicability in Cross-National Survey Research

Keywords:
Crossnational Survey Research
Replication
Secondary Analysis
Transparency

Transparency and replicability in cross-national survey research: Demarcation of problems and possible solutions
Abstract
This paper offer insights into the level of transparency and replicability of cross-national survey research. The first contribution is theoretical as we provide an overview of the current measures taken to achieve research transparency in cross-national survey studies. We do so by developing a heuristic theoretical model regarding the actors, factors and processes that influence the level of transparency of an academic article. The second contribution is empirical and regards our dependent variable - article transparency. Specifically, using a random sample of 305 comparative studies published in one of 29 peer-reviewed Social Sciences journals (1986-2016), we show that most articles do not provide crucial empirical information for independent researchers to evaluate the validity and reliability of study’s findings or perform a direct replication. Additionally, we develop and propose a set of transparency guidelines tailored for reporting cross-national survey research.

13:00

lunch

Lunch Break

13:00 - 14:00

14:00

poster

Poster Session 1

14:00 - 15:30

15:30

state-of-the-art

Using incidental data for serious social research

N.N.

15:30 - 16:00

Saal Friedrich Schiller

state-of-the-art

Tests and Testing: Current trends and future challenges

N.N.

15:30 - 16:00

Salon Schlegel

16:00

coffee

Coffee Break

16:00 - 16:30

16:30

session

Latent Variable Analysis

Tobias Koch

16:30 - 17:50

Saal Friedrich Schiller

Talk

Analyzing Different Types of Moderated Method Effects in Confirmatory Factor Models for Structurally

Keywords:
Latent Variable Analysis
Structural Equation Modeling

In this talk, we present two confirmatory factor models for multimethod designs with structurally different methods for the analysis of latent moderations: the nonlinear latent difference (NL-LD) model and the nonlinear correlated trait–correlated method minus-one (NL-CTC[M-1]) model. We explain how different moderated method effects can be examined in the NL-CTC(M-1) model and why the classical NL-LD model does not permit this fine-grained analysis of method effects. To fully recover the results of the NL-CTC(M-1) model, we propose an extended version of the NL-LD model. The different versions of the nonlinear multimethod models are compared with regard to the psychometric definition and meaning of the latent moderated method effects and are illustrated using real data from a multirater study. Finally, the advantages and challenges of incorporating latent interaction effects in multimethod confirmatory factor models are discussed.

Talk

Repeated Measures ANOVA with Latent Variables Using the Latent Growth Component Approach

Keywords:

In this presentation, we introduce a way of testing hypotheses of interest in repeated measures designs with latent variables using structural equation modeling (SEM). Traditionally, such designs are analyzed using repeated measures analysis of variance (repeated measures ANOVA). A limitation of repeated measures ANOVA is that only manifest variables can be used. To overcome this limitation, we extend the traditional method to incorporate latent variables. We build on the latent growth components approach (Mayer, Steyer & Mueller, 2012) for this purpose. The latent growth approach is a flexible method that can be used to define latent effect variables in a structural equation modeling framework. We propose to use a comprehensive structural equation model that is specified in three steps. In the first step, a measurement model is formulated for the latent dependent variables. In the second step, a contrast matrix is used to decompose the latent dependent variables into several latent effect variables (e.g., latent difference score variables) that represent main effects and interaction terms. The matrix used to decompose the latent variables corresponds to the transformation matrices that appear in multivariate general linear hypotheses. In the third step, the structural coefficients for the SEM are derived by inverting the contrast matrix. All parameters of the resulting structural equation model are estimated simultaneously. This approach allows us to directly examine main effects and interaction terms by testing the means of the latent effect variables using Wald tests.

Talk

DIF of Self-Assessment Items Across Different Levels of a Latent Variable: Positive Affect

Keywords:
Differential Item Functioning
Latent Variables
Multiple Indicators and Multiple Causes Models
Structural Equation Modeling

Differential Item Functioning (DIF) is said to occur when the item performs differently for one group of a population compared to another group, after controlling for the construct being measured and the possible existing construct differences between the groups (Holland, 1993). In other words, the expected item responses after controlling for the latent trait, depend on group membership. However, apart from group membership (for example sex, culture, etc.), DIF may be attributed to a “quantitative” variable (for example age or impulsiveness). Typically, in these cases variables have been categorized in order to test for DIF (e.g. Fleishman, Spector & Altman, 2002) with the corresponding loss of information. MIMIC models have been used to test for uniform DIF by regressing item responses on the grouping variable, after allowing for mean differences across groups (Woods, 2009). In the present study we propose that existing DIF across different levels of a latent construct can also be assessed by means of MIMIC models. Specifically, considering that, there is some evidence that positive affect increases the perceptions of self-evaluative constructs such as self-efficacy (e.g. Medrano, Flores-Kanter, Moretti & Pereno, 2016) we assess whether item responses to a Core-Self Evaluation questionnaire (CSE; Judge, Erez, Bono, & Thoresen, 2005) show DIF across different levels of positive affect, when controlling for the relation between the two latent variables (CSE and positive affect). We fitted a MIMIC model in a sample of 503 participants by means of SEM, and regressed CSE item responses on the two latent variables and the interaction between them. Results show that, after controlling for the latent CSE construct, positive affect and the interaction significantly predicted CSE item responses. As expected, positive affect positively biased the CSE assessments. Interestingly this bias was not constant along all levels of CSE.

Talk

Parameter Associations in Bivariate Dual Change Score Models: Implications for Simulation Studies

Keywords:
Latent Change
Longitudinal Data
Simulation
Structural Equation Modeling

Latent change score models are longitudinal structural equation models that can be used to concurrently investigate growth over time and dynamic relations between two variables (known as bivariate dual change score [BDCS] models). These models are useful instruments for researchers studying development, and as such their use in the social sciences has increased in recent years (Ferrer & McArdle, 2010), particularly the use of the BDCS model. Methodological researchers have used simulation studies to examine the complexities of these models using a variety of procedures for parameter selection and data generation (Grimm, 2006; Hamagami & McArdle, 2001; O’Rourke, 2016; Prindle & McArdle, 2014; Usami, 2014; Usami, Hayes, & McArdle, 2016; Voelkle & Oud, 2015). With simulation work for the BDCS model, there is currently no standardized procedure to select parameters and produce data trajectories that appropriately simulate trajectories seen in real-world data. In this study, we first review the current simulation work on BDCS models and describe the parameter selection procedures that have been used. Many studies either use parameters based on published studies with BDCS models or use arbitrary criteria for parameter selection. We then describe the mean and covariance expectations for BDCS models (Grimm & McArdle, 2005) in the context of their usefulness for parameter selection and data generation. Finally, we propose a parameter selection procedure that retains the unique associations among parameters of the BDCS model, which produces trajectories that mimic trajectories found in empirical work.

session

Multilevel Analysis

Johannes Hartig

16:30 - 17:50

Salon Schlegel

Talk

Estimation of Random Group DIF Using Two- and Three-Level GLMMs

Keywords:
Differential Item Functioning
Invariance
Item Response Theory
Multilevel Analysis

Of critical importance to education policy is the monitoring of trends in education over time. Developing optimal predictive models allows researchers and policy makers to assess cross-country progress and forecasts toward that goal. The purpose of this project is to apply Bayesian model averaging to cross-country growth regressions in education achievement using data from TIMSS (Trends in International Mathematics and Science Study).
Whereas it is common practice to select one particular model from a set of models based on the data fit, model averaging utilizes relevant information from a set of models. This technique allows considering the model uncertainty. Bayesian model averaging has been applied to a wide variety of content domains in economics, bioinformatics, weather forecasting, causal inference within propensity score analysis, and structural equation modeling. However, applications to education data were not common, because trend data with sufficient number of data points was not available. This study utilizes data from TIMSS, an international assessment of mathematics and science at the fourth and eighth grades. 20 years of data collection allows for analyzing changes over time on country level to predict future developments of student achievement as well as gender gaps.
We expect the Bayesian model averaging for growth curve models to produce more precise forecasts of student achievement and gender gaps than conventional approaches to forecasting. The study supports basic theoretical research on the problem of prediction in international large-scale assessment data and contributes to it by adding results from one of the largest international large-scale assessments in education. The unique contribution pertains in particular to the length of the survey and a fine grain of curriculum data crucial for policy analysis. It presents a novel approach to utilize large-scale assessment data for forecasting and policy analysis.

Talk

Multilevel Models for Evaluating the Effectiveness of Instruction: ANCOVA vs. Change-Score Approach

Keywords:
Conditional Model
Latent Change
Multilevel Modeling

In instructional research, the primary focus lies on identifying teacher behavior that positively influences relevant student outcomes. Analytical models in this field are typically complex, since (a) the involved constructs are measured using multiple indicators, (b) research questions on the classroom level require the application of multilevel models, and(c) the outcome variable should be observed at two time points at least, in order to infer that classes develop differently under different forms of instruction. Various multilevel models can be applied for analyzing such research questions. Two prominent approaches are (1) covariance analytical approaches in which the outcome at the second measurement occasion is controlled for by regressing it on the outcome at the first measurement occasion and (2) latent change-score models, where the change in the outcome between the two measurement occasions is modeled as an additional latent variable. Both approaches have been widely discussed regarding their differences and respective assumptions in models without a multilevel structure (Allison, 1990; Holland & Rubin, 1982; McArdle, 2009). The aim of this contribution is to apply them to the field of instructional research and to outline under which specific circumstances which model is more appropriate with respect to the underlying assumptions, and which inferences they each allow. We use empirical examples to illustrate these differences. We further simulate various data sets in order to examine influences of time specific, stable and random error components of the measured variables on both levels of measurement.

Talk

The Optimal Design of Cluster Randomized Trials With Outcomes at Individual and Cluster Level

Keywords:
Multilevel Modeling
Optimal Design
Power

With cluster randomized trials complete groups of subjects are randomized to treatment conditions. An example is a study on the effectiveness of neighbourhood-level interventions to improve quality of life in impoverished neighbourhoods. Here residents are nested within neighbourhoods and outcomes may be measured on residents (e.g. perceived safety) and the level of the neighbourhood (e.g. crime index).
The optimal design determines the number of neighbourhoods and the number of residents per neighbourhood in the intervention and control conditions. It is found by taking a cost constraint into account: costs are associated with implementing the intervention or control in a neighbourhood, and with taking measurements at the resident and neighbourhood level. The optimal design is found such that the effect of intervention is estimated with highest efficiency, and the total costs do not exceed the budget that is available.
The design that is optimal for the outcome at the resident level is not necessarily optimal for the outcome at the neighbourhood level. Multiple objective optimal designs are used to take both outcomes into account. The aim is to find a design that has high efficiencies for both outcome measures. A Shiny App that can be used to find the optimal design is demonstrated.
The optimal design ensures financial resources are used in the most efficient way and the power for finding an effect on intervention is maximized.

Talk

Comparative Performance of Single Trial Multilevel Analyses of Event-Related Brain Potentials

Keywords:
Event-Related Potentials
Multilevel Analysis

Repeated measures Anova or Manova are frequently used for analyzing event-related brain potentials. They are typically performed on averaged trials as a way of increasing the reliability of the electroencephalogram signal. Averaging, however, leads to information loss concerning the covariance matrix of random individual differences of participant treatment and time effects, which could be of substantive interest. Lack of adequate specification of the covariance matrix has also been shown to lead to inferential biases. The objective of the present study is to compare performance of this traditional approach with that of multilevel models which allow for an explicit modeling of these random effects using single trial measures. The numbers of stimuli and participants as well as the magnitude of variance and covariance components between participants, treatment by participant and time by participant interactions will be manipulated in a simulation of a facial perception experiment. Empirical power, Type I Error, and effect sizes will be obtained as performance measures. Results will be presented and discussed.

session

Structural Equation Modeling

Keith Widaman

16:30 - 17:50

Salon Hölderlin

Talk

Unreliability Has Important Negative Effects: Correcting May Be Easier Than You Think

Keywords:
Bias-Correction
Regression
Reliability
Structural Equation Modeling

Unreliability can have important negative effects on parameter estimates in multiple regression analysis, a problem that has long been understood, is often largely ignored, but recently has returned to prominence. In this presentation, the population basis of the effect will be demonstrated, and simulations will demonstrate the magnitude of effects on Type I error rates, Type II error rates, and parameter coverage rates. A more nuanced evaluation of the problem will be presented, with comparisons of effects on a target predictor based on perfect or imperfect reliability of the other predictor. Correcting for unreliability may be easier than you think, but is not without caveats. One way to correct for unreliability is to disattenuate correlations for unreliability and perform multiple regression analyses on these corrected estimates. Examples will demonstrate that this can resolve the pernicious effects of unreliability. The presentation then will be extended to path analysis with the analysis of an educational status attainment data set that has been used as a classic example since the early days of structural equation modeling. In the past, the standard approach to path analysis has been taken with this data set, an approach that fails to correct for unreliability. By correcting for unreliability, model fit can be affected in substantively important ways, parameter estimates are altered, and the ultimate form of the acceptable path model is altered in fundamental ways. The major caveat: all of these effects of correcting for unreliability require optimal estimates of reliability. This implies that some commonly used reliability indices, such as coefficient alpha, may not be recommended; model-based estimates, such as coefficient omega, might be preferred. Sensitivity analyses can inform about whether choice of reliability estimate has a notable effect on model fit and parameter estimates.

Talk

Multilevel SEM for Discrete Data With the Pairwise Likelihood Estimation Method

Keywords:
Discrete Data
Mixed Effects
Multilevel Analysis
Pairwise Likelihood
Structural Equation Modeling

Social and behavioural research frequently involves multilevel data with a large number of latent variables (i.e., random slopes, random intercepts, and hypothetical constructs). Current full-information approaches for discrete data typically involve computationally intensive numerical methods (e.g., adaptive Gauss-Hermite quadrature). Alternatively, in the Pairwise Likelihood (PL) estimation method (Jöreskog, & Moustaki, 2001), the full likelihood is replaced by a sum of (bivariate) pairwise likelihoods, which are easier to handle. In this presentation, we will examine the 'wide format' or 'multivariate' approach to multilevel data. In this approach, all the individuals in a cluster are row-wise displayed and analysed (see Bauer, 2003, and Mehta & Neale, 2005 for continuous data). In a simulation study using discrete data, we will compare the 'wide format' PL estimation method with the multilevel weighted least squares (WLS) approach (Asparouhov & Muthen, 2007) and the multilevel marginal maximum likelihood approach (Hedeker & Gibbons, 1994) under different conditions (i.e., sample size, model misspecification, and balanced/unbalanced data). Overall, results show that PL estimation in the 'wide format' approach is quite close to the multilevel marginal maximum likelihood estimates, which is often considered to be the golden standard. During this presentation we will illustrate the use of the 'wide format' approach for discrete data and discuss the advantages and disadvantages of the different estimation methods with multilevel data.

Talk

An Alternative Estimation Method for Multilevel SEM Based on Factor Scores

Keywords:
Bias-Correction
Factor Analysis
Multilevel Modeling
Structural Equation Modeling

Multilevel SEM is an increasingly popular technique, used to analyze data that is both hierarchical and contains latent variables. When using the within-between framework, the parameters are usually jointly estimated using a maximum likelihood estimator (MLE). This has some drawbacks. First, a large number of clusters is needed to obtain unbiased estimates. Second, misspecifications in one part of the model, may influence the whole model. To overcome these issues, we propose a stepwise estimation method, which is an extension of the Croon method for factor score regression (Croon2002). A factor analysis is performed for every latent variable, resulting in factor scores. Next, the between- and within-cluster covariance matrices of these factor scores are calculated and corrected using the formulas of Croon. New data is simulated using these corrected covariance matrices, which can be used in subsequent analyses, such as multilevel regression or multilevel path analysis.
A simulation study was set up to compare this new estimation method to the standard MLE. The results of uncorrected multilevel factor score path analysis were also considered. Two software packages were used to perform MLE, namely lavaan and Mplus. Five criteria were considered, namely bias of the regression parameters at the within level and between level, bias of the within and between variance and the proportion of successful replications.
No major differences were found between lavaan and Mplus. The uncorrected path analysis resulted in biased estimates of all regression parameters and variance components. On the within level, the Croon method and MLE resulted in very similar, unbiased estimates for the regression parameters and variance components. The proportion of successful replications was also very similar for both methods. On the between level, the Croon method outperformed MLE when the number of clusters was low. In conclusion, the Croon method seems to be a promising alternative to MLE.

Talk

Omitted Cross-Loadings in Nonlinear SEM: A Monte Carlo Study

Keywords:
Cross Loagings
Interaction Effects
Nonlinear Structural Equation Modeling
Quadratic Effects

Researchers are often interested in detecting nonlinear effects using nonlinear structural equation modeling (SEM). Generally, it is assumed that the measurement models of the latent predictor variables are unidimensional, i.e. that all indicator variables load only onto a single latent variable, although this may not always be true. There already exists strong evidence for linear SEM that omitted cross-loadings might lead to biased covariances between latent predictor variables as well as biased structural parameter estimates (Bandalos, 2014; Hsu, Troncoso-Skidmore, Li & Thompson, 2014; Li, 2016). However, until now effects of omitted cross-loadings have not been investigated in the context of nonlinear SEM. Given that omitted cross-loadings affect the estimates of factor variances and covariances, and that the estimation of nonlinear effects depends on these estimates, it might be expected that omitted cross-loadings will also affect estimates of latent interaction and quadratic effects. In a Monte Carlo study using the LMS method (Klein & Moosbrugger, 2000) of the Mplus program, we investigated the effects of omitting cross-loadings on estimates of nonlinear effects by varying the number, size, and sign of the secondary loadings, the number of nonlinear effects, and the size of the latent covariance. Our results indicate that cross-loadings may lead to severely biased parameter estimates. We will show under which conditions spurious nonlinear effects occur (type I error rates) or existing nonlinear effects vanish (power rates). In order to detect omitted cross-loadings, an empirical researcher might think of testing the fit of the measurement models using the likelihood ratio test for linear SEM, because a global test for nonlinear SEM does not yet exist. However, we will also demonstrate that despite a good model fit undetected omitted cross-loadings may bias the nonlinear effects.

symposium

New Developments in Mokken Scale Analysis

Andries van der Ark

16:30 - 17:50

Salon Novalis

Talk

Introduction to Mokken Scale Analysis

Mokken scale analysis

Keywords:
Mokken Scaling

Over the past decade, Mokken scale analysis (MSA) has shown a quickly growing popularity among researchers from many different research areas. This introduction discusses a set of techniques and a procedure for their application, such that the construction of scales that have superior measurement properties is further optimized, taking full advantage of the properties of MSA. First, I define the conceptual context of MSA, discuss the two item response theory models that constitute the basis of MSA, and discuss how these models differ from other IRT models. Second, I discuss dos and don’ts for MSA; the don’ts include misunderstandings frequently encountered in applications of MSA. Third, I discuss a methodology for MSA on real data consisting of a sample of persons who have provided scores on a set of items that, depending on the composition of the item set, constitute the basis for one or more scales, and I use the methodology to analyze an example real-data set.

Talk

Checking Assumptions in Two-Level Mokken Scale Analysis

New Developments in Mokken Scale Analysis (Symposium)

Keywords:
Item Response Theory
Item Response Theory
Mokken Scaling
Multilevel Modeling
Nonparametric Statistics

The nonparametric IRT models that underlie Mokken scale analysis consist of four main assumptions: unidimensionality, local independence, monotonicity, and invariant item ordering. These assumptions imply certain observable properties of the data. For example, local independence and monotonicity imply conditional association; for dichotomous items scores, monotonicity implies manifest monotonicity; and invariant item ordering implies manifest invariant item ordering. Mokken scale analysis provides methods to investigate the assumptions of the nonparametric IRT models by investigating the observable properties. When dealing with multi-rater data, some adjustments of the assumptions are necessary. For example, for multi-rater data, the monotonicity assumption concerns the latent trait of the subject combined with the rater effect. In addition, multi-rater data require a different way to estimate the item probabilities. As a result the methods that are used to investigate observable properties must be adapted for multi-rater data. I will discuss the necessary adaptations to make the methods from Mokken scale analysis useful in a multi-level context, and I will discuss how these adaptations may be implemented.

Talk

Two-Level Mokken Scale Analysis: The State of the Art

New Developments in Mokken Scale Analysis

Keywords:
Mokken Scaling
Multilevel Analysis
Nonparametric Item Response Theory
Scalability
Standard Errors

Currently, Mokken scale analysis for two-level data is being developed. It is a scaling procedure that allows test constructors to investigate the scalability, reliability, and validity of measurement instruments producing two-level data. This talk provides an overview what has been achieved so far. I will first discuss the types of data for which two-level Mokken scale analysis is a useful procedure, and I will briefly discuss underlying nonparametric IRT models. Then, I will discuss the estimation of scalability coefficients and their standard errors, and finally I will demonstrate two-level Mokken scale analysis using a multi-rater dataset.

Talk

Using Mokken Scaling Techniques to Evaluate Educational Assessments

Keywords:
Mokken Scaling
Performance Assessment
Rater-mediated Assessment

The purpose of this study is to illustrate and consider the use of Mokken Scale Analysis (MSA) a method for evaluating the quality of educational assessments, including multiple-choice (MC) assessments and rating quality in rater-mediated educational performance assessments. I focus on the following questions:
1.What information does MSA provide about the psychometric quality of MC items?
2.What information do traditional applications of MSA provide about rating quality in rater-mediated performance assessments?
3.What information does an adjacent-categories scaling procedure adapted from MSA provide about rating quality in rater-mediated performance assessments?
4.What are the practical implications for researchers and practitioners of using MSA as an evaluative framework for educational assessments?
To address these questions, I used two data sources: (1) middle school students’ responses to an MC-format engineering design process assessment, and (2) ratings of middle school students’ written compositions composed during an administration of a rater-mediated writing performance assessment. I analyzed the first dataset using a traditional application of dichotomous MSA to the MC items. For the performance assessment, I calculated indicators of rating quality adapted from Molenaar’s original polytomous MSA models and an adjacent-categories approach to MSA (ac-MSA). Specifically, I examined indicators of rater monotonicity, rater scalability, and invariant rater ordering using the original models and the ac-MSA adaptations. I discuss the implications of considering rating quality from the perspective of MSA and ac-MSA.
Together, the results indicated that MSA provides diagnostic information about the psychometric quality of individual MC items and individual raters that can provide valuable insight during assessment development and revision. Implications for research and practice are discussed.

Thursday, July 26
↑ Go to top ↑

08:30

keynote

Towards a deeper understanding of the effectiveness of interventions: New methods based on structural equation models and causal inference

Rolf Steyer

08:30 - 09:15

Saal Friedrich Schiller

09:30

session

Large Scale Data

Steffi Pohl

09:30 - 10:30

Saal Friedrich Schiller

Talk

Item-Person Mismatch and Parameter Recovery Accuracy in Sparse Multi-Matrix Booklet Designs

Keywords:
Item Response Theory
Large-Scale Data
Optimal Design

Multi-matrix booklet designs are important in educational large-scale assessments as they help reduce respondents’ burden and enable test administrators to save time and money. However, accurate or efficient parameter recovery from response data is a central problem when analyzing data from multiple matrix booklet designs in conjunction with IRT. Important factors such as sample size and the match between person and item location parameter distributions could influence parameter recovery accuracy. This study investigates the degree to which person and item parameters are recovered as a function of matrix sparseness and sample size when using balanced incomplete block multi-matrix designs with varying degrees of alignment between the person and item location parameter distributions.
To achieve this, data was simulated where person and item location parameters are generated from a population assuming a normal distribution with unit variance. The mean for the item location parameter is fixed at 0 in all conditions while that for the person parameters vary to give differing levels of mismatch (μ = 0, 0.2, 0.4, 0.8, 1.2, 1.6, 2.0). The R package irtoys (Partchev, 2016) was used in simulating the data under a Rasch model; as well as, for estimating item and person parameters using MML. The sparse multi-matrix designs used were like those used by Gonzalez & Ruthowski (2010) but with a test length of 42 items and varying sample sizes of 300, 500, 1000, 3000, 4500 and 6000 examinees. The root mean squared error and bias between true and estimated parameter values were used to assess parameter recovery accuracy. The study design was fully crossed with 1000 replications used in each condition to ensure stable results.
The results showed that parameter recovery accuracy was affected by the match between person and item location parameter distributions. However, the size of the effect was negligible as the difference between the RMSEs in the perfectly matched cases and the most mismatched cases was always in all conditions less than 0.02. As expected, parameter recovery accuracy decreased as sample size reduced or when the multi-matrix designs became sparser.

Talk

Measurement Invariance of the Academic Performance for Fifteen Countries With the Alignment Method

Keywords:
Alignment Method
Complex Survey
Invariance
Standardized Educative Evaluation

In the context of international assessments, the comparability of scores between countries is based on the assumption that the measures are equivalent. UNESCO’s Third Regional Comparative and Explanatory Study (TERCE) program reports on the results for mathematics, science and reading for 15 Latin American countries and the State of Nuevo León in Mexico. A standard reporting practice is to rank order the countries according to their performance levels in each of these three subjects. An implicit assumption in this ranking is that the measures are sufficiently invariant to allow an unconfounded interpretation. The objectives of our research were to investigate the use of a relatively newly developed psychometric method -- the alignment method (Asparouhov & Muthén, 2014) -- for the analysis of the measurement invariance and determine the comparability of the scores obtained in the assessment. The analysis was carried out with 82 items of the Science test applied to 61,921 students. The alignment method was used for the item pool of the test, under the MLR estimation strategy to test the approximate measurement invariance. The data analyses were performed with the Mplus 8 program. The preliminary results indicate that the alignment method based on a configural model automates the process of invariance measurement. In summary, the research shows the effectiveness of the use of the technique for the detection of invariance in complex samples, providing evidence of non-invariant items that may affect the validity of interpretations in cross-cultural comparisons.

Talk

Treatment of Measurement Error and Missing Data Using Nested and Non-Nested Multiple Imputation

Keywords:
Educational Assessment
Measurement Error
Missing Data
Multiple Imputation

In educational research, plausible values (PVs) are frequently used to correct measurement error and represent students‘ latent achievement scores while taking background information, such as students‘ interests or attitudes, into account (Mislevy, 1991). This method follows the multiple imputation (MI) approach of Rubin (1987) by considering latent variables as missing data, thus predicting achievement scores on the basis of a measurement model and background information. However, this procedure requires that the background data are completely observed, raising the question of how best to treat measurement error and missing data.
In the present talk, we consider different strategies for dealing with measurement error and missing data using MI. This includes the procedures currently employed in educational large-scale assessments such as PISA but also strategies involving nested and non-nested MI (Rubin, 2003) which are commonly used in secondary analyses in educational research.
We present the results from a simulation study that compared these methods under different conditions (e.g., sample size, reliability, amount of missing data). We show that the procedures currently employed in large-scale assessments could lead to biased parameter estimates (e.g., correlations, regression coefficients) when the data are not missing completely at random. By contrast, nested and non-nested MI are shown to provide unbiased estimates even with systematically missing data (missing at random). In addition, we show that simplified procedures can often be used, which are based on nested MI but do not require the use of specialized pooling methods normally required to analyze the data obtained from nested MI. In this context, we provide recommendations for practice, emphasizing the different roles of researchers who are involved in the scaling of student achievement data and those who conduct secondary analyses.

symposium

Differential Item Functioning (DIF) in Educational Settings: Methods, Simulations and Applications

Rudolf Debelak, Martin Tomasik

09:30 - 10:50

Salon Schlegel

Talk

A Regularized Moderated Item Response Model for Assessing Differential Item Functioning

Differential Item Functioning (DIF) in Educational Settings: Methods, Simulations and Applications

Keywords:
Differential Item Functioning
Item Response Theory
Moderated Factor Analysis
Regularization

The evaluation of differential item functioning (DIF) is an important aspect in the evaluation of measurements. Very often, DIF is investigated for several categorical (e.g., gender or region) and continuous (e.g., social status) covariates and a statistical model including all covariates is warranted. In this paper, a moderated item response model (also labelled as moderated nonlinear factor analysis, MNLFA) for polytomous responses is investigated which allows the dependence of item parameters on both types of covariates (Bauer, 2017). The model could include all possible DIF effects which leads to a highly parametrized model. Alternatively, statistical tests can be employed in a modelling phase to select DIF parameters which should be estimated in the MNLFA model which is essentially a multi-step approach. To circumvent both unfavourable strategies, we propose to include regularization methods into the MNLFA model using penalized marginal maximum likelihood estimation to assess DIF items (see Tutz & Schauberger, 2015, for a similar approach). Simulation studies and an application demonstrate the usefulness of the proposed method.

Talk

Calibration of a Criterion-Referenced Computerized Adaptive Test in Higher Education

Differential Item Functioning (DIF) in Educational Settings: Methods, Simulations and Applications

Keywords:
Adaptive Testing
Differential Item Functioning

The use of digital technologies opens up new opportunities in the field of higher education, for both teaching and testing. Regarding testing (e.g. written exams), besides other advantages, digital technology makes it possible to introduce state of the art methods from Psychometrics and Educational Measurement into the daily practice in higher education. In particular, criterion-referenced adaptive testing (CRT-CAT) has the potential to make exams more individualized, more accurate and fairer. However, from a practical point of view, the calibration of the item pool needed for CRT-CAT poses a critical challenge since a separate calibration study is typically not feasible and/or sample sizes of one written exam are too low to allow for a stable estimation of item parameters. To overcome this problem of small sample sizes in one written exam, within an actual construction of a CRT-CAT for student competences in statistics, the same test items were presented at different university locations. Thereby, the number of responses per item is increased making scaling with item response theory models possible. The test takers at all locations attended lectures with comparable content. However, as students in one university share exactly the same learning opportunities, they might be more similar than students between universities. A possible result could be differential item functioning (DIF) due to testing location. To examine for which items DIF due to testing location does not occur (so that they can be used in a joint calibration), several analyses with the multi-facet Rasch model are conducted. The data collection is carried out at the end of the winter term 17/18. The analyses will be finished before the conference. On the basis of the results, it will be discussed if the calibration across several university locations of an item pool for such a specific construct, that is only recently generated in students through lectures, can be recommended for future studies.

Talk

Differential Item Functioning in the Context of Multistage Testing

Differential Item Functioning (DIF) in Educational Settings: Methods, Simulations and Applications

Keywords:

Multi-stage tests (MST) based on item-response theory (IRT) are becoming more and more common in educational research. MSTs have properties of both linear and of computer adaptive tests (CAT). In a CAT scenario, it is evaluated after each question whether the test taker should be presented an item that is more or less difficult. In a MST scenario, the test taker is presented whole sets of questions. After one set is completed, an algorithm decides the difficulty of the next set. The aim of MST is to evaluate the test taker´s ability with fewer questions and/or with a higher precision as compared to linear testing.
A routine procedure in test fairness evaluation is the identification of differential item functioning (DIF). DIF is observed when individuals from different subgroups but with identical abilities have different probabilities of solving an item. Using the 2PL-IRT model, DIF can occur in two different forms, namely as uniform DIF and non-uniform DIF (also called crossing DIF). In uniform DIF, the item characteristic curves do not intersect while in crossing DIF they do.
For linear tests various statistical tests for DIF detection are readily available but most of these methods are not suitable for DIF detection in MST. One of the methods, namely the SIBTEST, was adapted to the CAT scenario (CATSIB) that also seems to work well in MST scenarios. However, CATSIB cannot detect crossing DIF that occurs quite often in 2-PL models. Our aim, therefore, was the implementation of CATSIB in an R package and to develop an R function for the detection and evaluation of CDIF in MST. We compared these R functions in a simulation study to check their statistical properties. We also conducted a DIF analysis study with real data from a MST of competencies in the entire population of students from grade 8 and 9 (N > 3000). Of special interest was the detection of DIF between students from the two grades based on the null hypothesis that there would be none.

Talk

A Flexible Method for the Detection of Differential Item Functioning

Differential Item Functioning (DIF) in Educational Settings: Methods, Simulations and Applications

Keywords:
Differential Item Functioning

Educational assessments are required to be fair. This includes that persons of comparable ability obtain comparable results in educational tests and that no bias against specific groups of respondents is present. Tests which show such a bias are said to exhibit differential item functioning (DIF). In practical measurements, the statistical framework of item response theory (IRT) provides tools for DIF detection.
We present a class of score-based tests, which are named M-fluctuation tests, for the detection of DIF in the IRT framework. These tests can be applied with a wide range of IRT models to test the hypothesis that the characteristics of test items, for instance their difficulty or item discrimination, are invariant with regard to the characteristics of test takers, for instance their age and gender. In contrast to alternative tests for DIF detection in IRT, M-fluctuation tests do not require the definition of groups of respondents (e.g., age groups) between which the parameters are compared, but are directly applied to person characteristics like age and gender.
Using simulated data, which are generated based on the two- and three-parametric logistic models, we first demonstrate that M-fluctuation tests are able to detect changes in the different model parameters (like the difficulty or the guessing parameter) between groups of respondents, but are conservative if these parameters are stable. We then present how software packages in the free statistical software environment R can be used to calculate M-fluctuation tests.

session

Factor Analysis

Florian Scharf

09:30 - 10:50

Salon Hölderlin

Talk

Orthogonal Versus Oblique Rotation in Temporal EFA for Event-Related Potentials

Keywords:
Applied Statistics
Factor Analysis
Latent Variable Analysis
Structural Equation Modeling

Temporal exploratory factor analysis (EFA) is widely used to reduce the dimensionality of event-related potential (ERP) data sets and to reduce the ambiguity with respect to the underlying components. Typically, EFA is conducted on a data matrix in which the columns are time points and data from all participants, electrodes and conditions are commingled in the rows.
The central goal of this procedure is to test whether there are differences in the factor scores (i.e., amplitudes) between the conditions. In the past, the risk of incorrect allocation of condition effects between factors has raised concerns. Simulation studies have shown that orthogonal rotation methods are more prone to this variance misallocation than oblique rotation methods. However, orthogonal rotations such as Varimax are still applied.
Here, we outline the reasons for the superior performance of oblique rotation from the perspective of EFA as a statistical model. Specifically, we show that factors in temporal EFA for ERP data are inevitably correlated due to the condition effects and the scalp topography. We also demonstrate these principles in a Monte Carlo simulation comparing orthogonal Varimax rotation with the oblique Promax and Geomin rotations.
In line with previous research and our mathematical derivations, Varimax rotation was prone to spurious cross-loadings between the factors. This pattern occurred even when the factors were uncorrelated across participants. Oblique rotation methods showed much weaker biases that increased as a function of the temporal overlap between the factors.
In order to circumvent correlated factors as a major cause of variance misallocation, oblique rotation methods should generally be the method of choice when applying temporal EFA to ERP data.

Talk

Common Factor Analysis and Principal Component Analysis: Competing Indeterminacies

Keywords:
Exploratory Factorial Analysis
Factor Analysis
Measurement Invariance

To date, comparisons between exploratory versions of common factor analysis (CFA) and principal component analysis (PCA) have tended to focus on differential parameter estimates across methods. The purpose of this presentation is to extend prior work to additional issues. When using any method of analysis, the analyst should understand: (a) how to perform analyses in reasonable, step-like fashion; (b) how to interpret parameter estimates; (c) how to simulate data from the mathematical model or representation; and (d) the relations between parameter estimates and key data from which they were derived. In general, these four aspects of analyses are well understood when using CFA. However, I submit that only the first – how to perform analyses in step-like fashion – is well understood in PCA, and the remaining issues are not well known in relation to PCA. The presentation will stress the nature and range of parameter estimates under CFA and PCA, discuss issues related to simulating data from each procedure, and outline relations between parameter values and the correlation matrices from which they were derived. One key distinction historically stressed involves indeterminacy: that CFA has several crucial indeterminacies (e.g., rotational indeterminacy, factor score estimation), whereas PCA provides a determinate solution. Investigation of points (c) and (d) uncovered heretofore unaddressed, pernicious problems and/or indeterminacies associated with PCA. The nature of these problems with PCA will be illustrated. A major conclusion is that PCA offers a weak basis for scientific generalization in precisely those situations in which it is typically used, situations in which between 3 and 6 variables load highly on each component. If exploratory analyses should be used to hone measurement models for subsequent confirmatory analyses in new samples, CFA is the appropriate exploratory method to use, and the use of PCA should be strongly discouraged.

Talk

The Number of Factors in Exploratory Factor Analysis

Keywords:
Exploratory Factorial Analysis
Monte-Carlo Simulation
Number of Factors

Exploratory factor analysis (EFA) is a widely used statistical method to study the underlying latent structure of a large number of observed variables, especially if there is no strong a priori justification for a particular theoretical model. Many criteria have been suggested to determine the correct number of factors. In this study, we present an extensive Monte Carlo simulation comparing traditional parallel analysis (PA), the Kaiser-Guttman criterion, and sequential $\chi^2$ model tests (SMT) to four recently suggested methods: revised PA, comparison data (CD), the Hull method, and the Empirical Kaiser Criterion (EKC). We manipulated the number of latent factors, the correlation among the factors, the number of items per factor, the magnitude of loadings, the underlying distribution, the presence of cross-loadings, minor factors, and the number of observations. No single extraction criterion performed best for every factor model. In unidimensional and orthogonal models, traditional PA, EKC, and Hull consistently identified the correct number of factors, even in small samples. Models with correlated factors were more challenging, where CD and SMT outperformed other methods especially for scales with fewer items. Given that the correct number of factors was reliably retrieved when SMT and either Hull, EKC, or traditional PA indicate the same number of factors to retain, we suggest that investigators first apply these methods to determine the number of factors. When the results of this combination rule are inconclusive, CD performed best. However, disagreement also suggests that factors will be harder to detect, in which case we recommend a sample size of $N\geq500$.

Talk

On the Influence of Processing Speed on Investigations of Structural Validity: A Simulation Study

Keywords:
Structural Equation Modeling

A time limit in testing prevents slow participants from reaching their highest possible scores, (besser mit Komma) whereas fast participants can reach their highest possible scores within the available time span. The consequences of such time limits for the investigation of the structural validity by means of confirmatory factor analysis are investigated in a simulation study. The following questions are addressed: Does the time limit in testing impair model fit in investigating structural validity? Does the representation of the effect prevent the impairment of model fit? Is it possible to identify and discriminate this method effect from another method effect? An important characteristic of the study is the assumption that omissions due to the time limit in testing reflect the participants’ processing speed. Four sets of 500 matrices showing a strong effect, a medium effect, a weak effect and no effect due to a time limit were generated and investigated. Impairment of model fit resulting from the effect due to the time limit was signified by some fit indices only. The inclusion of a factor for representing processing speed into the model of measurement in confirmatory factor analysis improved model fit. But there was no full compensation for the impairment. The precise representation of the effect enabled the discrimination of the effect due to the time limit from a uniform effect.

session

Applied Statistics

Thomas Schäfer

09:30 - 10:50

Salon Novalis

Talk

Cohen Revised: Empirical Redefinition of the Conventions for Interpreting Effect Sizes in Psychology

Keywords:
Effect Sizes Measures

Effect sizes are the currency of psychological research. They quantify the results of a study to answer the research question and are used to calculate statistical power. The interpretation of effect sizes—when is an effect small, medium, or large?—has been guided by the conventional definitions Jacob Cohen suggested in his pioneering writings starting in 1962. But do Cohen’s suggestions stand up to comparison with the empirical distributions of effect sizes as they are really found in psychology? For the present analysis, 900 effect sizes were randomly drawn from the whole history of psychological research. The distributions of effect sizes revealed that effects are much larger than suggested by Cohen, warranting a redefinition of the standards of their interpretation. New benchmarks are provided for the effect sizes r, d, and eta-squared. In addition, large differences were found between psychological subdisciplines, calling for the careful use of general guidelines.

Talk

Introducing Indigenous Methodology Into the Practice of the European Social Research

Keywords:
Indigenous Methodology
Qualitative Methods

The talk will evolve around the issue of indigenous methodology and its application in the social as well as cultural studies within the practices of the European qualitative research. It puts emphasis on innovative trends in qualitative methodologies that focus on culturally diverse environments as exemplified by the cultural borderland of Bosnia-Herzegovina’s society, yet stemming from emancipatory paradigm and post-colonial critique. The contents of the presentation will entail some key points to the indigenous methodology, studied by the author in the selected aboriginal academic and educational centres in Australia this year during a research visit. Indigenous approach can be referred to as “an ethically correct and culturally appropriate, indigenous manner of taking steps towards the acquisition and dissemination of knowledge about indigenous peoples” (Porsanger, 2004). Yet, despite the fact of being originally designed to emancipate and empower the indigenous peoples (e.g. in Canada, Australia, New Zealand or North America), it bears paramount significance for the many culturally fragile (i.e. subject to marginalization or discrimination) groups in Europe (such as Bosnian Muslims, Roma communities in Slovakia, Arabic refugees, etc.). It can, therefore, serve as a great tool of empowering and subjectifying the researched individuals, whose cultural identities are manifested in at times full of tensions cultural borderlands, exposed to ethnic, religious and national diversity. Introducing indigenous methods and paradigms can significantly enrich culturally sensitive practices of the field researchers, who respect the research communities and acknowledge them as the active co-producers of knowledge. Taking the above into consideration, the author discusses applied, and empirically developed theoretical models concerning indigenous methodology that can be successfully and mutually beneficially used in European social sciences and humanities.

Talk

The Impact of Test-Review Models on Improving Tests and Testing: The Case of Spain

Keywords:
NA
Test
Test Quality Criteria
Test Review Model

Tests are essential tools in educational and psychological assessment. In order to improve test quality, different associations across different countries have proposed to use Test-Review Models (see for example Prieto & Muñiz, 2000; Bartram, 2002; Evers, Braak, Frima, & Van Vliet-Mulder, 2009; Lindley, 2009; Nielsen, 2009, Evers et al., 2013; Hernández, Ponsoda, Muñiz, Prieto & Elosua, 2016). When applying these models, qualified professionals make a rigorous assessment of a number of tests, and the results of the assessments are made available to test users and professionals in order to improve testing and help them to make the right assessment decisions. In Spain, Prieto and Muñiz (2000) proposed the Spanish Test-Review model, which was revised in 2016 based on the updated EFPA model (see Hernández et al., 2016). Since the model was applied for the first time, in 2010, the National Tests Commission of the Spanish Psychological Association (COP) has periodically reviewed some of the tests most commonly used by Spanish psychologists. To date, six rounds of test reviews have been completed and a total of 65 tests have been assessed. In this study, we present the results of the sixth test-review edition, where 10 new tests have been evaluated. In addition, we assess the impact that after these years, the model and the review process has had on different stakeholders: a) professional psychologists, b) psychometricians and psychological-assessment scholars, and c) test builders and test publishing houses. We conclude with some actions that could increase the impact of the test-review models.

Talk

Neets in French Labour Market: A Multidimensionnal and Fuzzy Approach

Keywords:
Fuzzy Sets
Multidimensionality
Transition From School to Work

The NEET measure concerns young people that are neither in employment, nor in education or training. Since 2010, the European Commission has introduced this NEET measure to monitor the labour market and social situation of youth. This measure is supposed to represent the problematic labour market transitions among early school-leavers and more specifically to identify the more vulnerable one (Furlong, 2006). However, as revealed in a number of studies conducted, NEETs are a very heterogeneous population that cover a very wide range of situations, of which certain accumulate vulnerability factors: demotivated youths, unoccupied, living with parents, disadvantaged, looking for a career path, having families responsibilities, youths taking a year out (Eurofound, 2011).
If France has a NEET rate close to the European and OECD average (16.6% cf. OECD, 2016), this population (1,8 million) is still less well known and the understanding is still limited. In this research, we propose measuring NEET situation using a multidimensional and fuzzy approach. This method has been used as part of the measurement of poverty, health or overeducation. The main advantage of this approach is that it makes it possible to construct an indicator that is similar to a membership function by including different deprivation dimensions. More particularly, we propose to calculate, for the population of NEETS, such an indicator by taking into account dimensions related to the lack of accumulation of human capital of the young person in all its forms. In other words, we define a distance to the labour market. The data used come from the National Survey of Youth Resources conducted by DREES and INSEE. In this survey, 5800 young French people aged 18 to 24 were interviewed. The indicator was constructed from the aggregation weights proposed by Betti and Verma (1999). The advantage of these weights is to take into account the relative frequency of dimensions in the population and the correlation between them.

11:00

coffee

Coffee Break

11:00 - 11:30

11:30

session

Bayesian Statistics

Herbert Hoijtink

11:30 - 12:50

Saal Friedrich Schiller

Talk

Computing Bayes Factors From Data With Missing Values

Keywords:
Bayesian Statistics

The Bayes factor is increasingly used for the evaluation of hypotheses. These may be
traditional hypotheses specified using equality constraints among the parameters of the
statistical model of interest or informative hypotheses specified using equality and
inequality constraints. So far no attention has been given to the computation of Bayes
factors in the presence of missing data.
This paper will show how observed data Bayes factors, that is, Bayes factors computed using only the observed data, can be approximated using multiple imputations of the missing values. After introduction of the general framework elaborations for Bayes factors based on priors specified using training data and Bayes factors based on default or subjective prior distributions will be given. It will be illustrated that the approach proposed can be applied using the packages MICE, Bain, and BayesFactor.

Talk

Handling Ordinal Predictors in Regression Models via Monotonic Effects

Keywords:
Bayesian Statistics
Multilevel Modeling
Ordinal Data
Regression

Ordinal predictors are commonly used in regression models. Yet, they are often incorrectly treated as either nominal or metric thus under- or overestimating the contained information. This is understandable insofar as general applicable solutions or corresponding statistical software is still missing. We propose a new way of handling ordinal predictors, which we call monotonic effects. Here, the regression coefficients are reparameterized in terms of a scale parameter $b$ taking care of the direction and size of the effect and a simplex parameter $\zeta$ modeling the normalized differences between categories. This ensures that predictions are monotonically increasing or decreasing, while changes between adjacent categories may vary across categories. This formulation nicely generalizes to both interaction terms and multilevel structure. Fitting motonotonic effects in a fully Bayesian framework is straightforward with the R package brms, which also allows to incorporate prior information and to test whether the assumption of monotonicity is justified.

Talk

Bayesian Estimation for Cases of Empirical Underidentification

Keywords:
Bayesian Statistics
Multitrait-Multimethod
Structural Equation Modeling

The correlated trait-correlated method model (CT-CM) and the true-score multitrait-multimethod model (TS-MTMM) are two structural equation models that can summarize multitrait-multimethod data. However, these two models often produce inadmissible solutions (e.g., failed convergence, out-of-bounds parameter estimates), which may be due to empirical underidentification (i.e., the data do not contain enough information to produce a unique set of parameter estimates, even though the model may accurately represent the data generation process). This presentation describes how Bayesian estimation can alleviate these estimation problems. A large-scale simulation showed that Bayesian estimation produced admissible solutions for 99.99% of data sets simulated from the CT-CM and TS-MTMM, whereas ML only produced admissible solutions for 49.82% and 36.48% of data sets simulated from CT-CM and TS-MTMM, respectively. Furthermore, Bayesian parameter estimates showed comparable bias and greater efficiency relative to ML parameter estimates. The results of the simulation were echoed by five empirical examples, which lead to admissible parameter solutions for Bayesian estimation and inadmissible solutions for ML. The presentation concludes with a discussion of how the Bayesian estimation procedure aids parameter estimation for different sources of empirical underidentification, and why Bayesian estimation should help assuage estimation difficulties for other models used in psychological science.

Talk

Operationalizations of Inaccuracy of Prior Distributions in Simulation Studies

Keywords:
Bayesian Statistics
Inaccurate Priors
Simulation Study

The most controversial aspect of a Bayesian analysis is the selection of prior distributions. There have been several simulation studies evaluating the impact of inaccurate prior distributions on statistical properties of posterior summaries of model parameters (Depaoli, 2013; 2014; Miočević, Levy, & MacKinnon, under review). However, the findings of such simulation studies depend heavily on the operationalization of inaccuracy and informativeness of priors.
This talk presents several possibilities for designing inaccurate priors for a simulation study, and communicates the results of a simulation study used to compare the consequences of two different conceptualizations of inaccurate priors on recommendations made for applied researchers. The talk offers several options for constructing inaccurate priors for simulation studies that mimic real-life scenarios for how applied researchers might inadvertently specify inaccurate priors. This project aims to demonstrate that as methodologists, we should include multiple types of inaccurate priors in simulation studies used to examine statistical properties of Bayesian methods with inaccurate priors.

session

Causal Inference

Rolf Steyer

11:30 - 12:50

Salon Schlegel

Talk

Average Effects Based on Regressions with Log Link: A New Approach with Stochastic Covariates

Symposium of Prof. Dr. Steyer

Keywords:
Average Effects
Count Data
Poisson Regression Model

Researchers oftentimes use a regression with logarithmic link function (e.g., Poisson regression) to evaluate the effects of a treatment or an intervention on a count variable. In order to judge the average effectiveness of the treatment on the original count scale, they compute so-called marginal or average treatment effects, which are defined as the average difference between the expected outcomes under treatment and under control. Current practice is to evaluate the expected differences at every observation and use the sample mean of these differences as point estimate of the average effect. Standard errors for average effects are then obtained using the delta method. This approach makes the implicit assumption that covariate values are manifest and fixed, i.e., do not vary across different samples. We present a new way to analytically compute average effects based on regressions with log link with stochastic and/or latent covariates and develop new formulas to obtain standard errors for the average effect. In a simulation study, we evaluate the statistical performance of our new estimator and compare it to the traditional approach. Our findings suggest that the new approach gives unbiased effect estimates and standard errors and outperforms the traditional approach in some conditions.

Talk

Ignoring Ignorability: Towards a Realistic Public Policy Evaluation

Keywords:
Causal Inference

Evidence-based public policies is a widely accepted requirement (OECD, 2007). Although the search of objectivity is a reason (King, 2010), to ensure the public trust constitutes the main motivation (Holt, 2008). This leads to evaluate a public policy by maximizing the social welfare (Manski, 1996). Thus, the evaluation itself reduces to compare the probability of the outcome when all the units are under the treatment, P(Y1| X); and the probability of the outcome when all the units are not exposed to the treatment, P(Y0| X); here X characterizes contexts in which the public policy is intended to be applied. These conditional probabilities are non-identified due to the fundamental problem of causal inference (Holland, 1986): P(Y1| X, Z=0) and P(Y0| X, Z=0) are non-identified, where Z=1 means that a unit is under the treatment, Z=0 otherwise. An identification restriction widely used is the strong ignorability condition (Rosenbaum & Rubin, 1983) that establishes that (Y0,Y1) is independent of Z given X. This condition is not empirically refutable, but only justified by substantive considerations. However, there are two problems associated to this condition: at the logical level, the question is to know if the recommendations obtained under the strong ignorability condition are still valid under weaker assumptions. At the policy level, it would be relevant to realize to what extent the ignorability condition determines the policy maker behavior with respect to the implementation of the treatment. In this talk, we intend to enhance the scope of public policy evaluations by introducing partial identification restrictions leading to solve the fundamental problem of causal inference. More specifically, we show how a public policy cab be evaluated from different scenarios. We show not only how each of these scenarios induce specific policy maker behaviors, but also to what extent one scenario allows to falsify the policy recommendations reached under a complementary scenario.

Talk

How to Model Production in Psychology? A Bayesian Stochastic Frontier Structural Equation Model

Keywords:
Bayesian Statistics
Causal Inference
Stochastic Frontier Analysis
Structural Equation Modeling

In the realm of Social Psychology or Work and Organizational Psychology it is often the case that groups, for example research teams, are observed which produce an output for a given input. Modeling this relationship as a simple prediction task miss the idea of production. It is not of interest to predict the average outcome of a unit, but to estimate for a unit the maximal possible outcome (frontier of production) for a given input and the deviation of the observed outcome from the maximal possible outcome (inefficiency). Stochastic Frontier Analysis (SFA), adopted from Econometrics, should be favored in this case toward ordinary regression. Considering psychological applications, however, four problems arise: First, the outcomes are often count variables (e.g., number of publications). Second, the outcomes are prone to measurement errors (e.g., number of publications for a unit vary among different bibliographic databases). Third, not only one, but multiple outputs are often reported (e.g., number of publications, number of Phds). Fourth, the input (e.g., funding) as a treatment might be confounded (e.g., scientific fields). A Bayesian Stochastic Frontier Structural Equation Modeling approach (BSFSEM) addresses all of these problems: The outcome variables are assumed to be Poisson distributed. A measurement model is assumed to assess several latent output and input factors. The structural part of the BSFSEM is defined by the stochastic frontier component as a skewed normal distribution model. The main input factor (e.g., funding) provides for a continuous treatment by adopting concepts of the theory of causal effects of Steyer and the General Propensity Score Matching. Besides the methodological concepts, a simulation study will show how the model behave under different sampling conditions. The Austrian Science Fund (FWF) provided for research reports of N = 1,046 funded projects in order to illustrate the proposal.

Talk

Bias in Estimating Treatment Effects of Latent Non-normal Distributed Outcome Variables With Binary Indicators

Symposium of Prof. Dr. Steyer

Keywords:
Average Effects
Categorical Data
Latent Variable Analysis

To evaluate the effects of a treatment, it is useful to estimate the average treatment effect (ATE). Often, the outcome does not only depend on the treatment, but on one or several (qualitative or quantitative) covariates and the interactions between treatment and covariates as well. To estimate the ATE, one can use multi-group structural equation modeling (SEM), which also allows to analyze latent outcome variables and latent covariates. We focus on the case of a single latent outcome variable, which is measured by several binary indicators (items) and we use the Rasch model as the measurement model. Previous simulation studies have shown that the ATE can be estimated unbiasedly if all covariates which the latent outcome variable depends on are included in the model. However, if some covariates are ignored the ATE estimate can be biased. This can even lead to a bias in a random experiment where the treatment and the ignored covariates are stochastically independent and consequently the true ATE is unbiased. One explanation for this bias is that ignoring relevant covariates can lead to a skewed distribution of the latent outcome in one or more treatment groups. Subsequently, using conventional estimation methods, like the maximum-likelihood-estimator, results in a biased estimate of the ATE, because they assume a normal distribution of the latent outcome variable. In order to investigate to which extent a non-normal distribution affects the ATE estimate, we conducted a simulation study in which we systematically varied the conditional distribution of the latent outcome in control and treatment. Results showed that skewed distributions only led to a bias if the distributions of the latent outcome variable differed in control and treatment. We will further discuss these results and their implications for estimating treatment effects.

symposium

Response Time Modeling in Psychometrics

Steffi Pohl

11:30 - 12:50

Salon Hölderlin

Talk

Disentangling Missingness Due to a Lack of Speed From Missingness Due To Quitting

Response time modeling in psychometrics

Keywords:
Item Response Theory
Missing Data
Response Times

Missing values at the end of a test can occur for a variety of reasons: On the one hand, examinees may not reach the end of a test due to time limits and a lack of speed. On the other hand, examinees may not attempt all items and end the test early due to, e.g., fatigue or a lack of motivation. We use response times retrieved from computerized testing to distinguish missing data due to a lack of speed from missingness due to quitting. On the basis of this information, we present an approach that allows to disentangle and simultaneously model and account for different missing data mechanisms underlying not reached items. The proposed model combines research on missing data and research on response times. In doing so, the model a) supports a more fine-grained understanding of the processes underlying not reached items and b) allows to obtain less biased and more efficient ability estimates. In a simulation study we evaluate the proposed model and compare its performance to current state of the art models for not reached items. In an empirical study we show which insights can be gained regarding test taking behavior using this model.

Talk

Response Time Models for Automated Test Assembly

Response time modeling in psychometrics

Keywords:
Item Response Theory
Response Times
Test Assembly

In the recent past, Automated Test Assembly (ATA) methods have been developed to enable the automatic generation of comparable test forms. In high-stakes assessments, comparable test forms are required for the comparison of persons across test forms. As it is common practice in high-stakes tests to use a fixed time limit across test forms and to score not reached items as incorrect responses, test length is a crucial property of test forms. Research has shown that items differ not only regarding their average response times but regarding how discriminating they are for differences in participants’ speed. Van der Linden (2011) showed that the balancing of parameters retrieved via the Hierarchical Response Time Model is useful to achieve the same degree of speededness across test forms, even for different levels of speed. However, the proposed model is rarely used in practice, partly as the consequences of differential speed sensitive test forms in high-stakes settings have not yet been investigated.
In a simulation study we show that test forms with equal average difficulty and length but with different speed discrimination can yield different person parameter estimations for certain speed levels. This demonstrates that it is necessary to include speed discrimination parameters into the assembly of test forms to prevent the allocation of test forms to individuals from having an impact on the ability estimation. In a second simulation study we found that the approach of van der Linden (2011) can prevent bias in person parameter estimation. The approach was compared to ATA methods using average response times and no response time information. Both of the latter approaches resulted in bias in person parameter estimation. We conclude that using the Hierarchical Response Time model in Automated Test Assembly is crucial for the assembly of fair test forms.
van der Linden, W. J. (2011). Test design and speededness. Journal of Educational Measurement, 48 (1), 44–60.

Talk

Response Times and Latent Response Style Classes in Noncognitive Measures

Response time modeling in psychometrics

Keywords:
Item Response Theory
Response Styles
Response Times

The purpose of the study is to examine the relationship between response time and response styles in the assessment of noncognitive constructs. Personality constructs, attitudes, and other noncognitive variables are often measured using rating scales. These scales can be biased by respondents giving invalid responses in form of response styles (RS) due to low motivation, fatigue effects, or problems understanding the questions. RS are defined as respondents’ tendencies to respond in a systematic way independent of the item content (Paulhus, 1991). They can affect the dimensionality of the measurement (Chun, Campbell, & Yoo, 1974), the validity of survey data (cf. Baumgartner & Steenkamp, 2001; Dolnicar & Grun, 2009; Morren, Gelissen, & Vermunt, 2012) and the comparability of test scores.
The current study is based on a multi-process IRTree approach (Böckenholt, 2012) and multidimensional extensions (Khorramdel & von Davier, 2014; von Davier & Khorramdel, 2013) to detect and correct for response styles in rating data. A multidimensional IRTree approach combined with mixture IRT models is applied to rating data from PISA 2015. This approach allows modeling respondents’ behavior closely by decomposing rating data into multiply nested response sub-processes (binary pseudo items) separating different types of RS from trait-related responses. Because the measurement of RS is not straightforward – not all respondents show RS and the ones who do may not show it to the same extent or in the same direction – mixture IRT models are applied using the pseudo items to differentiate between groups of respondents with different response behavior. The resulting latent classes are then used to identify respondents who show RS and are related to response times as covariates in the mixture distribution modeling approach.

Talk

A Finite-State Machine Approach to Extract Item Response Times From Questionnaire Item Batteries

Response time modeling in psychometrics

Keywords:
Computer-based Assessment
ICT Familiarity
Log Data Analysis
Response Time Modeling

A potential benefit from modeling response times is to increase measurement precision of constructs defined by responses if response times are systematically related. Response time models for test items (e.g., van der Linden, 2007) or Likert type questions (e.g., Ferrando & Lorenzo-Seva, 2007) require time measures at the item level, defined as time differences between the onset of item presentation and the answer. Response times can easily be recorded for each item under very general conditions (labeled as “one-item-one-screen” design by Reips, 2002) in computer-based assessments. However, until now there is no well-established method that allows deriving response times when multiple questions are presented on one screen simultaneously, a design often used for item batteries. Based on a general framework for the analysis of log data using finite state machines (Kroehne et al., 2016), a method to extract response times for item batteries is presented. Instead of using the total time on screen, the proposed approach uses log-events like answer-change events to aggregate time differences between subsequent answers as response times at the item-level, while accounting for the self-selected order of responses within the battery. The proposed method is illustrated using data from the ICT familiarity questionnaire of the PISA 2015 context assessment. Item response times are used to increase measurement efficiency by using them as predictors of the measured constructs in latent regression models (Ramalingams, 2017). It shows that adding item-level response times increases the reliability of the generalized partial credit models, for instance, for the subscale “Use of ICT outside of school” from 0.781 to 0.794, while adding only the total time on screen does not (0.782). The implied relationship between the measured construct and time of this method as well as alternative psychometric approaches will be discussed with respect to response times for item batteries.

session

Experimental Design

Volker Kraft

11:30 - 12:40

Salon Novalis

Talk

Statistical Power in Pooled Time Series

Keywords:
Behavioral Memory
Pooled Time Series
Smoking Behaviour
Statistical Power
Tobacco

With the aim to find the total number of weeks that the smoking behaviour is rooted in the organism, a daily register was taken during 12 weeks (84 days) counting the amount of cigarettes smoked by a sample of 62 Spanish university students, to subsequently carry out a study using pooled time series. The results show that the smoking behaviour has an AR model (2)(7)8, that is, the smoking behaviour has a 56 days memory.
We want to check if our study has enough statistical power to confirm that our results obtained here are sensitive to the real values of our estimators. Given the absence of previous studies, 10 subjects of the initial sample were taken a posteriori, and when we analysed their data, an AR model (1)(7)1 was obtained (R2 = .50), having the lag 56 a statistical power of .249.
Using GPower statistical software, the sample size needed was estimated to obtain a statistical power of .90 for the lag 56, resulting in a minimum of 1428 useful data, that is, 51 subjects, because when the dependent variable is lagged, each subject would only have 28 useful data (84-56).
To sum up, we conclude with the affirmation that time series analysis has a poor statistical power, so samples for this type of analysis should be quite large. Furthermore, the ideal number of subjects to obtain an adequate statistical power and effect size should be checked by a previous study, or if that is not possible, a posteriori analysis.

Talk

Understanding (the) Power in Designed Experiments

Keywords:
Experimental Design
Experimental Design
Mixed Model Analysis
Optimal Design
Power Calculation

Computer-generated or optimal designs provide an accessible approach to efficiently learn from data in many situations. A live demo of a designed experiment will focus on data visualization at several steps of the integrated workflow. Visualization does not only help to better understand interactions and variability in the factor space, but also to communicate findings derived from the analyzed experiment.
Based on a real-world scenario, JMP Pro will be used for designing a repeated measures (split-plot) experiment and for the mixed-model analysis. Integrated simulators support a-priori power calculations during the design phase and empirical statistics based on modeling results during the analysis phase. The demo case and other resources will be shared to get everybody started with state-of-the-art tools for statistically designed experiments.

Talk

Degrees of Freedom Approximations in Multilevel Meta-Analysis of Standardized Single-Case Experiment

Keywords:
Degrees of Freedom
Experimental Design
Multilevel Meta-Analysis

Multilevel modeling can be used to synthesize the results of multiple single-case experimental designs. We are interested to discover to what degree using different methods of estimating the degrees of freedom, including the default ‘containment’, the Satterthwaite, and the Kenward-Roger methods, in analysis of standardized raw data and standardized effect sizes can improve statistical inferences on the fixed effects. We also used the ‘sandwich estimator’ to evaluate the improvement in the standard errors of the fixed-effects’ estimates. The raw data were simulated using a three-level model that models a possible effect of the intervention on the level and on a time trend. The standardized raw data were synthesized in a three-level meta-analysis. We also calculated the ordinary least square estimations of the case-specific intervention’s effects on the level and on the slope, and combined these standardized effect sizes over cases and studies in two separate univariate three-level meta-analyses. Results indicate as expected that the quality of the fixed effect estimates is unaffected by the method of estimating the degrees of freedom and for all methods the estimations are less biased as the number of measurement occasions increase. The number of measurement occasions and number of studies have a significant impact on the standard error estimates’ bias, operationalized as the gap between the standard error estimates and the standard deviation of the fixed effect estimates: as the number of occasions and/or studies increase, the negative bias decreases significantly. Surprisingly, the standard error estimates of the mean effect sizes are more underestimated when sandwich estimation was applied, and as a result, the sandwich estimation procedure produced a too small coverage proportion of the confidence intervals for effect estimates. The approach for approximating the degrees of freedom did not considerably affect the coverage proportion of the confidence intervals.

13:00

lunch

Lunch Break

13:00 - 14:00

14:00

poster

Poster Session 2

14:00 - 15:30

meeting

Executive Committee Meeting

NA

14:00 - 15:30

Salon Schlegel

15:30

state-of-the-art

The role of time in dynamic models of change

Noémi Schuurman

15:30 - 16:00

Saal Friedrich Schiller

state-of-the-art

Optimal Design

Mirjam Moerbeek

15:30 - 16:00

Salon Schlegel

16:00

coffee

Coffee Break

16:00 - 16:30

16:30

session

Structural Equation Modeling

Jonathan Helm

16:30 - 17:50

Saal Friedrich Schiller

Talk

Dust Yourself Off and Try Anew: Reproducing ANOVA Using SEM

Keywords:
Analysis of Variance
Model Equivalence
Structural Equation Modeling

This presentation demonstrates how to reproduce the results (e.g., F-values, p-values) from different kinds of analysis of variance (between subjects, repeated measures, and multivariate ANOVA) using structural equation modeling (SEM). The presented approach differs from prior approaches, which incorporated indicator variables (e.g., dummy variables, effects codes) into a single SEM (analogous to regression). The approach presented here translates the main effects, interaction effects, and distributional assumptions of ANOVA into a set of SEMs with specific equality constraints, and then reproduces the ANOVA by statistically comparing the SEMs (i.e., difference testing). The results are virtually identical (i.e., sample statistics and p-values are equivalent to the third or fourth decimal) across the two approaches for a range of empirical examples, and the models can be extended to relax distributional assumptions (e.g., homogeneity of variance and sphericity) underlying ANOVA. Therefore, this presentation provides researchers with a series of stepping stones for using SEM in place of ANOVA, may facilitate analyses that extend beyond mean differences.

Talk

Dealing With Hypotheses That Depend on the Scaling of Latent Variables

Keywords:
Identification Constraints
Structural Equation Modeling

Latent variables in a structural equation model must be scaled in order for the model to become identified. Scaling methods include setting one loading per latent variable to unity (fixed marker method), setting the variances of all latent variables to unity (fixed factor method), or setting the average of all loadings per latent variable to unity (effects coding method). Unfortunately, model parameters estimate different quantities under these scaling methods and thus have different interpretations (Kl{\"o\ss}ner \& Klopp, 2017). This talk explores the types of hypotheses about model parameters that are affected by the choice of scaling method, as well as options for testing one's original hypotheses of interest in these cases.

Talk

Metric Measurement Invariance of Latent Variables: Foundations, Testing, and Correct Interpretation

Keywords:
Measurement Invariance
Structural Equation Modeling

In multi-group and longitudinal studies, it is important to test for metric measurement invariance (MI). Recently, it has been pointed out that currently used test procedures for MI are not complete in the sense that additional assumptions about the referent indicator's invariance are needed in order to conclude that actual data satisfy MI (Raykov, Marcoulides & Li, 2012, Educational and Psychological Measurement, 72, 954-974).
Introducing the new concept of proportional factor loadings (PFL), we show that tests for metric MI actually only test for PFL, because PFL is empirically indistinguishable from metric MI. More precisely, if the loadings in the population are only proportional over groups or over time, the manifest variables' implied distribution is identical to one that stems from invariant factor loadings in the population. Thus, it is impossible to differentiate between metric MI and PFL based on empirical data only.
Using Monte Carlo studies, we demonstrate that the power to detect violations of metric MI drastically deteriorates the closer the factor loadings' pattern of non-invariance comes to a PFL pattern, explaining results in the literature that uniform non-invariance is very hard to detect (Yoon & Millsap, 2007, Structural Equation Modeling, 14, 435-463). Furthermore, our results on PFL explain why the choice of a referent indicator does not affect the results of testing for MI (Johnson, Meade & DuVernet, 2009, Structural Equation Modeling, 16, 642-657), while tests about the equality of latent variables' variances may lead to potentially wrong conclusions when the data only satisfy PFL, but not MI.
With respect to partial metric MI, we find that empirically, it is impossible to differentiate invariant indicators from non-invariant ones: one can only detect which indicators form subsets whose loadings are proportional over groups or time. For detecting these subsets of indicators, we develop the partition method, and show that it works well.

Talk

The Interpretation of Parameter Estimates in Structural Equation Models

Keywords:
Identification Constraints
Latent Variable Analysis
Parameter Estimates
Structural Equation Modeling

Estimated parameters in statistical models usually represent population parameters. A classical example is a regression model with observed variables. The estimated slopes and intercept represent the respective population parameters. But this is no longer the case when latent variables are invoked in the model. This results from the necessity to scale the latent variable or, stated otherwise, to impose identification constraints on the latent variable. Thus, the estimated parameters do not represent population values, but an algebraic expression combining the population values of the parameter of interest and the parameter(s) used to impose the identification constraint. In practice, things become even more complicated because there are three well-known methods for achieving identification: the fixed marker, the fixed factor and the effects coding method.
In the literature, the correct interpretation of estimated parameters has not received wide attention. An exemption is Raykov, Marcoulides and Li (2011, Educational and Psychological Measurement, 72, 954–974), who demonstrated that when using the fixed marker method in factor models, an indicator’s estimated loading represents the ratio of the indicator’s population loading to the referent indicator’s population loading.
The aim of this contribution is to derive the correct interpretation of estimated parameters in structural equation models for all three widely used scaling methods. First, we deduce the interpretation of all estimated parameters in confirmatory factor analysis models, extending the results of Raykov et al. to the other scaling methods and model parameters. Second, we deduce the interpretation of parameters in recursive structural equation models. Finally, we show how our findings help to explain some recent findings regarding the effects of ignoring measurement invariance for path coefficients in structural equation models (Guenole & Brown, 2014, Frontiers in Psychology, 5, 980).

session

Latent Variable Analysis

Heidelinde Dehaene

16:30 - 17:50

Salon Schlegel

Talk

Semiparametric Regression Models for Indirectly Observed Outcomes

Keywords:
Latent Variable Analysis
Nonparametric Statistics
Regression

Although it is not always obvious at first glance, research studies across different fields often care about latent variables. These are variables that are not directly observed, but rather theoretically postulated or empirically inferred from observed variables (also known as proxies). The reason for using proxies can be motivated by theoretical or practical considerations (e.g. availability of sophisticated devices). Examples include the body mass index as a proxy for body fat percentage and the Beck Depression Inventory-II as a proxy for depression.
In the first part of this presentation we illustrate by examples that the relationship between the outcome of interest and the proxy can be non-linear. The majority of available methods (e.g. standard structural equation models), however, typically assume that this relationship is linear. We illustrate how slight deviations from linearity can have a substantial impact on the validity of these inferential procedures.
In the second part of this presentation we present a new methodology that no longer imposes this linearity restriction, but only relies on monotonicity. Our methodology originates from the combination of three major statistical concepts: measurement error, a semiparametric linear transformation model and binary regression.
The result is a model that enables us to quantify the effect of observed covariates on a summary measure of the unobserved outcome. We propose to quantify the effect of a covariate on the outcome in terms of the probabilistic index, i.e. the probability that the outcome of one subject exceeds the outcome of another subject, conditional on covariates (Thas et al., 2012, De Neve and Thas, 2015). We evaluate the proposed estimators empirically in a simulation study.

Talk

Impact and Dimensionality: The Performance of Logistic Regression in Differential Item Functioning

Keywords:
Differential Item Functioning
Logistic Regression
Measurement Invariance
Multidimensionality

Conventional differential item functioning (DIF) approaches such as logistic regression (LR) often assume unidimensionality of scales and match participants in the reference and the focal groups based on total scores. However, many educational and psychological assessments are multidimensional by design, and a matching variable using total scores that does not reflect the test structure may not be able to eliminate the impact of multidimensional items on DIF detection. Thus, we proposed using all sub-scores of a scale in LR for DIF detection and compared the performance with three alternative matching methods in LR, including using total score and individual subdomain score. The present study manipulated three factors (the test structure, group impact and the number of cross-loading items) to compare false positive (FP) and true positive (TP) rates. We assumed that 500 or 1000 participants in each group answered 20 items, reflecting two dimensions; the tested item (the 21st item) measured a single or two domains; group impact included no impact, one group with a higher average ability on the first domain and with no impact on the other domain, and one group with a better ability on the first domain, whereas the other group was more capable on the second domain; 0, 20% or 40% of the first 20 items measured two domains. When the tested item measured a single domain, the conventional LR incorporating total score as a matching variable yielded inflated false positive rates when there was an impact, and the situation got worse when two groups showed different ability on the two domains. The same patterns were found in LR using a single subdomain, but the proposed LR using two sub-scores was robust and yielded satisfactory FP and TP rates. When the tested item measured two domains, both the proposed and the conventional LR yielded controlled FP rates across manipulated conditions, compared to the ones using a single sub-score. It is suggested to use LR incorporating all sub-domain scores for DIF detection in multidimensional data.

Talk

Defining Quality of Life by a Multi-Group Network Analysis

Keywords:
Cross-national Survey Research
European Values Study
Measurement Invariance
Model Equivalence
Network Analysis

After years of research, the definition of Quality of Life (QoL) keeps being an unknown. Due to the diversity of approaches studying the construct from different perspectives, to reach a consensus in its description has become difficult. However, despite the research goals, to learn about the elements composing QoL is needed to improve methodologies for assessing QoL as well as for designing policies directed to increase it. The aim of this study is twofold: a) to estimate a network analysis where we will represent, analyze, and interpret the full complexity of the QoL concept; and b) to analyze the invariance of the QoL concept conducting a multi-group network analysis where the model will be estimated across groups.
To do that, we will integrate the different dimensions of QoL, taking into account QoL scales from international studies such as the European Values Study (EVS), the European Social Survey (ESS), and the European Quality of Life Survey (EQLS). Secondly, we will use responses to QoL scales of participants from different Spanish-speaking countries (i.e. Colombia, Mexico or Spain) and from European non-spanish speaking countries (i.e. The Netherlands, Italy or Germany) to create the multi-group network.
Due to network analysis has been traditionally implemented to create psychopathological models, the present study will propose an innovative application of that procedure to another substantive area. Concretely, results will be presented in terms of methodological contributions of network analyses to understand the complexity of QoL. In addition, we will discuss cross-cultural differences when comparing the definition and the composition of the QoL across countries.

Talk

Assessing Individual Change Without Knowing the Test Properties: Item Bootstrapping

Keywords:
Bootstrap
Individual Change
Psychometric Properties
Reliable Change
Significant Change

Introduction: A change between an individual’s responses in two administrations of a test reaches statistical significance when the two confidence intervals for the true scores do not overlap. When elaborating the confidence interval for the true score of a person in a test, the classical procedures involves knowing several properties of the test given a fixed sample, such as the population’s variance and the reliability or internal consistency. Sometimes, those procedures cannot be employed because these properties are unknown or are not trustworthy. We propose to use the ‘non parametric bootstrap’ method (Efron & Tibshirani, 1994) to the responses given by an individual to the items of the test in order to create confidence intervals for an individual's true test score for situations in which classical procedures cannot be used.
Method and Results: Six databases containing the responses of several groups to one or more subscales have been analyzed: In two of them, there was not an expected change, while in the other four, a change in the criterion of interest was expected after an intervention. In each database two procedures have been applied to create the confidence intervals; a classical one, Estimating the True Score (ETS; Gulliksen, 1950), and the Bootstrap of items (BSI; Botella, Blázquez, Suero, & Juola, in press). The rates of significant change obtained with both procedures were very similar, suggesting that BSI is a promising solution when other methods cannot be applied.
Discussion: The BSI procedure has some advantages over ETS given that BSI requires no distributional assumptions, never needs to be adjusted because of inadequate confidence interval limit values, and the amplitude of the interval generated varies from one individual to another depending on the variability of their responses to the items. But BSI is still a very new procedure and, for assessing its performance, we still need evidence from different research contexts.

session

Item Response Theory

Sebastian Born

16:30 - 17:30

Salon Hölderlin

Talk

Comparing Fixed-Precision Multidimensional Computerized Adaptive Tests for Various Assessment Areas

Keywords:
Computerized Adaptive Testing
Item Banks
Multidimensional Item Response Theory

Computerized adaptive testing (CAT) is an efficient approach to assemble tests: by selecting the most informative items for a given person, relatively short but precise tests can be constructed. In the last years, several studies have shown that taking into account the correlation structure when measuring multiple dimensions simultaneously can improve test efficiency even further. Up to now it is not entirely clear to what extent the research results on multidimensional computerized adaptive testing obtained in the field of educational testing can be generalized to fields such as health assessment where design factors differ considerably from those typically used in educational testing.
In a simulation study, we examine the impact of different item bank properties which are thought to typify two assessment scenarios: health assessment (polytomous items, small to medium item bank sizes, high discrimination parameters) and educational testing (dichotomous items, large item banks, small to medium-sized discrimination parameters) on the performance of fixed-precision (variable-length) CATs. The average absolute bias across dimensions and the total test length were used to evaluate the quality of trait recovery and the measurement efficiency, respectively, for between-item multidimensional CATs and multiple unidimensional CATs. The study shows that the benefits associated with fixed-precision multidimensional CAT hold under a wide variety of circumstances.

Talk

Concurrent Adaptive Tests for Formative Assessments in School Classes

Keywords:
Adaptive Testing
Constrained Test Assembly
Optimal Design
Simulation Study

Tests developed for summative assessment are administered to describe the performance of groups such as classes, for instance, relative to a comparable group. A formative use of such tests (e.g., the interpretation of student’s results in feedbacks to parents) necessitates a higher frequency of assessments and thus requires the increased measurement efficiency afforded by Computerized Adaptive Testing (CAT). CAT allows implementing tests with higher reliability which can be used to identify student needs and to make according adjustments to teaching and learning strategies in the course of ongoing learning events. However, the formative use requires that the test results are not only interpreted at the construct level, but rather at the level of individual items. The sizable item pools required for efficient CAT and the personalization of resulting tests mean that a huge number of different items are administered to students of one classroom, turning the item-level review into a daunting task for teachers. The talk shows how effective control of test overlap can be implemented by constraining adaptive item selection appropriately. Different from e.g. content constraints, the focus of overlap constraints extends across a group of students concurrently taking part in a testing session. The talk shows how test assembly with overlap can be formulated as an extension of the Shadow Test Approach, in which the overlap requirement can be understood as a constraint on a “Shadow Item Pool” from which individual students’ tests are assembled. The assumption of concurrent test-taking poses a number of challenges arising e.g. from individual differences in working speed and their correlation with proficiencies. Simulation studies are presented which explore the measurement properties of the proposed concurrent CAT under different assumptions on student proficiencies and working speed as well as group size, test length and target test overlap.

Talk

The Dual Side of Classical Test Theory: The Geometry of the Axiom of Common Cause

Keywords:
Classical Test Theory

In Classical Test Theory (CTT), the latent variable theta is characterized by the Axiom of Local Independence (Lazarsfeld, 1950) or the Axiom of Common Cause (Laplace 1774; Reichenbach, 1956). According to Lazarsfeld (1950, p. 367), this axiom means that "all the within relationships between the [observed scores] should be accounted for by the way in which each [observed score] alone is related to the latent variable". In Psychometrics, the interest focuses on the latent variable, which is "estimated" by the empirical Bayes estimator (EBE) E(theta|Y). Its theoretical status, meaning and existence are of interest (Suppes & Zanotti, 1981; Borsboom et al., 2003). This paper makes a contribution to these aspects. By endowing the CTT with a Hilbert space structure, we make a review of the principal axioms and results of CTT as established by Novick (1968) and Zimmerman (1975). Thereafter, we extend those results by exploiting the duality of E(theta| Y) wrt E(Y|theta). The following results will be discussed:
1) We define a dual reliability and we prove that it is equal to the standard reliability. As a consequence, we prove a more general Spearman-Browne result and we state that the quality of the EBE is directly measured by the reliability.
2) If theta is the minimal subspace such that the weak version of the Axiom of Local Independence (Bollen, 2002), then the null space of the dual operator E(theta Y) is reduced to {0} and consequently E(Y|theta)=theta; and reciprocally. This last result opens the door to identify the distribution generating theta.
3) Under the minimality of theta, the Axiom of Local Independence has a geometrical formulation, namely the orthogonal to Y have any element common with theta except {0}.
4) Finally, we make explicit linear subspaces of observables in between the subspace generated by the latent variable is contained. The main consequence is that the latent variable is an unobservable rather than an unobserved.

session

Applied Statistics

Patricia Flor Arasil

16:30 - 17:30

Salon Novalis

Talk

Measuring Intercultural Competence: Methodological Issues and Challenges

Keywords:
Item Response Theory
Psychometrics

Increased globalization in education demands that students develop global perspective and intercultural competence in order to be competitive contributors in the global economy. Correspondingly, there has been a growing interest among educators, policy makers, and researchers in understanding and measuring students’ intercultural competence, also known as cultural intelligence.
When it comes to assessing intercultural competence, methodological issues and challenges remain. Conceptually, intercultural competence can be considered as a noncognitive attribute/trait and it contains attitudes/beliefs, knowledge strategies and behaviors. These conceptual dimensions often relate to motivation, self-efficacy and personality, which are interconnected. This creates complexity and unique challenges for assessing noncognitive attributes/traits, such as intercultural competence. Furthermore, with regards to scoring and reporting, most of the existing assessment instruments were developed and validated based on classical measurement theory. Alternatively, the modern measurement theory (i.e., item response theory) offers promising solutions to address issues that have been difficult to solve through classical methods. It offers a greater psychometric metric that brings assessment development to a new level of precision, efficiency and standardization, and provides greater flexibility and unique benefits for value-added measurement and group comparisons.
This presentation seeks to discuss methodological issues, challenges and possible solutions in relation to assessment development, data analysis and score reports of intercultural competence assessments, drawing on preliminary findings of a study on assessing intercultural competence for college students.

Talk

Irritability and Its Memory: A Time Series Model

Keywords:
Behavioral Memory
Longitudinal Data
Mood
Multilevel Analysis
Time-Intensive Longitudinal Studies

Diary registration used as a tool to collect data to explore the change of a person over time within natural settings has increased during the past years (Cranford et al., 2016). Daily data can be very useful to gather cyclical patters and the possibility of studying this effect (Liu & West, 2016). Furthermore, emotions are subjective phenomena that initially were understood as intern states that could not be observed or measured. Emotions can be activated because of different reasons (Izard, 1977) and their duration is shorter than mood states. In fact, it is said that mood can last hours or days (Ekman, 1992; Frijda, 2009), but it’s not always easy to know why you are in a certain mood or in another because the event or object that has made you get in that mood may not be present at that moment, as Lazarus(1991) mentioned. Negative mood reflects irritability. We also wanted to prove the influence of days of the week, holidays, gender and age could have.
A total of 74 college students voluntarily participated. They had to register their mean mood irritability at the end of the day during a minimum period of 60 consecutive days. To do this we used a longitudinal design to collect the data, and a complex method to obtain the results: intensive longitudinal method with temporal series.
Results show that mood irritability has a memory of up to seven weeks. We didn’t find significant differences in days of the week, holidays, age, gender and the interaction between gender and sexual desire.

Talk

Use of Low Cost Tools in Ergonomic Research of Mobile Restaurants in Western India

Keywords:
Anthropometrics
Direct Observation
Ergonomics
Photography

Mobile restaurants are very popular food joints in Gandhinagar city in Western Indian state of Gujarat. It’s imperative that such mobile restaurants need to be designed with different human factors issues else drawing customers or increasing efficiency in the kitchen area could lead to decline in productivity and business for the people associated in this business. One such restaurant selling a very special snacks called “vada pau” and “dabelie” which is essentially a modified burger like food was analysed on the request of the restaurant association. The methodology applied was direct observation and activity analysis of both the customers and the cooks in the kitchen with an eye to locate the nature, quantum and exact location of problems related to ergonomic issues in the space. This methodology also gave insights into customer behaviour in terms of group formation, standing in the queue, postural analysis and reachability issues. This methodology was supplemented with still and videography of the entire site to record the postural issues and the time taken for performing different activities. To get further insights questionnaires both open and closed ended were also used. After collecting the data from the field the data was analysed in the laboratory to come up with concept of ergonomically designed spaces which were again tested on the grid board with mannequins and also on real users. The output of this research showed that it's possible to conduct reliable ergonomic research even by using low cost tools and this is pertinent in a country like India where usage of expensive tools for data collection are always not feasible.

void

NA

NA

18:00 - NA

19:00

event

Conference Dinner

Paradiescafé Jena

NA

19:00 - 00:00

Friday, July 27
↑ Go to top ↑

09:00

keynote

How to make missingness ignorable in longitudinal modeling

N.N.

09:00 - 09:45

Saal Friedrich Schiller

10:00

session

Item Response Theory

Christoph König

10:00 - 11:00

Saal Friedrich Schiller

Talk

Reducing Sample Size Requirements of the 2PL With a Bayesian Hierarchical Approach

Keywords:
Bayesian Statistics
Item Response Theory
Psychometrics
Sample Size

Given their complexity, item response theory (IRT) models require large samples for accurate item calibration and are considered primarily large-scale application methods. Therefore, approaches are needed which allow for a resource-efficient way to estimate complex IRT models, which yield accurate parameter estimates even in smallest sample sizes. Bayesian hierarchical IRT modeling with a non-centered parameterization of the item parameters, combined with a Cholesky decomposition of their covariance matrix, is a promising approach to increasing the accuracy of parameter estimation in small sample situations, when appropriate prior information is not available. Careful consideration is required especially regarding the choice of hyperprior parameters for the variance components of the item parameters $\tau_\alpha$ and $\tau_\beta$. In the context of the hierarchical 2PL model and small sample sizes (N < 500), this Monte Carlo simulation study investigates differences in sensitivity of the Inverse Gamma, the Cauchy, and the Exponential distribution regarding their impact on parameter accuracy across different specifications. Results show that estimation accuracy of the discrimination parameter is sensitive to different specifications of the Inverse Gamma, and robust against different specifications of the Cauchy and Exponential distributions. Differences in sensitivity between the three hyperprior distributions are most distinct for short test lengths (k = 25) and very small sample sizes (N < 100). Thus, the use of either the Cauchy or the Exponential distribution as hyperprior distributions for variance components, considerably reduces the sample size requirements of the 2PL model. This is advantageous when there are few items and respondents (e.g., in university exams), where the estimation of item parameter variance is often problematic because of sparse data, and presents resource-efficient possibilities for accurate item calibration in small sample situations.

Talk

Instructional Sensitivity of Polytomous Test and Questionnaire Items

Keywords:
Educational Assessment
Item Response Theory
Multilevel Modeling

Student achievement is widely used in educational research for drawing inferences on teaching. Valid inferences on teaching require that tests are instructionally sensitive, that is, capable of capturing effects of classroom instruction (Polikoff, 2010). Yet, despite numerous instructional sensitivity (InSe) measures to evaluate dichotomous items, none are available for polytomous items. Polytomous items may not only be sensitive as a whole, but also response categories may vary in their degree of InSe. Thus, the aim of our study is to provide measures for the InSe of polytomous items.
We advance a longitudinal multilevel IRT model (LMLIRT; Naumann, Hochweber, & Hartig, 2014) to fit polytomous items. The LMLIRT model provides measures for two facets of InSe: a) global sensitivity (average change in classroom-specific item difficulty), and b) differential sensitivity (variation of change across classes). We combine the LMLIRT model with the partial credit model (Masters, 1982), using a) the standard parametrization for achievement items and b) the expanded parametrization for Likert items. So far, Likert items have been neglected when evaluating InSe. We exemplarily apply our model to achievement test and questionnaire data from the DESI (3613 students, 135 classes) and the InSe (815 students, 46 classes) studies.
Results indicate that the model works well in empirical application. Polytomous items can be considered a) insensitive, if average change in location and thresholds is zero and there is no variation across classes, b) globally sensitive, if there is nonzero change in location or at least one threshold, c) differentially sensitive, if there is variation of change across classes in location or at least one threshold, and d) globally and differentially sensitive, if both b and c apply. We are confident that such information fosters valid use and interpretation when individual level student responses are used for drawing inferences on teaching.

Talk

Estimation of a Multidimensional Item Response Model Using Bayesian Nonparametrics

Keywords:
Bayesian Statistics
Multidimensional Item Response Theory
Nonparametric Statistics

Parametric Item Response models do not always show acceptable fit to the data obtained from psychological tests. In these cases, one option is to resort to more flexible nonparametric models. Peress (2012) provides an identification proof for a very general Item Response model, which can be viewed as a multidimensional compensatory model with nonparametric Item Characteristic Curve (ICC).
The subject of this talk is the application of Bayesian nonparametrics to the estimation of the ICC and the item- and person parameters of this model. A reparameterisation is proposed, which allows for a Bayesian formulation of the model. As the parameter space of the ICC is a function space, a Dirichlet Process Mixture of normal cumulative distribution functions can be chosen as a prior for the ICC. This allows for the derivation of all full conditionals of the joint posterior distribution of the parameters and thus for an implementation of a Gibbs Sampler.

session

Missing Data

Mario Lawes

10:00 - 11:00

Salon Schlegel

Talk

Planned Missing Data Designs: Investigating the Efficiency of a Three-Method Measurement Model

Keywords:
Bias-Correction
Planned Missing Data
Sample Size
Structural Equation Modeling
Two-Method Measurement

Planned missing data designs are an elegant way to incorporate expensive gold standard methods (e.g., biomarker, ambulatory assessment) and cheaper but systematically biased methods (e.g., self- and informant ratings) in research designs to ensure high statistical power while keeping the costs low. This talk outlines a planned missing data design with one expensive and two cheap methods (three-method measurement [3-MM] design). The statistical efficiency of this 3-MM design is investigated in a simulation study and compared to the efficiency of corresponding two-method measurement (2-MM) designs. In most conditions, planned missing data designs yielded higher statistical efficiency compared to complete-cases design. Within the planned missing data designs, 3-MM designs can increase statistical efficiency compared to 2-MM, when the additional cheap measure of the 3-MM design is inexpensive and shares only small amounts of method variance with the initial cheap measure as well as when the gold standard measure is highly expensive compared to the cheap measure. Implications for the needed sample sizes are discussed.

Talk

Handling Missing Data in Single-Case Experiments

Keywords:
Experimental Design
Missing Data
Randomization Test
Simulation Study

Single-case experiments have become increasingly popular in educational and behavioral research. However, the analysis and meta-analysis of single-case data is often complicated by the frequent occurrence of missing or incomplete observations in the data series. One reason for this large frequency of missing or incomplete observations in educational and behavioral single-case research might be that practitioners or participants are often required to record data themselves at regular intervals, sometimes even retrospectively, and that the human element in this process leads to missingness and incompleteness. If missingness or incompleteness cannot be avoided it becomes important to know which strategies are optimal if missing or incomplete data do occur because presence of missing or incomplete data or inadequate data handling strategies may lead to the experiments no longer “meeting standards” set by, for example, the What Works Clearinghouse.
For the examination and comparison of several strategies to handle missing data, we simulated complete data sets for phase designs, alternating treatments designs, and multiple baseline designs. We then introduced missingness in the simulated datasets using multiple probabilities of data being “Missing Completely At Random”. We evaluated the Type I error rate and the statistical power of a randomization test for the null hypothesis that there is no treatment effect, using the different strategies of handling missing data. We compared the operating characteristics for the original dataset (before the introduction of missing data points) with the operating characteristics for three strategies: (1) randomizing a missing data marker and calculating all reference statistics only for the available data points; (2) estimating the missing data points by using minimum mean square error linear interpolation; and (3) multiple imputation methods based on resampling the available data points.

Talk

Testing Missingness for Continuous and Categorical Data

Keywords:
Missing Data
Regression
Simulation
Test

The probability of having missing data in a survey is very close to 1. Their handling should be correct in order to have unbiased and consistent results, and to do so it is essential to know their type. Missing data are generally divided into three single types (Rubin, 1976): missing completely at random, missing at random, and missing not at random. Additionally, they can be a mixture of these mechanisms as well. The first step to understanding the type of non-observed information generally consists in testing whether or not the missing data are missing completely at random. Several tests have been developed for that purpose, but they have complications when dealing with non-continuous variables.
Our approach tests whether or not the missing data are missing completely at random using a regression model and a distribution test specific to the type of the incomplete variable. Formally, for a variable with missing data, we compare the predictions of the regression model given for observed data with those given for unobserved data. Consequently, our test can be applied both for continuous and categorical variables, which is not the case of the usual procedures. Simulations with six types of single and mixed missing data mechanisms were performed, linear and multinomial regression models with full and restricted information were used, and the quantity of missing data varied between 50% and 1%. The implementation of our method on various sample sizes was explored as well. Our simulation results show that compared to the Little (Little, 1988), Jamshidian-Yuan (Jamshidian&Yuan, 2014) and Dixon (Dixon, 1990) tests, our method is at least as powerful for numerical and categorical data and provides additional information, especially in the case of mixed missing data mechanisms. More specifically, our test allows the detection of the MCAR data easier in the mixed cases and is generally more sensitive to the relative percentage of missing data.

symposium

Applying Subjective Bayes to Real Life Data

Fayette Klaassen

10:00 - 11:00

Salon Hölderlin

Talk

Including and Assessing Expertise via Prior Probabilities.

Applying subjective Bayes to real life data

Keywords:
Bayesian Statistics
Elicitation
Prior Knowledge

The relative evidence for a set of hypotheses can be updated continuously with new data into posterior probabilities. In order to do so, (subjective) prior probabilities need to be specified for the hypotheses considered. Prior probabilities are often considered equal in applied research, while this might not be an accurate representation of researcher’s beliefs. This talk discusses the need for subjective prior probabilities in Bayesian hypothesis testing, and proposes an application to facilitate applied researchers (say, psychologist) to better understand and formulate prior probabilities based on subjective ideas.
In practice companies have many employees that are on the same level and are each considered experts in their field. Yet evidence for such assumed expertise is hardly ever obtained. We proposed to let experts predict new data in the form of a probability distribution, thereby explicitly incorporating both tacit knowledge and uncertainty. These predictions are then compared to actual new data and we adapted a prior-data conflict measure to assess appropriateness of the experts’ beliefs. We used this measure to rank regional directors in a large financial institution based on their capabilities to predict future turnover.

Talk

Bayesian Approximate Measurement Invariance: When You Have Too Little or Too Much Data.

Applying subjective Bayes to real life data

Keywords:
Bayesian Statistics
Measurement Invariance

In Bayesian structural equation modeling (BSEM), informative, small-variance priors with mean 0 are used to replace exact zero constraints. One application of BSEM is in the testing of measurement invariance (MI). Traditionally, MI is accepted when exact zero constraints on the difference in intercepts and factor loadings over different groups (e.g. countries) or time hold. This exact approach works well with a reasonable number of countries or time points, but in large-scale applications it is cumbersome and leads to a frequent rejection. We show how BSEM can overcome the difficulties of exact MI in large-scale applications.
In addition to being beneficial in large-scale applications, BSEM estimation of MI can also be advantageous when samples are small. In such scenarios, traditional methods of testing MI can run into computational problems and fail to converge to a reliable solution. BSEM overcomes these computational issues and also allows for the introduction of priors on other parameters in the model. In this talk, we discuss an empirical application of Approximate MI using longitudinal survey data.

Talk

Expert-Weighted Prior Information: Applications in Psychology and Veterinary Medicine

Applying subjective Bayes to real life data

Keywords:

We systematically acquired prior knowledge for a structural equation model on the development of working memory in young adolescents with severe behavioural problems for which a subgroup frequently used cannabis. To collect prior information, a systematic search for meta-analyses, reviews, and empirical papers was conducted. A clinical and scientific expert weighted the information. Combined with general knowledge, the final prior distributions were constructed. Based on our experience, we present a set of general guidelines for collecting prior knowledge and formalizing it in prior distributions.
Furthermore, we present two examples from veterinary medicine. First, the treatment effect of oral Glucosamine/Chondroitin on gait characteristics of aged horses was examined. Historical studies from across species resulting from a systematic review were weighted by the expert for their clinical relevance and incorporated into power priors. In another example, a random effects logistic regression diagnostic model was developed for detecting subclinical ketosis in dairy cows. Cluster level information about feed content and milk production was used by the expert to estimate the relative position of each herd regarding other herds in the population. The elicited judgement was incorporated into the priors for the random effects.

11:00

coffee

Coffee Break

11:00 - 11:30

11:30

session

Longitudinal Data

Manuel Arnold

11:30 - 12:50

Saal Friedrich Schiller

Talk

Individual Parameter Contribution Regression for Longitudinal Data

Keywords:
Heterogeneity
Longitudinal Data
Measurement Invariance
Structural Equation Modeling

Structural equation models (SEM) are widely applied in the behavioural and social sciences to analyse the relationship between observed and latent variables. A standard assumption underlying many SEMs is that parameter values are equal for all observations in the sample. Thus, researchers need to determine whether there exists relevant heterogeneity in their samples or they run the risk of reporting meaningless parameter estimates and inaccurate standard errors.
During the last decades, several SEM extensions have been developed to identify and account for heterogeneity. One of those approaches is “individual parameter contribution” (IPC) regression, proposed by Oberski (2013). IPC regression is conducted in three steps. First, a theory-guided SEM is fitted. Second, the contributions of every individual to the model parameters are calculated based on the first-step model. Third, heterogeneity in the model parameters is explained by regressing the contributions on grouping variables or individual characteristics.
This talk aims to illustrate how IPC regression can be used as a data-driven procedure to detect and provide estimates of individual or group differences in contemporary longitudinal structural equation models, focusing on autoregressive panel models in discrete and continuous time. It will be shown that equality and nonlinear parameter constraints, often encountered in longitudinal models, may bias the IPC regression estimates. A novel and robust estimation procedure for IPC regression and its software implementation will be presented.

Talk

Generalized Continuous Time Models and the Continuous Time Rasch Model as an Example

Keywords:
Autoregressive Models
Classical Test Theory
Continuous Time Models
Generalized Linear Mixed Models
Generalized Linear Models
Item Response Theory
Longitudinal Data
Measurement

Autoregressive models can be used to analyze longitudinal data. However, depending on the spacing of the discrete measurement occasions, autoregressive models will come to different results. In consequence, results from studies with different time intervals will differ. To overcome this and other shortcomings of discrete time autoregressive models, continuous time autoregressive models have been formulated. The basic idea of continuous time modeling is the assumption of an underlying continuous time process that can describe all associations of variables between any discrete points in time. This allows for using unequally spaced assessment designs to study psychological processes, and simplifies comparisons of results from studies with differing time intervals. Besides the repeated assessment over time, latent variable measurement models can be used at each measurement occasion to quantify (and increase) the reliability and precision of the measurement. Previous research has already exploited the idea of combining autogressive models and latent variable models to a certain extent. However, continuous time models have rarely been combined with a general modeling framework that is suitable for incorporating a relatively large number of measurement models.
The main goal of this work is to combine continuous time and Generalized Linear (Mixed) Models to a new class of models termed "Generalized Continuous Time Models". We provide and describe the general equations needed and give some concrete examples. We chose one popular measurement model, the Rasch model, as an illustration of Generalized Continuous Time Models. For this model we conducted a "proof-of-concept" simulation study and give an illustrative real data example. A publicly accessible R package is available (ctsem) that can be used to model some Generalized Continuous Time Models.

Talk

The Association Between Depression and Education in UK Adolescents: A Cross-Lagged Panel Analysis

Keywords:
Applied Statistics
Causal Inference
Longitudinal Data
Mood
Observational Studies
Structural Equation Modeling
Structural Equation Modeling
Structural Equation Modeling

AIM: Mental health and educational achievement are both important in child development. There is previous evidence for links between them, however, the nature and direction of this relationship is unclear. In this paper, we explore the strength and directionality of the associations between both areas throughout adolescence.
DATA: We used data from the Avon Longitudinal Study of Parents and Children, a birth cohort study including 15,390 children from the Bristol area who were born in the early 1990s. This resource includes four repeated measures of depressive symptoms (Short Mood and Feelings Questionnaire) and educational records (key stages 2-5), at mean ages 11, 14, 16 and 16 years.
ANALYSIS: we fitted cross-lagged panel models within a structural equation modelling framework, using Mplus v8. We used full information maximum likelihood for missing data imputation in our main analyses.
FINDINGS: results from the preliminary main analyses suggest possible age specific directionality of associations. Between ages 11 and 16, early measures of depression are negatively associated with later measures of education (standardised coefficient -0.034, 95% CI -0.054 to -0.014), whereas early educational measures show little evidence for an association with depression (0.021, -0.028 to 0.070). Conversely, education at age 16 (end of high school) showed a negative association with later depression (-0.122, -0.159 to -0.085), whereas we found no strong evidence for an association of depression at age 16 and educational records at age 18 (-0.020, -0.040 to 0.000). We will examine the robustness of these findings to different model specifications and missingness scenarios. We will also explore the consequences of incorporating additional variables such as sex, IQ, parental depression and socio-economic status, and genetic polygenic risk scores for mood and education.

Talk

Latent Markov Factor Analysis for Exploring MM Changes in Time-Intensive Longitudinal Studies

Keywords:
Experience Sampling
Factor Analysis
Latent Markov Modeling
Measurement Invariance
Time-Intensive Longitudinal Studies

New technology facilitates the collection of time-intensive longitudinal data to study daily-life dynamics of psychological constructs (such as well-being) within persons over time (e.g., by means of Experience Sampling Methodology; ESM). However, the measurement quality can be affected by time- or situation-specific artefacts such as response styles or substantive changes in item interpretation. These distortions can be traced as changes in the measurement model (MM), which evaluates the constructs that are underlying a participant’s answers. If not captured, these changes might lead to invalid inferences about the targeted psychological constructs. Existing methodology can only test for a priori hypothesized changes in the MM. However, typically we have no prior information on (changes in) the MMs. Thus, an exploratory method that detects and models MM changes is needed before we can benefit from the full potential of ESM data. To this end, we present a method called latent Markov factor analysis (LMFA). In LMFA, a latent Markov chain captures the changes in MMs over time by clustering observations per subject into a few states and the data are factor-analysed per state. The states indicate for which time points the construct measurements may be validly compared and within-subject MM differences can be explored by comparing the state-specific MMs. A simulation study shows a good performance of LMFA in recovering parameters under a wide range of conditions. The practical value of LMFA is illustrated with an empirical example.

session

Latent Variable Analysis

Jose-Luis Padilla

11:30 - 12:50

Salon Schlegel

Talk

On the Effect of Observations and Parameters on the fit of SEM Models With Large Sample-Sizes

Keywords:
Applied Statistics
Large-Scale Data
Structural Equation Modeling

According to Tanaka (1987) there appears a general agreement that sample-size appropriateness should be tied to the ratio of the number of objects to the number of parameters estimated. In the works of Bentler (1980, 1983), Jöreskog (1978), Everitt (1984), James et al. (1982), and McDonald (1985) issue of sample size was invariably raised but typically not treated in sufficient detail. In fact, issue of how many observations and the number of parameters are needed before estimating and testing SEM has bothered researchers for years. Some of them have learnt lessons about the necessity of applying large samples. Marsh et al. (1998) explained that more observations (N) in sample size mean always better results for SEM models. Also Boomsma (1982, 1985) found that the percentage of proper solutions for SEM, including the accuracy of parameters, sampling variability in parameter estimates and the appropriateness of chi-square test statistic were all favorably influenced by having larger N’s, recommending N > 100, but also noting the desirability of N > 200. Given a widely acknowledged consensus that ‘more means better’, in this study we ask the question: what precisely means a desirable level of sample size (in particular, exceeding the level of 200 observations, and reaching 300, 900, 1500, 2000, 3000, 5000, 10000, 19000, 28000, 60000 ob.) if we assume more or fewer parameters in SEM model? Besides, can we argue that larger or extreme samples will appear better than small and medium samples, as far as the quality of SEM models is concerned? Given this, we focus on the effects of observations and parameters on the fit of structural equation models (constructed on the basis of large sample-sizes). With this idea in mind, a series of calculations were conducted on one example of SEM specified model, while the data were derived from national survey taking place in Poland. The model included three varying levels of the number of parameters and ten levels of sample size.

Talk

Evaluating Model Quality in Exploratory Bi-factor Modelling

Keywords:
Bi Factor
Exploratory Factorial Analysis
External Cephalic Version
Omega
Reliability
Target Rotation

Bi-factor models are usually applied to separate general and specific sources of variance. Bi-factor model quality measures include omega hierarchical, omega hierarchical subscale and the Explained Common Variance (ECV) index. The former represents superior choices to classical reliability statistics, while the latter evaluates the relative strength of a latent factor. They can be conjointly used to assess the extent that a general factor accounts for the common variance or if the unidimensional is supported. However, under realistic settings (e.g., cross-loadings are present) these indexes could be biased if obtained by either CFA or Schmid-Leiman solutions, as both methods lead to incorrect estimates.
Two promising algorithms for exploratory bi-factor modelling are: a) the iterative target rotation based on an initial Schmid-Leiman solution (SLi): b) the bi-factor target rotation based on obliquely rotated solution (biFAD; Waller, 2017). As they differ on how target matrix is defined (empirically vs fixed cut-off point; totally vs partially specified) and applied (iteratively vs non-iteratively), they are expected to have a dissimilar impact on quality indices.
A Monte Carlo simulation was conducted by manipulating sample size, number of specific factors, number of indicators, factor loading range, factor loading average and cross-loading size. BiFAD fixed cut-off points ranged from .05 to .20. Both methods showed good recovery, but antagonistic behaviors regarding quality indexes: SLiD overestimated omega hierarchical and ECV for the general factor (due to overestimation of the mentioned factor), while underestimating omega hierarchical subscale and ECV for the specific factors (due to underestimation of such factors). BiFAD results were in the contrary direction. SLiD was prone to produce factor collapse, while biFAD was impacted by cut-off point selection and the number of factors involved. Guidelines for interpreting bi-factor quality indexes are provided.

Talk

Analyzing Approximate Invariance From a Mixed-Method Ecological Approach to Validation

Keywords:
Alignment Method
Cross-national Survey Research
Measurement Invariance
Mixed Methods
Structural Equation Modeling
Validity

To uncover cause of measurement invariance is critical to improve validity of cross-cultural comparisons from an ecological approach to validation (Zumbo, 2017). New developments in quantitative methods like “alignment” method can help researchers in understanding the absence of invariance. “Alignment” results can make it easier to link quantitative results with qualitative findings within a mixed methods research framework. The Indigenous Social Desirability Scale (ISDS) (Domingez & van de Vijver, 2014), was developed from an emic perspective for Mexican culture. The aim of this study is to explore how to integrate quantitative evidence of approximate measurement invariance obtained by “alignment” with qualitative evidence of cognitive probes for a multi-national research project with country samples from Mexico, Colombia and Spain. 967 participants responded to the ISD scale: Mexican (257), Spanish (513), and Colombian (197). The ISD scale consists of 14 polytomous items to capture the positive and negative dimensions of social desirability. A robust maximum likelihood fixed alignment analysis was conducted to test approximate measurement invariance. At the same time, a literature review was conducted to propose a mixed research design to integrate alignment results with cognitive probe findings. 4 intercepts and 1 loading for the Spanish sample, and 1 intercept for the Mexican sample, out of the 14 items are not invariant across the country groups. The percentage of non-invariant parameters support quality of alignment results. At the same time, a mixed research design including cognitive probe formats was developed for off and online cognitive interviewing. Alignment results support approximate metric invariance across ISDS country samples. In addition, we discuss benefits of the mixed methods research to investigate into possible causes of the absence of invariance for cultural-bound constructs like social desirability from an ecological view of validity.

Talk

A Correlated Covariate Amplifies the Bias of a Fallible Covariate in Causal Effect Estimates

Keywords:
Bias Amplification
Causal Inference
Latent Covariates

Covariate adjusted treatment effects are commonly estimated in non-randomized studies, either with ANCOVA or Propensity score methods. It has been shown that measurement error in a covariate can bias treatment effect estimates, if it is not appropriately accounted for. So far, delineations on the bias of a fallible covariate primarily assumed a true data generating model that included just the respective latent covariate. It is, however, more plausible that the true model consists of more than one covariate. By intuition, an additional covariate that is correlated with the latent covariate can be helpful, if only a fallible measure of the latent covariate is available for adjustment - the correlation might partly compensate for the bias due to measurement error. We disentangle when it is advisable to include a correlated covariate for adjustment. For this aim, we analytically investigate a true model with two covariates that are correlated and derive the bias when only a fallible measure of one of the covariates is available for adjustment. With the fallible covariate, it is not always advisable to include the additional covariate in the adjustment model, as it can extensively increase the bias, even if it is highly correlated to the latent covariate. We point out the distorting effects of fallible covariates and discuss adjustment for latent covariates as a possible solution.

session

Multilevel Modeling

Wouter Talloen

11:30 - 12:50

Salon Hölderlin

Talk

Measurement Error and Unmeasured Confounding in Multilevel Mediation Models

Keywords:
Confounding
Measurement Error
Multilevel Modeling

Multilevel 2--1--1 mediation models are frequently used in educational research when intervention is measured at cluster (e.g. class) level and mediator and outcome at unit ( e.g. student) level. In such settings, different indirect effects may be of interest. The within indirect effect measures the indirect effect through the mediator at the unit-level while the between indirect effect measures this effect at cluster-level. The latter can be estimated by adding an aggregated unit-level mediator as a predictor for the outcome. Manifest approaches assume that observed group means are perfectly reliable measurements of this aggregated unit-level mediator while latent approaches such as multilevel structural equation modeling take into account measurement error. In a first step, we study the impact of measurement error on estimators for between-and within indirect effects obtained by manifest and latent approaches. In a second step, we focus on the impact of unit- and cluster-level unmeasured mediator-outcome confounding on causal effects. In a third step, we study the combined effect of measurement error and unmeasured confounding. A simulation study is conducted to compare bias, precision, coverage rate and bias-variance trade-off for each causal effect. Estimators for the within indirect effect obtained by manifest and latent approaches perform similarly, only unmeasured confounding at unit-level induces bias for the within indirect effect. The estimator for the between indirect effect in both approaches is affected by unmeasured confounding at cluster-level. For manifest approaches, bias for the between effect estimator also depends on the strength of the intraclass correlation for the mediator and unit-level unmeasured confounding.

Talk

Same Same but Different?! Measuring of Local Sex Ratios

Keywords:
Measurement
Multilevel Modeling
Partner Market
Sex Ratio

Imbalanced numbers of men and women in societies or social groups (i.e. sex ratios) have been linked to a variety of social consequences. Studies report associations with relationship formation patterns and timing, divorce rates, fertility timing and rates, sexual norms, female labour market participation as well as violence and aggression. Theoretical arguments commonly start from a social exchange perspective, considering imbalanced sex ratios as a factor that shapes individuals´ dyadic power on the partner market and within relationships. However, theoretical reasoning remains unclear on whether behavioural consequences result from individuals’ deliberate adjustment of partner market strategies or unconscious endocrinal or normative variations. Previous studies use population register data to refer local sex ratios on county or municipality-level for a wide range of age, mostly from 16 up to 40 years old adults. These studies are based on the implicitly assumption that individuals reflect those imbalances deliberate, resulting in specific daily life actions. But, there is no study which tests this assumption empirically. Based on combined data from a representative German longitudinal survey (pairfam) and population register, we analyse how local sex ratios, measured by register data, are perceived conscious by individuals. Our empirical approach contains two parts: First, we test correlations between local sex ratios and perceived subjective sex ratios. Second, we analyse the transition from singlehood to partnership as an example for the consequences of imbalanced sex ratios and the (competing) influence of local and perceived sex ratio. Our results suggest that correlations between local sex ratios from register data and perceived sex ratios by individuals are very low. Our longitudinal example also indicates that local and perceived sex ratios are independent while subjective perceived sex ratios seems a better indicator to explain partner market outcomes

Talk

Assessing Structures of Prejudice in Europe with Multilevel Latent Class Analysis

Keywords:
European Values Study
Finite Mixture Modeling
Latent Class Models
Multilevel Analysis
Multilevel Latent Class Analysis
Prejudice

Analyses of prejudice against out-groups often focus on a specific group, e.g. Muslims, immigrants, or homosexuals. However, Gordon Allport already argued in the 1950s that “people who reject one out-group will tend to reject other out-groups” (1954, p. 68), and many studies have since shown that different types of prejudice are interrelated (e.g., Zick et al. 2008; Reeskens 2013). In this presentation, we propose to assess structures of prejudice in Europe by means of multilevel latent-class analysis (ML-LCA). For this purpose, we use a variant of the Bogardus Social Distance scale that was part of the European Values Study (EVS) in 2008: Survey respondents were asked to sort out any groups (e.g., drug addicts, right wing extremists, Jews, immigrants…) they would not like to have as neighbours. Applying ML-LCA to this question, we identify both an individual-level typology of perceived out-groups and segments of countries with different patterns of prejudice. On the individual level, five latent classes are distinguished that differ with respect to the characteristics of out-groups (cultural minorities, people with deviant behaviour, and political extremists) and the amount of resentment towards groups-as-neighbours in general. The latter is also a main differentiating factor between countries. In a second step, predictors for class membership on the individual and country level are investigated. Substantial and methodological implications of the ML-LC approach towards prejudice structures will be discussed.

Talk

Detecting Selection Bias in Meta-Analyses with Dependent Effect Sizes: A Simulation Study

Keywords:
Meta-Analysis
Multilevel Meta-Analysis
Publication Bias

In meta-analysis, it is common to find primary studies that include multiple effect sizes, generating dependence among them. Although there are several techniques available for dealing with dependent effect sizes, the assessment of selection bias in this context (i.e., publication bias and selective outcome reporting bias) has not yet been thoroughly scrutinized. Therefore, the aim of this study is to explore, by means of a simulation study, the performance of commonly used methods for detecting publication bias in situations where primary studies include multiple effect sizes. To that end, meta-analytic datasets were generated under a variety of realistic conditions. Next, three different types of bias were induced: publication bias, selective outcome reporting bias, and the combination of both. Datasets unaffected by any type of selection bias were also considered. Afterwards, six different methods for detecting publication bias were applied: Begg’s Rank Correlation test (using variance and sample size), the Trim and Fill method (R0 and L0 estimator), Egger’s Regression and Funnel Plot test. These two last methods were adapted by using multilevel three-level models to account for within-study dependency. These methods were evaluated in terms of Type I error and power. Results indicated that Begg’s Rank Correlation test (using both variance and sample size), the Trim and Fill (L0) method and Egger’s Regression test lead to Type I error rates that are too high in most conditions, whereas the Funnel Plot test and the Trim and Fill (R0) method lack power. Also, results showed that concluding that there is selection bias when four out of the six methods indicate the presence of selection bias, leads to controlled Type I errors across conditions. However, this approach is still unsatisfactory in terms of power. We conclude that the studied approaches have serious flaws, and that other approaches should be explored.

session

Bayesian Statistics

David Kaplan

11:30 - 11:30

Salon Novalis

Talk

An Approach to Addressing Multiple Imputation Model Uncertainty Using Bayesian Model Averaging

Keywords:

This paper considers the problem of imputation model uncertainty in the context of
missing data problems. We argue that so-called “Bayesianly proper" approaches to
multiple imputation, although correctly accounting for uncertainty in imputation model
parameters, ignore the uncertainty in the imputation model itself. We address imputation
model uncertainty by implementing Bayesian model averaging as part of the imputation
process. Bayesian model averaging accounts for both model and parameter uncertainty,
and thus we argue is fully Bayesianly proper, in the sense of Schafer (1997). We apply
Bayesian model averaging to multiple imputation under the fully conditional specification
approach. An extensive simulation study motivated by real data considerations is
conducted comparing our Bayesian model averaging approach against choosing the
imputation model with the highest posterior model probability, and against normal
theory-based Bayesian imputation not accounting for model uncertainty. The results reveal
a small but consistent advantage to our Bayesian model averaging approach under MCAR
and MAR in terms of Kullback-Liebler divergence. No procedure works well under NMAR.
A small case study is also presented. Directions for future research are discussed.

Talk

Bayesian Meta-Analysis of Studies Using Cohen's d in R

Keywords:
Bayesian Statistics
Effect Sizes Measures
Meta-Analysis
R

Bayesian meta-analysis has several key advantages over frequentist meta-analysis. First, a Bayesian framework theoretically utilises the correct conditional probability, and practically allows evidence for the null hypothesis. Second, the posterior distribution and credible intervals are intuitively interpretable. Third, data can be added as new participants or studies appear, which is particularly important in living meta-analyses (Elliott, et al., 2017; Simmonds, Salanti, McKenzie & Elliott, 2017). There already exist examples of how to apply a Bayesian framework to meta-analysis (e.g., Scheibehenne, Jamil & Wagenmakers, 2016; Smith, Spiegelhalter & Thomas, 1995; Sutton & Abrams, 2001). However, these examples only utilise Bayes factors and odds ratios or risk differences. There are no studies that demonstrate how to apply effect size meta-analysis on commonly used effect sizes in psychology such as Cohen’s d. This paper proposes a Bayesian fixed-effects meta-analysis of studies that use Cohen’s d. The meta-analysis results in an overall effect size (an estimate of the population effect size) and its credible interval. The analysis is conducted using Stan in R. It will also provide practical guidelines about how to interpret the results of the Bayesian meta-analysis.

Talk

Hypothesis-Testing Demands Trustworthy Data – A Simulation Approach to Inferential Statistics Based on the Research Program Strategy

Keywords:
Bayes’ Theorem
Inferential Statistics
Likelihood
Research Program Strategy
Wald-criterion
t-Test

In psychology as elsewhere, the main strategy to obtain empirical effects remains null-hypothesis significance testing (NHST). However, recent attempts have failed to replicate “established” effects that allegedly were well supported. Hence, NHST retains too many errors. For otherwise far more such effects should have been successfully replicated. This makes trusting even results that top-journals publish a difficulty.
We advocate the research program strategy (RPS) as superior to NHST. Employing both Frequentist and Bayesian tools, we show by means of data-simulation that RPS’s six steps—leading from making a discovery against a random model, to statistically verifying a hypothesis—retain far fewer errors than a standard usage of NHST. Therefore, RPS-results deserve far greater trust than NHST-results. Our simulations moreover estimate the expectable proportion of errors among published results.
Where test-power is unknown, NHST constitutes the RPS’s first step, where probabilities serve to discover an effect preliminarily. By contrast, if we know test-power, then a substantial discovery may arise (step 2). Moving beyond discoveries, steps 3 to 6 concern the justification of hypotheses (falsification and verification). These steps presuppose the use of likelihoods, and demand data of high induction quality (test-power) for such data to test hypotheses. We employ Wald’s criterion (the ratio of test-power and significance level) to preliminarily or substantially falsify the H0 (steps 3, 4), and to preliminary verify the H1 (step 5). Finally, if the ratio of the likelihoods for the H1 and the H0 exceeds Wald’s criterion, while the maximum-likelihood-estimate of data lies close to the H1, then this substantially verifies the H1 (step 6).

13:00

lunch

Lunch Break

13:00 - 14:00

14:00

session

Item Response Theory

Timo Bechger

14:00 - 15:20

Saal Friedrich Schiller

Talk

DIF Methods in Dexter

Keywords:

Arguably, DIF is a shrinking problem as item writers strive towards, and get better at, producing DIF-neutral items. Notwithstanding, methods for detecting DIF are proliferating. We discuss two methods included in the R package, dexter (Partchev, Bechger, Maris & Koops, 2017): one exploratory, based on the relative difficulties of pairs of items (Bechger & Maris, 2015); and one confirmatory, related to latent profile analysis (Verhelst, 2012).

Talk

The Great Dexperiment: Psychometrics With Observed Variables

Keywords:
Item Response Theory

The Dexter package is based on the principle that psychometric models are vehicles to answer questions about observed variables: Users are not supposed to see parameters or (estimates of) latent abilities. My intention is to discuss how this principle is implemented illustrated with a number of real live examples.

Talk

Bayesian Estimation of Item Response Models to Account for Learning During the Test

Keywords:
Bayesian Statistics
Componential Models
Item Response Theory
Learning
Stan

In the present work, several explanatory item response models are proposed to account for the learning that takes place during the execution of a test due to the repeated use of the operations involved in the items. The models include a difficulty component derived from the cognitive operations involved in solving the item, as well as a learning component derived from the use of said operations in previously answered items. Six different models are proposed taking into account the type of response in which the model establishes that learning occurs (i.e., in correct responses only, in correct responses and errors indistinctly, or in correct responses and errors in different measure) and whether or not the model considers the existence of individual differences in learning. Based on the above, a simulation study was conducted in order to test whether Bayesian goodness-of-fit procedures allow identifying the model used to simulate the data. The data were generated from the six proposed models plus the LLTM and the Rasch model. Additionally, three different sample sizes were used (i.e., N = 250, N = 500, and N = 1000). Thus, the combination of models and sample sizes resulted in an 8 × 3 factor design. One hundred data sets were simulated for each of the 24 design points. For each data set, 20 dichotomous responses were simulated based on a weight matrix of five components. The eight models were estimated from each simulated data set using Bayesian inference. Specifically, parameters were estimated via Markov Chain Monte Carlo (MCMC) using the Stan computer language. The fit of the models to the data was assessed with three deviance measures based on the information theory: the deviance information criterion (DIC), the widely applicable information criterion (WAIC), and the leave-one-out cross-validation (LOO). As expected, the results indicated that the model used to generate the data for each design point minimized the discrepancy statistics. The results therefore support the ability of the proposed models to detect learning effects during the test.

Talk

A Probabilistic IRT Model for the Joint Assessment of Objects and Persons in Fully Crossed Designs

Keywords:
Aesthetics
Bayesian Statistics
Categorical Data
Fully Crossed Designs
Item Response Theory

In psychological aesthetics, it is of interest to assess an artwork‘s (emotional) impact on the perceiver. From the perspective of differential psychology, it is of interest to assess interindividual differences of perceivers in displaying or reporting a certain response. The methodological question that arises in this context is: how to assess perceivers and artworks simultaneously with regard to a psychological construct? The answer is relatively simple: perceivers are presented with artworks in a fully crossed design and the task is to judge the objects on, for instance, the items of the AESTHEMOS questionnaire. In a next step, the manifest responses are projected onto latent dimensions using a Bayesian probabilistic IRT model which allows for disentangling the effects of the artworks as well as the effects of the perceivers. While models of this type are certainly not uncommon, the proposed model differs from known approaches as it is an extension of Master’s partial credit model migrated to a Bayesian MCMC setting. Using the Bayesian approach allows readily for the identification of posterior distributions of the perceivers’ individual characteristics. In addition, it is possible to evaluate the artworks’ overall tendencies to evoke certain responses. In this talk, the theoretical and methodological underpinnings of the proposed model are discussed. Results from a real world application of the model to the task of judging paintings are presented. In addition, it is highlighted that the model’s application is not limited to artworks but also persons could be subjects of the judging process.

symposium

Challenges in Interdisciplinary Research Methodology: The Study of Complex Systems

Hilde Tobi

14:00 - 15:05

Salon Schlegel

Talk

Using Simulation Models to Measure Resilience

Challenges in Interdisciplinary Research Methodology: The study of Complex Systems

Keywords:
Agent-based
Dynamic Modeling
Resilience
Simulation

Many social-ecological systems (SES), such as fisheries, land-use systems, and agricultural systems are under pressure from human activities and environmental changes. Thus, it is important to study their resilience against such pressures. Resilience may be generated by various mechanisms, such as stabilising feedbacks, spatial interactions, diversity of underlying units (like agents), and mechanisms for adaptation. Simulation models are an important tool for assessment of resilience. Simulation models help us to test the effects of various assumptions on interactions and feedbacks within the system on its resilience. But, not all types of simulation models may be equally suitable for this purpose. We compare two commonly used model types for describing SES, namely ordinary differential equation (ODE) models and agent-based models (ABMs). As test-case, we consider a system in which consumers compete for a renewable common-pool resource. The system is modelled both as an ABM and as an ODE model. The ABM is spatially explicit and dynamic with respect to time. Agents can move in search of resource, and can harvest from their present location. The ODE model is dynamic, but non-spatial. We examine to what extent the ODE model can be fitted to the ABM. We investigate how both models respond to external shocks, and apply resilience measures such as return time to quantify resilience. The results show that the ODE model can reproduce the behaviour of the ABM, if some mechanisms relevant for resilience are excluded. Specifically, the ODE model does not capture effects of agent adaptation, or local differences in space. We conclude that the most suitable modelling approach depends on the system. If resilience is caused by system-level feedbacks, then ODE models may be suitable for assessing this resilience. If, in contrast, agent adaptation or localised actions of agents contribute to resilience, then ABMs may be more suitable.

Talk

A Modest Step Toward Bringing Unity in Interdisciplinary Research

Challenges in Interdisciplinary Research Methodology: The study of Complex Systems

Keywords:
Interdisciplinary Research Methodology

Research communication in interdisciplinary research projects requires a way of demarcation of theory and knowledge that is easy to communicate, inconsequential for the framework of concepts, results, and procedures within existing scientific disciplines, and abstains from trying to resolve the dispute of (neo)positivist and constructivists. In this essay, a simple way of demarcation is proposed that only secures the vocabulary needed to comprehensively discuss research methodology and findings in interdisciplinary research contexts. It starts from the notion of language independent, and language dependent, reality. Language, when it's not sheer fantasy, is at best after-fact. All language is instruction tied to specific senses or acts which definitions have to be known for the instruction to be possible. A possible instruction can be carried out, an impossible instruction cannot be carried out (logically or empirically, temporarily of permanently). Knowledge is to know which instructions are predictive of a demonstrable result, state or situation in language independent reality. Knowledge decreases outcome space by pointing out possibilities and impossibilities. Any theory that contains one or more impossible instructions is not knowledge. Any theory that doesn't reduce outcome space (typically, by not pointing out impossibilities) is not knowledge. Any theory with wrong predictions is not knowledge. Any theory falling short of a demonstration is not knowledge.

Talk

Innovation Modelling in Engineering and Scholastic Philosophy

Challenges in Interdisciplinary Research Methodology: The Study of Complex Systems

Keywords:
Complex Adaptive System
Engineering
Innovation Modeling
Interdisciplinary Research Methodology
Scholastic Philosophy

A phenomenon that affects all domains of human affair is that of innovation. In the course of innovation, someone makes a new contribution in a subject domain, causing more or less perturbation in the field. Today, when a novel car-model appears on the market, the customers obtain a new choice (additive effect), but the overall transportation infrastructure remains mostly unchanged. By contrast, when cars made their first appearance on the market, transportation by horse was gradually replaced and respective industries, offering horse food, equipment, medicine etc., had to give way (subtractive effect). Innovation domains are complex adaptive systems. Stakeholders in the domain can embrace or mobilize resistance against innovation depending on the effects they anticipate.
The modelling of innovation patterns is an interdisciplinary challenge par excellence. A basic model is sought that should be applicable to all domains of human affair, while parameters in the model might change depending on the field of application. To develop and test a basic model, as well as to identify domain-specific parameters, inter- and cross-disciplinary studies are required. We apply an innovation model recently developed in the field of engineering to Scholastic Philosophy. This application has three important advantages: (i) Scholastic Philosophy is clearly different from engineering. (ii) High quality data is available for long time spans; we analyse changes in philosophical theory from Augustinus to Aureoli, thus covering almost 1000 years of theorising. (iii) Scholastic authors are highly explicit as to which claims of their predecessors they endorse or reject, and which novel claims they propose, so that additive versus subtractive effects can be easily identified. In sum, the basic model that is tested fits the data well. However, modelling challenges are also identified, especially pertaining to the operationalization of community resistance against innovation.

Talk

Mapping Validity in Modelling for Interdisciplinary Research

Challenges in Interdisciplinary Research Methodology: The study of Complex Systems

Keywords:
Dynamic Modeling
Interdisciplinary Research Methodology

Computer simulations are a promising methodology in the interdisciplinary study of complex systems, such as socio-ecological systems and socio-technical systems. One quality criterion of all empirical research, regardless of its inter-disciplinarily, is validity. Actually, validity is not one criterion as different kinds of validity are usually distinguished (e.g. content validity, external validity). Validity is also used as a quality criterion in the context of simulation modelling. Here, validity pertains to different aspects of the model built.
To understand and assess the quality of interdisciplinary research in which models are designed and used, a thorough understanding of the different meanings of ‘validity’ is needed. In this paper, we first review concepts of validity and validation of models of complex systems. Then we review validities and validation procedures in interdisciplinary research with an emphasis on research involving both the social sciences and the natural sciences. Looking at both the purpose of the model and the input of empirical sciences, these two strands (models of complex systems, and interdisciplinary empirical research) are synthesized into one map of different validities and validations in modelling for interdisciplinary research. With this map, we propose unambiguous terminology for validity assessment in modelling for interdisciplinary research.

session

Latent Class Models

Ana Gomes

14:00 - 14:40

Salon Hölderlin

Talk

Internet Use in the European Union: A Multilevel Latent Class Analysis

Keywords:
European Union
Internet
Latent Class Models
Multilevel Analysis

Multilevel data structures are quite common in the social and human behavior sciences and new analytical techniques have to be applied to these specific data sets. In this particular case, the Multilevel Latent Class Model (MLCM) becomes a viable alternative to the conventional Latent Class Model (LCM).
The Multilevel Latent Class Model (MLCM) considers not only the individual level 1 (Level 1), but also an upper level (Level 2) that defines a nesting or hierarchical structure (Kimberly & Muthén, 2010). The MLCM decomposes the existing heterogeneity between countries and within countries (individuals), resulting into homogeneous segments of countries and individuals.
The data set comes from the Eurobarometer (TNS Opinion & Social, 2013) and contains information on the 28 countries of the European Union (n = 26680 citizens). The average age of the respondents is 46.82 years (s.d. = 1.9) and varies between 15 and 98 years.
At the individual level (Level 1), three variables were used to identify individual segments in Europe, taking their Internet usage pattern into account: frequency of access to the Internet, means of access, and online activities. Six sociodemographic variables were introduced to characterize the latent classes, namely: gender, age, literacy, marital status, occupation, and type of community. At the second level of analysis (Level 2), countries were introduced as contextual predictors, allowing the grouping of individuals into segments based on the similarities found.
References
Kimberly, H. & Muthén, B. (2010). Multilevel Latent Class Analysis: An Application of Adolescent Smoking Typologies with Individual and Contextual Predictors. Structural Equation Modeling, 17(2), 193-215.
TNS Opinion & Social. (2013). Cyber security report. European Commission, November, 156.

Talk

Using Latent Variable Models to Evaluate Test Quality Criteria of Tests Measuring Nominal Constructs

Keywords:
Applied Statistics
Latent Class Models
Latent Transition Analysis
Latent Variable Analysis
Logistic Regression
Test Quality Criteria

In psychological testing, measured constructs are often assumed as being continuous latent variables, and thus are modeled accordingly. In the case of continuous latent variables, there is a variety of methods for evaluating test quality criteria. However, many of these methods are not applicable if the latent variable is constructed as being nominal. By using the example of the nominal construct sophistication of conditional reasoning (reasoning with if-then-propositions), analysis procedures are presented that allow an evaluation of important test quality criteria like reliability and validity. These analysis procedures (e.g. latent class analysis, latent transition analysis, multinomial logistic regression analysis with latent variables) provide adequate information and well interpretable results when it comes to developing and evaluating psychological tests measuring nominal constructs.

session

Applied Statistics

Alrik Thiem

14:00 - 15:00

Salon Novalis

Talk

Interpretation of Main Effects for Moderated Regression Models

Keywords:
Regression
Simulation
Statistical Power

Moderated regression models include an interaction, or product term and can be used to assess whether the relationship between a given independent variable (IV) and dependent variable (DV) depends on a third moderator variable (MV). Literature exists regarding interpretation of a significant moderator effect as well as guidance for interpreting the main effects in the presence of a significant moderator effect. Typically, researchers recommend either ignoring main effect completely, or carefully interpreting them as conditional effects. However, when the interaction effect is not significant, recommendations indicate that the typical interpretation of main effects as average effects is appropriate. The present study challenges this claim since lack of significance may be due to lack of power rather than no true population effect. To explore this idea, a simulation study is conducted. A moderated regression model with one predictor, Y = a + b1*X + b2*M + b3*X*M + e, is estimated based on simulated data varying the sample size, effect size of the interaction term, and the centering of X. Preliminary results indicate that for typical data sets this model may be underpowered to detect moderation effects, which is consistent with the literature. Examining the distribution of slope coefficient estimates for the main effects (b1 and b2) indicates that even for models with no significant interaction effect, interpreting these main effects as average effects could be very misleading. Recommendations for applied researchers include using model selection procedures that can provide evidence for either the more or the less complex model, such as Bayesian Information Criterion (BIC); routinely mean-centering predictors to guard against particularly misleading main effect interpretation; and conducting post hoc power analyses for non-significant interaction effects before proceeding with interpretation of main effects.

Talk

Topic Modeling As a Type-Forming Process of Social-Ecological Education Research

Keywords:
Program Analysis

Title: Topic modeling as a type-forming process of social-ecological education research: Program analysis on education for sustainable development in and by companies
The latent semantic analysis of qualitative data is presented on the basis of a programme analysis on education for sustainable development in and by companies. It deals with the question of how a topic modeling can be applied to a qualitative data set in the context of a classical program analysis. And the question is being pursued, what is the possibility of validating the result? To this end, the concrete evaluation and analysis steps of computer-aided analysis are traced in detail with the "MAchine Learning for LanguagE Toolkit" (MALLET) based on the algorithm of the "latent dirichelet allocation" (LDA). As a result, a Topic Modeling with 10 topics will be presented and interpreted as a central framework for a sustainability-oriented learning culture with regard to ecological education as a problem solving and life-world-oriented cognitive process in the context of a required legitimization management.

Talk

Small Act, Huge Effect: Algorithmic Sources of Publication Bias in Political Science Research

Keywords:
Causal Inference
Meta-Analysis
Qualitative Methods

Meta-analyses in political science continue to demonstrate the pervasiveness of publication bias, the reasons for which are said to lie with authors, reviewers, journal editors and project sponsors. In this article, we reveal an as yet undiscovered source of publication bias. More specifically, we demonstrate why the uncritical import of the Quine-McCluskey algorithm (QMC) from electrical engineering into social-scientific data analysis with Qualitative Comparative Analysis (QCA) in the late 1980s had to lead to considerable publication bias. Drawing on complete replication material for 160 studies from political science that have employed QCA, we also measure the extent this problem has assumed in empirical research. Last but not least, we present a solution that is guaranteed to eliminate this source of bias. It consists in a redefinition of the objective function under which optimization algorithms such as QMC operate in QCA. Besides contributing to the scientific study of publication bias, our article thus also underlines the importance of evaluating the adequacy of foreign methods before putting them to uses which they were not originally designed for.

15:30

state-of-the-art

Systematic observation of human behavior from a methodological perspective

Daniel Oberski

15:30 - 16:00

Saal Friedrich Schiller

state-of-the-art

Addressing Treatment Non-adherence in Randomized Experiments

Steffi Pohl

15:30 - 16:00

Salon Schlegel

16:00

coffee

Coffee Break

16:00 - 16:30

16:30

meeting

EAM Members Meeting

NA

16:30 - 18:00

Saal Friedrich Schiller