library(idealstan)
# First we can simulate data for an IRT 2-PL model that is inflated for missing data
library(ggplot2)
library(dplyr)
# This code will take at least a few minutes to run
<- id_sim_gen(model_type='binary',inflate=T)
bin_irt_2pl_abs_sim
# Now we can put that directly into the id_estimate function
# to get full Bayesian posterior estimates
# We will constrain discrimination parameters
# for identification purposes based on the true simulated values
<- id_estimate(bin_irt_2pl_abs_sim,
bin_irt_2pl_abs_est model_type=2,
restrict_ind_high =
sort(bin_irt_2pl_abs_sim@simul_data$true_person,
decreasing=TRUE,
index=TRUE)$ix[1],
restrict_ind_low =
sort(bin_irt_2pl_abs_sim@simul_data$true_person,
decreasing=FALSE,
index=TRUE)$ix[1],
fixtype='prefix',
ncores=2,
nchains=2)
# We can now see how well the model recovered the true parameters
id_sim_coverage(bin_irt_2pl_abs_est) %>%
bind_rows(.id='Parameter') %>%
ggplot(aes(y=avg,x=Parameter)) +
stat_summary(fun.args=list(mult=1.96)) +
theme_minimal()
# In most cases, we will use pre-existing data
# and we will need to use the id_make function first
# We will use the full rollcall voting data
# from the 114th Senate as a rollcall object
data('senate114')
# Running this model will take at least a few minutes, even with
# variational inference (use_vb=T) turned on
<- id_make(score_data = senate114,
to_idealstan outcome = 'cast_code',
person_id = 'bioname',
item_id = 'rollnumber',
group_id= 'party_code',
time_id='date',
high_val='Yes',
low_val='No',
miss_val='Absent')
<- id_estimate(to_idealstan,
sen_est model_type = 2,
use_vb = TRUE,
fixtype='prefix',
restrict_ind_high = "BARRASSO, John A.",
restrict_ind_low = "WARREN, Elizabeth")
# After running the model, we can plot
# the results of the person/legislator ideal points
id_plot_legis(sen_est)
Estimate an idealstan
model
Description
This function will take a pre-processed idealdata
vote/score dataframe and run one of the available IRT/latent space ideal point models on the data using Stan’s MCMC engine.
Usage
id_estimate(
idealdata = NULL,
model_type = 2,
inflate_zero = FALSE,
vary_ideal_pts = "none",
keep_param = NULL,
grainsize = 1,
mpi_export = NULL,
use_subset = FALSE,
sample_it = FALSE,
subset_group = NULL,
subset_person = NULL,
sample_size = 20,
nchains = 4,
niters = 1000,
use_vb = FALSE,
ignore_db = NULL,
restrict_ind_high = NULL,
fix_high = 1,
fix_low = (-1),
restrict_ind_low = NULL,
num_restrict_high = 1,
num_restrict_low = 1,
fixtype = "prefix",
const_type = "persons",
id_refresh = 0,
prior_only = FALSE,
warmup = 1000,
ncores = 4,
use_groups = FALSE,
discrim_reg_upb = 1,
discrim_reg_lb = -1,
discrim_miss_upb = 1,
discrim_miss_lb = -1,
discrim_reg_scale = 2,
discrim_reg_shape = 2,
discrim_miss_scale = 2,
discrim_miss_shape = 2,
person_sd = 3,
time_fix_sd = 0.1,
time_var = 10,
spline_knots = NULL,
spline_degree = 2,
ar1_up = 1,
ar1_down = 0,
boundary_prior = NULL,
time_center_cutoff = 50,
restrict_var = FALSE,
sample_stationary = FALSE,
ar_sd = 1,
diff_reg_sd = 3,
diff_miss_sd = 3,
restrict_sd_high = 10,
restrict_sd_low = 10,
restrict_N_high = 1000,
restrict_N_low = 1000,
gp_sd_par = 0.025,
gp_num_diff = 3,
gp_m_sd_par = 0.3,
gp_min_length = 0,
cmdstan_path_user = NULL,
map_over_id = "persons",
save_files = NULL,
het_var = TRUE,
compile_optim = FALSE,
debug = FALSE,
init_pathfinder = TRUE,
debug_mode = FALSE,
...
)
Arguments
idealdata
|
An object produced by the id_make containing a score/vote matrix for use for estimation & plotting
|
model_type
|
An integer reflecting the kind of model to be estimated. See below. |
inflate_zero
|
If the outcome is distributed as Poisson (count/unbounded integer), setting this to TRUE will fit a traditional zero-inflated model. To use correctly, the value for zero must be passed as the miss_val option to id_make before running a model so that zeroes are coded as missing data.
|
vary_ideal_pts
|
Default ‘none’ . If ‘random_walk’ , ‘AR1’ , ‘GP’ , or ‘splines’ , a time-varying ideal point model will be fit with either a random-walk process, an AR1 process, a Gaussian process or a spline. Note that the spline is the easiest time-varying model to fit so long as the number of knots (option spline_knots ) is significantly less than the number of time points in the data. See documentation for more info.
|
keep_param
|
A list with logical values for different categories of paremeters which should/should not be kept following estimation. Can be any/all of person_int for the person-level intercepts (static ideal points), person_vary for person-varying ideal points, item for observed item parameters (discriminations/intercepts), item_miss for missing item parameters (discriminations/intercepts), and extra for other parameters (hierarchical covariates, ordinal intercepts, etc.). Takes the form list(person_int=TRUE,person_vary=TRUE,item=TRUE,item_miss=TRUE,extra=TRUE) . If any are missing in the list, it is assumed that those parameters will be excluded. If NULL (default), will save all parameters in output.
|
grainsize
|
The grainsize parameter for the reduce_sum function used for within-chain parallelization. The default is 1, which means 1 chunk (item or person) per core. Set to -1. to use
|
mpi_export
|
If within_chains=“mpi” , this parameter should refer to the directory where the necessary data and Stan code will be exported to. If missing, an interactive dialogue will prompt the user for a directory.
|
use_subset
|
Whether a subset of the legislators/persons should be used instead of the full response matrix |
sample_it
|
Whether or not to use a random subsample of the response matrix. Useful for testing. |
subset_group
|
If person/legislative data was included in the id_make function, then you can subset by any value in the $group column of that data if use_subset is TRUE .
|
subset_person
|
A list of character values of names of persons/legislators to use to subset if use_subset is TRUE and person/legislative data was included in the id_make function with the required $person.names column
|
sample_size
|
If sample_it is TRUE , this value reflects how many legislators/persons will be sampled from the response matrix
|
nchains
|
The number of chains to use in Stan’s sampler. Minimum is one. See stan for more info. If use_vb=TRUE , this parameter will determine the number of Pathfinder paths to estimate.
|
niters
|
The number of iterations to run Stan’s sampler. Shouldn’t be set much lower than 500. See stan for more info.
|
use_vb
|
Whether or not to use Stan’s Pathfinder algorithm instead of full Bayesian inference. Pros: it’s much faster but can be much less accurate. Note that Pathfinder is also used by default for finding initial starting values for sfull HMC sampling. |
ignore_db
|
If there are multiple time periods (particularly when there are very many time periods), you can pass in a data frame (or tibble) with one row per person per time period and an indicator column ignore that is equal to 1 for periods that should be considered in sample and 0 for periods for periods that should be considered out of sample. This is useful for excluding time periods from estimation for persons when they could not be present, i.e. such as before entrance into an organization or following death. If ignore equals 0, the person’s ideal point is estimated as a standard Normal draw rather than an auto-correlated parameter, reducing computational burden substantially. Note that there can only be one pre-sample period of 0s, one in-sample period of 1s, and one post-sample period of 0s. Multiple in-sample periods cannot be interspersed with out of sample periods. The columns must be labeled as person_id , time_id and ignore and must match the formatting of the columns fed to the id_make function.
|
restrict_ind_high
|
If fixtype is not "vb_full", a vector of character values or integer indices of a legislator/person or bill/item to pin to a high value (default +1).
|
fix_high
|
The value that the high fixed ideal point(s) should be fixed to. Default is +1. Does not apply when const_type=“items” ; in that case, use restrict_sd /restrict_N parameters (see below).
|
fix_low
|
The value that the low fixed ideal point(s) should be fixed to. Default is -1. Does not apply when const_type=“items” ; in that case, use restrict_sd /restrict_N parameters (see below).
|
restrict_ind_low
|
If fixtype is not "vb_full", a vector of character values or integer indices of a legislator/person or bill/item to pin to a low value (default -1).
|
num_restrict_high
|
If using variational inference for identification (fixtype=“vb_full” ), how many parameters to constraint to positive values? Default is 1.
|
num_restrict_low
|
If using variational inference for identification (ixtype=“vb_full” ), how many parameters to constraint to positive negative values? Default is 1.
|
fixtype
|
Sets the particular kind of identification used on the model, could be either ‘vb_full’ (identification provided exclusively by running a variational identification model with no prior info), or ‘prefix’ (two indices of ideal points or items to fix are provided to options restrict_ind_high and restrict_ind_low ). See details for more information.
|
const_type
|
Whether “persons” are the parameters to be fixed for identification (the default) or “items” . Each of these pinned parameters should be specified to fix_high and fix_low if fixtype equals “prefix” , otherwise the model will select the parameters to pin to fixed values.
|
id_refresh
|
The number of times to report iterations from the variational run used to identify models. Default is 0 (nothing output to console). |
prior_only
|
Whether to only sample from priors as opposed to the full model with likelihood (the default). Useful for doing posterior predictive checks. |
warmup
|
The number of iterations to use to calibrate Stan’s sampler on a given model. Shouldn’t be less than 100. See stan for more info.
|
ncores
|
The number of cores in your computer to use for parallel processing in the Stan engine. See stan for more info. If within_chain is set to “threads” , this parameter will determine the number of threads (independent processes) used for within-chain parallelization.
|
use_groups
|
If TRUE , group parameters from the person/legis data given in id_make will be estimated instead of individual parameters.
|
discrim_reg_upb
|
Upper bound of the rescaled Beta distribution for observed discrimination parameters (default is +1) |
discrim_reg_lb
|
Lower bound of the rescaled Beta distribution for observed discrimination parameters (default is -1). Set to 0 for conventional IRT. |
discrim_miss_upb
|
Upper bound of the rescaled Beta distribution for missing discrimination parameters (default is +1) |
discrim_miss_lb
|
Lower bound of the rescaled Beta distribution for missing discrimination parameters (default is -1). Set to 0 for conventional IRT. |
discrim_reg_scale
|
Set the scale parameter for the rescaled Beta distribution of the discrimination parameters. |
discrim_reg_shape
|
Set the shape parameter for the rescaled Beta distribution of the discrimination parameters. |
discrim_miss_scale
|
Set the scale parameter for the rescaled Beta distribution of the missingness discrimination parameters. |
discrim_miss_shape
|
Set the shape parameter for the rescaled Beta distribution of the missingness discrimination parameters. |
time_fix_sd
|
The variance of the over-time component of the first person/legislator is fixed to this value as a reference. Default is 0.1. |
spline_knots
|
Number of knots (essentially, number of points at which to calculate time-varying ideal points given T time points). Default is NULL, which means that the spline is equivalent to polynomial time trend of degree spline_degree . Note that the spline number (if not null) must be equal or less than the number of time points–and there is no reason to have it equal to the number of time points as that will likely over-fit the data.
|
spline_degree
|
The degree of the spline polynomial. The default is 2 which is a quadratic polynomial. A value of 1 will result in independent knots (essentially pooled across time points T). A higher value will result in wigglier time series. There is no "correct" value but lower values are likely more stable and easier to identify. |
boundary_prior
|
If your time series has very low variance (change over time), you may want to use this option to put a boundary-avoiding inverse gamma prior on the time series variance parameters if your model has a lot of divergent transitions. To do so, pass a list with a element called beta that signifies the rate parameter of the inverse-gamma distribution. For example, try boundary_prior=list(beta=1) . Increasing the value of beta will increase the "push" away from zero. Setting it too high will result in time series that exhibit a lot of "wiggle" without much need.
|
time_center_cutoff
|
The number of time points above which the model will employ a centered time series approach for AR(1) and random walk models. Below this number the model will employ a non-centered approach. The default is 50 time points, which is relatively arbitrary and higher values may be better if sampling quality is poor above the threshold. |
sample_stationary
|
If TRUE , the AR(1) coefficients in a time-varying model will be sampled from an unconstrained space and then mapped back to a stationary space. Leaving this TRUE is slower but will work better when there is limited information to identify a model. If used, the ar_sd parameter should be increased to 5 to allow for wider sampling in the unconstrained space.
|
ar_sd
|
If an AR(1) model is used, this defines the prior scale of the Normal distribution. A lower number can help identify the model when there are few time points. |
diff_reg_sd
|
Set the prior standard deviation for the bill (item) intercepts for the non-inflated model. |
diff_miss_sd
|
Set the prior standard deviation for the bill (item) intercepts for the inflated model. |
restrict_sd_high
|
Set the prior shape for high pinned parameters. This has a default of 0.01 (equivalent to +0.99), but could be set lower if the data is really large. |
restrict_sd_low
|
Set the prior scale for low pinned parameters. This has a default of 0.01 (equivalent to -0.99), but could be set lower if the data is really large. To make the prior uninformative, set this value and restrict_N_low to +1 (or +2, +2 for weakly informative).
|
restrict_N_high
|
Set the prior scale for high pinned parameters. Default is 1000 (equivalent to 1,000 observations of the pinned value). Higher values make the pin stronger (for example if there is a lot of data). |
restrict_N_low
|
Set the prior shape for high pinned parameters. Default is 1000 (equivalent to 1,000 observations of the pinned value). Higher values make the pin stronger (for example if there is a lot of data). |
gp_sd_par
|
The upper limit on allowed residual variation of the Gaussian process prior. Increasing the limit will permit the GP to more closely follow the time points, resulting in much sharper bends in the function and potentially oscillation. |
gp_num_diff
|
The number of time points to use to calculate the length-scale prior that determines the level of smoothness of the GP time process. Increasing this value will result in greater smoothness/autocorrelation over time by selecting a greater number of time points over which to calculate the length-scale prior. |
gp_m_sd_par
|
The upper limit of the marginal standard deviation of the GP time process. Decreasing this value will result in smoother fits. |
gp_min_length
|
The minimum value of the GP length-scale parameter. This is a hard lower limit. Increasing this value will force a smoother GP fit. It should always be less than gp_num_diff .
|
cmdstan_path_user
|
Default is NULL, and so will default to whatever is set in cmdstanr package. Specify a file path here to use a different cmdtstan installation.
|
map_over_id
|
This parameter identifies which ID variable to use to construct the shards for within-chain parallelization. It defaults to “persons” but can also take a value of “items” . It is recommended to select whichever variable has more distinct values to improve parallelization.
|
save_files
|
The location to save CSV files with MCMC draws from cmdstanr . The default is NULL , which will use a folder in the package directory.
|
het_var
|
Whether to use a separate variance parameter for each item if using Normal or Log-Normal distributions that have variance parameters. Defaults to TRUE and should be set to FALSE only if all items have a similar variance. |
compile_optim
|
Whether to use Stan compile optimization flags (off by default) |
debug
|
For debugging purposes, turns off threading to enable more informative error messages from Stan. Also recompiles model objects. |
init_pathfinder
|
Whether to generate initial values from the Pathfinder algorithm (see Stan documentation). If FALSE, will generate random start values.. |
debug_mode
|
Whether to print valuesof all parameters for debugging purposes. If this is used, only one iteration should be used as it generates a lot of console output. |
…
|
Additional parameters passed on to Stan’s sampling engine. See stan for more information.
|
Details
To run an IRT ideal point model, you must first pre-process your data using the id_make
function. Be sure to specify the correct options for the kind of model you are going to run: if you want to run an unbounded outcome (i.e. Poisson or continuous), the data needs to be processed differently. Also any hierarchical covariates at the person or item level need to be specified in id_make
. If they are specified in id_make
, than all subsequent models fit by this function will have these covariates.
Note that for static ideal point models, the covariates are only defined for those persons who are not being used as constraints.
As of this version of idealstan
, the following model types are available. Simply pass the number of the model in the list to the model_type
option to fit the model.
-
IRT 2-PL (binary response) ideal point model, no missing-data inflation
-
IRT 2-PL ideal point model (binary response) with missing- inflation
-
Ordinal IRT (rating scale) ideal point model no missing-data inflation
-
Ordinal IRT (rating scale) ideal point model with missing-data inflation
-
Ordinal IRT (graded response) ideal point model no missing-data inflation
-
Ordinal IRT (graded response) ideal point model with missing-data inflation
-
Poisson IRT (Wordfish) ideal point model with no missing data inflation
-
Poisson IRT (Wordfish) ideal point model with missing-data inflation
-
unbounded (Gaussian) IRT ideal point model with no missing data
-
unbounded (Gaussian) IRT ideal point model with missing-data inflation
-
Positive-unbounded (Log-normal) IRT ideal point model with no missing data
-
Positive-unbounded (Log-normal) IRT ideal point model with missing-data inflation
-
Latent Space (binary response) ideal point model with no missing data
-
Latent Space (binary response) ideal point model with missing-data inflation
-
Ordered Beta (proportion/percentage) with no missing data
-
Ordered Beta (proportion/percentage) with missing-data inflation
Value
A fitted idealstan
object that contains posterior samples of all parameters either via full Bayesian inference or a variational approximation if use_vb
is set to TRUE
. This object can then be passed to the plotting functions for further analysis.
Identification
Identifying IRT models is challenging, and ideal point models are still more challenging because the discrimination parameters are not constrained. As a result, more care must be taken to obtain estimates that are the same regardless of starting values. The parameter fixtype
enables you to change the type of identification used. The default, ‘vb_full’, does not require any further information from you in order for the model to be fit. In this version of identification, an unidentified model is run using variational Bayesian inference (see vb
). The function will then select two persons/legislators or items/bills that end up on either end of the ideal point spectrum, and pin their ideal points to those specific values. To control whether persons/legislator or items/bills are constrained, the const_type
can be set to either “persons”
or “items”
respectively. In many situations, it is prudent to select those persons or items ahead of time to pin to specific values. This allows the analyst to be more specific about what type of latent dimension is to be estimated. To do so, the fixtype
option should be set to “prefix”
. The values of the persons/items to be pinned can be passed as character values to restrict_ind_high
and restrict_ind_low
to pin the high/low ends of the latent scale respectively. Note that these should be the actual data values passed to the id_make
function. If you don’t pass any values, you will see a prompt asking you to select certain values of persons/items.
The pinned values for persons/items are set by default to +1/-1, though this can be changed using the fix_high
and fix_low
options. This pinned range is sufficient to identify all of the models implemented in idealstan, though fiddling with some parameters may be necessary in difficult cases. For time-series models, one of the person ideal point over-time variances is also fixed to .1, a value that can be changed using the option time_fix_sd
.
References
-
Clinton, J., Jackman, S., & Rivers, D. (2004). The Statistical Analysis of Roll Call Data. The American Political Science Review, 98(2), 355-370. doi:10.1017/S0003055404001194
-
Bafumi, J., Gelman, A., Park, D., & Kaplan, N. (2005). Practical Issues in Implementing and Understanding Bayesian Ideal Point Estimation. Political Analysis, 13(2), 171-187. doi:10.1093/pan/mpi010
-
Kubinec, R. "Generalized Ideal Point Models for Time-Varying and Missing-Data Inference". Working Paper.
-
Betancourt, Michael. "Robust Gaussian Processes in Stan". (October 2017). Case Study.
See Also
id_make
for pre-processing data, id_plot_legis
for plotting results, summary
for obtaining posterior quantiles, posterior_predict
for producing predictive replications.