The guiding principle behind
duawranglr is to make it easier for organizations to share data that
contain protected elements and/or personally idenfiable information
(PII) with researchers. There are two key problems this package attempts
to solve:
- Data owners and reseachers may wish to collaborate on multiple
projects, each with a different level of data security required;
executing a unique data usage agreement (DUA) for each project can be
time consuming and inefficient.
- Administrators tasked with approving data requests do not always
have the time or technical proficiency to closely review the code that
reads, subsets, filters, and deidentifies data files according to a
DUA.
Data usage agreements
The duawranglr package is designed with the idea that rather than
setting a new DUA for each project in an ongoing collaboration between
researchers and data partners, two things will happen instead:
- An overarching DUA will be signed that establishes a general
framework for collaboration with multiple pre-established levels of data
restriction; for each new project, these levels (e.g., I, II,
& III) are invoked and used to determine which variables may be
shared, with whom, and under what conditions according to the DUA.
- An associated crosswalk file—which can be an easy-to-modify and
share spreadsheet—will list the names of data elements that are
restricted at each level. This crosswalk is then used to clearly
transform raw restricted data files into those that can be shared under
the conditions of the DUA.
An example DUA crosswalk
An example crosswalk file (e.g. a CSV file or Excel
spreadsheet) might look like this:
sid |
sid |
sid |
sname |
sname |
sname |
dob |
dob |
|
gender |
|
|
raceeth |
|
|
tid |
|
|
tname |
tname |
tname |
zip |
zip |
|
Each column represents a restriction level—level_i
,
level_ii
, or level_iii
—along with the
corresponding data element names that are restricted at that level. In
this crosswalk, like variable names have been aligned so that they are
easier to compare, but the elements can be included in whichever way
makes most sense to the data administrator.
The restriction level names are arbitrary as far as the package goes,
but in conjunction with a DUA, they have meaning:
- Level I: The first level produces data sets that
can be shared more widely, but at the cost of losing access to many data
elements in the final data set.
- Level II: The second level has slightly fewer data
element restrictions, making it better for more research projects. Data
produced at this level likely come with more sharing and storage
restrictions than those produced at the first level.
- Level III: The third level has the fewest
restrictions: only names and the student’s ID cannot be contained in the
final data set. Data produced at this level will have the strongest
restrictions on who can use it an how it is stored by the research
team.
The benefit of this level-plus-crosswalk system is two-fold:
- Data element restrictions are clearly defined for each level, which
in turn has its own clearly defined scope for data storage and sharing.
When starting a new project under the scope of the DUA, researchers and
data partners need only to assign a proper level based on the needs of
the analyses.
- Because the crosswalk is a simple tabular file, data element names
can easily be added or deleted by data partners who do not typically use
data analysis software. This helps keep the process transparent for all
team members.
What duawranglr does not do
Functions in the package do not
- Replace existing data wrangling functions
- Guarantee data security
There are many packages, such as those in the tidyverse suite, that are already
well suited to data wrangling tasks. There is no need to replicate those
functions in this package.
It also should go without saying, but users can simply not
use functions in this package when attempting to secure restricted
data. What this package does is offer a framework and a set of useful
functions that, when followed, help users secure data in a clear and
replicable manner that allows data administrators to more easily
participate in the process.