Senators Mike Crapo (R-ID) and Chuck Grassley (R-IA) recently asked for an investigation into the Internal Revenue Service’s research activities, including its use of contractors to conduct studies and its security protocols. The senators are right to be concerned about taxpayer privacy, especially after ProPublica’s recent articles about the tax situations of very high-income Americans based on a “massive trove of tax information.”
But their concern that IRS public use files (PUFs) might have been ProPublica‘s source is misguided. And further limiting access to tax return data would conflict with the explicit mandate of the Foundations for Evidence-Based Policymaking Act of 2018, which passed with overwhelming bipartisan support in Congress.
WASHINGTON, DC – SEPTEMBER 28: Senate Judiciary Committee Chairman Sen. Chuck Grassley (R-IA) … [+] listens to Democratic senators speak during a committee meeting on September 28, 2018 in Washington, DC. The committee met to discuss and later vote on the nomination of Judge Brett Kavanaugh to the U.S. Supreme Court prior to the nomination proceeding to a vote in the full U.S. Senate. (Photo by Chip Somodevilla/Getty Images)
That’s why Urban Institute colleagues and I are working with the staff of the IRS’s Statistics of Income (SOI) division to strengthen the agency’s long-standing privacy protections even further and safely expand research access. Together, we are creating a “synthetic” PUF designed to produce data that look like tax returns but are safe to release to researchers.
Because synthetic data may yield unreliable estimates for complex statistical models, we also are developing a safe way for researchers to access the underlying tax data without ever seeing any actual tax returns and with a strong guarantee that published statistics will not inadvertently reveal confidential information about any taxpayer.
Synthetic data protects
MORE FOR YOU
We’re creating a synthetic PUF by generating a statistical model of the universe of income tax returns and drawing values at random. (See this blog post for details.) There are no actual tax returns in the synthetic file. However, researchers will be able to use it to understand information about tax variables and the relationships among them. Among its uses: Running a tax model such the Tax Policy Center’s to estimate the distributional and revenue effects of legislation.
Crapo and Grassley cite an earlier TPC research paper as evidence that carelessly designed synthetic data could allow for disclosure. The senators are right that our paper points out these risks, but it further shows how to synthesize the data to avoid those risks. The ultimate file will be the safest public use file ever produced.
A validation server safely expands research access to tax data
The senators also question the security of the SOI’s Joint Statistical Research Program that allows researchers to work with IRS employees. The current program is extremely valuable, but it faces two challenges.
First, IRS employees must review all results before releasing them for publication to protect against inadvertent disclosure of confidential information. Because this puts such a strain on IRS staff, the agency has accepted only a fraction of worthwhile research proposals.
The second challenge could come from the outside–if, for example, an intruder could combine published statistics with outside information to infer something about an individual tax return. This a much different concern from thousands of complete tax returns apparently being leaked to ProPublica, but the IRS wants to protect against all disclosure risks.
A validation server could address both challenges. It would expand access to tax data for research while providing even more robust protections and using fewer IRS resources in the process.
It would work like this: A researcher would develop a statistical analysis using synthetic data. The validation server would run the program on the confidential tax return data and alter statistics–by small amounts in most cases– to protect against disclosure. The resulting statistics are valid, just somewhat less precise.
Think of it like manipulated video a journalist uses to protect the identity of a source. It modulates their voice and pixelates their face so the viewer can hear what the source has to say, but without exposing their identity. With a validation server, an algorithm modifies any statistics to protect against even a remote risk of disclosure.
SOI staff still would make decisions about who gets access to IRS data and what information is revealed. But the actual process of releasing statistics could be automated, reducing the demands on the IRS staff and providing a systematic privacy guaranty. The bottom line: far more access to data for critical policy analysis and better security.
More privacy and more research
The senators’ concerns strengthen the case for enhancing the IRS’s long-standing effort to protect data. And they strengthen the case for building high quality synthetic datasets and a validation server.
It is essential that policymakers understand the effects and effectiveness of tax laws. And outside research can enhance that knowledge.
Full disclosure: As part of the project described above the Urban Institute is an unpaid contractor with the IRS, which grants several of my colleagues access to tax data, subject to very tight controls. Our research is supported by grants from Arnold Ventures, the Sloan Foundation, and the National Science Foundation.