Friday, January 24, 2014

5 Datasets in Celebration of Data Innovation Day

(Cross-posted from CREW)

Today is Data Innovation Day. To celebrate, here are five government datasets that, while technically available to the public now, we would love to see published in a structured, downloadable database.

5. Lobbying activities by foreign registered agents

Most people who lobby the federal government are required to report their activities. Those who lobby on behalf of domestic entities have their information published in bulk as a downloadable dataset here. While persons who lobby on behalf of foreign governments are required to file disclosure forms, unfortunately you cannot download all the information as a data file. This information is incredibly valuable, but without the efforts of third parties like ProPublica and the Sunlight Foundation, who make much of it available as a dataset, it would be just about impossible to figure out larger patterns of lobbying activity.

The Department of Justice, which is responsible for tracking foreign agents, should make sure lobbying information about foreign agents is made available to the public in a machine-processable digital format.

4. Political spending on television ads

For over 40 years, advertisers have been required to file information on political ad purchases with broadcasters. That data has been available at each local station, but it is not available in a central location. The Federal Communications Commission recently required that the “political files” of CBS, ABC, NBC, and Fox that are located in the top 50 markets be made available online, with all television stations required to upload their information after July 1, 2014. Unfortunately, while some information is now available at this central FCC website, without standardization it’s virtually impossible to search across the data. Most information has been uploaded as unsearchable, virtually unusable PDFs.

The FCC should require that all information be uploaded in a standardized format that allows for easy flow into a database, which the FCC should release to the public.

3. Expenditure reports for members of Congress

By law, the House of Representatives and the United States Senate must publish every dime they spend. For over two centuries, they’ve published regular statements of disbursements, composed of thousands of pages of tables and figures. In the last few years, the House began publishing its quarterly statements online as PDFs, a move belatedly followed by the Senate for its semiannual reports. Of course, several thousand pages of tables is a lot less useful than providing the information in electronic tabular format, i.e. a spreadsheet.

The House and Senate should save themselves a lot of time, and much of their printing costs, and publish their regular expenditure reports in machine-readable formats.

2. Tax forms for non-profit organizations

In recent years, non-profit organizations have become significant players in our political space. While the tax turns that non-profits must file — known as 990s — are required to be available to the public, you can only them from the IRS one at a time by filling out this form, and then waiting weeks or months for the response. It is possible to get them in bulk from the IRS, but it costs thousands of dollars. This is so unbelievably frustrating that a fair number of organizations have taken up the mantle of (1) publishing all the forms and (2) digitizing their contents. While the administration has said it is moving in this direction, it’s not there yet.

The Obama administration, specifically the Department of the Treasury, should publish all 990 tax forms online in bulk, in both human-readable and machine-readable formats.

1. Legislative information

It's unfair to talk about federal legislative information without distinguishing between the House and the Senate. In recent years, the House has made significant strides in releasing legislative information in electronic formats, from bills to committee hearing notices to votes. Generally speaking, the Senate has lagged significantly. The two chambers jointly are responsible for THOMAS, the legislative information website, and its not-yet-fully-implemented successor While the new is a lot more flexible than THOMAS, the underlying information it contains — bills, amendments, identities of legislative sponsors, bill summaries, the status of legislation — is not available in a machine-readable format. This is a real problem. All the third-party, nifty websites that make this information digestible for the rest of us consequently are powered by arcane and fragile techniques that harvest the information from THOMAS.

Congress should publish all of this legislative information online in structured formats that machines can easily process. We have seen other countries, notably the United Kingdom, undertake an effort to make all legislative information available in useful forms. People shouldn’t have to rely on third parties to access basic information about Congress.

Honorable mentions

There are a several obvious datasets that have not been mentioned. Let me briefly add that all federal spending information should be made available to the public, which is why we support the DATA Act that would require just that. All reports to Congress should be made available online, which is why we support the Access to Congressionally Mandated Reports Act. And all court opinions and orders should be available to the public without charge. There are many other things the government should do — big and small — to make our lives and the lives of public servants much easier, and probably save the taxpayers a few bucks, too.