Sunday, June 2, 2013

3 Federal #HackForDC Ideas From a Non-Coder

Today, the National Day of Civic Hacking, I am at home, taking care of my 7-month old, having given up my ticket to #hackfordc so that someone else can meet the cool folks I’ve had the pleasure of working with over the years. Even as I struggle through Learn Python the Hard Way, I am reminded that you don’t have to be a coder to contribute.

In that spirit, here are 3 project ideas that make use of congressional information.

What's happening on the hill?

It’s just about impossible to follow when Congress has scheduled a committee/subcommittee hearing/meeting without a paid subscription to a news service that gathers this info. But over the last few years, the Senate and House have begun releasing meeting notices online in parsable formats. Unfortunately, there’s no publicly-available central place to see all the notices from the different committees, and it’s not possible to sign up for official alerts for a particular subcommittee. All the data is there, but it isn’t being corralled.

For most people, it may be useful to follow a few particular subcommittees, but information about actions by others are distracting. For example, I pay attention to the Legislative Branch Appropriations Subcommittee, but don’t really care much about the other Appropriations Subcommittees. There should be a way to filter out the noise.

What would be great is if one could subscribe to subcommittee notices as RSS feeds, or even better, as something that could be pushed by email as information is updated. A user could subscribe to the subcommittees (including the full committee itself) of his or her choosing, and ignore the rest.

Here’s where you can find the data:

The Senate meeting calendar is available in XML here. As you look at the XML, you can see that the calendar identifies both the name of the full committee and the subcommittee.

The House publishes notices of meetings and markups weekly here, and if you go to a particular committee (say Appropriations), there’s an RSS feed for upcoming committee meetings. (It’s also possible to filter the calendar by subcommittee, but I’m not sure how you get at the underlying data.) The subcommittee is identified in the description tag, along with other details.

Open Up Draft Legislation

It’s important to be able to have plain text versions of bills, especially draft legislation. Why? Clean (non-PDF) versions can be compared against other iterations to see what has changed and marked-up so that you can easily make suggestions for improvements. Unfortunately, pre-introduction legislation is only made available to staff as a PDF, which is hardly useful to anyone. And sometimes even introduced legislation is available first as a PDF and only later as XML.

What would be helpful is a tool that ingests PDFs of draft-legislation and returns plain text. But converting the PDF to text isn’t enough. It also would need to remove the line numbers, the headers (e.g. “F:\M13\ROYCE\ROYCE_005.XML” as well as the page numbers), and the footers (e.g. “F:\M13\ROYCE\ROYCE_005.XML f:\VHLC\022613\022613.176.xml (542138|23)”. By clearing out this additional stuff, you’re left with the text of the legislation only, which can then be used in many ways.

Here are some examples of Senate pre-introduced legislation. Example 1, Example 2, Example 3, Example 4. You can use this Google search to find more: ‘S.  ll  "In the Senate of the United States" filetype:pdf’. Note that the S.L.C. in the top right corner means it was drafted by Senate Legislative Counsel, indicating it likely will follow standard formatting.

Here are some examples of House pre-introduced legislation. Example 1, Example 2, Example 3, Example 4. You can use this Google search to find more ‘H. R. ll IN THE HOUSE OF REPRESENTATIVES "(Original Signature of Member) " filetype:pdf’. In the House, nearly all legislation is drafted by House Legislative Counsel, so they all follow a pretty standard format.

CRS Report Freshness Ratings

The Congressional Research Service is a congressional think tank, and it issues report on important issues of the day. Over time, CRS will update a report to reflect new facts or changing circumstances. Sometimes these changes are significant, but other times the update could be as minor as the addition of punctuation or removal of a citation. However, there’s no way for the reader to know whether the new report needs to be read closely or if there’s just been a cosmetic change.

CRS reports should have freshness ratings based on a comparison of the current text to the previous iteration. So, if the language is virtually identical except for the addition of a sentence, it would receive a low rating (e.g. 1% fresh), but if the report has been largely rewritten, it would receive a high rating (e.g. 80% fresh).

All CRS reports have a unique identifier on their front page as well as the date it was issued. For example, a report could have unique ID RL1234 and have an issued day of May 1, 2012. If it is reissued, the unique ID stays the same, but it gets a new issued date of September 1, 2013. Alas, the reports are in PDF format, so it’s probably a non-trivial problem to show what text has changed. But using PDF-to-text, you can at least compare the output files to see whether there’s a trivial or significant difference.

So where you can find CRS reports? That’s another problem, but a large corpus is available at opencrs, which just happens to have an API. If you want to gather more, there’s other aggregators, or you could use this Google search ‘7-5700 "Congressional Research Service" filetype:pdf’.