Article

Digitizing the US Code

We’re excited to share a recent conversation with Dr. Sylvia Kwakye, a developer at Cornell University’s Legal Information Institute. Sylvia uses data mining and natural language processing to transform static legal documents into searchable, interconnected, user-friendly compilations.

December 22, 2021

An Interview with Dr. Sylvia Kwakye of the Cornell University Legal Information Institute

Hudson Hollister

‍Sylvia Kwakye of the Cornell University Legal Information Institute (LII), thank you so much for being part of this conversation today.

‍Sylvia Kwakye

Thank you. I'm happy to be here.

‍Hudson Hollister

‍Sylvia, I would love to hear a bit of your story, what brought you into the field of digital structure for legislative and regulatory materials?

‍Sylvia Kwakye

‍I actually started out as a biomedical engineer. I came to Cornell as a premed, with a background in electrical engineering to get a PhD in biomedical engineering.

Along the way, I found out that I needed a lot of software development to get my work done. What that involved, for instance, was designing a system to pick primers and probes for various DNA and RNA analyses — so a lot of computer science, pattern recognition, all that good stuff. One of the classes that I took in software engineering was a class that allowed other departments on campus to post projects and solicit student workers to go through a formal process of the development. LII just happened to be one of those organizations that posted a project, and it was to convert the US Code of law from a plain text format into structured XML.

‍Hudson Hollister

‍This is what got you into structured data for the law?

‍Sylvia Kwakye

‍Yes, this is what hooked me! I looked at the problem and it was very similar to the kinds of problems I already dealt with. Because when you're picking primers and probes, for instance, you're looking for patterns. You're doing a lot of pattern matching, you're chopping things up and rearranging them in different ways. So the tools to do the job were similar. How do you recognize, say, a citation? How do you say something is a heading? It was all the same things.

‍Hudson Hollister

‍Did your digitization project become the online version of the code that is published by LII?

‍Sylvia Kwakye

‍Yes. It became the very first one. When we finished the project for the class, it wasn't quite done. But we were far along enough that I could see what the end product would look like. And so I talked to Tom Bruce, the director of the LII at the time, and I asked if I could continue working on the project, because I was very interested in seeing the final end result. And he's like, "Yep!"

‍Hudson Hollister

‍I think if I had been in his place I would have said, "Thank God, somebody wants to work on this!"

What was it about the digitization of the code that had you so interested from the very beginning?

‍Sylvia Kwakye

‍It was exciting just to see the end result, the difference between where we started, where you had this blob of text, which was mostly undifferentiated, and then the end result, which was this beautiful HTML where you could follow the links. Just putting up the slide in our presentation of, you know, this is what we're starting with, this is where we're going, and this is where we ended up. The sense of accomplishment was good.

At the time, I didn't have any feeling about the legislation itself, although some of the content was quite amusing. You know, you're reading the US Code, and you're recognizing, for instance, "Oh, it looks like this piece of the code was written with a particular company in mind." Those pieces were interesting, but for the most part, it was the transformation. Going from something that was a blob of text to something that was structured, was searchable, and you could link it to other things on the LII website and to other things in the world. I found that very interesting.

‍Hudson Hollister

‍That was the beginning of years of work. I want to get to your current role but I have one question before that. It does seem that as a result of your project, and the transformation that you did, LII's compilations became, perhaps the primary, unofficial source outside of the US government itself for the US Code and for other parts of US law. How is it that that compilation became so vital for so many users in public policy across the country?

‍Sylvia Kwakye

It was easy to navigate. At the time, just having the ability to very rapidly navigate from one section to another section of interest in a different title, just by clicking on a link, ie, following a cross reference was something to be appreciated.. It made it very easy for legal researchers to get to where they needed to go.

It was also readable. There was formatting and you could apply styles. In 2003, it was exciting to apply styles to an HTML page so that something looked presentable. The usability component was very, very big for people. If I had to determine what was most important about the work that we did, I would say the links. Being able to link to other things in the world. Also, with markup, the ability to transform the data into other formats, embed provenance etc. But for most users, it seemed to be the formatting — whitespace, indentation, headings — the presentation was really important.

So it wasn't any of the features that "structured data people" geek out over, it was usability. Having structured data allowed you to present something that was more usable. That was it.

‍Hudson Hollister

‍Tell us about your path, since that consequential time you decided to join LII. How has your job evolved and what are you working on right now?

‍Sylvia Kwakye

‍From there I finished school, went off to do other things, and my family just happened to come back to Ithaca so my husband could go to school. I contacted LII and asked, "Do you have something for me to do?"

At the time, LII was not using USLM XML output yet. So they asked if I could work on that, conform the project I'd worked on to USLM formatted XML and redo the US Code collection. So that was my first project when I started back at LII.

‍Hudson Hollister

‍Would it be right to say that you have reformatted the entire US Code twice?

‍Sylvia Kwakye

‍Yes. So we started with the plain ASCII, and oh, I forgot, we also did the bell codes one, and then the XML.

‍Hudson Hollister

‍Is that three times then three times you've gone through the code? ASCII, HTML, and USLM.

‍Sylvia Kwakye

‍Yes. But not HTML, bell codes.

‍Hudson Hollister

‍What are bell codes?

‍Sylvia Kwakye

‍That's another form of markup that the government had, and I think they were designed for ancient typesetting software.

‍Hudson Hollister

‍So the most recent iteration of your project has been taking the official USLM version and ...

‍Sylvia Kwakye

... Enriching it. Adding back all the features that we had previously, and adding more, like the extraction of definitions, marking up named entities, and things like that.

‍Hudson Hollister

‍Is the LII compilation now fully USLM-based?

‍Sylvia Kwakye

Yes, we're using USLM.

‍Hudson Hollister

Congratulations.

‍Sylvia Kwakye

Thank you. Right around when I finished that, we got CFR XML, which was easier to work with, than what they had previously put out. And so we pretty much dropped everything and consumed that new data source, we did the same thing we did for US Code to that as well. And actually CFR, the Code of Federal Regulations, is one of the most highly trafficked collections at this point.

‍Hudson Hollister

‍Let me ask you about maintenance for both the US Code and for the Code of Federal Regulations. What is the procedure for keeping the LLI compilation up to date, now that we have access to more modern sources?

‍Sylvia Kwakye

‍We've put in various automated code that checks for new data at least once a week per collection. The checks include monitoring RSS feeds, ftp sites, web scrapers/hooks, etc. It’s a little different for each collection. We also have sharp users who let us know if we happen to miss anything. We then run anything new through our automated enrichment pipelines.

We have extensive logging on various pieces of the pipelines. If something is different from what we expect, the logger will note it, fire off a message to Slack or email so that we can go and look at it. So every day, via Slack, we get a message about what is new in various collections, how many of those are in processing or have successfully been processed and served on the site.

And, as I said, If anything goes wrong, the pipeline is designed to send a notice to the appropriate Slack channel as well. We see that, go in and say maybe they've changed how they're presenting an element or how they're marking up something or the other, especially CFR. CFR seems to be shifting towards USLM. So periodically, you have changes in how the data is structured that the existing software might not be able to handle correctly. When we get messages like that, we go in and we fix it, ASAP. So everything is updated at least once a week.

‍Hudson Hollister

‍What's one of the most difficult challenges in that scraping? What's one of the changes that might have been made that was most difficult to deal with on the ingestion?

‍Sylvia Kwakye

‍Occasionally, we get what looks like hand-edited XML. Like somebody might have gone in and done something by hand and saved it. So you will get weird encoding errors. Maybe they opened it in Word, edited it, and saved — who knows! But yes, occasionally something will get coded in a way where we cannot ingest the XML at all. In which case, as I said, the error is logged. It pops up in the correct channel in Slack for the collection. And we have to go in and manually fix it.

Usually, we send a polite message to the folks at the source that this document came through with this bunch of errors and they're always very, very gracious. It's usually fixed very quickly. We have a good relationship with most of the bulk data distributors. There are some instances where we do have to go scrape web pages to get the information that we need. That is the case with the annotated constitution, for instance. For the longest time, the only way we could get it was via PDF. We actually did have a project where we extract the text from the PDF, and convert everything to XML. This way, we could have it in a format that we could transform in various ways, one of which was to transform it into HTML, and serve it to the public. Now, they do publish it in HTML, which is great, but there's no bulk data download distribution.

‍Hudson Hollister

‍Let me ask you about the medium term at LII. What are some of the improvements that you would like to be able to make for the future?

‍Sylvia Kwakye

‍Right now, most of our collections are at the federal level, but we recently started working with legal documents from the States. Our latest project is state regulations. We want to do the same thing that we've done for the Code of Federal Regulations for state regulations.

Take election law, for instance, somebody who is researching election law, could very readily review election law across states, because topic references are marked up in a way that makes this kind of comparison easy to do. So all these different kinds of enrichment. Take definitions, for as another example, do states define the same thing in the same way? What are the patterns of differences? That is the direction we're going in.

Of course we have more capable tools now for natural language processing and machine learning to bring to bear on some of these analyses.

‍Hudson Hollister

‍And for the long term, what do you hope that the bulk data sources in the federal government and in the state government will do to bring about truly data first, modern law and regulation?

‍Sylvia Kwakye

‍We have all sorts of tools at our disposal, we have akoma ntoso, USLM, and all these standards. Make use of them! If, from the source, the structure of the document is taken care of, it allows us to do so much more. Can you do topic modeling, could you extract all the named entities, can you connect them to different knowledge graphs? My goodness, you could make so many more connections, you could tell better stories.

One fun project we did with students was to build topic models out of various titles in the Code of Federal Regulations. You see these patterns emerging. When did we start regulating pork, for instance, did something happen at that time? Was there an outbreak of illness at that time? Another looked at fracking laws in relation to geography. It was interesting to see laws about fracking in areas that had no shale to frack. I mean, you could make these sorts of law-of-something connections and tell much better stories to connect people to the laws that affect them.

Because I'm at Cornell and I'm surrounded by a lot of academics, I'm looking at it in terms of what academics want to do with that data. But making bulk data readily available would make life easier also for legislators. If they are able to compare what has been done before what worked and what didn't work, for instance, if we’re able to visualize legislation in different ways that makes sense to them and their constituencies, we all benefit. When you have structured data, you could present it in so many different ways. You could make it more usable for people who are challenged in some way. I mean, the whole work around accessibility, we've had to do a massive restructuring of all our collections to make them accessible. That work would be easier if some of the basic structures from the distributors are in place.

‍Hudson Hollister

‍I am certain that because of the work of advocates and the work of practitioners like yourself, this will happen.

‍Sylvia Kwakye

‍I hope so.

‍Hudson Hollister

‍We’ll have to leave it there. Thank you so much for spending this time. I hope we have a chance to revisit this conversation soon.

‍Sylvia Kwakye

‍Thank you.

Digitizing the US Code

An Interview with Dr. Sylvia Kwakye of the Cornell University Legal Information Institute

More News