We had the privilege of interviewing Vlad Eidelman, chief scientist at FiscalNote, a leading AI-driven enterprise SaaS company that delivers global policy data and insights, on how digital structure transforms legislative and regulatory work. The interview was conducted by Hudson Hollister, founder and CEO of Hdata and former president of the Data Coalition. This transcript has been edited for clarity.
Dr. Vlad Eidelman, chief scientist at FiscalNote, thank you so much for agreeing to be interviewed.
Vlad, tell us a bit about those heady early days, when you first began to design scrapers and technology for the platform that became FiscalNote.
Yeah, so that’s going back seven, eight years now, a little over eight years. The problem that we saw when we got together was that there was so much unstructured textual data coming out at all levels of government, from local to federal in the U.S. and internationally, and we’ll focus on the U.S. at this point.
There were thousands of new documents coming out a day. And that’s kind of on the lower side of the estimate. If you’re including all the regulatory filings, the enforcement actions, the other kinds of activities, it’s easily in the tens of thousands or hundreds of thousands.
But even just focusing on legislation and regulation, across the state and federal level, which is where we started, at the state level at first, at the peak cycles of the [legislative] seasons, there were thousands of new bills coming out every day and thousands of changes to existing bills.
We saw that the way people were currently tracking those introductions and updates and movements was manually, largely because there were at least fifty different sources, for each state.
Underneath that, we talked about URLs. Each state government has multiple URLs, actual web servers, websites, that host different aspects of the legislative or regulatory process. And each of those has different websites for their calendars, their hearings, their agendas.
And so you’re quickly exploding into potentially dozens or hundreds of sources of URLs, even for what would, from a human perspective, be considered one data set: legislation in Maryland, for example, or legislation in New York.
On top of that, you have hundreds of regulatory agencies, which might be putting out announcements through a bulletin or some sort of register, or it might be spread throughout different agencies of a state.
The first problem was this: how do we collect all the URLs, the actual source URLs we need to start with to get at the [legislative and regulatory] data?
We built out several systems for quickly identifying URLs and validating them, involving some amount of human in the loop. From the beginning we have tried to automate as much as possible but we recognized that we couldn’t automate everything.
The second task, once we had a URL, the first implementation [of the FiscalNote platform] was the basic scraping. We look at a [legislative or regulatory] site and we try to crawl through its structure and identify paths within it that appear important or interesting to us – for example, the text of a bill, the time a committee hearing is happening, the identity of a bill’s sponsor.
Third, once we had that data, we tried to pull everything together into a monolithic pipeline, which tied together scraping, which is collection from the source sites; ingestion, which is pulling that data down into our own database; extracting what we think is [the key information]; and then storing [that information] at rest. The pipeline connected all the steps together. It was pretty rigid and we built it up quickly.
Once we had done this we could serve the state legislation [on the FiscalNote platform]. But there was obviously brittleness in this setup, as well. As soon as any state website changes, and really when any of the hundreds of URLs change, the scrapers might break, because the site changed.
[Whenever the scrapers broke] we had to go update them. So we had to introduce all these monitoring and logging metrics and mechanisms so that our engineers and our product managers could immediately tell when a [government] website had some changes.
Usually what happened was this: on the first day of a new [legislative] session, a state government would release a new site. And then, suddenly, everything would break and we would have to rebuild it.
It was quite painful to maintain this monolithic, rigid [pipeline setup], even with all the monitoring and logging. We hadn’t yet gotten to the point of learning from that and building out something that was more robust, more modular.
Before we talk about that next generation of the platform, I’ve got a couple questions about this early part.
Tell us a little bit about the digital structure that you must have built for that monolithic pipeline to fit everything into. How do you explain that nobody else had tried this up until you did it successfully?
By digital structure, do you mean the schema into which we are placing this data? Right.
There’s all this [legislative and regulatory] data, mostly unstructured. It’s semi-structured, really, because there are titles or headings or columns in some of the material that we’re extracting. There are semantically-relevant things like bill sponsors and introduction dates. We are trying to parse that out because it’s not given to us in a CSV or other tabular format.
Looking across even just the 50 state governments, those have hundreds of source URLs underneath, plus the federal governments. All of them have somewhat related but somewhat unique … parliamentary processes. For example, consider their introductory action–how a bill goes through first reading, second reading, third reading. Or, for regulation, consider whether there’s a notice and comment period and what that process looks like.
For all these different state and federal processes, we had to figure out a way to accommodate them within at least one or more than one schema, in a way that made it accessible to our users. We wanted to make things as easy as possible [and make legislative and regulatory information] accessible to non-technical consumers, whether that’s at a nonprofit, or a corporate office, or inside of the government itself, tracking and understanding what other governments are doing.
All these users need easy accessibility, in a more or less singular kind of framework: like you said, a schema allowing them to search for an update [simultaneously] across all 50 states to see what’s going on with [some specific kind of] policy.
From the beginning we tried to create a mandatory schema–a small set of fields that we expect are going to be present across any given data type, like legislation, and there’ll be a different set of core data types or fields within a different data type like regulation.
And then there’s optional fields, which we don’t expect, so they’re not going to be a blocker for ingesting that data.
If one of the core [mandatory] fields is missing, we might decide not to even save that data, because we think it’s malformed in some way. If the core [mandatory fields are] there, we save it. Like, let’s say if it’s a bill, we need a title.
If that core is there, there might be optional fields, like status updates to a bill. When the bill was first introduced, there were no status updates yet. And status updates don’t even make sense in some locations in some different kinds of parliaments, globally. So, status updates might be an optional field.
We started with this concept of mandatory and optional fields [for legislative and regulatory information], and then we built off workflow functionality on top of that: filters, alerts, other kinds of user-related actions, based off of that schema. We really thought hard and iterated on the question of what is mandatory and what is optional. How were we going to make our structure generic enough that when we ingest a new data set, we can actually minimize our time to platform for new data, because we don’t have to introduce new parts of the schema that we wouldn’t have encountered before?
I want to bring that forward to the present. I’m assuming that as FiscalNote developed, the pipeline got less monolithic, the schema got more flexible, and AI was introduced? But before that, it sounds like you’re about to answer my second question: how come nobody else did this?
There were already some legacy vendors that were working with legislative and regulatory information, but they did not figure out how to do what you did figure out how to do.
We are not the first to introduce scraping, or structuring legislative regulatory data, by any means.
But maybe our innovation was trying to make it easy to connect the different [legislative and regulatory] data sources together in a way that was not done before. Oftentimes, people would scrape data and then put it up into a fairly non-intuitive workflow or interface, focused on the data source.
For example–here’s all the legislation from Montana, or from New York or Maryland.
But you couldn’t really do things across different data sets. Government affairs, public affairs, and our legal and regulatory clients think about issues – deforestation policies, COVID policies, retail policies [across all the jurisdictions].
We took the mindset of a common schema, like you and I just talked about, and that allowed us to connect these different sources together.
And like you just mentioned, we started to introduce AI and more core capabilities throughout the pipeline.
The first scrapers had a little bit of robustness – for example, looking for different combinations of regulatory actions.
But now, we have different kinds of classifiers that have been built to automatically identify whether we think this is a news page, a bill page, a regulation page. Is this related to something we scraped before? Has this changed? Do we need to rescrape it?
We have tools that if there’s a broken link, go out and try several mechanisms automatically, to find what the new link might be for that same source, without having a human intervene.
We’ve moved away, if we can, in some places from crawling the structure itself, to more smartly looking at a page, and trying to identify without having to specify xpaths, where we think critical information is. This means, if the xpath changes, its name changes, maybe its location in a tree changes–it doesn’t actually break our ingestion. They keep going.
We’ve also introduced AI upstream from there: how do we refine the data that we get? How do we add to it? How do we enrich? How do we add metadata that wasn’t originally part of the source data?
This includes topics. We automatically identify if [legislative or regulatory content] is related to banking or finance, or indigenous issues, or healthcare, or agriculture, or any topic. This includes identifying individuals and organizations associated with topics: Who are the core organizations? Who are the people mentioned in regulatory comments or legislation? These entities will automatically now start to connect and support the ability for our users to see across different data.
For all these things that we derived using our machine learning and natural language processing, you can think of them as additional optional fields added to the data schema [discussed earlier]. These are derived fields, so they have some sort of score probability that we can threshold.
As a result, we can analyze legislation and regulation, automatically, topically, into our two-thousand-topic node categorization. Initially, the topic node categorization was two hundred or so; our number of topics has grown over time. And so our ability to really discreetly identify things that might be related to a niche topic has gone up, as we’ve collected more data.
All these combinations have allowed us to accelerate our time to product, which is something that we really care about. How long does it take to get a new data set into the product and time to value? What are the ways in which a new data set can be used immediately?
We want to reuse types of workflows–so how do we consistently combine ways [of interacting with data sets]? I think that’s our innovation.
So many companies have scraping technology and many companies look at policy, and many companies have workflow applications. But where we’ve tried to separate ourselves is having a combination of all of those three. Combined, that’s more value for the user, ultimately,
Those three categories being scraping, workflow, and what was the third one?
The refinement! The analysis of how we connect these together in ways that weren’t present at the source sites.
We’ve increased our modules [and connections between] the data over time. So it used to be–we only alerted our users to the existence of new legislation or regulation. Now we have additional capabilities around advocacy, more capabilities around the constituent interaction side, powering the offices of the actual legislators and regulators. So we’ve expanded it to cover more of the lifecycle of legislative and regulatory rulemaking.
And that brings us forward to the present! The platform combines capabilities that no other platform for public policy professionals does, and has allowed for global scaling, including across all jurisdictions and types of governmental bodies.
That’s very kind of you to say! A couple of different ways in which we’ve tried to expand include, like you said, the foundations of our data.
We started with just state legislation. Then we included state regulation, we included federal legislation, federal regulation, and now really, it’s expanded over the last couple years to thousands of local municipalities in the US. It’s expanded to global coverage, where we cover in some ways, most European countries, South America, and Asia.
And we’ve also expanded our ability to deliver that data in different ways with different product lines that are meant for related but complementary roles within an organization.
We started with the core government affairs or government relations user who needs to know what’s going on in a state capitol or agency. And now we also have some of the legal and general counsel users who understand corporate risk and compliance. We have some of the financial side users who need to understand the impacts of policy through, for instance, ESG ratings. And we have our public sector and our government users, who span the gamut from the similar government affairs capabilities, to more involved intelligence for policy making capabilities.
You’ve been able to do all this while legislative and regulatory jurisdictions continue to use paper-based legacy methods of drafting, managing, and publishing legislative and regulatory materials.
What will it do for your platform and what will it do for the world if legislative and regulatory jurisdictions move to a Data First model of drafting, managing, and publishing such materials?
There has been a lot of talk and a lot of initiatives that have been successful, to put out machine readable data as a first competency, especially in the federal government.
We’ve seen that in congress.gov and some of the other data.gov initiatives that it’s become much easier to consistently access data we need.
It’s no longer a competitive advantage to be able to take microfiche and convert it to XML, and then be able to publish that – which was a competitive advantage some years ago. And I think that’s great. I think that’s better for everyone.
We all believe, internally, that there’s a rising tide of availability of data and that there’s some commoditization going on, even with transcripts. So for instance, we have a product that takes audio and video signals and transcribes them into text so that it can be similar documents, or any congressional hearing, or White House press release, press conference, things like that. We don’t believe that that’s a competitive advantage for more than five years, we think that should be freely available [eventually]. The more access both the public and customers have, that we have to that data, is awesome.
It’s actually a little bit surprising that, where we are now in 2022, it’s still as difficult as it is, and still as cumbersome to actually get at the data that we want.
When a lot more data is available, not just machine readable, but actually processable and consistent in its format and schema, that will allow us to stop spending so much time monitoring and changing the ingestion side, to keep up with the different sources, and spend a lot more time on what we can actually deliver back to the client in terms of value, of understanding the impact, of understanding the way that different datasets relate to each other, and modeling the data in ways that we haven’t been able to do before.
[Right now] we just don’t have the time or the resources or even the access to some of the different kinds of fields that might be available.
If, for instance, the impact clauses [of regulation] were identified in numerical values, where they’re in terms of what we think the impact is going to be, already published out from agencies, that would immediately allow us to build more interesting economic models and forecast models that we could connect back to other kinds of statistics, from demographics to economics.
[Statistics and regulation represent] almost two different worlds. First, there’s the indicator datasets like those the World Bank and governments publish: their quantitative data. And then, there’s the policy, which is purely qualitative.
We’re trying to bridge that gap right now, which is one of the innovations of FiscalNote, by making the qualitative data a bit more quantitative, by structuring it. But if governments release more of that qualitatively and quantitatively, it creates a world of opportunity for automating some of the regulatory processes for that interaction of enforcement to be much smoother between companies to understand the quantitative definition of a rule or a legislative action, and then be able to implement that internally in their policies.
Last question: what message would you select, to provide to those that are maybe under-resourced in legislative and regulatory drafting offices in government, considering moving away from Microsoft Word perhaps and adopting Data First drafting. What do we say to them?
I would say: we understand and empathize, how difficult that process is, and how much change management is going to be involved in updating both the internal systems as well as the mindsets of people, and being comfortable with the way that digital policymaking and Data First policymaking will happen.
I understand the frustrations as well as the anxiety about opening up that process more and potentially changing the way things are happening.
But once it happens, in retrospect it will seem like some of those frustrations and anxieties were probably overestimated.
Let’s do it. Let’s go. Dr. Vlad Eidelman, chief scientist at FiscalNote, thank you so much for spending some time with us.
No, thank you!