5 Frustrations Every Developer Has With SEC EDGAR

Thesma ·
edgar sec-filings developer-experience

SEC EDGAR is one of the best open datasets in the world. Every 10-K, 10-Q, 8-K, Form 4, 13F, and DEF 14A from every public US company, free, going back to the mid-90s. No licence, no paywall, not even a registration form. For a developer, that’s the good news.

The bad news is that EDGAR is also a thirty-year-old system built for compliance, not for developers, and it shows. If you’ve ever tried to build a product on top of it, you’ve probably hit at least three of the following five walls. Most of us have hit all five.

1. Rate limits are quieter than you think

SEC’s published rule is simple: 10 requests per second, and you must send a User-Agent header that identifies you and your company. Miss either and you get a 403.

What the docs don’t emphasise is how little feedback you get when something goes wrong. There’s no X-RateLimit-Remaining header. There’s no backoff hint. No dashboard. The 403 response page is an HTML error — which is particularly fun when your parser was expecting iXBRL. One runaway parallel worker, one retry loop gone wrong, and your whole IP is blocked, with no clear signal as to which of the three possible mistakes caused it.

Most developers build a local throttle once, forget about it for six months, and then get bitten when they add a new crawler that shares the same IP. The limit is the limit across your whole infrastructure, not per script.

None of this is hard. It’s just tedious, and it’s exactly the kind of undifferentiated work that you’d rather not be doing on day one of a new project.

2. XBRL is two problems pretending to be one

On paper, XBRL is beautiful: every number in a 10-K is tagged with a standard taxonomy concept, so cross-company comparison should be trivial.

In practice, XBRL is two problems stacked on top of each other:

Parsing it. iXBRL is embedded in HTML. Arelle — the open-source XBRL parser most vendors rely on — works, but it is slow, memory-hungry, and does not scale to millions of filings without real engineering. Everyone who has processed XBRL at scale has a war story about Arelle, garbage collection, and OOM kills at three in the morning.

Normalising it. Two companies can both report “revenue”, but one might use us-gaap:Revenues, another uses us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax, and a third invents a company-specific extension element that only exists in their own filings. Bank holding companies report interest income under a different tag altogether. The taxonomy version changes every year. “Total revenue” is not a solved problem — it is a question you answer company by company, year by year, by reading filings and building a mapping table.

If you just want the income statement for Apple last quarter, you can get there. If you want the same set of fields, normalised, across every public US company, every quarter, for the last decade — that is a multi-month engineering project, not an afternoon.

3. The identifier situation is a mess

A public company has:

  • A CIK — EDGAR’s internal ID
  • One or more tickers — because class A and class B shares trade separately
  • One or more CUSIPs — the settlement system ID, which, incidentally, is not free to redistribute
  • An LEI — the global legal entity identifier, if the company bothered to register
  • A SIC code — the SEC’s industry classification, which the Census stopped using in 1997

EDGAR filings are keyed by CIK. Your users search by ticker. Your data vendor probably sends you CUSIPs. The financial press uses company names, which change over time, get acquired, and are often ambiguous (there are at least five “Alphabet”s on EDGAR). Mapping between these is table-stakes for any SEC-based product, and there is no canonical, free, up-to-date mapping maintained by anyone.

The closest thing is SEC’s own company_tickers.json, which is incomplete, occasionally stale, and cheerfully ignores share classes. You end up building your own mapping. Then you end up maintaining it. Forever.

4. Filing structure is aspirational, not enforced

A 10-K has 15 items. In theory, Item 1 is Business, Item 7 is MD&A, Item 8 is Financial Statements. In practice, those items show up in filings looking like any of the following:

  • Item 1. Business
  • PART I — ITEM 1 — BUSINESS
  • Nested inside a larger table of contents with no Item label at all
  • Just the word Business in bold, with no number

And that is before you get to incorporation by reference, a legal shortcut where a 10-K says “see our DEF 14A for executive compensation” and simply does not include the data in the 10-K. You have to know to go hunt for it in a different filing, matching on CIK and fiscal year.

Then there are exhibits. Sometimes they are in the main document. Sometimes they are in separate files linked from the filing index. Sometimes they are external URLs that 404 six months after the filing is submitted. Some filings are plain text, some are HTML, some are iXBRL, and some are bizarre hybrids of all three where the same number appears three times in three different formats.

Writing a parser that works on “most” 10-Ks is a weekend. Writing a parser that works on all 10-Ks, including shell companies, foreign filers, amendments, and the cheerfully non-compliant edge cases, is a career.

5. 8-K is unstructured on purpose

8-Ks announce material corporate events — earnings, acquisitions, executive departures, bankruptcies, shareholder votes. Each 8-K has one or more “items” that tell you the type of event: Item 2.02 for earnings, Item 5.02 for officer changes, Item 8.01 for “other events”. That is where the structure ends.

The actual content of the item is free-form prose. There is no structured field for “name of departing executive”. There is no structured field for “amount of the acquisition”. If you want to know that the CEO of a company just resigned, you are doing natural language processing on press-release English, written by a lawyer, in a format that was last updated when Internet Explorer 6 was new.

This is why real-time corporate event monitoring is hard even though the underlying data is public and available within minutes of the event. The raw signal is there. Turning it into “Company X’s CFO resigned” is the work.

So what do you actually do?

If you only need data for a handful of companies and you don’t mind the engineering, raw EDGAR is a fine choice. It’s free, it’s the source of truth, and the SEC is never going to deprecate it. Respect the rate limit, cache aggressively, and don’t expect consistency.

If you are building a product on SEC filings — especially one that needs to work across the whole market, across time, without a dedicated data engineering team — the economics stop making sense pretty quickly. Every one of the five problems above is solvable, but solving them is a significant amount of work that has nothing to do with your actual product. If you want to see what “doing the work” actually looks like vs raw EDGAR, see Get 10-K Financial Data in 3 Lines of Code.

This is the gap Thesma exists to fill. We normalise XBRL into a clean set of financial fields across every public US company. We maintain the identifier mapping so you can look up by ticker, CIK, or name. We parse 8-K items into structured events. We handle the rate limits and the edge cases and the exhibits, so you can get Apple’s latest annual income statement in a single HTTP call:

curl https://api.thesma.dev/v1/us/sec/companies/0000320193/financials \
  -H "X-API-Key: $THESMA_API_KEY" \
  -G --data-urlencode "statement=income" \
     --data-urlencode "period=annual"

The free tier includes every dataset we offer, with no credit card. If any of the frustrations above sound familiar, grab a key and try it or browse the API documentation — the first call takes about thirty seconds from sign-up.

And if you’d rather stay on raw EDGAR, we respect that too. Just be kind to the rate limiter.