Get 10-K Financial Data in 3 Lines of Code

Thesma ·
tutorial 10-k python sec-filings

Let’s try something that ought to be easy.

Pull the most recent annual income statement for Apple — revenue, cost of revenue, operating income, net income, EPS. The data is public. It’s on sec.gov. There is no login. How hard should this be?

If you’re using Thesma, here it is in three lines of Python:

import requests
r = requests.get("https://api.thesma.dev/v1/us/sec/companies/0000320193/financials",
    headers={"X-API-Key": KEY},
    params={"statement": "income", "period": "annual"})
print(r.json())

(0000320193 is Apple’s CIK — SEC’s internal company identifier. If you don’t know a company’s CIK off the top of your head, and almost nobody does, one call to the companies endpoint turns a ticker into a CIK: GET /v1/us/sec/companies?ticker=AAPL. Hold that thought — it is the subject of the rest of this post.)

The response is a clean JSON object with every line item on the income statement, normalised into a consistent shape that works for any company, any quarter, any year. Need the balance sheet instead? Change statement to balance-sheet. Cash flow? cash-flow. Quarterly instead of annual? period=quarterly. Same three-line pattern, same clean response.

If you were going directly against SEC EDGAR, you would be looking at something rather different. Let’s take the scenic route and see what is actually involved — then come back to the three-line version.

The raw EDGAR tour

Step 1: turn “AAPL” into a CIK

EDGAR does not know what a ticker is. Internally, every filer has a Central Index Key — a zero-padded 10-digit number. Apple’s is 0000320193. To find it, you fetch SEC’s own ticker-to-CIK JSON map at https://www.sec.gov/files/company_tickers.json.

This file is around fifteen thousand companies in a slightly strange inverted format (the keys are row numbers, the values are the records). You parse it, match on ticker == "AAPL", pull cik_str, and left-pad it to ten digits. Don’t forget the User-Agent header identifying your project — miss it and you get a 403 with no explanation.

Step 2: find Apple’s most recent 10-K

Now you hit the submissions endpoint at https://data.sec.gov/submissions/CIK0000320193.json. This returns Apple’s full filing history. You scan the recent object for the most recent form of type 10-K, pull its accession number, and hold onto the filing date.

The accession number looks like 0000320193-24-000123. You will need to strip the dashes to build the URL path. You will also need the un-stripped version to build certain filenames. Both formats are required, in different places. This is fine.

Step 3: download the filing documents

A 10-K is not a single file. It is a directory of documents. The financial data lives in an iXBRL document buried among dozens of exhibits, and its filename is not standardised across companies. You parse the filing index to find it, or you hit another endpoint that gives you the “Financial Report” URL, and then you download the document.

Step 4: parse the iXBRL

You now have a large HTML file with embedded XBRL facts. You need an XBRL parser. The canonical open-source choice is Arelle, which works but is slow and memory-hungry — processing a single 10-K can take several seconds and requires the full US-GAAP taxonomy to be loaded into memory first. You get back a list of facts, each one tagged with a concept from the taxonomy.

Step 5: normalise the taxonomy

This is where it really starts to hurt. You want “total revenue”. Apple reports it under us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax. Microsoft uses us-gaap:Revenues. A bank holding company reports no “revenue” element at all — its top line is us-gaap:InterestAndDividendIncomeOperating plus non-interest income. A SaaS startup might invent a company-specific extension element that only appears in their own filings. The taxonomy version changes every year, so the tag that worked in 2023 may have been superseded in 2024.

To get a consistent “total revenue” field across every public company, you maintain a mapping table: preferred tag, ordered fallbacks, company-specific overrides, special cases for banks and insurers. You do this for every field you care about. Net income has five common tags. EPS has three. Cash flow from operations behaves differently depending on whether the filer uses the direct or indirect method.

Step 6: handle the edge cases

  • A small percentage of 10-Ks incorporate financial statements by reference to an exhibit. You have to fetch the exhibit.
  • Shell companies report zeros for most line items. Distinguishing “shell” from “bug in my parser” is its own problem.
  • Amendments (10-K/A) can supersede the original filing. Do you want the original or the amended version?
  • Foreign private issuers file 20-F instead of 10-K, often in IFRS rather than US-GAAP — same problem, different taxonomy altogether.
  • Banks, insurance companies, and REITs use specialised industry taxonomies that do not overlap cleanly with the base US-GAAP concepts.

By the time you have handled all of this for just Apple, you have written a few hundred lines of code. Extending it to every public US company, every quarter, every year, without regressions — that is the job we have done so you do not have to.

Back to the three lines

Here it is again, in full:

import requests
r = requests.get("https://api.thesma.dev/v1/us/sec/companies/0000320193/financials",
    headers={"X-API-Key": KEY},
    params={"statement": "income", "period": "annual"})
print(r.json())

And a slice of what comes back (real FY2025 values, shape simplified for illustration):

{
  "cik": "0000320193",
  "ticker": "AAPL",
  "statement": "income",
  "period": "annual",
  "fiscal_year": 2025,
  "period_end": "2025-09-27",
  "filing_type": "10-K",
  "currency": "USD",
  "data": {
    "total_revenue": 416200000000,
    "cost_of_revenue": 221000000000,
    "gross_profit": 195200000000,
    "operating_income": 133100000000,
    "pre_tax_income": 132700000000,
    "net_income": 112000000000,
    "eps_basic": 7.49,
    "eps_diluted": 7.46
  }
}

Same shape for every filing, every company. The XBRL mapping has been done. The exhibits have been followed. The bank and insurance taxonomies are handled separately. The edge cases are caught.

Want the balance sheet or cash flow instead? Change statement to balance-sheet or cash-flow. Want quarterly data? Set period=quarterly and pass a quarter parameter. Want it for the entire S&P 500? Loop the request over the tier=sp500 company list — your free-tier rate limit will give you the whole index in a few minutes.

When raw EDGAR still makes sense

If you only need data for a handful of companies and you are happy to do the taxonomy work once, raw EDGAR is free and it is the source of truth. The SEC is never going to deprecate it. For a weekend project, it is a completely reasonable choice — you will learn a lot about XBRL along the way, and that knowledge does not go to waste.

If you are building something that needs to work across the market, across time, without a taxonomy-mapping side project of its own, that is why Thesma exists. The free tier includes every dataset we offer, with no credit card, and you can grab a key and make your first call in under a minute. The full endpoint reference lives in the API documentation.

See you in the next post.