April 5, 2021

Contextual Content Discovery: You've forgotten about the API endpoints

No items found.

Creative Commons license

‍

Overview

Presented at BSides Canberra 2021, slides available: PDF & Keynote (with videos).

As a team, we’re passionate about content discovery as it has historically led to the discovery of vulnerabilities. Typically, these days, content discovery usually involves tools like ffuf with a large wordlist. Over the last ten years, we have seen content discovery tools iterate on many important features (such as filtering and recursion) but a greater focus has been on making these tools faster, not necessarily innovating in the field of content discovery.

Over time, we have also seen a great shift in application development where APIs have become the backbone of server-side functionality. With single page applications taking off and technologies such as Express, Rails, Flask and other API centric frameworks becoming a centerpiece of web applications, we believed that content discovery tooling also needed to evolve to account for this.

Modern API frameworks may require the correct HTTP method, headers, parameters and values in order to produce a valid server-side response (non 404). Our tooling sends requests with context (HTTP methods, headers, parameters and values) by leveraging a large dataset composed of OpenAPI/Swagger specifications from the internet.

When targeting hosts running APIs, this has proven to be an extremely effective method in finding endpoints that typical content discovery tools are not capable of.

Swagger files were collected from a number of datasources, including an internet wide scan for the 20+ most common swagger paths. Other datasources included GitHub via BigQuery, APIs.guru and SwaggerHub.

In order for us to effectively discover content on API based application frameworks, we have developed a tool called Kiterunner and an accompanying datasets: routes-large.kite.tar.gz and routes-small.kite.tar.gz.

If you’re interested in the raw OpenAPI/Swagger files we collected by scanning the internet and from other datasources, you can download them from here. Additionally, if you’re interested in using traditional content discovery tools with our dataset, download swagger-wordlist.txt.

If you would like to learn about how we’ve tackled this problem, continue reading on as we explain the nuances of this problem.

What’s wrong with content discovery?

Content discovery tooling currently relies on static txt files as wordlists and it is up to the user to perform bruteforces using different HTTP methods or to have wordlists with parameters and values pre-filled.

All current tooling operates this way, however, long gone are the days where we are only bruteforcing for static files and folders (legacy web applications). We live in a world where API endpoints account for a large attack surface, and this attack surface can go unnoticed.

We’ve seen a huge shift in how web applications are being built over the last 10 years. Let’s take a look at some examples of API implementations in different frameworks:

Flask

@app.route("/api/v1/notes/<int:key>", methods=['GET', 'PUT', 'DELETE'])
def notes_detail(key):
    """
    Retrieve, update or delete note instances.
    """
    if request.method == 'PUT':
        note = str(request.data.get('text', ''))
        notes[key] = note
        return note_repr(key)
    elif request.method == 'DELETE':
        notes.pop(key, None)
        return '', status.HTTP_204_NO_CONTENT
    # request.method == 'GET'
    if key not in notes:
        raise exceptions.NotFound()
    return note_repr(key)

In this example, we can see that there is an API endpoint which requires a path such as /api/v1/notes/1 in order to reach the server-side functionality.

We can also note that this endpoint accepts GET, PUT and DELETE requests, where if the key integer value is not present in the notes dictionary, a not found exception will be raised.

Unless your wordlist has /api/v1/notes/<integer>, you will miss this API endpoint. Furthermore, if this endpoint only accepted PUT and DELETE requests, you would miss this API endpoint unless you bruteforced content with that specific HTTP method.

With Kiterunner, we leveraged our OpenAPI/Swagger datasets to automatically fill out the correct values for API endpoints, whether they are UUIDs, integers or strings. This gives us a higher chance of reaching API endpoints such as these.

Rails

# app/controllers/items_controller.rb
class ItemsController < ApplicationController
  before_action :set_todo
  before_action :set_todo_item, only: [:update, :destroy]

  # POST /todos/:todo_id/items
  def create
    @todo.items.create!(item_params)
    json_response(@todo, :created)
  end

  # PUT /todos/:todo_id/items/:id
  def update
    @item.update(item_params)
    head :no_content
  end

  # DELETE /todos/:todo_id/items/:id
  def destroy
    @item.destroy
    head :no_content
  end

...

Similarly, in this example we can see that there are a number of API endpoints that are only reachable by providing the correct todo_id and in some cases the item id.

We can see that these endpoints accept POST, PUT and DELETE requests, only if the correct todo_id and item id are provided.

Unless your content discovery tool was configured to send POST/PUT/DELETE requests and your wordlist had /todos/1/items or /todos/1/items/1, it’s very likely that you will miss these API endpoints.

Express

app.get('/api/restaurants/:id/:date/:size', (req, res) => {
  db.getSpecificAvailability(req.params.id, req.params.date, req.params.size, (err, results) => {
    if (err) {
      res.status(404);
      res.end();
      console.log(err);
    } else {
      res.status(200);
      res.send(results);
      res.end();
    }
  });
});

Most content discovery tools will miss this API endpoint unless there is a path in the wordlist that contains valid values for the parameters id, date and size.

As mentioned earlier, we replace placeholder values in Kiterunner to the best of our ability, based off our OpenAPI/Swagger dataset.

Content discovery tools over the years

As hackers, we’ve invested a lot of time and effort into improving content discovery tooling over the years. Some notable projects include:

All of these tools have done an excellent job at iterating on the concept of content discovery, whether that be speed improvments or usability features.

We pay our respects to all of the tooling built over the years and hope that our work in this area pushes the field of content discovery further.

The lightbulb moment

The primary motivation for us at Assetnote in building Kiterunner and furthering the field of content discovery was driven by our initial idea of leveraging OpenAPI/Swagger specifications for content discovery.

API specifications give us all the information we need about APIs. The correct HTTP methods, headers, parameters and values. This context is essential for so many APIs built in modern frameworks in order to receive a non 404 response.

Our hypothesis was that with enough API specifications collected from the internet, we would be able to compile a dataset that is effective in contextual content discovery. Using this large dataset of OpenAPI/Swagger specifications, we could build a tool that is capable of making requests with the correct context.

We made it our mission to collect as many API specifications as we could from the internet. We created a tool named kitebuilder which converted OpenAPI/Swagger specifications into our own intermediary JSON specification.

The intemediary data can be found below if you would like to try and use it for other projects that require this data:

routes-large.json.tar.gz (118MB compressed, 2.6GB decompressed)
routes-small.json.tar.gz (14MB compressed, 228MB decompressed)

From this intemediary format, we were able to then generate a dataset that could be used by our tool Kiterunner.

Data collection

BigQuery

BigQuery is magical.

In order to build our dataset, we leveraged BigQuery’s public datasets. On BigQuery, you can find GitHub’s public dataset, which is updated on a weekly basis. This dataset has terabytes of data, and was a perfect gold mine for us to extract Swagger files from.

This dataset can be accessed through the following URL: https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=github_repos&page=dataset.

First, we select all of the files which have a path which end with either swagger.json, openapi.json or api-docs.json and save it to a new BigQuery table:

Then we obtain the contents of all of these files, by running the following query:

🎉 We now have ~11,000 Swagger files 🎉

APIs.Guru

APIs.Guru is an OpenAPI directory which is essentially a machine readable Wikipedia for REST APIs. The project collects swagger specifications and makes them available in a normalized format. It’s an open source project and all of the data they have collected can be accessed through a REST API.

We were able to obtain all of the Swagger APIs from this API using the following command:

wget https://api.apis.guru/v2/list.json ; cat list.json | jq -r '.[]["versions"][]["swaggerUrl"]' > swagger-to-scrape

In order to download all the Swagger files to disk, we used the following command:

wget -i swagger-to-scrape

This was an effective way of downloading all the Swagger files within APIs.Guru as seen in the screenshot below:

*Downloading all the swagger files to disk*

‍

🎉 We now have ~14,000 Swagger files 🎉

SwaggerHub

SwaggerHub is an API design and documentation platform built for teams to drive consistency and discipline across their API development workflow.

By browsing SwaggerHub, you can see that they have a total of 434495 Swagger specifications.

Unfortunately, their API limited us and we were only able to download 10k specifications.

You can find the code we wrote to scrape SwaggerHub here.

🎉 We now have ~23,000 Swagger files 🎉

Scanning the Internet

When scanning the internet, we highly suggest setting up an RDNS record pointing to a web server you control that explains the scanning traffic. This greatly reduces the number of abuse requests.

We scanned the internet for 22 paths that could contain Swagger specifications. The following is a list of paths and the total number of hits we received:

/swagger/v1/swagger.json - 11173 hits
/swagger/v1/swagger.yaml - 5048 hits
/swagger/v2/swagger.json - 5107 hits
/swagger/v2/swagger.yaml - 4919 hits
/swagger/v1/api-docs - 5115 hits
/swagger/v2/api-docs - 5255 hits
/swagger/api-docs - 6049 hits
/v2/api-docs - 27189 hits
/v1/api-docs - 314 hits
/api-docs - 2454 hits
/swagger.json - 23983 hits
/swagger.yaml - 4816 hits
/api/swagger.json - 5313 hits
/api/swagger.yaml - 1689 hits
/api/v1/swagger.json - 1732 hits
/api/v1/swagger.yaml - 680 hits
/api/v2/swagger.json - 91101 hits
/api/v2/swagger.yaml - 404 hits
/api/docs - 1125 hits
/api/api-docs - 461 hits
/static/api/swagger.json - 187 hits
/static/api/swagger.yaml - 152 hits

We leveraged passive datsources such as Rapid7’s HTTP and HTTPs datasets, combined with an internet wide masscan to fill in the deltas.

Additionally, we only scanned port 80 and 443, focusing on known API documentation paths, scoped to Swagger 2.0 complaint files.

In order to facilitate the scanning, we used Zgrab2, with a custom HTTP module made for Swagger specification files.

Custom tooling written in golang was used to take the 20GB output of zgrab2 and deserialise to individual, validated and deduplicated API specifications.

🎉 We now have ~67,500 Swagger files 🎉

Finding APIs worth bruteforcing

‍

In order to find APIs on the internet, we deployed a large number of API centric frameworks and fingerprinted them. This provides us with a good understanding on what to look for before running Kiterunner over

We’ve created a list of signatures that you can use to discover API endpoints on the internet. We cover the following APIs:

Adonis
Aspnet
Beego
Cakephp
Codeigniter
Django
Dropwizard
Echo
Express
Fastapi
Fastify
Flask
Generic
Golang-HTTP
Hapi
Jetty
Kestrel
Koa
Kong
Laravel
Loopback
Nest
Nginx
Phalcon
Playframework
Rails
Sinatra
Spark
Spring-boot
Symfony
Tomcat
Tornado
Totaljs
Yii

You can find these signatures here. These signatures contain expected HTTP responses and example Censys queries to discover hosts running these APIs on the internet.

Preliminary Results

After developing the dataset and tooling, we spent some time proving the concept of contextual bruteforcing leading to the discovery of API endpoints that other content discovery tools would struggle finding.

Download API Endpoint

Take for example, the following request:

GET /download = 404 Not Found

If your wordlist only has /download, you may think that there is no content. However, with Kiterunner, we saw the following results returned:

GET /download/17506499 = 200 OK

When visiting the fully formed API path /download/17506499, we saw the following response:

Unless your wordlist had a path like /download/<int>, you would have missed this discovery.

User Create API Endpoint

In this example, it is highly unlikely that any other content discovery tool would have picked up this API endpoint.

The following request returned a 404:

GET /user/create = 404 Not Found

However, with Kiterunner, we saw the following output:

POST /user/create = 500 Internal Server Error

GET is 404, POST is 500, indicating that this endpoint does exist.

We were able to replay this request to Burp Suite using the following command:

r kb replay -w routes.kite "POST    500 [     40,    5,   1] https://redacted/user/create 0cc39f768fe347ac930d8fc3f6b5bf2e4b6be41e" --proxy http://localhost:8080

Which resulting in the following request:

The response indicates that this API endpoint exists however we are likely missing the correct parameters. We were able to use Param Miner to guess the JSON parameters for this endpoint.

Images Endpoint (Local File Read)

Traditional content discovery tools would return the following:

GET /images = 404 Not Found GET /images/ = 404 Not Found

However, when using Kiterunner, we noticed that we were given a 301 response status code when making a request with /images/ and having any suffix:

Upon visiting this endpoint, the response returned indicated that it couldn’t locate the image indicated by our suffix path:

We experimented with this and found that it was actual a local file read vulnerability. We could request the following path to leak the contents of the /etc/passwd file:

GET /secured/28572478/images//etc/passwd = 200 OK (with file contents)

How do I use the tool?

Download a release for your operating system from the releases page: github.com/assetnote/kiterunner/releases

Contextual Bruteforcing

Download the dataset routes-large.kite.tar.gz (40MB compressed, 577MB decompressed) or the smaller dataset routes-small.kite.tar.gz (2MB compressed, 35MB decompressed).

After downloading the tool and dataset, Kiterunner can be used through a command like the following:

Put the hosts you wish to scan with Kiterunner inside a single text file, one host per line.

kr scan <hosts-file> -w routes-large.kite -x 20 -j 100 --fail-status-codes 400,401,404,403,501,502,426,411

The following options are available for the kr scan (contextual bruteforcing):

-A, --assetnote-wordlist strings    use the wordlists from wordlist.assetnote.io. specify the type/name to use, e.g. apiroutes-210228. You can specify an additional maxlength to use only the first N values in the wordlist, e.g. apiroutes-210228;20000 will only use the first 20000 lines in that wordlist
      --blacklist-domain strings      domains that are blacklisted for redirects. We will not follow redirects to these domains
      --delay duration                delay to place inbetween requests to a single host
      --disable-precheck              whether to skip host discovery
      --fail-status-codes ints        which status codes blacklist as fail. if this is set, this will override success-status-codes
      --filter-api strings            only scan apis matching this ksuid
      --force-method string           whether to ignore the methods specified in the ogl file and force this method
  -H, --header strings                headers to add to requests (default [x-forwarded-for: 127.0.0.1])
  -h, --help                          help for scan
      --ignore-length strings         a range of content length bytes to ignore. you can have multiple. e.g. 100-105 or 1234 or 123,34-53. This is inclusive on both ends
      --kitebuilder-full-scan         perform a full scan without first performing a phase scan.
  -w, --kitebuilder-list strings      ogl wordlist to use for scanning
  -x, --max-connection-per-host int   max connections to a single host (default 3)
  -j, --max-parallel-hosts int        max number of concurrent hosts to scan at once (default 50)
      --max-redirects int             maximum number of redirects to follow (default 3)
  -d, --preflight-depth int           when performing preflight checks, what directory depth do we attempt to check. 0 means that only the docroot is checked (default 1)
      --profile-name string           name for profile output file
      --progress                      a progress bar while scanning. by default enabled only on Stderr (default true)
      --quarantine-threshold int      if the host return N consecutive hits, we quarantine the host as wildcard. Set to 0 to disable (default 10)
      --success-status-codes ints     which status codes whitelist as success. this is the default mode
  -t, --timeout duration              timeout to use on all requests (default 3s)
      --user-agent string             user agent to use for requests (default "Chrome. Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36")
      --wildcard-detection            can be set to false to disable wildcard redirect detection (default true)

Vanilla Bruteforcing

It’s also possible to use Kiterunner as a traditional content discovery tool (txt file wordlists). Kiterunner has built in support for Assetnote Wordlists.

Run the following command to list Assetnote Wordlists:

kr wordlist list

This will return a table of contents like below:

+-----------------------------------+-------------------------------------------------------+----------------+---------+----------+--------+
|               ALIAS               |                       FILENAME                        |     SOURCE     |  COUNT  | FILESIZE | CACHED |
+-----------------------------------+-------------------------------------------------------+----------------+---------+----------+--------+
| asp_lowercase                     | asp_lowercase.txt                                     | manual.json    |   24074 | 1.1mb    | false  |
| aspx_lowercase                    | aspx_lowercase.txt                                    | manual.json    |   80293 | 4.4mb    | false  |
| apiroutes-201120                  | httparchive_apiroutes_2020_11_20.txt                  | automated.json |  953011 | 45.3mb   | false  |
| apiroutes-210128                  | httparchive_apiroutes_2021_01_28.txt                  | automated.json |  225456 | 6.6mb    | false  |
| apiroutes-210228                  | httparchive_apiroutes_2021_02_28.txt                  | automated.json |  223544 | 6.5mb    | true   |
| apiroutes-210328                  | httparchive_apiroutes_2021_03_28.txt                  | automated.json |  215114 | 6.3mb    | false  |
... omitted for brevity...

Use your favourite txt wordlist with Kiterunner, combined with the apiroutes-210228 Assetnote Wordlist through the following command:

kr brute <hosts-file> -w wordlist.txt -e asp,aspx,cfm,xml -x20 -j250 -A=apiroutes-210228

The following additional options are available for the kr brute (vanilla bruteforcing):

-D, --dirsearch-compat              this will replace %EXT% with the extensions provided. backwards compat with dirsearch because shubs loves him some dirsearch
  -e, --extensions strings            extensions to append while scanning
  -w, --wordlist strings              normal wordlist to use for scanning

Replaying requests

When you receive a bunch of output from kiterunner, it may be difficult to immediately understand why a request is causing a specific response code/length. Kiterunner offers a method of rebuilding the request from the wordlists used including all the header and body parameters.

You can replay a request by copy pasting the full response output into the kb replay command.
You can specify a --proxy to forward your requests through, so you can modify/repeat/intercept the request using 3rd party tools if you wish
The golang net/http client will perform a few additional changes to your request due to how the default golang spec implementation (unfortunately).

❯ go run ./cmd/kiterunner kb replay -q --proxy=http://localhost:8080 -w routes.kite "POST    403 [    287,   10,   1] https://target.com/dedalo/lib/dedalo/publication/server_api/v1/json/thesaurus_parents 0cc39f76702ea287ec3e93f4b4710db9c8a86251"
11:25AM INF Raw reconstructed request
POST /dedalo/lib/dedalo/publication/server_api/v1/json/thesaurus_parents?ar_fields=48637466&code=66132381&db_name=08791392&lang=lg-eng&recursive=false&term_id=72336471 HTTP/1.1
Content-Type: any


11:25AM INF Outbound request
POST /dedalo/lib/dedalo/publication/server_api/v1/json/thesaurus_parents?ar_fields=48637466&code=66132381&db_name=08791392&lang=lg-eng&recursive=false&term_id=72336471 HTTP/1.1
Host: target.com
User-Agent: Go-http-client/1.1
Content-Length: 0
Content-Type: any
Accept-Encoding: gzip


11:25AM INF Response After Redirects
HTTP/1.1 403 Forbidden
Connection: close
Content-Length: 45
Content-Type: application/json
Date: Wed, 07 Apr 2021 01:25:28 GMT
X-Amzn-Requestid: 7e6b2ea1-c662-4671-9eaa-e8cd31b463f2

User is not authorized to perform this action

Conclusion

Content discovery on API hosts requires a different approach in order to achieve maximum coverage. We were able to operationalize OpenAPI/Swagger specifications from a number of datasources to assist in the discovery of API endpoints from a black-box perspective.

From the preliminary results gathered using Kiterunner, we found numerous examples where current content discovery tooling would have failed to pick up endpoints which represented legitimate attack surface.

By taking a contextual approach at discovering endpoints (correct HTTP method, headers, parameters and values), we were able to find content that would have been tricky to pick up otherwise.

Most of our challenges were related to the engineering efforts required to deal with OpenAPI/Swagger specifications from arbitrary sources and making Kiterunner a performant bruteforcer. We plan on writing a separate blog post to go over the engineering challenges faced during this security research.

Next time you’re bruteforcing an application with an API, consider using Kiterunner alongside your other content discovery workflows.

Assetnote

We love content discovery at Assetnote. Working on research and engineering problems surrounding this problem area has always been exciting to us.

Assetnote Continuous Security automatically maps your external assets and monitors them for changes and security issues to help prevent serious breaches. If you want to protect your attack surface and would like a demonstration of our product, please reach out to us by submitting our contact form.

On a final note, Assetnote is hiring across a number of engineering, security and operations roles. One of the best parts of working at Assetnote is the ability to combine interesting engineering challenges with security research as part of the every day work on our products. If you are interested in joining our team and working on cutting edge security products check out our careers page.

Credits

The following people were involved in this research:

Written by:

Shubham Shah

Your subscription could not be saved. Please try again.

Your subscription has been successful.

More Like This

Security Research

New!

Doing the Due Diligence: Analyzing the Next.js Middleware Bypass (CVE-2025-29927)

Read on ASN Blog

Security Research

New!

Ready to get started?

Get on a call with our team and learn how Assetnote can change the way you secure your attack surface. We'll set you up with a trial instance so you can see the impact for yourself.

Request a Demo