DATA SOURCES

Open Web

14min

The Open Web category includes data collected on various sources on the Internet, also known as "clear web".

For most Open Web sources, any Identifiers you create are used to run periodic searches. The Identifier data is disclosed to third-parties, but these are reliable platforms that support our collection, such as Shodan or GitHub.

Source Types

Pastes

The Pastes subcategory includes various websites that allow for anonymous sharing of text content, including Pastebin (pastebin.com). These websites are typically used by malicious actors looking to share content between each other (between different computers for example).

There are two main paste collection processes:

  1. Certain paste websites have APIs or historical paste lists that our systems monitor, download and store.
  2. When paste links are found on any document in the entire system, they are added to a queue and crawled as well.

All pastes found through the 2 processes are collected, stored and searchable in Flare.

Web Accounts

The Web Accounts categorymonitors a list of websites that mirrors the coverage of the Sherlock Project's OSINT tool. A list of sites currently covered by the tool can be reviewed here: https://github.com/sherlock-project/sherlock/blob/master/sites.md

Source Code

The Source Code subcategory includes data collected on platforms that include mainly source code, such as GitHub.

GitHub

Flare collects data from GitHub in the following manner:

Document image


Stack Overflow

Flare also monitors Stack Overflow and will send out alerts if any Keyword or Domain identifier are disclosed in a question on the forum.

Google

The Google subcategory is used based on specialized queries (dorks). Our experts maintain a large list of dorks, that are adjusted based on the domain and keyword identifiers. All results are stored in the database.

Hosts

The Hosts subcategory includes data related to an organization's SSL certificates and open ports. It uses mainly third-party services such as Shodan. Searches are run on these services based on the domain and IP identifiers.

Shodan

Shodan is a tool that discovers and indexes any devices that are connected to the Internet, where they are located and who is using them. It allows you to keep track of all the computers on your network that are directly accessible from the Internet. The tool is known to be used by malicious actors that investigate potential victims and their vulnerabilities. More granularly, for every IP, Shodan will look at most ports open, the service running on the port, versions of systems, associated vulnerabilities, and more depending on the use case. Here is a full list of all ports shodan scans, and the amount they have in their database for each: https://www.shodan.io/search/facet?query=%2A&facet=port

In Flare, Shodan data is integrated to your identifiers monitoring as long as you check the Hosts box in either one of the following ways:

A) Domain Identifiers: Whenever you create a Domain identifier and check the Hosts category, we will automatically send queries to shodan on a daily basis to see if the domain has appeared in any of its indexed hosts.

B) Subdomain Identifiers: Whenever you create a Domain Identifier and enable Subdomains, we will query Shodan using the domain and subdomain names. We also resolve the IP addresses ourselves and query Shodan.

C) IP Address Identifiers: Whenever you create an IP identifier, we will query shodan to return the data it has regarding that specific IP address.

D) IP Address Range: Whenever you create an IP range as an identifier, we will query Shodan to return the data it has for each IP address in the range.

Shodan Banners

Shodan collects what it calls Banners. In practice, the information contained in a Banner will differ depending on the port and service running. For example, for an HTTP/HTTPS service, the data in the Banner will be the content of the response Headers. For a port 22 / SSH service, at a minimum it will contain the SSH service version, but could also provide the host key algorithms and compression algorithms used.

For more on banners you can use the search query port:<port number>In Flare search bar to explore different ports, or look at this Shodan article on Banners.

Buckets

The Buckets subcategory queries a database that indexes all publicly available cloud buckets from Azure and AWS. We use this tool with identifiers of type Domain and keyword, and search sensitive items particularly, given the large amount of noise present in cloud buckets.

We have published an article in Flare's Research Center that explains why monitoring cloud buckets is important (see here).

Data Enrichment

Regroupment by similarity

If items are identical, they are merged in a single card with the And X similar activities mentioned. The factors to consider activities identical vary by source but generally include the content. 

Regroupment by project

For sources that have specific grouping mechanisms, such as GitHub repositories, results in the same group or repository are merged in a single card.

Document image


Dork Types

If a certain activity was found with a specialized query ("dork"), the name of the dork will be indicated in a tag as well as on the right-hand side panel, as shown in the image above. Tags include themes such as Private PEM certificate or Django secrets.

Scoring

Scoring on open web activities varies by source.

For anything related to network infrastructure, each port:IP combination will be assigned a score. An array of factors are considered in order to assign the score, including type of service running, port, associated vulnerability, associated SSL certificate, etc. For example, unless the operating system version on the host contains known vulnerabilities (CVEs), a port 443 with an up to date certificate will likely be rated at risk score of 1, which is the minimum.

At the other end, an ElasticSearch service with its port open to the web, or a Microsoft SMB service will likely be rated at a Risk of either 3 or 4.

For any document or code, the system tries to extract sensitive information such as API keys, encryption keys, etc. If successful, the score is adjusted to 3 or 4 depending on the severity of the finding.

Related Articles