We've prepared a two-part translation of Ryan Sears's article on handling Google's Certificate Transparency logs . The first part gives an overview of the structure of the logs and provides a sample Python code for parsing records from these logs. The second part is devoted to obtaining all certificates from the available logs and configuring the Google BigQuery system for storing and organizing searches for the received data.
Three years have passed since the original was written, and since then the number of available logs and, accordingly, entries in them has increased many times. It is all the more important to correctly approach the processing of logs if the goal is to maximize the amount of data received.
Part 1. Parsing Certificate Transparency Logs Like a Boss
During the development of our first project, phisfinder , I spent a lot of time thinking about the anatomy of phishing attacks and the data sources that would allow us to identify traces of upcoming phishing campaigns before they can cause any real damage.
One of the sources we've integrated (and certainly one of the best) is Certificate Transparency Log (CTL), a project started by Ben Laurie and Adam Langley at Google. Essentially, a CTL is a log containing an immutable list of certificates issued by a CA that is stored in a Merkle tree, allowing each certificate to be cryptographically verified if necessary.
, , , CTL:
import requests
import json
import locale
locale.setlocale(locale.LC_ALL, 'en_US')
ctl_log = requests.get('https://www.gstatic.com/ct/log_list/log_list.json').json()
total_certs = 0
human_format = lambda x: locale.format('%d', x, grouping=True)
for log in ctl_log['logs']:
log_url = log['url']
try:
log_info = requests.get('https://{}/ct/v1/get-sth'.format(log_url), timeout=3).json()
total_certs += int(log_info['tree_size'])
except:
continue
print("{} has {} certificates".format(log_url, human_format(log_info['tree_size'])))
print("Total certs -> {}".format(human_format(total_certs)))
:
ct.googleapis.com/pilot has 92,224,404 certificates
ct.googleapis.com/aviator has 46,466,472 certificates
ct1.digicert-ct.com/log has 1,577,183 certificates
ct.googleapis.com/rocketeer has 89,391,361 certificates
ct.ws.symantec.com has 3,562,198 certificates
ctlog.api.venafi.com has 94,797 certificates
vega.ws.symantec.com has 200,401 certificates
ctserver.cnnic.cn has 5,081 certificates
ctlog.wosign.com has 1,387,492 certificates
ct.startssl.com has 293,374 certificates
ct.googleapis.com/skydiver has 1,249,079 certificates
ct.googleapis.com/icarus has 48,585,765 certificates
Total certs -> 285,037,607
285,037,607 . , , . .
CTL
CTL HTTP, . , , . :
json
// curl -s 'https://ct1.digicert-ct.com/log/ct/v1/get-entries?start=0&end=0' | jq .
{
"entries": [
{
"leaf_input": "AAAAAAFIyfaldAAAAAcDMIIG/zCCBeegAwIBAgI...",
"extra_data": "AAiJAAS6MIIEtjCCA56gAwIBAgIQDHmpRLCMEZU..."
}
]
}
`leaf_input` `extra_data` base64. RFC6962 , `leaf_input` - MerkleTreeLeaf, `extra_data` - PrecertChainEntry.
PreCerts
, , PreCert ( , RFC, , , . PreCerts :
PreCerts , CA , โโ . , , x509 v3, `poison` . , , , PreCert, , .
, , , x509/ASN.1 , PreCert. , , , PreCerts CTL , CA, .
, - CTF, . `struct`, , , Construct, . , , :
from construct import Struct, Byte, Int16ub, Int64ub, Enum, Bytes, Int24ub, this, GreedyBytes, GreedyRange, Terminated, Embedded
MerkleTreeHeader = Struct(
"Version" / Byte,
"MerkleLeafType" / Byte,
"Timestamp" / Int64ub,
"LogEntryType" / Enum(Int16ub, X509LogEntryType=0, PrecertLogEntryType=1),
"Entry" / GreedyBytes
)
Certificate = Struct(
"Length" / Int24ub,
"CertData" / Bytes(this.Length)
)
CertificateChain = Struct(
"ChainLength" / Int24ub,
"Chain" / GreedyRange(Certificate),
)
PreCertEntry = Struct(
"LeafCert" / Certificate,
Embedded(CertificateChain),
Terminated
)
import json
import base64
import ctl_parser_structures
from OpenSSL import crypto
entry = json.loads("""
{
"entries": [
{
"leaf_input": "AAAAAAFIyfaldAAAAAcDMIIG/zCCBeegAwIBAgIQ...",
"extra_data": "AAiJAAS6MIIEtjCCA56gAwIBAgIQDHmpRLCMEZUg..."
}
]
}
""")['entries'][0]
leaf_cert = ctl_parser_structures.MerkleTreeHeader.parse(base64.b64decode(entry['leaf_input']))
print("Leaf Timestamp: {}".format(leaf_cert.Timestamp))
print("Entry Type: {}".format(leaf_cert.LogEntryType))
if leaf_cert.LogEntryType == "X509LogEntryType":
# , - X509
cert_data_string = ctl_parser_structures.Certificate.parse(leaf_cert.Entry).CertData
chain = [crypto.load_certificate(crypto.FILETYPE_ASN1, cert_data_string)]
# `extra_data`
extra_data = ctl_parser_structures.CertificateChain.parse(base64.b64decode(entry['extra_data']))
for cert in extra_data.Chain:
chain.append(crypto.load_certificate(crypto.FILETYPE_ASN1, cert.CertData))
else:
# , - PreCert
extra_data = ctl_parser_structures.PreCertEntry.parse(base64.b64decode(entry['extra_data']))
chain = [crypto.load_certificate(crypto.FILETYPE_ASN1, extra_data.LeafCert.CertData)]
for cert in extra_data.Chain:
chain.append(
crypto.load_certificate(crypto.FILETYPE_ASN1, cert.CertData)
)
X509 leaf_input
, Construct Python.
, , CTL , - .
2. Retrieving, Storing and Querying 250M+ Certificates Like a Boss
RFC, `get-entries`. , , ( `start` `end`), 64 . CTL Google, , 1024 .
Google (Argon, Xenon, Aviator, Icarus, Pilot, Rocketeer, Skydiver) 32 , , , .
1024 , CTL, Google, 256 .
IO-bound ( http) CPU-bound ( ), , .
, CTL ( Google, , . Axeman, asyncio aioprocessing , CSV , -.
(_. ._ Google Cloud VM) c 16 , 32 SSD 750 ( Google 300$ !), Axeman, `/tmp/certificates/$CTL_DOMAIN/`
?
Postgres, , , Postgres 250 ( , 20 !), , :
, , (AWS RDS, Heroku Postgres, Google Cloud SQL) . , , .
, , map/reduce , , Spark Hadoop Pig. โbig dataโ ( ), Google BigQuery, .
BigQuery
BigQuery , Google gsutil. :
, `gsutil` Google ( BigQuery). `gsutil config`, :
gsutil -o GSUtil:parallel_composite_upload_threshold=150M \
-m cp \
/tmp/certificates/* \
gs://all-certificates
:
BigQuery:
. , BigQuery โ, โ, CTL , . ( ):
, โEdit as Textโ. :
[
{
"name": "url",
"type": "STRING",
"mode": "REQUIRED"
},
{
"mode": "REQUIRED",
"name": "cert_index",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "chain_hash",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "cert_der",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "all_dns_names",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "not_before",
"type": "FLOAT"
},
{
"mode": "REQUIRED",
"name": "not_after",
"type": "FLOAT"
}
]
. , ( , , ). :
.
, punycode . :
SQL
SELECT
all_dns_names
FROM
[ctl-lists:certificate_data.scan_data]
WHERE
(REGEXP_MATCH(all_dns_names,r'\b?xn\-\-'))
AND NOT all_dns_names CONTAINS 'cloudflare'
15 punycode CTL!
. Coinbase, Certificate Transparency:
SQL
SELECT
all_dns_names
FROM
[ctl-lists:certificate_data.scan_data]
WHERE
(REGEXP_MATCH(all_dns_names,r'.*\.coinbase.com[\s$]?'))
:
- , - .
, . `flowers-to-the-world.com` . , :
SQL
SELECT
url,
COUNT(*) AS total_certs
FROM
[ctl-lists:certificate_data.scan_data]
WHERE
(REGEXP_MATCH(all_dns_names,r'.*flowers-to-the-world.*'))
GROUP BY
url
ORDER BY
total_certs DESC
Whois , Google, , - . Google, - , Certificate Transparency, .
, . Certificate Transparency.
`flowers-to-the-world.com` Google. , CTL RFC6962. , .
, , , , , .
`flower-to-the-world.com`, , : โC=GB, ST=London, O=Google UK Ltd., OU=Certificate Transparency, CN=Merge Delay Monitor Rootโ
, .
โ NetLas.io. , , , .
, , . , . , โ , . Netlas.io " ". โ .