Google's Certificate Transparency as a Data Source for Attack Prevention

We've prepared a two-part translation of Ryan Sears's article on handling Google's Certificate Transparency logs . The first part gives an overview of the structure of the logs and provides a sample Python code for parsing records from these logs. The second part is devoted to obtaining all certificates from the available logs and configuring the Google BigQuery system for storing and organizing searches for the received data.





Three years have passed since the original was written, and since then the number of available logs and, accordingly, entries in them has increased many times. It is all the more important to correctly approach the processing of logs if the goal is to maximize the amount of data received.





Part 1. Parsing Certificate Transparency Logs Like a Boss

During the development of our first project, phisfinder , I spent a lot of time thinking about the anatomy of phishing attacks and the data sources that would allow us to identify traces of upcoming phishing campaigns before they can cause any real damage.





One of the sources we've integrated (and certainly one of the best) is Certificate Transparency Log (CTL), a project started by Ben Laurie and Adam Langley at Google. Essentially, a CTL is a log containing an immutable list of certificates issued by a CA that is stored in a Merkle tree, allowing each certificate to be cryptographically verified if necessary.





, , , CTL:





import requests
import json
import locale
locale.setlocale(locale.LC_ALL, 'en_US')

ctl_log = requests.get('https://www.gstatic.com/ct/log_list/log_list.json').json()

total_certs = 0

human_format = lambda x: locale.format('%d', x, grouping=True)

for log in ctl_log['logs']:
	log_url = log['url']
	try:
		log_info = requests.get('https://{}/ct/v1/get-sth'.format(log_url), timeout=3).json()
		total_certs += int(log_info['tree_size'])
	except:
		continue

	print("{} has {} certificates".format(log_url, human_format(log_info['tree_size'])))

print("Total certs -> {}".format(human_format(total_certs)))
      
      







:





ct.googleapis.com/pilot has 92,224,404 certificates
ct.googleapis.com/aviator has 46,466,472 certificates
ct1.digicert-ct.com/log has 1,577,183 certificates
ct.googleapis.com/rocketeer has 89,391,361 certificates
ct.ws.symantec.com has 3,562,198 certificates
ctlog.api.venafi.com has 94,797 certificates
vega.ws.symantec.com has 200,401 certificates
ctserver.cnnic.cn has 5,081 certificates
ctlog.wosign.com has 1,387,492 certificates
ct.startssl.com has 293,374 certificates
ct.googleapis.com/skydiver has 1,249,079 certificates
ct.googleapis.com/icarus has 48,585,765 certificates
Total certs -> 285,037,607
      
      







285,037,607 . , , . .





, API , PreCerts ( ) . , , , 6 , Chrome. , , .





, , , Google Chrome, 46 , 6,861,473,804 , .





CTL

CTL HTTP, . , , . :





   json
// curl -s 'https://ct1.digicert-ct.com/log/ct/v1/get-entries?start=0&end=0' | jq .
{
  "entries": [
    {
      "leaf_input": "AAAAAAFIyfaldAAAAAcDMIIG/zCCBeegAwIBAgI...",
      "extra_data": "AAiJAAS6MIIEtjCCA56gAwIBAgIQDHmpRLCMEZU..."
    }
  ]
}
      
      







`leaf_input` `extra_data` base64. RFC6962 , `leaf_input` - MerkleTreeLeaf, `extra_data` - PrecertChainEntry.





PreCerts

, , PreCert ( , RFC, , , . PreCerts :





PreCerts , CA , โ€œโ€ . , , x509 v3, `poison` . , , , PreCert, , .





, , , x509/ASN.1 , PreCert. , , , PreCerts CTL , CA, .





, - CTF, . `struct`, , , Construct, . , , :





from construct import Struct, Byte, Int16ub, Int64ub, Enum, Bytes, Int24ub, this, GreedyBytes, GreedyRange, Terminated, Embedded

MerkleTreeHeader = Struct(
    "Version"         / Byte,
    "MerkleLeafType"  / Byte,
    "Timestamp"       / Int64ub,
    "LogEntryType"    / Enum(Int16ub, X509LogEntryType=0, PrecertLogEntryType=1),
    "Entry"           / GreedyBytes
)

Certificate = Struct(
    "Length" / Int24ub,
    "CertData" / Bytes(this.Length)
)

CertificateChain = Struct(
    "ChainLength" / Int24ub,
    "Chain" / GreedyRange(Certificate),
)

PreCertEntry = Struct(
    "LeafCert" / Certificate,
    Embedded(CertificateChain),
    Terminated
)
      
      







import json
import base64

import ctl_parser_structures

from OpenSSL import crypto

entry = json.loads("""
{
  "entries": [
    {
      "leaf_input": "AAAAAAFIyfaldAAAAAcDMIIG/zCCBeegAwIBAgIQ...",
      "extra_data": "AAiJAAS6MIIEtjCCA56gAwIBAgIQDHmpRLCMEZUg..."
    }
  ]
}
""")['entries'][0]

leaf_cert = ctl_parser_structures.MerkleTreeHeader.parse(base64.b64decode(entry['leaf_input']))

print("Leaf Timestamp: {}".format(leaf_cert.Timestamp))
print("Entry Type: {}".format(leaf_cert.LogEntryType))

if leaf_cert.LogEntryType == "X509LogEntryType":
    #  ,   -  X509 
    cert_data_string = ctl_parser_structures.Certificate.parse(leaf_cert.Entry).CertData
    chain = [crypto.load_certificate(crypto.FILETYPE_ASN1, cert_data_string)]

    #   `extra_data`     
    extra_data = ctl_parser_structures.CertificateChain.parse(base64.b64decode(entry['extra_data']))
    for cert in extra_data.Chain:
        chain.append(crypto.load_certificate(crypto.FILETYPE_ASN1, cert.CertData))
else:
    #   ,   - PreCert
    extra_data = ctl_parser_structures.PreCertEntry.parse(base64.b64decode(entry['extra_data']))
    chain = [crypto.load_certificate(crypto.FILETYPE_ASN1, extra_data.LeafCert.CertData)]

    for cert in extra_data.Chain:
        chain.append(
            crypto.load_certificate(crypto.FILETYPE_ASN1, cert.CertData)
        )
      
      



X509 leaf_input





, Construct Python.





, , CTL , - .





2. Retrieving, Storing and Querying 250M+ Certificates Like a Boss





RFC, `get-entries`. , , ( `start` `end`), 64 . CTL Google, , 1024 .





Google (Argon, Xenon, Aviator, Icarus, Pilot, Rocketeer, Skydiver) 32 , , , .





1024 , CTL, Google, 256 . 





IO-bound ( http) CPU-bound ( ), , .





, CTL ( Google, , . Axeman, asyncio aioprocessing , CSV , -.





(_. ._ Google Cloud VM) c 16 , 32 SSD 750 ( Google 300$ !), Axeman, `/tmp/certificates/$CTL_DOMAIN/`





?

Postgres, , , Postgres 250 ( , 20 !), , :

















, , (AWS RDS, Heroku Postgres, Google Cloud SQL) . , , .





, , map/reduce , , Spark Hadoop Pig. โ€œbig dataโ€ ( ), Google BigQuery, .





BigQuery

BigQuery , Google gsutil. :





, `gsutil` Google ( BigQuery). `gsutil config`, :





gsutil -o GSUtil:parallel_composite_upload_threshold=150M \
       -m cp \
       /tmp/certificates/* \
       gs://all-certificates
      
      



:





BigQuery:





. , BigQuery โ€œ, โ€, CTL , . ( ):





, โ€œEdit as Textโ€. :





[
    {
        "name": "url",
        "type": "STRING",
        "mode": "REQUIRED"
    },
    {
        "mode": "REQUIRED",
        "name": "cert_index",
        "type": "INTEGER"
    },
    {
        "mode": "REQUIRED",
        "name": "chain_hash",
        "type": "STRING"
    },
    {
        "mode": "REQUIRED",
        "name": "cert_der",
        "type": "STRING"
    },
    {
        "mode": "REQUIRED",
        "name": "all_dns_names",
        "type": "STRING"
    },
    {
        "mode": "REQUIRED",
        "name": "not_before",
        "type": "FLOAT"
    },
    {
        "mode": "REQUIRED",
        "name": "not_after",
        "type": "FLOAT"
    }
]
      
      







. , ( , , ). :









.





, punycode . :





   SQL
SELECT
  all_dns_names
FROM
  [ctl-lists:certificate_data.scan_data]
WHERE
  (REGEXP_MATCH(all_dns_names,r'\b?xn\-\-'))
  AND NOT all_dns_names CONTAINS 'cloudflare'
      
      







15 punycode CTL!





. Coinbase, Certificate Transparency:





   SQL
SELECT
  all_dns_names
FROM
  [ctl-lists:certificate_data.scan_data]
WHERE
  (REGEXP_MATCH(all_dns_names,r'.*\.coinbase.com[\s$]?'))
      
      







:





- , - .





, . `flowers-to-the-world.com` . , :





   SQL
SELECT
  url,
  COUNT(*) AS total_certs
FROM
  [ctl-lists:certificate_data.scan_data]
WHERE
  (REGEXP_MATCH(all_dns_names,r'.*flowers-to-the-world.*'))
GROUP BY
  url
ORDER BY
  total_certs DESC
      
      



Whois , Google, , - . Google, - , Certificate Transparency, .





Google

Google





, . Certificate Transparency.





`flowers-to-the-world.com` Google. , CTL RFC6962. , . 





, , , , , .





`flower-to-the-world.com`, , : โ€œC=GB, ST=London, O=Google UK Ltd., OU=Certificate Transparency, CN=Merge Delay Monitor Rootโ€





, .










โ€” NetLas.io. , , , .





, , . , . , โ€” , . Netlas.io " ". โ€” .








All Articles