How do I use Amazon Athena to analyze my Application Load Balancer access logs?

4 minute read
3

I want to use Amazon Athena to analyze my Application Load Balancer access logs.

Short description

Elastic Load Balancing doesn't activate access logs by default. When you activate access logs, you must specify an Amazon Simple Storage Service (Amazon S3) bucket. Athena analyzes Application Load Balancer and Classic Load Balancer access logs and stores the logs in the Amazon S3 bucket.

Note: This resolution applies only to Application Load Balancers.

Resolution

Create a database and table for Application Load Balancer logs

Complete the following steps:

  1. Open the Athena console.
  2. To create a database, run the following command in the Query Editor:
    CREATE DATABASE alb_db
    Note: It's a best practice to create the database in the same AWS Region as the Amazon S3 bucket.
  3. In the database, create an alb_logs table for the Application Load Balancer logs.
    Example query:
    CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs(
        type string,
        time string,
        elb string,
        client_ip string,
        client_port int,
        target_ip string,
        target_port int,
        request_processing_time double,
        target_processing_time double,
        response_processing_time double,
        elb_status_code int,
        target_status_code string,
        received_bytes bigint,
        sent_bytes bigint,
        request_verb string,
        request_url string,
        request_proto string,
        user_agent string,
        ssl_cipher string,
        ssl_protocol string,
        target_group_arn string,
        trace_id string,
        domain_name string,
        chosen_cert_arn string,
        matched_rule_priority string,
        request_creation_time string,
        actions_executed string,
        redirect_url string,
        lambda_error_reason string,
        target_port_list string,
        target_status_code_list string,
        classification string,
        classification_reason string,
        conn_trace_id string
    )
    PARTITIONED BY(day STRING)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES(
        'serialization.format' = '1',
        'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) (.*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-_]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\" ?([^ ]*)?( .*)?')
    LOCATION 's3://<your-alb-logs-directory>/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/'
    TBLPROPERTIES(
        "projection.enabled" = "true",
        "projection.day.type" = "date",
        "projection.day.range" = "2022/01/01,NOW",
        "projection.day.format" = "yyyy/MM/dd",
        "projection.day.interval" = "1",
        "projection.day.interval.unit" = "DAYS",
        "storage.location.template" = "s3://<your-alb-logs-directory>/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/${day}"
    )
    Note: In the preceding query, replace the table name and S3 locations with your table name and S3 locations. For projection.day.range, replace 2022/01/01 with a start date. For better query performance, create a table with partition projection.
  4. To create a table with partitions that are stored in the AWS Glue Data Catalog, run the following query:
    CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs_partitioned(type string,
        time string,
        elb string,
        client_ip string,
        client_port int,
        target_ip string,
        target_port int,
        request_processing_time double,
        target_processing_time double,
        response_processing_time double,
        elb_status_code int,
        target_status_code string,
        received_bytes bigint,
        sent_bytes bigint,
        request_verb string,
        request_url string,
        request_proto string,
        user_agent string,
        ssl_cipher string,
        ssl_protocol string,
        target_group_arn string,
        trace_id string,
        domain_name string,
        chosen_cert_arn string,
        matched_rule_priority string,
        request_creation_time string,
        actions_executed string,
        redirect_url string,
        lambda_error_reason string,
        target_port_list string,
        target_status_code_list string,
        classification string,
        classification_reason string,
        conn_trace_id string
    )
    PARTITIONED BY(day string)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES(
        'serialization.format' = '1',
        'input.regex' ='([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) (.*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-_]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\" ?([^ ]*)?( .*)?')
    LOCATION 's3://<your-alb-logs-directory>/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/'
    Note: In the preceding query, replace the table name and S3 locations with your table name and S3 locations.
  5. (Optional) To load the partitions, run the following ALTER TABLE ADD PARTITION DDL statement:
    ALTER TABLE alb_logs_partitioned ADD PARTITION (day = '2022/05/21') LOCATION's3://<your-alb-logs-directory>/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/2022/05/21/'
    Note: If your table uses partition projection, then skip the load partitions step. Don't use an AWS Glue crawler on the Application Load Balancer logs.
  6. In the navigation pane, under Tables, choose Preview table.
  7. In the Results window, view the data from the Application Load Balancer access logs.
    Note: To run SQL statements on the table, use the Query editor.

Related information

Access log entries

Querying Application Load Balancer logs

AWS OFFICIAL
AWS OFFICIALUpdated a month ago
13 Comments

This is great but requires some corrections:

For example, SELECT * from alb_log where elb_status_code = '503'; will fail with "TYPE_MISMATCH: line 1:46: Cannot apply operator: integer = varchar(3)" error:

and we should use the following instead: SELECT * from alb_log where elb_status_code = 503;

Rez
replied a year ago

Also this query:

SELECT COUNT(request_verb) AS count, request_verb, client_ip FROM alb_log_partition_projection WHERE day = '2022/05/21' GROUP BY request_verb, client_ip;

Needs to be changed to:

SELECT COUNT(request_verb) AS count, request_verb, client_ip FROM alb_log WHERE day = '2022/05/21' GROUP BY request_verb, client_ip;

Not a major issue its just the table name according to your tutorial.

Rez
replied a year ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied a year ago

Create table queries missing REGION from the S3 path

profile picture
replied a year ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied a year ago

The region is still missing from the S3 paths in the first CREATE TABLE query. Without this, the table doesn't work and returns no results.

profile picture
replied a year ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied a year ago

The first create table in step 3 has an extra single quote in the LOCATION line:

LOCATION ''s3://<your-al...

replied 8 months ago

This example in the article:

SELECT target_ip, (Count(target_ip)* 100.0 / (Select Count(*) From alb_logs)) as backend_traffic_percentage FROM alb_logs GROUP by target_ip< ORDER By count() DESC;

Give this error when running:

error - line 5:1: mismatched input '<'. Expecting: <predicate>

I was able to resolve it by removing < after target_ip on line 4.

SELECT target_ip, (Count(target_ip)* 100.0 / (Select Count(*) From alb_logs)) as backend_traffic_percentage FROM alb_logs GROUP by target_ip ORDER By count() DESC;

profile pictureAWS
replied 6 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 6 months ago

Is there something that needs to be done to load the log files into athena? I followed this article step by step. I have the table created but I do not see the logs. I have logs in S3. What is the secret?

replied 5 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 5 months ago
Hristo
replied 2 months ago