Building a simple Apache log file parser with Python/lifter in 30 lines of code

published on July 12, 2016, 12:59 p.m. by eliotberriot | 0

As you may now, I released lifter publicly a few months ago.

Despite a lot of positive feedback, I've been quite busy over the last months and the project did not evolved much.

One of the reason explaining this situation is that when I initially released lifter, it was only as an ORM for Python iterables. But, thanks to some feedback, I realized that the project as a lot of potential, as a generic ORM that would enable querying any datasource with the same API. I won't go further into details, there is a dedicated issue on GitHub.

Because of that, I took a step back to think of the right architecture that would enable such a use case.

From Python iterables to a log file parser

With lifter, currently, you can run such queries:

from lifter.import models
from lifter.backends.python import IterableStore

class User(models.Model):
    pass

data = [
    {
        'age': 27,
        'is_active': False,
        'email': 'kurt@cobain.music',
    },
    {
        'age': 687,
        'is_active': True,
        'email': 'legolas@deepforest.org',
    },
    {
        'age': 34,
        'is_active': False,
        'email': 'golgoth@lahorde.org',
    }
]

store = IterableStore(data)
manager = store.query(User)
young_users = manager.filter(User.age < 30)

This is quite easy to implement, because everything is Python: the data source, and the API.

Now, what if I want to query my users from a REST API instead, or even a file on my machine ?

I've updated lifter's API to achieve that in the latest lifter release (0.3). As a test of this new implementation, I'll be writing in this guide a simple lifter backend to run queries against an Apache log file.

The backend will have the following capabilities:

  • Convert each log entry to a plain Python model
  • Filtering using log fields (IP, user agent, path requested...)
  • Running more complex queries to extract general data (getting a list of all visitors IP, all referers, total bytes sent over a period...)

The whole thing should fit in 30 lines of code.

Setup

If you want to run the code from this guide, you'll have to download and setup lifter 0.3:

git clone https://github.com/EliotBerriot/lifter.git
cd lifter
git checkout 0.3
python setup.py install

I suggest you run this into a virtual environment for proper isolation (although lifter only requires little dependencies)

The data source

First, we'll have to generate log entries, such as

172.183.134.216 - - [12/Jul/2016:12:22:14 -0700] "GET /wp-content HTTP/1.0" 200 4980 "http://farmer-harris.com/category/index/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; rv:1.9.3.20) Gecko/2013-07-10 02:46:11 Firefox/9.0"

To do so, I'll use a fake log generator. I've edited the project, so you can use it immediatly on lifter, see the example/fake-logs.py file.

So we'll generate 5000 log entries, that will make a sufficient data source to test our backend:

pip install fake-factory numpy  # here we install the faker requirements
python2 example/fake-logs.py -n 5000 > /tmp/apache.log

You can open the /tmp/apache.log file to check everything is okay.

The model

Now we have a real data to play with, let's express it as a lifter model:

from lifter import models

class LogEntry(models.Model):
    ip = models.CharField()
    date = models.DateTimeField()
    method = models.CharField()
    request_path = models.CharField()
    http_version = models.CharField()
    status_code = models.IntegerField()
    response_size = models.IntegerField()
    referrer = models.CharField()
    user_agent = models.CharField()

Nothing fancy here, we declare one LogEntry model with a few fields declarations, corresponding to fields available in the log file.

Some fields are more specifically typed, (response_size with IntegerField and date with DateTimeField), meaning the value for each log entry will be cast to the corresponding python types (integer and datetime).

The adapter

Now we've got our models, we have to write the adapter, that will be responsible for casting each line of our log file as a LogEntry model. First let's get done with the regular expression:

import re

LOG_REGEX = '(?P<ip>[(\d\.)]+) - - \[(?P<date>.*?) -(.*?)\] "(?P<method>\w+) (?P<request_path>.*?) HTTP/(?P<http_version>.*?)" (?P<status_code>\d+) (?P<response_size>\d+) "(?P<referrer>.*?)" "(?P<user_agent>.*?)"'

line = '172.183.134.216 - - [12/Jul/2016:12:22:14 -0700] "GET /wp-content HTTP/1.0" 200 4980 "http://farmer-harris.com/category/index/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; rv:1.9.3.20) Gecko/2013-07-10 02:46:11 Firefox/9.0"'

compiled = re.compile(LOG_REGEX)

match = compiled.match(line)
data = match.groupdict()
print(data)

This will output:

{
    'user_agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; rv:1.9.3.20) Gecko/2013-07-10 02:46:11 Firefox/9.0',
    'date': '12/Jul/2016:12:22:14 -0700',
    'http_version': '1.0',
    'referrer': 'http://farmer-harris.com/category/index/',
    'ip': '172.183.134.216',
    'status_code': '200',
    'method': 'GET',
    'path': '/wp-content',
    'response_size': '4980'
}

The previous output proves that our regex is indeed capturing the data we need.

Now, let's write our adapter:

from lifter.adapters import RegexAdapter

class LogEntryFileAdapter(RegexAdapter):

    regex = LOG_REGEX

    def clean_date(self, data, value, model, field):
        date_format = '%d/%b/%Y:%H:%M:%S'
        return field.to_python(self, value, date_format=date_format)

As you can see, the code is pretty straightforward. We set the regex the adapter will use to parse data, and we only override the clean_date method to provide our custom date format.

Here again, we can test our adapter is working properly:

import datetime

adapter = LogEntryFileAdapter()
instance = adapter.parse(line, LogEntry)

assert instance.ip == '172.183.134.216'
assert instance.date == datetime.datetime(2016, 7, 12, 12, 22, 14)
assert instance.method == 'GET'
assert instance.request_path == '/wp-content'
assert instance.http_version == '1.0'
assert instance.status_code == 200
assert instance.response_size == 4980
assert instance.referrer == 'http://farmer-harris.com/category/index/'
assert instance.user_agent == 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; rv:1.9.3.20) Gecko/2013-07-10 02:46:11 Firefox/9.0'

Wonderful, our log line was successfully converted to a regular lifter model!

Running queries

Until here, everything works properly, but we don't want to create each LogEntry instance by hand. That's where the Store come in:

from lifter.backends import filesystem

store = filesystem.FileStore(path='/tmp/apache.log')
manager = store.query(LogEntry, adapter=LogEntryFileAdapter())

And here we go! With only two lines, we wrapped all our previous code and made our log file queryable.

Maybe you don't believe me, so let's dive into the data:

manager.all().count()
>>> 5000

manager.filter(LogEntry.status_code == 200).count()
>>> 4463

manager.order_by(~LogEntry.response_size).values_list('response_size', 'ip', 'request_path')[:10]
>>> [(5175, '64.219.210.186', '/wp-content'),
     (5163, '157.102.97.30', '/search/tag/list'),
     (5160, '240.199.75.110', '/posts/posts/explore'),
     (5158, '250.236.19.198', '/list'),
     (5156, '154.52.202.165', '/apps/cart.jsp?appID=5056'),
     (5155, '2.132.244.147', '/list'),
     (5149, '20.32.252.68', '/search/tag/list'),
     (5146, '77.223.24.206', '/apps/cart.jsp?appID=9970'),
     (5145, '83.4.117.89', '/explore'),
     (5144, '183.8.119.188', '/list')]

from lifter.aggregates import Avg, Sum, Max, Min
manager.aggregate(Max('response_size'), Min('response_size'), Sum('response_size'), Avg('response_size'))
>>> {'response_size__avg': 4999.165,
     'response_size__max': 5175,
     'response_size__min': 4830,
     'response_size__sum': 24995825}

qs = manager.filter(LogEntry.status_code == 200).exclude(LogEntry.request_path == '/list')

for entry in qs.order_by(LogEntry.date)[:5]:
    print('Hello there, I was generated on {0}'.format(entry.date))

>>> Hello there, I was generated on 2016-07-12 12:19:13
>>> Hello there, I was generated on 2016-07-12 12:22:36
>>> Hello there, I was generated on 2016-07-12 12:28:49
>>> Hello there, I was generated on 2016-07-12 12:29:42
>>> Hello there, I was generated on 2016-07-12 12:32:37

If I remove all checks and prints from the code, we have this:

import datetime

from lifter import models
from lifter.adapters import RegexAdapter
from lifter.backends import filesystem


class LogEntry(models.Model):
    ip = models.CharField()
    date = models.DateTimeField()
    method = models.CharField()
    request_path = models.CharField()
    http_version = models.CharField()
    status_code = models.IntegerField()
    response_size = models.IntegerField()
    referrer = models.CharField()
    user_agent = models.CharField()


class LogEntryFileAdapter(RegexAdapter):

    regex = '(?P<ip>[(\d\.)]+) - - \[(?P<date>.*?) -(.*?)\] "(?P<method>\w+) (?P<request_path>.*?) HTTP/(?P<http_version>.*?)" (?P<status_code>\d+) (?P<response_size>\d+) "(?P<referrer>.*?)" "(?P<user_agent>.*?)"'

    def clean_date(self, data, value, model, field):
        date_format = '%d/%b/%Y:%H:%M:%S'
        return field.to_python(self, value, date_format=date_format)


store = filesystem.FileStore(path='/tmp/apache.log')
manager = store.query(LogEntry, adapter=LogEntryFileAdapter())

Which is exactly 30 lines of code (actually, less if we remove blank lines, but let's not be petty).

Going further

Lifter's documentation is here to help, especially regarding the query language and API.

More backends will be implemented in future releases of lifter (I'd really like a http backend to run queries against a REST API ;).

If you enjoyed this guide, have some questions, or want to spot some mistakes please leave a comment. If you're in to contribute to lifter's development, you're welcome, and I invite you to visit the code repository.

Thank you for reading!

0 comments

publish