作者都是各自领域经过审查的专家,并撰写他们有经验的主题. All of our content is peer reviewed and validated by Toptal experts in the same field.
Fabrice Triboix
Verified Expert in Engineering

Fabrice是一名云架构师和软件开发人员,在思科工作了20多年, Samsung, Philips, Alcatel, and Sagem.

Expertise

PREVIOUSLY AT

Cisco
Share

Elasticsearch是一个功能强大的软件解决方案,旨在快速搜索大量数据中的信息. Combined with Logstash and Kibana, this forms the informally named “ELK stack”, and is often used to collect, temporarily store, analyze, and visualize log data. A few other pieces of software are usually needed, such as Filebeat to send the logs from the server to Logstash, and Elastalert 根据对存储在Elasticsearch中的数据运行的一些分析结果生成警报.

The ELK Stack is Powerful, But…

My experience with using ELK for managing logs is quite mixed. On the one hand, it’s very powerful and the range of its capabilities is quite impressive. On the other hand, it’s tricky to set up and can be a headache to maintain.

The fact is that Elasticsearch is very good in general and can be used in a wide variety of scenarios; it can even be used as a search engine! Since it is not specialized for managing log data, 这需要更多的配置工作来定制其行为,以满足管理此类数据的特定需求.

设置ELK集群是相当棘手的,需要我玩弄一些参数,以便最终得到它的启动和运行. Then came the work of configuring it. In my case, I had five different pieces of software to configure (Filebeat, Logstash, Elasticsearch, Kibana, and Elastalert). This can be a quite tedious job, 因为我必须通读文档并调试链中不与下一个通信的一个元素. Even after you finally get your cluster up and running, you still need to perform routine maintenance operations on it: patching, upgrading the OS packages, checking CPU, RAM, and disk usage, making minor adjustments as required, etc.

My whole ELK stack stopped working after a Logstash update. Upon closer examination, It turned out that, for some reason, ELK developers decided to change a keyword in their config file and pluralize it. 这是最后一根稻草,我决定寻找更好的解决方案(至少是针对我的特殊需求的更好的解决方案)。.

I wanted to store logs generated by Apache and various PHP and node apps, and to parse them to find patterns indicative of bugs in the software. The solution I found was the following:

  • Install CloudWatch Agent on the target.
  • Configure CloudWatch Agent to ship the logs to CloudWatch logs.
  • Trigger invocation of Lambda functions to process the logs.
  • The Lambda function would post messages to a Slack channel if a pattern is found.
  • Where possible, 对CloudWatch日志组应用过滤器,以避免对每个日志调用Lambda函数(这会很快增加成本)。.

And, at a high level, that’s it! 100%无服务器解决方案,无需任何维护即可正常工作,并且无需任何额外工作即可很好地扩展. The advantages of such serverless solutions over a cluster of servers are numerous:

  • In essence, 在集群服务器上定期执行的所有日常维护操作现在都由云提供商负责. 任何底层服务器都会在你不知情的情况下为你打补丁、升级和维护.
  • 您不再需要监控您的集群,您可以将所有扩展问题委托给云提供商. Indeed, 如上所述的无服务器设置将自动扩展,而无需执行任何操作!
  • The solution described above requires less configuration, 而且云提供商不太可能对配置格式进行重大更改.
  • Finally, 编写一些CloudFormation模板将所有这些作为基础设施即代码是非常容易的. Doing the same to set up a whole ELK cluster would require a lot of work.

Configuring Slack Alerts

So now let’s get into the details! Let’s explore what a CloudFormation template would look like for such a setup, complete with Slack webhooks for alerting engineers. We need to configure all the Slack set up first, so let’s dive into it.

AWSTemplateFormatVersion: 2010-09-09

Description: Setup log processing

Parameters:
  SlackWebhookHost:
  	Type: String
  	Description: Host name for Slack web hooks
  	Default: hooks.slack.com

  SlackWebhookPath:
  	Type: String
  	Description: Path part of the Slack webhook URL
  	Default: /services/YOUR/SLACK/WEBHOOK

You would need to set up your Slack workspace for this, check out this WebHooks for Slack guide for additional info.

Once you created your Slack app and configured an incoming hook, the hook URL will become a parameter of your CloudFormation stack.

Resources:
  ApacheAccessLogGroup:
  	Type: AWS::Logs::LogGroup
  	Properties:
  	RetentionInDays: 100  # Or whatever is good for you

  ApacheErrorLogGroup:
  	Type: AWS::Logs::LogGroup
  	Properties:
  	RetentionInDays: 100  # Or whatever is good for you

Here we created two log groups: one for the Apache access logs, the other for the Apache error logs.

我没有为日志数据配置任何生命周期机制,因为这超出了本文的讨论范围. In practice, 您可能希望缩短保留窗口,并设计S3生命周期策略,以便在一段时间后将它们移动到Glacier.

Lambda Function to Process Access Logs

Now let’s implement the Lambda function that will process the Apache access logs.

BasicLambdaExecutionRole:
	Type: AWS::IAM::Role
	Properties:
  AssumeRolePolicyDocument:
  Version: 2012-10-17
  Statement:
  - Effect: Allow
  Principal:
  Service: lambda.amazonaws.com
  Action: sts:AssumeRole
  ManagedPolicyArns:
  - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Here we created an IAM role that will be attached to the Lambda functions, to allow them to perform their duties. In effect, the AWSLambdaBasicExecutionRole is (despite its name) an IAM policy provided by AWS. It just allows the Lambda function to create its a log group and log streams within that group, and then to send its own logs to CloudWatch Logs.

ProcessApacheAccessLogFunction:
	Type: AWS::Lambda::Function
	Properties:
  Handler: index.handler
  Role: !GetAtt BasicLambdaExecutionRole.Arn
  Runtime: python3.7
  Timeout: 10
  Environment:
  Variables:
  SLACK_WEBHOOK_HOST: !Ref SlackWebHookHost
  SLACK_WEBHOOK_PATH: !Ref SlackWebHookPath
  Code:
  ZipFile: |
  import base64
  import gzip
  import json
  import os
  from http.client import HTTPSConnection

  def handler(event, context):
  tmp = event['awslogs']['data']
  # `awslogs.data` is base64-encoded gzip'ed JSON
  tmp = base64.b64decode(tmp)
  tmp = gzip.decompress(tmp)
  tmp = json.loads(tmp)
  events = tmp['logEvents']
  for event in events:
  raw_log = event['message']
  log = json.loads(raw_log)
  if log['status'][0] == "5":
    # This is a 5XX status code
    print(f"Received an Apache access log with a 5XX status code: {raw_log}")
    slack_host = os.getenv('SLACK_WEBHOOK_HOST')
    slack_path = os.getenv('SLACK_WEBHOOK_PATH')
    print(f"发送Slack帖子到:host={slack_host}, path={slack_path}, url={url}, content={raw_log}")
    cnx = HTTPSConnection(slack_host, timeout=5)
    cnx.request("POST", slack_path, json.dumps({'text': raw_log}))
    # It's important to read the response; if the cnx is closed too quickly, Slack might not post the msg
    resp = cnx.getresponse()
    resp_content = resp.read()
    resp_code = resp.status
    assert resp_code == 200

So here we are defining a Lambda function to process Apache access logs. Please note that I am not using the common log format which is the default on Apache. 我像这样配置访问日志格式(您将注意到它实际上生成的日志格式为JSON), which makes processing further down the line a lot easier):

LogFormat "{\"vhost\": \"%v:%p\", \"client\": \"%a\", \"user\": \"%u\", \"timestamp\": \"%{%Y-%m-%dT%H:%M:%S}t\", \"request\": \"%r\", \"status\": \"%>s\", \"size\": \"%O\", \"referer\": \"%{Referer}i\", \"useragent\": \"%{User-Agent}i\"}" json

This Lambda function is written in Python 3. It takes the log line sent from CloudWatch and can search for patterns. In the example above, 它只是检测导致5XX状态码的HTTP请求,并向Slack频道发布消息.

You can do anything you like in terms of pattern detection, and the fact that it’s a true programming language (Python), as opposed to just regex patterns in a Logstash or Elastalert config file, gives you a lot of opportunities to implement complex pattern recognition.

Revision Control

关于修订控制的简短介绍:我发现,将代码内联到CloudFormation模板中,用于小型实用程序Lambda函数(如此)是非常可接受和方便的. Of course, for a large project involving many Lambda functions and layers, this would most probably be inconvenient and you would need to use SAM.

ApacheAccessLogFunctionPermission:
	Type: AWS::Lambda::Permission
	Properties:
  FunctionName: !Ref ProcessApacheAccessLogFunction
  Action: lambda:InvokeFunction
  Principal: logs.amazonaws.com
  SourceArn: !Sub arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:*

The above gives permission to CloudWatch Logs to call your Lambda function. One word of caution: I found that using the SourceAccount property can lead to conflicts with the SourceArn.

Generally speaking, 当调用Lambda函数的服务在同一个AWS帐户中时,我建议不要包含它. The SourceArn will forbid other accounts to call the Lambda function anyway.

ApacheAccessLogSubscriptionFilter:
	Type: AWS::Logs::SubscriptionFilter
	DependsOn: ApacheAccessLogFunctionPermission
	Properties:
  LogGroupName: !Ref ApacheAccessLogGroup
  DestinationArn: !GetAtt ProcessApacheAccessLogFunction.Arn
  FilterPattern: "{$.status = 5*}"

The subscription filter resource is the link between CloudWatch Logs and Lambda. Here, logs sent to the ApacheAccessLogGroup will be forwarded to the Lambda function we defined above, but only those logs that pass the filter pattern. Here, 过滤器模式需要一些JSON作为输入(过滤器模式以'{'开始,以'}'结束), and will match the log entry only if it has a field status which starts with “5”.

这意味着只有当Apache返回的HTTP状态码是500码时,我们才调用Lambda函数, which usually means something quite bad is going on. 这确保我们不会过多地调用Lambda函数,从而避免不必要的开销.

More information on filter patterns can be found in Amazon CloudWatch documentation. The CloudWatch filter patterns are quite good, although obviously not as powerful as Grok.

Note the DependsOn field, 这确保CloudWatch日志可以在创建订阅之前调用Lambda函数. This is just a cherry on the cake, it’s most probably unnecessary as in a real-case scenario, Apache可能在几秒钟之前不会收到请求(例如:将EC2实例与负载平衡器链接起来), and get the load balancer to recognised the status of the EC2 instance as healthy).

Lambda Function to Process Error Logs

Now let’s have a look at the Lambda function that will process the Apache error logs.

ProcessApacheErrorLogFunction:
	Type: AWS::Lambda::Function
	Properties:
  Handler: index.handler
  Role: !GetAtt BasicLambdaExecutionRole.Arn
  Runtime: python3.7
  Timeout: 10
  Environment:
  Variables:
  SLACK_WEBHOOK_HOST: !Ref SlackWebHookHost
  SLACK_WEBHOOK_PATH: !Ref SlackWebHookPath
  Code:
  ZipFile: |
  import base64
  import gzip
  import json
  import os
  from http.client import HTTPSConnection

  def handler(event, context):
  tmp = event['awslogs']['data']
  # `awslogs.data` is base64-encoded gzip'ed JSON
  tmp = base64.b64decode(tmp)
  tmp = gzip.decompress(tmp)
  tmp = json.loads(tmp)
  events = tmp['logEvents']
  for event in events:
  raw_log = event['message']
  log = json.loads(raw_log)
  if log['level'] in ["error", "crit", "alert", "emerg"]:
    # This is a serious error message
    msg = log['msg']
    if msg.startswith("PHP Notice") or msg.startswith("PHP Warning"):
    print(f"Ignoring PHP notices and warnings: {raw_log}")
    else:
    print(f"Received a serious Apache error log: {raw_log}")
    slack_host = os.getenv('SLACK_WEBHOOK_HOST')
    slack_path = os.getenv('SLACK_WEBHOOK_PATH')
    print(f"发送Slack帖子到:host={slack_host}, path={slack_path}, url={url}, content={raw_log}")
    cnx = HTTPSConnection(slack_host, timeout=5)
    cnx.request("POST", slack_path, json.dumps({'text': raw_log}))
    # It's important to read the response; if the cnx is closed too quickly, Slack might not post the msg
    resp = cnx.getresponse()
    resp_content = resp.read()
    resp_code = resp.status
    assert resp_code == 200

第二个Lambda函数处理Apache错误日志,只有在遇到严重错误时才会向Slack发布消息. 在这种情况下,PHP通知和警告消息不会被认为严重到足以触发警报.

Again, this function expects the Apache error log to be JSON-formatted. So here is the error log format string I have been using:

ErrorLogFormat "{\"vhost\": \"%v\", \"timestamp\": \"%{cu}t\", \"module\": \"%-m\", \"level\": \"%l\", \"pid\": \"%-P\", \"tid\": \"%-T\", \"oserror\": \"%-E\", \"client\": \"%-a\", \"msg\": \"%M\"}"
ApacheErrorLogFunctionPermission:
	Type: AWS::Lambda::Permission
	Properties:
  FunctionName: !Ref ProcessApacheErrorLogFunction
  Action: lambda:InvokeFunction
  Principal: logs.amazonaws.com
  SourceArn: !Sub arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:*
  SourceAccount: !Ref AWS::AccountId

This resource grants permissions to CloudWatch Logs to call your Lambda function.

ApacheErrorLogSubscriptionFilter:
	Type: AWS::Logs::SubscriptionFilter
	DependsOn: ApacheErrorLogFunctionPermission
	Properties:
  LogGroupName: !Ref ApacheErrorLogGroup
  DestinationArn: !GetAtt ProcessApacheErrorLogFunction.Arn
  FilterPattern: '{$.msg != "PHP Warning*" && $.msg != "PHP Notice*"}'

Finally, 我们使用Apache错误日志组的订阅过滤器将CloudWatch日志与Lambda函数链接起来. Note the filter pattern, 它确保以“PHP警告”或“PHP通知”开头的消息的日志不会触发对Lambda函数的调用.

Final Thoughts, Pricing, and Availability

One last word about costs: this solution is much cheaper than operating an ELK cluster. The logs stored in CloudWatch are priced at the same level as S3, and Lambda allows one million calls per month as part of its free tier. 这对于流量适中的网站来说可能足够了(前提是你使用了CloudWatch日志过滤器)。, especially if you coded it well and doesn’t have too many errors!

Also, please note that Lambda functions support up to 1,000 concurrent calls. At the time of writing, this is a hard limit in AWS that can’t be changed. However, you can expect the call for the above functions to last for about 30-40ms. This should be fast enough to handle rather heavy traffic. If your workload is so intense that you hit this limit, you probably need a more complex solution based on Kinesis, which I might cover in a future article.

Further Reading on the Toptal Blog:

Understanding the basics

  • What is the ELK stack?

    ELK is an acronym for Elasticsearch-Logstash-Kibana. Additional software items are often needed, 比如Beats(一个向Logstash发送日志和指标的工具集合)和Elastalert(基于Elasticsearch时间序列数据生成警报).

  • Is ELK stack free?

    The short answer is: yes. 组成ELK堆栈的各种软件项目有各种软件许可证,但通常都有提供免费使用而不提供任何支持的许可证. It would be up to you, however, to set up and maintain the ELK cluster.

  • How does the ELK stack work?

    The ELK stack is highly configurable so there isn’t a single way to make it work. For example, here is the path of an Apache log entry: Filebeat reads the entry and sends it to Logstash, which parses it, and sends it to Elasticsearch, which saves and indexes it. Kibana can then retrieve the data and display it.

Hire a Toptal expert on this topic.
Hire Now
Fabrice Triboix

Fabrice Triboix

Verified Expert in Engineering

London, United Kingdom

Member since September 6, 2017

About the author

Fabrice是一名云架构师和软件开发人员,在思科工作了20多年, Samsung, Philips, Alcatel, and Sagem.

作者都是各自领域经过审查的专家,并撰写他们有经验的主题. All of our content is peer reviewed and validated by Toptal experts in the same field.

Expertise

PREVIOUSLY AT

Cisco

World-class articles, delivered weekly.

Subscription implies consent to our privacy policy

World-class articles, delivered weekly.

Subscription implies consent to our privacy policy

Toptal Developers

Join the Toptal® community.