I first began working at Sentia as a co-op from January 2023 to August 2023 for a total of 8 months as my final co-op placement of my undergraduate career. Directly afterwards, during the final 8 months of my degree, I continued to work for them as a part-time employee. After graduation, I was welcomed to a permanent full-time position at the company which lasted from June 2024 to December 2024. The following is a major project which I worked on, parts of it beginning as a co-op, and more complex elements being finished during my time as a permanent full-time employee.
At Sentia, I led the design and implementation of a centralized log management system designed to aggregate and process logs from a wide variety of customer devices which generated logs in different formats and frequencies, presenting unique challenges for data ingestion. This project was a culmination of my studies in data engineering and software architecture, including my implementation of the medallion architecture, as well as the use of Azure Databricks, PySpark, and Azure Data Factory, and transformed how we monitored customer infrastructure.
Our customers operated in distributed cloud environments, generating gigabytes of server logs daily from several device types such as switches, servers, storage devices, virtual machines (VMs), and other network appliances. These logs were critical for monitoring system health and detecting security breaches. However, the logs were stored in siloed systems across multiple cloud providers (e.g., AWS, Azure, GCP) and on-premise servers. Retrieving and analyzing this data was a manual, time-consuming process, often taking days. This delay hindered incident response and posed significant risks to our service.
The challenge was clear: How could we consolidate, process, and analyze logs from 50+ diverse environments in near-real time, while ensuring scalability, reliability, and cost efficiency?
The first step was to build a robust data ingestion pipeline that could collect logs from 50+ customer environments, each with its own format, protocol, and storage system. I chose Azure Data Factory (ADF) for this task due to its flexibility and scalability.
Design Pattern: I implemented a publisher-subscriber model, where each customer environment acted as a publisher, sending logs to a centralized ingestion endpoint. ADF served as the subscriber, orchestrating the ingestion process.
Technical Implementation: I created custom connectors in ADF to handle diverse log formats (e.g., JSON, CSV, syslog) and protocols (e.g., HTTP, FTP, SFTP). For example, I used ADF’s REST API connector to pull logs from customer APIs and the Blob Storage connector to fetch logs stored in Azure Blob Storage.
Automation: The pipeline was scheduled to run at 5-minute intervals, ensuring near-real-time data availability. I also implemented retry mechanisms and dead-letter queues to handle transient failures and ensure data integrity.
Once the logs were ingested, the next challenge was to process and transform them into a structured format suitable for analysis. This is where Azure Databricks and PySpark came into play.
Design Pattern: I used a lambda architecture, combining batch and stream processing to handle both historical and real-time data. Azure Databricks served as the processing engine, while PySpark provided the computational power.
Technical Implementation: I wrote PySpark scripts to clean, normalize, and enrich the raw logs. For example, for JSON and XML logs, I used PySparks's inferSchema
functionality to automatically detect the structure of the logs. I also wrote custom parsers in Python to extract relevant fields. I also used statistical methods like Z-score analysis to identify anomalies in log patterns and filter out invalid logs: For instance, I calculated the Z-score for error rates, to filter out invalid logs:
from pyspark.sql.functions import stddev, mean
df = df.filter(col("status_code").isin(200, 404, 500))
error_rates = df.groupBy("customer_id").agg(mean("error_count").alias("mean_error"), stddev("error_count").alias("stddev_error"))
error_rates = error_rates.withColumn("z_score", (col("error_count") - col("mean_error")) / col("stddev_error"))
Logs with a Z-score greater than 3 were flagged as anomalies and routed for further investigation. If a log ingestion job failed (e.g., due to network issues or device unavailability), the system would automatically retry the job after a configurable delay. Logs that could not be processed after multiple retries were sent to a dead-letter queue for manual inspection and reprocessing, for example:
if retry_count > MAX_RETRIES:
send_to_dead_letter_queue(log)
Scalability: Azure Databricks’ auto-scaling feature, in theory, allowed the system to handle varying data volumes efficiently.
To store the processed logs, I designed a medallion architecture, a multi-layered data storage pattern that balances cost, performance, and accessibility.
Bronze Layer: This layer stored raw, unprocessed logs in their original format. It served as a "source of truth" for debugging and auditing.
Silver Layer: Here, logs were cleaned, normalized, and enriched. This layer was optimized for query performance, with indexed columns and partitioned data.
Gold Layer: This layer contained aggregated and summarized data, such as daily error rates and security compliance metrics. It was designed for fast, analytical queries.
Technical Implementation: I used Azure Data Lake Storage (ADLS) for the bronze layer, Delta Lake for the silver layer, and Azure SQL Database for the gold layer. Delta Lake’s ACID transactions and schema enforcement ensured data consistency and reliability.
To provide customers with real-time insights, I built a monitoring dashboard using Power BI. The dashboard visualized key metrics like server uptime, error rates, and security compliance status.
Automation: I configured Azure Logic Apps to send automated alerts for critical events, such as server downtime or security breaches. For example:
{
"trigger": "When a new log is added",
"conditions": [
{
"field": "status_code",
"operator": "equals",
"value": "500"
}
],
"actions": [
{
"type": "SendEmail",
"to": "admin@customer.com",
"subject": "Critical Error Detected"
}
]
}
One of the most challenging aspects of the project was diagnosing and resolving performance bottlenecks in the data processing pipeline. For example, during load testing, we noticed that certain PySpark jobs were taking significantly longer to complete.
Bug fix: Using Spark UI, I identified that the bottleneck was caused by data skew—a few partitions were processing significantly more data than others. This was due to uneven distribution of log timestamps.
Solution: I implemented salting, a technique that distributes data more evenly across partitions. For example:
df = df.withColumn("salt", (col("timestamp") % 10))
df = df.repartition("salt")
This change plus other necessary code reduced processing times by 40%.
The medallion architecture (bronze, silver, gold layers) was implemented to optimize storage costs and query performance. However, this approach introduced several challenges:
Bug fix: Ensuring consistency across the bronze (raw), silver (cleaned), and gold (aggregated) layers required careful handling of schema changes and data updates.
Solution: I used Delta Lake to enforce schema validation and enable ACID transactions.
Bug fix: Handling partial logs, a bug emerged where partial logs were ingested during network interruptions, leading to data corruption.
Solution: I implemented a checksum validation step to detect and discard partial logs:
def validate_log(log):
expected_checksum = calculate_checksum(log)
actual_checksum = log["checksum"]
returnexpected_checksum == actual_checksum
The centralized log management system delivered transformative results:
95% Reduction in Data Retrieval Time: Customers could access and analyze logs in near-real time, improving incident response times.
60% Reduction in Storage Costs: The medallion architecture and Delta Lake optimizations significantly reduced storage costs.
Enhanced Security and Compliance: Real-time monitoring and automated alerts helped customers maintain compliance and respond to security threats proactively.
This project was a testament to the power of data-driven decision-making and continuous optimization. It also reinforced the importance of root cause analysis in building reliable, high-performance systems.