Building Automation for Fraud Detection Using OpenSearch and Terraform

Organizations that interface with online payments are continuously monitoring and guarding against fraudulent activity. Transactional fraud usually presents itself as discrete data points, making it challenging to identify multiple actors involved in the same group of transactions. Even a single actor operating over a period of time can be hard to detect. Visibility is key to prevent fraud incidents from occurring and to give meaningful knowledge of the activities within your environment to data, security, and operations engineers.

Understanding the connections between individual data points can reduce the time for customers to detect and prevent fraud. You can use a graph database to store transaction information along with the relationships between individual data points. Analyzing those relationships through a graph database can uncover patterns difficult to identify with relational tables. Fraud graphs enable customers to find common patterns between transactions, such as phone numbers, locations, and origin and destination accounts. Additionally, combining fraud graphs with full text search provides additional benefits as it can simplify analysis and integration with existing applications.

In our solution, financial analysts can upload graph data, which gets automatically ingested into the Amazon Neptune graph database service and replicated into Amazon OpenSearch Service for analysis. Data ingestion is automated with Amazon Simple Storage Service (Amazon S3) and Amazon Simple Queue Service (Amazon SQS) integration. We do data replication through AWS Lambda functions and AWS Step Functions for orchestration. The design is using open source tools and AWS Managed Services to build resources and is available in this https://github.com/aws-samples/neptune-fraud-detection-with-opensearch GitHub repository under an MIT-0 license. You will use Terraform and Docker to deploy the architecture, and will be able to send search requests to the system to explore the dataset.

Solution overview

This solution takes advantage of native integration between AWS services for scalability and performance, as well as the Neptune-to-OpenSearch Service replication pattern described in Neptune’s official documentation.

Figure 1 An architectural diagram that illustrates the infrastructure state and workflow as defined in the Terraform templates.

The process for this solution consists of the following steps, also shown in the architecture diagram here:

Financial analyst uploads graph data files to an Amazon S3 bucket.

Note: The data files are in a Gremlin load data format (CSV) and can include vertex files and edge files.

The action of the upload invokes a PUT object event notification with a destination set to an Amazon SQS queue.
The SQS queue is configured as an AWS Lambda event source, which invokes a Lambda function.
This Lambda function sends an HTTP request to an Amazon Neptune database to load data stored in an S3 bucket.
The Neptune database reads data from the S3 endpoint defined in the Lambda request and loads the data into the graph database.
An Amazon EventBridge rule is scheduled to run every 5 minutes. This rule targets an AWS Step Functions state machine to create a new execution.
The Neptune Poller step function (state machine) replicates the data in the Neptune database to an OpenSearch Service cluster.
Note: The Neptune Poller step function is responsible for continually syncing new data after the initial data upload using Neptune Streams.

User can access the replicated data from the Neptune database with Amazon OpenSearch Service.
Note: A Lambda function is invoked to send a search request or query to an OpenSearch Service endpoint to get results.

Prerequisites

To implement this solution, you must have the following prerequisites:

An AWS account with local credentials is configured. For more information, check the documentation on configuration and credential file settings.
The latest version of the AWS Command Line Interface (AWS CLI).
An IAM user with Git credentials.
A Git client to clone the source code provided.
A Bash shell.

Docker installed on your localhost.

Terraform installed on your localhost.

Deploying the Terraform templates

The solution is available in this GitHub repository with the following structure:

data: Contains a sample dataset to be used with the solution for demonstration purposes. Information on fictional transactions, identities and devices is represented in files within the nodes/ folder, and relationships between them are represented in files in the edges/ folder.
terraform: This folder contains the Terraform modules to deploy the solution.
documents: This folder contains the architecture diagram image file of the solution.

Create a local directory called NeptuneOpenSearchDemo and clone the source code repository:

mkdir -p $HOME/NeptuneOpenSearchDemo

cd $HOME/NeptuneOpenSearchDemo

git clone https://github.com/aws-samples/neptune-fraud-detection-with-opensearch.git

Change directory into the Terraform directory:

cd $HOME/NeptuneOpenSearchDemo neptune-fraud-detection-with-opensearch /terraform

Make sure that the Docker daemon is running:

docker info

If the previous command outputs an error that is unable to connect to the Docker daemon, start Docker and run the command again.

Initialize the Terraform folder to install required providers:

terraform init

The solution is deployed on us-west-2 by default. The user can change this behavior by modifying the variable “region” in variables.tf file.

Deploy the AWS services:

terraform apply -auto-approve

Note: Deployment will take around 30 minutes due to the time necessary to provision the Neptune and OpenSearch Service clusters.

To retrieve the name of the S3 bucket to upload data to:

aws s3 ls | grep “neptunestream-loader.*d$”

Upload node data to the S3 bucket obtained in the previous step:

aws s3 cp $HOME/NeptuneOpenSearchDemo/neptune-fraud-detection-with-opensearch /data s3:// neptunestream-loader-us-west-2-123456789012 –recursive

Note: This is a sample dataset for demonstration purposes only created from the IEEE-CIS Fraud Detection dataset.

Test the solution

After the solution is deployed and the dataset is uploaded to S3, the dataset can be retrieved and explored through a Lambda function that sends a search request to the OpenSearch Service cluster.

Confirm the Lambda function that sends a request to OpenSearch was deployed correctly:

aws lambda get-function –function-name NeptuneStreamOpenSearchRequestLambda –-query ‘Configuration.[FunctionName, State]’

Invoke the Lambda function to see all records present in OpenSearch that are added from Neptune:

aws lambda invoke –function-name NeptuneStreamOpenSearchRequestLambda response.json

The results of the Lambda invocation are stored in the response.json file. This file contains the total number of records in the cluster and all records ingested up to that point. The solution stores records in the index amazon_neptune. An example of a node with device information looks like this:

{
“_index”: “amazon_neptune”,
“_type”: “_doc”,
“_id”: “1fb6d4d2936d6f590dc615142a61059e”,
“_score”: 1.0,
“_source”: {
“entity_id”: “d3”,
“document_type”: “vertex”,
“entity_type”: [
“vertex”
],
“predicates”: {
“deviceType”: [
{
“value”: “desktop”
}
],
“deviceInfo”: [
{
“value”: “Windows”
}
]
}
}
}

Cleaning up

To avoid incurring future charges, clean up the resources deployed in the solution:

terraform destroy –auto-approve

The command will output information on resources being destroyed.

Destroy complete! Resources: 101 destroyed.

Conclusion

Fraud graphs are complementary to other techniques organizations can use to detect and prevent fraud. The solution presented in this blog post reduces the time financial analysts would take to access transactional data by automating data ingestion and replication. It also improves performance for systems with growing volumes of data when compared to executing a large number of insert statements or other API calls.

Flatlogic Admin Templates banner

Root Cause Analysis with DoWhy, an Open Source Python Library for Causal Machine Learning

Identifying the root causes of observed changes in complex systems can be a difficult task that requires both deep domain knowledge and potentially hours of manual work. For instance, we may want to analyze an unexpected drop in profit of a product sold in an online store, where various intertwined factors can impact the overall profit of the product in subtle ways.

Wouldn’t it be nice if we had automated tools to simplify and accelerate this task? A library that can automatically identify the root causes of an observed effect with a few lines of code?

This is the idea behind the root cause analysis (RCA) features of the DoWhy open source Python library, to which AWS contributed a large set of novel causal machine learning (ML) algorithms last year. These algorithms are the result of years of Amazon research on graphical causal models and were released in DoWhy v0.8 in July last year. Furthermore, AWS joined forces with Microsoft to form a new organization called PyWhy, now home of DoWhy. PyWhy’s mission, according to the charter, is to “build an open source ecosystem for causal machine learning that moves forward the state of the art and makes it available to practitioners and researchers. We build and host interoperable libraries, tools, and other resources spanning a variety of causal tasks and applications, connected through a common API on foundational causal operations and a focus on the end-to-end-analysis process.”

In this article, we will have a closer look at these algorithms. Specifically, we want to demonstrate their applicability in the context of root cause analysis in complex systems.

Applying DoWhy’s causal ML algorithms to this kind of problem can reduce the time to find a root cause significantly. To demonstrate this, we will dive deep into an example scenario based on randomly generated synthetic data where we know the ground truth.

The scenario

Suppose we are selling a smartphone in an online shop with a retail price of $999. The overall profit from the product depends on several factors, such as the number of sold units, operational costs or ad spending. On the other hand, the number of sold units, for instance, depends on the number of visitors on the product page, the price itself and potential ongoing promotions. Suppose we observe a steady profit of our product over the year 2021, but suddenly, there is a significant drop in profit at the beginning of 2022. Why?

In the following scenario, we will use DoWhy to get a better understanding of the causal impacts of factors influencing the profit and to identify the causes for the profit drop. To analyze our problem at hand, we first need to define our belief about the causal relationships. For this, we collect daily records of the different factors affecting profit. These factors are:

Shopping Event?: A binary value indicating whether a special shopping event took place, such as Black Friday or Cyber Monday sales.

Ad Spend: Spending on ad campaigns.

Page Views: Number of visits on the product detail page.

Unit Price: Price of the device, which could vary due to temporary discounts.

Sold Units: Number of sold phones.

Revenue: Daily revenue.

Operational Cost: Daily operational expenses which includes production costs, spending on ads, administrative expenses, etc.

Profit: Daily profit.

Looking at these attributes, we can use our domain knowledge to describe the cause-effect relationships in the form of a directed acyclic graph, which represents our causal graph in the following. The graph is shown here:

An arrow from X to Y, X → Y in this diagram describes a direct causal relationship, where X is the cause of Y. In this scenario we know the following:

Shopping Event? impacts:
→ Ad Spend: To promote the product on special shopping events, we require additional ad spending.
→ Page Views: Shopping events typically attract a large number of visitors to an online retailer due to discounts and various offers.
→ Unit Price: Typically, retailers offer some discount on the usual retail price on days with a shopping event.
→ Sold Units: Shopping events often take place during annual celebrations like Christmas, Father’s day, etc, when people often buy more than usual.

Ad Spend impacts:
→ Page Views: The more we spend on ads, the more likely people will visit the product page.
→ Operational Cost: Ad spending is part of the operational cost.

Page Views impacts:
→ Sold Units: The more people visiting the product page, the more likely the product is bought. This is quite obvious seeing that if no one would visit the page, there wouldn’t be any sale.

Unit Price impacts:
→ Sold Units: The higher/lower the price, the less/more units are sold.
→ Revenue: The daily revenue typically consist of the product of the number of sold units and unit price.

Sold Units impacts:
→ Sold Units: Same argument as before, the number of sold units heavily influences the revenue.
→ Operational Cost: There is a manufacturing cost for each unit we produce and sell. The more units we well the higher the revenue, but also the higher the manufacturing costs.

Operational Cost impacts:
→ Profit: The profit is based on the generated revenue minus the operational cost.

Revenue impacts:
→ Profit: Same reason as for the operational cost.

Step 1: Define causal models

Now, let us model these causal relationships with DoWhy’s graphical causal model (GCM) module. In the first step, we need to define a so-called structural causal model (SCM), which is a combination of the causal graph and the underlying generative models describing the data generation process.

To model the graph structure, we use NetworkX, a popular open source Python graph library. In NetworkX, we can represent our causal graph as follows:

import networkx as nx

causal_graph = nx.DiGraph([(‘Page Views’, ‘Sold Units’),
(‘Revenue’, ‘Profit’),
(‘Unit Price’, ‘Sold Units’),
(‘Unit Price’, ‘Revenue’),
(‘Shopping Event?’, ‘Page Views’),
(‘Shopping Event?’, ‘Sold Units’),
(‘Shopping Event?’, ‘Unit Price’),
(‘Shopping Event?’, ‘Ad Spend’),
(‘Ad Spend’, ‘Page Views’),
(‘Ad Spend’, ‘Operational Cost’),
(‘Sold Units’, ‘Revenue’),
(‘Sold Units’, ‘Operational Cost’),
(‘Operational Cost’, ‘Profit’)])

Next, we look at the data from 2021:

import pandas as pd

pd.options.display.float_format = ‘${:,.2f}’.format # Format dollar columns
data_2021 = pd.read_csv(‘2021 Data.csv’, index_col=’Date’)
data_2021.head()

As we see, we have one sample for each day in 2021 with all the variables in the causal graph. Note that in the synthetic data we consider in this blog post, shopping events were also generated randomly.

We defined the causal graph, but we still need to assign generative models to the nodes. With DoWhy, we can either manually specify those models, and configure them if needed, or automatically infer “appropriate” models using heuristics from data. We will leverage the latter here:

from dowhy import gcm

# Create the structural causal model object
scm = gcm.StructuralCausalModel(causal_graph)

# Automatically assign generative models to each node based on the given data
gcm.auto.assign_causal_mechanisms(scm, data_2021)

Whenever available, we recommend assigning models based on prior knowledge as then models would closely mimic the physics of the domain, and not rely on nuances of the data. However, here we asked DoWhy to do this for us instead.

Step 2: Fit causal models to data

After assigning a model to each node, we need to learn the parameters of the model:

gcm.fit(scm, data_2021)

The fit method learns the parameters of the generative models in each node. The fitted SCM can now be used to answer different kinds of causal questions.

Step 3: Answer causal questions

What are the key factors influencing the variance in profit?

At this point, we want to understand which factors drive changes in the Profit. Let us first have a closer look at the Profit over time. For this, we are using pandas to plot the Profit over time for 2021, where the produced plot shows the Profit in dollars on the Y-axis and the time on the X-axis.

data_2021[‘Profit’].plot(ylabel=’Profit in $’, figsize=(15,5), rot=45)

We see some significant spikes in the Profit across the year. We can further quantify this by looking at the standard deviation, which we can estimate using the std() function from pandas:

data_2021[‘Profit’].std()
259247.66010978

The estimated standard deviation of ~259247 dollars is quite significant. Looking at the causal graph, we see that Revenue and Operational Cost have a direct impact on the Profit, but which of them contribute the most to the variance? To find this out, we can make use of the direct arrow strength algorithm that quantifies the causal influence of a specific arrow in the graph:

import numpy as np

def convert_to_percentage(value_dictionary):
total_absolute_sum = np.sum([abs(v) for v in value_dictionary.values()])
return {k: abs(v) / total_absolute_sum * 100 for k, v in value_dictionary.items()}

arrow_strengths = gcm.arrow_strength(scm, target_node=’Profit’)

gcm.util.plot(causal_graph,
causal_strengths=convert_to_percentage(arrow_strengths),
figure_size=[15, 10])

In this causal graph, we see how much each node contributes to the variance in Profit. For simplicity, the contributions are converted to percentages. Since Profit itself is only the difference between Revenue and Operational Cost, we do not expect further factors influencing the variance. As we see, the Revenue contributes 74.45 percent and has more impact than the Operational Cost which contributes 25.54 percent. This makes sense seeing that the Revenue typically varies more than the Operational Cost due to the stronger dependency on the number of sold units. Note that DoWhy also supports other kinds of measures, for instance, KL divergence.

While the direct influences are helpful in understanding which direct parents influence the most on the variance in Profit, this mostly confirms our prior belief. The question of which factor is now ultimately responsible for this high variance is, however, still unclear. Revenue itself is simply based on Sold Units and the Unit Price. Although we could recursively apply the direct arrow strength to all nodes, we would not get a correctly weighted insight into the influence of upstream nodes on the variance.

What are the important causal factors contributing to the variance in Profit? To find this out, we can use DoWhy’s intrinsic causal contribution method that attributes the variance in Profit to the upstream nodes in the causal graph. For this, we first define a function to plot the values in a bar plot and then use this to display the estimated contributions to the variance as percentages:

import matplotlib.pyplot as plt

def bar_plot(value_dictionary, ylabel, uncertainty_attribs=None, figsize=(8, 5)):
value_dictionary = {k: value_dictionary[k] for k in sorted(value_dictionary)}
if uncertainty_attribs is None:
uncertainty_attribs = {node: [value_dictionary[node], value_dictionary[node]] for node in value_dictionary}

_, ax = plt.subplots(figsize=figsize)
ci_plus = [uncertainty_attribs[node][1] – value_dictionary[node] for node in value_dictionary.keys()]
ci_minus = [value_dictionary[node] – uncertainty_attribs[node][0] for node in value_dictionary.keys()]
yerr = np.array([ci_minus, ci_plus])
yerr[abs(yerr) < 10**-7] = 0
plt.bar(value_dictionary.keys(), value_dictionary.values(), yerr=yerr, ecolor=’#1E88E5′, color=’#ff0d57′, width=0.8)
plt.ylabel(ylabel)
plt.xticks(rotation=45)
ax.spines[‘right’].set_visible(False)
ax.spines[‘top’].set_visible(False)

plt.show()

iccs = gcm.intrinsic_causal_influence(scm, target_node=’Profit’, num_samples_randomization=500)

bar_plot(convert_to_percentage(iccs), ylabel=’Variance attribution in %’)

The scores shown in this bar chart are percentages indicating how much variance each node is contributing to Profit — without inheriting the variance from its parents in the causal graph. As we see quite clearly, the Shopping Event has by far the biggest influence on the variance in our Profit. This makes sense, seeing that the sales are heavily impacted during promotion periods like Black Friday or Prime Day and, thus, impact the overall profit. Surprisingly, we also see that factors such as the number of sold units or number of page views have a rather small influence, i.e., the large variance in profit can be almost completely explained by the shopping events. Let’s check this visually by marking the days where we had a shopping event. To do so, we use the pandas plot function again, but additionally mark all points in the plot with a vertical red bar where a shopping event occured:

data_2021[‘Profit’].plot(ylabel=’Profit in $’, figsize=(15,5), rot=45)
plt.vlines(np.arange(0, data_2021.shape[0])[data_2021[‘Shopping Event?’]], data_2021[‘Profit’].min(), data_2021[‘Profit’].max(), linewidth=10, alpha=0.3, color=’r’)

We clearly see that the shopping events coincide with the high peaks in profit. While we could have investigated this manually by looking at all kinds of different relationships or using domain knowledge, the tasks gets much more difficult as the complexity of the system increases. With a few lines of code, we obtained these insights from DoWhy.

What are the key factors explaining the Profit drop on a particular day?

After a successful year in terms of profit, newer technologies come to the market and, thus, we want to keep the profit up and get rid of excess inventory by selling more devices. In order to increase the demand, we therefore lower the retail price by 10% at the beginning of 2022. Based on a prior analysis, we know that a decrease of 10% in the price would roughly increase the demand by 13.75%, a slight surplus. Following the price elasticity of demand model, we expect an increase of around 37.5% in number of Sold Units. Let us take a look if this is true by loading the data for the first day in 2022 and taking the fraction between the numbers of Sold Units from both years for that day:

first_day_2022 = pd.read_csv(‘2022 First Day.csv’, index_col=’Date’)
(first_day_2022[‘Sold Units’][0] / data_2021[‘Sold Units’][0] – 1) * 100
18.946914113077252

Surprisingly, we only increased the number of sold units by ~19%. This will certainly impact the profit given that the revenue is much smaller than expected. Let us compare it with the previous year at the same time:

(1 – first_day_2022[‘Profit’][0] / data_2021[‘Profit’][0]) * 100
8.57891513840979

Indeed, the profit dropped by ~8.5%. Why is this the case seeing that we would expect a much higher demand due to the decreased price? Let us investigate what is going on here.

In order to figure out what contributed to the Profit drop, we can make use of DoWhy’s anomaly attribution feature. Here, we only need to specify the target node we are interested in (the Profit) and the anomaly sample we want to analyze (the first day of 2022). These results are then plotted in a bar chart indicating the attribution scores of each node for the given anomaly sample:

attributions = gcm.attribute_anomalies(scm, target_node=’Profit’, anomaly_samples=first_day_2022)

bar_plot({k: v[0] for k, v in attributions.items()}, ylabel=’Anomaly attribution score’)

A positive attribution score means that the corresponding node contributed to the observed anomaly, which is in our case the drop in Profit. A negative score of a node indicates that the observed value for the node is actually reducing the likelihood of the anomaly (e.g., a higher demand due to the decreased price should increase the profit). More details about the interpretation of the score can be found in our research paper. Interestingly, the Page Views stand out as a factor explaining the Profit drop that day as indicated in the bar chart shown here.

While this method gives us a point estimate of the attributions for the particular models and parameters we learned, we can also use DoWhy’s confidence interval feature, which incorporates uncertainties about the fitted model parameters and algorithmic approximations:

median_attributions, confidence_intervals, = gcm.confidence_intervals(
gcm.fit_and_compute(gcm.attribute_anomalies,
scm,
bootstrap_training_data=data_2021,
target_node=’Profit’,
anomaly_samples=first_day_2022),
num_bootstrap_resamples=10)

bar_plot(median_attributions, ‘Anomaly attribution score’, confidence_intervals)

Note, in this bar chart we see the median attributions over multiple runs on smaller data sets, where each run re-fits the models and re-evaluates the attributions. We get a similar picture as before, but the confidence interval of the attribution to Sold Units also contains zero, meaning its contribution is insignificant. But some important questions still remain: Was this only a coincidence and, if not, which part in our system has changed? To find this out, we need to collect some more data.

What caused the profit drop in Q1 2022?

While the previous analysis is based on a single observation, let us see if this was just coincidence or if this is a persistent issue. When preparing the quarterly business report, we have some more data available from the first three months. We first check if the profit dropped on average in the first quarter of 2022 as compared to 2021. Similar as before, we can do this by taking the fraction between the average Profit of 2022 and 2021 for the first quarter:

data_first_quarter_2021 = data_2021[data_2021.index <= ‘2021-03-31’]
data_first_quarter_2022 = pd.read_csv(“2022 First Quarter.csv”, index_col=’Date’)

(1 – data_first_quarter_2022[‘Profit’].mean() / data_first_quarter_2021[‘Profit’].mean()) * 100
13.0494881794224

Indeed, the profit drop is persistent in the first quarter of 2022. Now, what is the root cause of this? Let us apply DoWhy’s distribution change method to identify the part in the system that has changed:

median_attributions, confidence_intervals = gcm.confidence_intervals(
lambda: gcm.distribution_change(scm,
data_first_quarter_2021,
data_first_quarter_2022,
target_node=’Profit’,
# Here, we are intersted in explaining the differences in the mean.
difference_estimation_func=lambda x, y: np.mean(y) – np.mean(x))
)

bar_plot(median_attributions, ‘Profit change attribution in $’, confidence_intervals)

In our case, the distribution change method explains the change in the mean of Profit, i.e., a negative value indicates that a node contributes to a decrease and a positive value to an increase of the mean. Using the bar chart, we get now a very clear picture that the change in Unit Price has actually a slightly positive contribution to the expected Profit due to the increase of Sold Units, but it seems that the issue is coming from the Page Views which has a negative value. While we already understood this as a main driver of the drop at the beginning of 2022, we have now isolated and confirmed that something changed for the Page Views as well. Let’s compare the average Page Views with the previous year.

(1 – data_first_quarter_2022[‘Page Views’].mean() / data_first_quarter_2021[‘Page Views’].mean()) * 100
14.347627108364

Indeed, the number of Page Views dropped by ~14%. Since we eliminated all other potential factors, we can now dive deeper into the Page Views and see what is going on there. This is a hypothetical scenario, but we could imagine it could be due to a change in the search algorithm which ranks this product lower in the results and therefore drives fewer customers to the product page. Knowing this, we could now start mitigating the issue.

With the help of DoWhy’s new features for graphical causal models, we only needed a few lines of code to automatically pinpoint the main drivers of a particular outlier and, especially, were able to identify the main factors that caused a shift in the distribution.

Conclusion

In this article, we have shown how DoWhy can help in root cause analysis of a drop in profits for an example online shop. For this, we looked at DoWhy features, such as arrow strengths, intrinsic causal influences, anomaly attribution and distribution change attribution. But did you know that DoWhy can also be used for estimating average treatment effects, causal structure learning, diagnosis of causal structures, interventions and counterfactuals? If this is interesting to you, we invite you to visit our PyWhy homepage or the DoWhy documentation to learn more. There is also an active community on the DoWhy Discord where scientists and ML practitioners can meet, ask questions and get help. We also host weekly meetings on Discord where we discuss current developments. Come join us!

Flatlogic Admin Templates banner

Top Tools every Software Developer should know in 2022

With the increase in popularity of software development in the market, the adoption of its tools has also increased. Now, programmers prefer to use the right software developer tool while creating a solution for the client as it makes their lives easier. Besides, the right set of tools can help in getting the maximum output each day. But this choice might be difficult because of the huge number of software development tools available in the market. So, to make this choice easy for you, here, in this blog, we’ll go through a list of top software development tools in 2022 that can be used to boost the professional performance of the software development team.

What is Software Development?

Software development is a simple process that every software programmer uses to create computer programs. The entire process of developing a system for any business organization is known as Software Development Life Cycle (SDLC). This process includes various phases that offer a perfect method for creating products that meet both user requirements and technical specifications. For this, web developers use different types of development tools and the use of the right tool can help in streamlining the entire software development process.

Why Use Software Development Tools?

Developers use software tools to investigate and complete the business processes, optimize them, and document the software development processes. With the use of such tools, the software developers can create a project whose outcome can be more productive. Using the development tools, a software developer can manage the workflow easily.

15 Best Software Development Tools

Some of the top software programming tools every developer can use are:

UltraEdit

UltraEdit is one of the best tools when it comes to creating software with proper security, flexibility, and performance. It comes with an all-access package that offers the developers access to various tools like an integrated FTP client, a file finder, and a Git integration solution. It is a very powerful text editor that has the capabilities to handle large files with a breeze.

Key Features:

It can handle and load large files with proper performance, file load, and startup.
Supports complete OS integration like shell extensions and command lines.
You can configure, customize, and reskin the entire application with the help of beautiful themes.
Accesses the server and opens files with SFTP browser/ Native FTP.
Helps in finding, comparing, and replacing inside files at blazing speed.
Spots visual differences between codes easily.
The all-access package of UltraEdit comes at $99.95 per year.

Atom

Atom is a top integrated development environment (IDE). And it’s open-source nature makes it run on the majority of the popular operating systems. It is a software development tool that is known for its rich level of customization and vast list of third-party integrations. Atom’s attribute, Autocomplete enables the developers to write the code easily and quickly. Besides this, the browser function of this tool simplifies project file management and this is possible because of its interface that comes with numerous panes to compare, view, edit, and compare files, all at once. Basically, Atom is the best option for developers because it can support every popular framework and programming language.

Key Features:

Atom supports cross-platform editing, this means that it can work for different types of operating systems like OS X, Windows, and Linux.
It uses the Electron framework for offering amazing web technologies.
It is a customizable tool that comes with effective features for a better look and feel.
Some of the important features of Atom like smart autocomplete, built-in package manager, multiple panes, find & replace feature, file system browser, etc.

Quixy

Quixy is used by enterprises for its cloud-based no-code platform approach. This tool helps businesses automate their workflows and create all types of enterprise-grade applications. Besides, it helps in eliminating the manual processes and turning different ideas into apps to make businesses transparent, productive, and innovative. 

Key Features:

Quixy helps in creating an app interface as per the client’s requirement by easily dragging and dropping 40+ form fields.
It seamlessly integrates the third-party app with the help of ready-to-use Webhooks, connectors, and API integrations.
It can model any process and create simple complex workflows.
It helps in deploying applications with a single click and making changes anytime.
Quixy also enables the developers to use it on any browser and device even in offline mode.
Offers live actionable dashboards and reports with the idea of exporting data in various formats.

Linx

Linx helps in creating and automating backend apps with a low-coding approach. This tool has the capability to accelerate the design, automation, and development of custom business processes. It offers services for easily integrating systems, apps, and databases. 

Key Features:

Drag and drop, easy-to-use IDE and Server.
It offers live debugging with the use of step-through logic.
Offers 100 pre-built plugins for rapid development.
It automates processes with the help of directory events and timers.

GitHub

GitHub is one of the most popular software development and collaboration tool for code management and review. It enables its users to create software and apps, host the code, manage the projects, and review the code. 

Key Features:

With the help of GitHub, web app developers can easily document their source code.
Some of the features of GitHub like access control and code security make it a more useful tool for all the team members.
GitHub’s project management tools enable app developers to coordinate tasks easily.
This tool can be hosted on servers & cloud platforms and can run on operating systems like Mac and Windows. 

Embold

Embold is one of the most popular tools when it comes to fixing bugs before deployment. It helps in saving a lot of energy and time in the long run. It is a software analytics platform that helps the developers to analyze the source code and uncovers problems that might impact robustness, stability, security, and maintainability.

Key Features:

Embold offers plugins that can help in picking up code vulnerabilities.
It helps in integrating the system seamlessly with Bitbucket, GitHub, and Git.
Embold comes with unique anti-pattern detection that helps in preventing the compounding of unmaintainable code.
With Emhold, it is possible to get faster and deeper checks for more than 10 languages.

Zoho Creator

Zoho Creator, a low-code software development tool enables rapid development and deployment of web applications and assists to create powerful enterprise software apps. Besides, it doesn’t require endless lines of code for creating an app. It comes with different features like JavaScript, Artificial Intelligence, Cloud functions, and more. There are more than 4 million users of this tool all over the world and they use it to enhance the productivity of their business.

Key Features:

Zoho Creator enables the creation of more apps with less effort.
It offers excellent security measures.
Creates insightful reports.
Helps in connecting the business data to different teams. 

GeneXus

GeneXus is a software development tool that offers an intelligent platform for creating applications that enable the automatic development, and maintenance of systems. The applications created using GeneXus can be easily adapted to changes. Besides, it is used when the developer has to work with the newest programming languages.

Key Features:

GeneXus offers an AI-based automatic software approach.
It comes with the highest flexibility which means that it has the capability to support the largest number in the market.
Multi-experience apps can be created using this tool.
It has the best app security module.
It offers business process management support.
With GeneXus, developers get the highest level of deployment flexibility.

NetBeans

NetBeans is a very popular open-source and free software development tool. It is written in Java. Developers use NetBeans to create mobile, web, and desktop applications. This tool uses languages like C / C++, JavaScript, PHP, Java, and more.

Key Features:

With the help of NetBeans, a cross-platform system, developers can create apps that can be used on all different platforms like Mac, Linux, Solaris, Windows, etc.
Java apps can be easily created and updated using NetBeans 8 IDE, the newer edition for code analyzing.
NetBeans is a tool that offers the best features like writing bug-free code, Smart Code Editing, quick user interface development, and an easy management process.
NetBeans allows for the creation of well-organized code that eventually helps the app development team to understand the code structure easily. 

Eclipse

Eclipse is another popular IDE that is majorly used by Java developers. This tool is used to create apps that are not only written in Java but also in programming languages like PHP, ABAP, C, C++, C#, etc.

Key Features:

Eclipse, an open-source tool, plays an important role in the development of new and innovative solutions.
It is used by developers for creating desktop, web, and cloud IDEs.
Eclipse Software Development Kit (SDK) is open-source which means that developers can use it freely for creating any type of application with the help of any programming language.
Eclipse helps in code completion, refactoring, syntax checking, error debugging, rich client platform, industrial level of development, and more.
Integrating Eclipse with other frameworks like JUnit and TestNG is very easy.

Bootstrap

Bootstrap is another open-source framework that is used by software development companies for creating responsive websites and mobile-first projects. For this tool, the developers can use technologies like HTML, CSS, and JS. It is widely used and is designed to make websites simpler. 

Key Features:

Bootstrap is a tool that offers built-in components that can be used in accumulating responsive websites.  by a smart drag and drop facility.
This open-source toolkit comes with various customization options.
It comes with some amazing features like a responsive grid system, pre-built components, plug-ins, sass variables & mixins, and more.
With Bootstrap, the developers get a guarantee of consistency,
Bootstrap, a front-end web framework is used by developers for quick modeling of the ideas.

Cloud 9

Cloud 9 was introduced in the year 2010 Cloud 9. At that time, it was an open-source, cloud-based IDE that supported different programming languages like Perl, C, Python, PHP, JavaScript, and more. But in the year 2016, AWS (Amazon Web Service) acquired this tool and it turned into a chargeable system. 

Key Features:

Cloud 9 IDE, a web-based platform is used by software development companies for scripting and debugging the app code in the cloud.
It comes with various features like code completion suggestions, file dragging debugging, and more.
With the use of Cloud 9, the developers can work with serverless applications.
Cloud 9 IDE is used by both web and mobile developers.
It enables one to create a replica of the entire software development environment.
Developers who use AWS Cloud 9 can share the environment with team members. 

Dreamweaver

Adobe Dreamweaver, an exclusive software programming editor is used to develop both simple and complex websites. It supports languages like CSS, HTML, XML, and JavaScript.

Key Features:

Dreamweaver is used in different operating systems like Windows, iOS, and Linux.
The latest version of this tool can be sued by the developers for creating responsive websites.
Dreamweaver CS6 offers a preview option that enables one to have a look at the designed website.
Dreamweaver CC, another version of this tool is a combination of a code editor and a design surface. It comes with features like code collapsing, auto-completion of code, real-time syntax checking, code inspection, and syntax highlighting.

Bitbucket

Bitbucket, a web-based version control tool is used by the developers for collaboration between teams. It is utilized as a repository for the source code of projects.

Key Features:

Bitbucket is a powerful tool that comes with features like flexible deployment models, code collaboration on steroids, and unlimited private repositories.
With the use of Bitbucket, developers can organize the repositories into different projects.
Bitbucket supports a few services like issue tracking, code search, Git large file storage, integrations, bitbucket pipelines, smart mirroring, and more.

CodeLobster

CodeLobster is another popular software development tool that is free to use and is a very convenient PHP IDE. Developers use it to create fully-featured web applications. This tool supports technologies like HTML, Smarty, JavaScript, Twig, and CSS.

Key Features:

This PHP Debugger facilitates the developers in debugging the systems easily at the time of coding.
CodeLobster PHP Edition makes the development process easy by supporting CMS like Magneto, Joomla, Drupal, and WordPress.
Some of its best features are PHP Debugger, CSS code inspector, PHP Advanced autocomplete, auto-completing of keywords, and  DOM elements.
This tool offers file explorer features and browser previews.

Conclusion

As seen in this blog, there are many different types of software development tools available in the market. And all of them are robust, fully-featured, and widely used. Here, we have listed some of the most popularly used development tools that are used by developers for creating unique solutions for their clients. The choice between these tools might be difficult at first, but if the developer properly studies the project and its requirements, choosing the right software developer tool can be really easy. And it can help in creating the finest project.

The post Top Tools every Software Developer should know in 2022 appeared first on Flatlogic Blog.Flatlogic Admin Templates banner

Some clues to understanding Benford’s law

Benford’s law is a really
fascinating observation that in many real-life sets of numerical data, the
first digit is most likely to be 1, and every digit d is more common than
d+1. Here’s a table of the probability distribution, from Wikipedia:

Now, the caveat “real-life data sets” is really important. Specifically, this
only applies when the data spans several orders of magnitude. Clearly, if we’re
measuring the height in inches of some large group of adults, the
overwhelming majority of data will lie between 50 and 85 inches, and won’t
follow Benford’s law. Another aspect of real-life data is that it’s non random;
if we take a bunch of truly random numbers spanning several orders of magnitude,
their leading digit won’t follow Benford’s law either.

In this short post I’ll try to explain how I understand Benford’s law, and why
it intuitively makes sense. During the post I’ll collect a set of clues,
which will help get the intuition in place eventually. By the way, we’ve already
encountered our first clues:

Clue 1: Benford’s law only works on real-life data.

Clue 2: Benford’s law isn’t just about the digit 1; 2 is more common than
3, 3 is more common than 4 etc.

Real-world example

First, let’s start with a real-world demonstration of the law in action. I
found a data table of the
populations of California’s ~480 largest cities, and ran an analysis of the
population number’s leading digit [1]. Clearly, this is real-life data, and it
also spans many orders of magnitude (from LA at 3.9 mln to Amador with 153
inhabitants). Indeed, Benford’s law applies beautifully on this data:

Eyeballing the city population data, we’ll notice something important but also
totally intuitive: most cities are small. There are many more small cities than
large ones. Out of the 480 cities in our data set, only 74 have population over
100k, for example.

The same is true of other real-world data sets; for example, if we take a
snapshot of stock prices of S&P 500 companies at some historic point, the prices
range from $1806 to $2, though 90% are under $182 and 65% are under $100.

Clue 3: in real-world data distributed along many orders of magnitude,
smaller data points are more common than larger data points.

Statistically, this is akin to saying that the data follows the Pareto
distribution
, of which the
“80-20 rule” – known as the Pareto principle – is a special case.
Another similar mathematical description (applied to discrete probability
distributions) is Zipf’s law.

Logarithmic scale

To reiterate, a lot of real-world data isn’t really uniformly distributed.
Rather, it follows a Pareto distribution where smaller numbers are more common.
Here’s a useful logarithmic scale borrowed from Wikipedia – this could be the
X axis of any logarithmic plot:

In this image, smaller values get more “real estate” on the X axis, which is
fair for our distribution if smaller numbers are more common than larger
numbers. It should not be hard to convince yourself that every time we “drop a
pin” on this scale, the chance of the leading digit being 1 is the highest.
Another (related) way to look at it is – when smaller numbers are more common it
takes a 100% percent increase to go from leading digit being 1 to it being 2,
but only a 50% increase to go from 2 to 3, etc.

Clue 4: on a logarithmic scale, the distance between numbers starting
with 1s and numbers starting with 2s is bigger than the distance between
numbers starting with 2s and numbers starting with 3s, and so on.

We can visualize this in another way; let’s plot the ratio of numbers starting
with 1 among all numbers up to some point. On the X axis we’ll place N which
means “in all numbers up to N”, and on the Y axis we’ll place the ratio of
numbers i between 0 and N that start with 1:

Note that whenever some new order of magnitude is reached, the ratio starts to
climb steadily until it reaches ~0.5 (because there are just as many numbers
with D digits as numbers starting with 1 and followed by another D digits);
it then starts falling until it reaches ~0.1 just before we flip to the next
order of magnitude (because in all D-digit numbers, numbers starting with each
digit are one tenth of the population). If we calculate the smoothed average of
this graph over time, it ends up at about 0.3, which corresponds to Benford’s
law.

Summary

When I’m thinking of Benford’s law, the observation that really brings it home
for me is that “smaller numbers are more common than larger numbers” (this is
clue 3). This property of many realistic data sets, along with an understanding
of the logarithmic scale (the penultimate image above) is really all you need
to intuitively grok Benford’s law.

Benford’s law is also famous for being scale-invariant (by typically applying
regardless of the unit of measurement) and base-invariant (works in bases other
than 10). Hopefully, this post makes it clear why these properties are expected
to be true.

[1]
All the (hacky Go) code and data required to generate the plots in this
post is available on GitHub.

Flatlogic Admin Templates banner

Understanding the Seq Storage view

Seq 2021 introduced a fantastic new visualization of how Seq uses disk and memory resources, under Data > Storage:

We created Storage because we we don’t believe in the “store everything and sort it out later” philosophy of log centralization. This initially-intriguing idea leads down a path of waste, confusion, and expense. Seq’s goal is to help teams store and use log data efficiently: when log volumes balloon out unexpectedly, we want to make this obvious and easy to trace, instead of sweeping it under the rug in the hopes of collecting ever-greater fees for ingestion, processing, and storage later on.

There are a few related pieces of information in this view; this post will help you interpret them to improve the performance and efficiency of your Seq installation.

Total stored data

The chart that takes up most of the Storage screen’s real estate shows stored bytes on disk, sliced up by time intervals from the oldest events on the left, through to the newest over on the right.

The height of each bar indicates how much data is stored for that time period. The measurement used here is actual bytes-on-disk, which are generally compressed. The actual amount of log data represented by each bar will be greater.

Does this add up to total disk consumption? Not precisely: Seq uses some additional disk space for temporary files and the snapshots that support multiversion concurrency control. During normal operation, the contribution these make to the storage footprint is negligible.

The chart breaks down stored data into four categories, that give you some insight into the workings of Seq’s underlying event storage engine.

Buffered

If your server is ingesting data right now, you’ll very likely see blue boxes over towards the right of the chart.

Buffered data is events in ingest buffers, which are five-minute chunks that events are written to as they arrive.

Ingest buffers are designed to be quick to write. Within a five minute window, the events are stored unsorted: events being generated “now” come from servers with slightly varying clocks, in batches of varying ages, over network links with varying latency. They’ll generally land in the same five-minute bucket, but rarely in timestamp order.

Sorting events on disk, on arrival, consumes write bandwidth for little gain. Instead, Seq sorts ingest buffers on read, and uses various levels of caching to avoid having to read ingest buffers from disk at all.

Your most recent hour of data will almost always be stored in ingest buffers. Seq leaves those five minute buffers alone, if they’ve been written to in the past hour, to maximize the chance of collecting up stragglers before sorting, compacting, and indexing ingest buffers into long-term storage spans – a process called “coalescing”.

Seq will also try to collect a decent amount of coalesceable data before processing ingest buffers (around 160 MB). If your server isn’t ingesting much, it may take a long time to collect up enough to coalesce. Once an ingest buffer has been around a while – a couple of days, it and its neighbours will be coalesced into a span regardless of how much data they contain.

✅ If your Storage screen looks something like the screenshot, with blue only at the rightmost end of the chart, all is well!

🔔 If, however, blocks of blue appear farther back in time, interspersed with other types of data, you may have a clock synchronization issue that’s timestamping events in the past or future.

🔔 If your storage screen is mostly blue, or has regions of blue covering more than a few days, something may be preventing coalescing from proceeding: insufficient disk space, a filesystem corruption, or some other problem could be at the root of this one.

Unindexed

When data is freshly coalesced from ingest buffers into spans, those spans won’t have any associated indexes. These peachy-colored blocks of data are quite rare to see, because as long as all of the available coalescing work is done, Seq will move straight onto indexing.

✅ Occasionally spotting unindexed data is a normal part of Seq’s operation. Nothing to worry about!

🔔 Seq prioritizes coalescing over indexing, so if limited time is available, data will be compacted and sorted, but signal indexing may be delayed. If you see peachy unindexed data hanging around for hours at time, check out hardware utilization, and consider whether your Seq machine may need more IO bandwidth or compute.

Indexed

Far and away the most common color to see on your Storage chart should be green. Green regions indicate data in span files, which have accompanying indexes for all current signals.

Indexed data is (often vastly) more efficient to query, when signals are used, because the indexes track – down to disk page resolution – which regions of the event stream contain events matching a signal’s filters.

✅ You want to see a lot of green indexed data in this chart! If it’s there, all you should need to pay attention to is how much of it is there.

🔔 Are there any outlying peaks, where a runaway process has filled your log server with junk? Clicking on a part of the chart will switch to the Events screen and show you what events are found in that time slice.

It’s worth calling out that when you switch over to view the corresponding events, you won’t get a breakdown by storage category: you’ll see all of the events in that time slice, from buffers and spans, indexed and unindexed. Seq Support can help you to dig in deeper if you need to debug a storage problem.

Queued for reindexing

Imagine your server has 100 signals configured, including shared signals and the private ones created by each user. When fully indexed, each span in the event store will have 100 up-to-date entries.

Now a user changes one of the signals: all of a sudden, the spans have 99 up-to-date indexes, and no indexing information for the latest version of the changed signal. Spans in this state are shaded bright pink.

This change is instantaneous, so it’s not unusual to see the whole chart flip from green to pink when a signal is modified. As indexing proceeds, spans will gradually flip back from unindexed/pink to indexed/green, as the indexer works its way through the event store.

✅ It’s normal to see a lot of data queued for reindexing, and because the indexer only kicks in periodically, data can wait for 10 minutes or so before the indexer kicks off. As long as it’s back to indexed/green in a short time, Seq’s doing what it should.

🔔 Okay, so you’ve been watching the indexer work, and in fits and starts, it’s only crawling through the data. Is the data being queued faster than it can be indexed, or the indexer taking so long that the stream is pink for hours at a time? You’re very likely short on IO bandwidth or compute. Indexing will stick to a single core, so you’ll need to check out CPU stats to know whether more CPUs (to handle other concurrent tasks) or faster CPUs are needed.

The x-axis

One last thing to take a quick look at is the x (time) axis of the storage chart.

🔔 Are the limits of the x axis what you’d expect? If you have stored data stretching far into the past or future, you may have data quality issues. Things can get quite strange – and queries slow – if your log sources have clock problems or application bugs that put timestamps back around the Unix epoch (1970) or worse, the .NET DateTime epoch (0001)!

Seq won’t drop valid log events that arrive with odd timestamps; if you’ve inadvertedly ingested them, then setting a retention policy or issuing a manual delete to clean them up will be a good idea.

Tip: you can set per-API-key filters that compare incoming event timestamps with now() and drop any dodgy ones. For example, @Timestamp >= now() – 7d or @Timestamp < now() + 1d will only let events timestamped in the last week, or the next 24 hours, through to the event store.

Retention policies

The little round, numbered markers on the chart show at which points Seq will apply retention policies to thin out data in the event stream. You can find the policy associated with each marker in the Retention Policies box, just below the chart.

Data that passes a retention policy marker isn’t processed immediately: because the underlying span files are immutable, Seq will generally wait until an entire span becomes elegible for retention processing, before it kicks off and rewrites it in one single pass.

✅ You should see the volume of stored data for each interval drop significantly, shortly after it passes each retention policy marker. Policies that trim down the stream, like (1) in the screenshot, help Seq run more efficiently and produce query results faster.

🔔 No difference in stored volume at a marker? Retention processing has a cost, so unless it’s for some worthy purpose, such as complying with data retention requirements, a policy that doesn’t actually delete any worthwhile amount of data is just draining resources unnecessarily. Try to make the policy more aggressive, or remove it entirely.

RAM cache coverage

By this point, you know just about everything there is to know about how Seq manages disk storage! Seq goes to great lengths to make queries work well from disk, but there’s no comparison between the performance of reading data from disk vs processing it in RAM.

That’s the narrow purple line that runs under the x axis on the right-hand side of the chart. Data in this region is cached in RAM, yielding faster queries and better use of disk bandwidth.

✅ Purple bar underlining all of the chart, or at least the last week? Great! Nothing much to do. You can still try moving retention policies to optimize caching (read on below), but you’re probably experiencing good performance day-to-day.

🔔 No purple bar? Uh-oh. Whether your machine is vastly underresourced, another process is using too much memory, you’re hitting a bug in Seq, or something else is going on entirely, you should contact the Seq Support team for some help. Working on a Seq instance without a decent amount of cache coverage is no fun at all.

🔔 Purple bar only covers a couple of days? It’s pretty likely that your machine needs more memory, or some tuning of retention policies. Read on below!

Here’s why we love the design of the Storage screen: it lets you directly visualize the impact of shortening or widening retention policies, on both disk space and the RAM cache.

If you’re not getting enough of the event stream into RAM, you need to move one or more retention policies further into that purple cache window. Ideally, you’ll have a retention policy right over close to the right-hand side, early in purple underlined area, that can drop out events that would otherwise take up RAM cache space.

Retention policies in the purple/cached area have the effect of making the purple area longer – even so much longer as to bring more retention policies into that purple area, for compounding benefit.

Another option, if your machine is bursting at the seams, is to filter more events out at the ingestion endpoint. Health checks, web logs for static file requests, and many other types of mundane, uninteresting events can be dropped during ingestion to save storage space. Seq’s Ingestion screen, right next to Storate under the Data menu, can help you with this.

Getting help

Think you’ve found a problem with storage on your Seq server, or need more eyes on it to spot some opportunities for improvement? The Seq developers would love to hear from you, here or via [email protected].

Happy logging!

Cloud Data Warehouse Explained

Possibly the single-most wildly anticipated and over-hyped IPO of 2020 was Snowflake (SNOW:NYSE), a hot provider of cloud data warehouse software. (Snowflake, in keeping with all hot young startups, likes to re-invent the standard terminology in their space so they have decided that “data warehouse” is out and “data cloud” is in.)

If you’re late to the party or didn’t get the memo, you might be asking yourself what all the fuss is about. As someone who got to the part way late–all the peeled shrimp had already been eaten–let me try to help you out. 

Data warehouse, data lakes, data clouds: where does it end?

There’s a straight line from the wadded up grocery list in my pocket to a data cloud, and it looks something like this: from a list of things to a table (things with properties) to a database (multiple tables) to a data warehouse (multiple databases) to a data lake (data warehouses plus the kitchen sink) to data cloud (who needs hardware?). 

Application software developers are usually intimately familiar with databases, whether Access, SQL Server, Oracle, MySQL, or whatever. It’s hard to write a useful application that doesn’t rely on some kind of data, and probably 90+ percent of business applications are just front ends for databases, allowing CRUD (create, replace, update, and delete) functions with associated business rules and logic. Our Salmon King Seafood demo app is exactly that. 

But as businesses grow, they tend to have multiple databases–one for finance, one for HR, one for sales, one for manufacturing, and on and on. While reporting from a single database is relatively easy (that’s what SQL is for), reporting across multiple databases can be tricky, especially if the schemas aren’t properly normalized across different tables in different databases. Suppose, for example, the sales database tracks contract dates as DDMMYYY and the finance database tracks receipt dates as MMDDYYYY: any report that wants to examine the lag between a signed contract and receipt of funds will have to do some process to normalize the dates from the two sources. Multiply that trivial example by multiple databases, often in different platforms with different data type definitions, SQL extensions, and software versions and it can get messy fast.

But that’s not all. Many production databases are heavily transaction focused–think of a sales database for a large e-commerce site. If your website is taking orders 24×7 because you sell worldwide, when can you run big queries without hurting responsiveness for customers trying to make a purchase? 

A data warehouse is not a database

The answer is a data warehouse, which pulls together data from multiple databases, normalizes it, cleanses it, and allows for queries across different data sources using a common data model. This not only reduces the hit on production databases, but, because the ETL (extract, transform, and load) processes create normalized data across all the tables in the warehouse, information consumers (management, business analysts, and even partners) can produce rich reports with minimal impact on the IT organization. 

Since their widespread adoption dating back to the 80s and 90s, enterprise data warehouses (EDWs) have become de rigueur in larger organizations. Companies like IBM, Oracle, and Teradata gobbled up market share with high-performance appliances for ingesting, processing, and running queries on enterprise data.

Enter the cloud

Like a tornado suddenly developing from a dark cloud and that proceeds to rip apart everything in its path, the gathering storm of IoT and public cloud upended a lot of EDW strategies.

Seemingly overnight organizations have petabytes of data flooding in from connected devices all over the world. Drawing insights from that data requires both a high-performance computing platform as well as vast capacities in both structured and unstructured data. Data lakes allow for extremely large storage of unstructured data for consumption by AI models to create business intelligence. Cloud storage and elastic cloud processing slashes the investment and direct costs of storing and processing these vast rivers of raw data.

In the “old” model you might have to build up capacity by adding more appliances and disk storage to handle your peak loads, with a public cloud EDW the compute capacity can expand or contract with usage while maintaining acceptable performance for running big queries or scripts. Plus, BLOB (binary large object) storage costs for AWS or Azure are quite inexpensive compared to arrays of on-prem disk drives. 

The only real downside of public cloud-based EDWs was stickiness–if your warehouse was in AWS is was difficult to move it to Azure, and vice versa. 

Along comes Snowflake

Last year Snowflake  was the hottest thing no one had ever heard of. Overnight the CEO becomes a newly minted billionaire (who hasn’t gone into space yet, but it’s early days) and the stock price went through the roof on day 1.

Why? 

Well, hype mainly. Yet the hype had some basis in reality. Let’s look at the things that make the Snowflake cloud data warehouse different:

Cloud agnostic: You can run your Snowflake instance in Azure, AWS, or wherever you want. You can easily move it from one cloud provider to a different one; something that isn’t possible if you’re using Azure or Amazon Redshift.  In general, customers are wary of vendor lock-ins.
Only pay for what you use: Snowflake’s pricing model is focused on your compute usage, rather than storage. And unlike many other EDW solutions, you only pay for what you use. If you’ve got a ginormous query you only run once a month, you don’t have to pay the rest of the month for that surge capacity.
Shared data: Snowflake dramatically reduced the complexity or time it takes to share data among disparate sources. They’re building a public data marketplace with some government databases already in place. Doesn’t really matter if your data is in AWS and the partner’s data you want to consume is in their private cloud: Snowflake makes it really easy to get it. 
Super fast performance: They claim queries run much much faster on Snowflake than on other systems. YMMV. 
Mix and match data: whether it’s a data lake of unstructured data from IoT devices or highly structured data from finance systems, Snowflake makes it cheap and easy to store, access, and understand that data. Queries across disparate data systems, housed in different architectures and sources, are simple. 

Enterprise data warehouses are the domain of very large companies, which are often the acquirers in merger and acquisition (M&A) deals. If BigCo running an EDW in AWS acquires SmallerCo who in turn is running their EDW in Azure, then it will be hard to combine those EDWs without Snowflake. In this situation, a Snowflake data cloud will work seamlessly across both AWS and Azure, eliminating the need for BigCo to convert the SmallerCo data lake and data warehouse to its own system. Further, the Snowflake architecture makes it easy to query very large data sets without the need to pay for the capacity to do so even when it’s idle most of the time. 

If Snowflake’s so great, why isn’t everyone using it?

Recently I was watching a show featuring an architect in Dublin, Ireland and in one episode he’s redoing a house for a family that moved from California; they get a shipping container delivered with all their furniture and appliances from the US. My immediate thought was that those appliances weren’t going to be much use in a country with 220v and different plug standards. 

Changing platforms or systems can be hard. And changing data warehouses can be really hard.

We’ll get into this in more detail in a subsequent post, but basically consider the things you have to do for a migration from, say Teradata to Snowflake:

Create the new Snowflake data warehouse, training the existing user base on the new platform
Duplicate all the data in the existing Teradata instance and move it into Snowflake
Extract the code (stored procedures, BTEQ scripts, and so forth) from the Teradata instance
Convert the code: Teradata SQL to Snowflake SQL, BTEQ to Python, procedures to JavaScript. This can be the most challenging part of the migration.
Identify and transfer all the associated processes like ETL and BI reporting/analysis
Run the new system in parallel with the existing Teradata system and compare all the results for a thorough QA review.
When everything checks out, turn off Teradata and enjoy the new cloud data warehouse.

For a large data warehouse system, with lots of ETL and BI processes depending on it, this can be a very challenging project to say the least. This is why Snowflake turned to Mobilize.Net to get help with the code conversion part, relying on our ability to create automated tools that can migrate over 90% of the code in the Teradata system. Our Snowconvert tool solves a big piece of the problem to get from an existing data warehouse to the Snowflake data cloud. If you’d like learn more, let us know.