As we all know that data is the new oil. Data is growing exponentially; data analysis and customer predictions methodologies have been changing over time and now some of the technologies have become obsolete and some are going to. Most organizations are moving towards microservices and big data handling and processing mechanism.

Architecture is evolving towards fast and reliable technologies and tools. Before starting optimization techniques and Spark Architecture let's understand what is big data and How Apache spark is related to big data.

Big Data

The collection of a huge amount of data that cannot be stored and handled by traditional…

Running an R Code on AWS Batch on Production Environment

In this blog, We will see how to Run a Job on an AWS batch with the help of container, S3, ec2, and environment variables to parameterize the Job.

I am using R(language) as a base container but possibilities are limitless.

First, let us understand the benefit and limitations of AWS Batch.

AWS Batch — This is a service provided by AWS with the essential task of running a Code on an EC2 machine with the capability of elasticity of memory, and storage without worrying about the configuration of the machine.

Limitation- It’s easy to run a job but when…

Hello there, Hope you are doing great and safe in this COVID situation. In this blog, I will share my story about fitness and how I became so fit to fat and Fat to fit and lost many KGs in the short span of 3 months. I will share my daily routine and diet I have taken at the time of transformation.

My Background

Presently I am a Senior Data Engineer at and Creating data pipelines.

I have completed my Engineering in May 2017 and I was in a few engineers who got an opportunity to Work in Accenture just after…


When it comes to big data and modern warehousing technology you must have heard about Apache hive.

Official Definition- The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. The structure can be projected onto data already in storage. A command-line tool and JDBC driver are provided to connect users to Hive.

Hive is created at Facebook but later Facebook donated hive to the Apache community.

Hive provides SQL like language called HiveQL with schema on read and transparently converts queries to MapReduce, Apache Tez, and Spark jobs.

One thing which…

In this blog, I will try to explain one of Nifi Funcationlity (Rest API) which is used for purposes like stopping a processor, starting a processor changing state of the processor, service, processor group, input ports, etc.

Definition according to Documentation

The Rest API provides programmatic access to command and controls a NiFi instance in real-time. Start and stop processors, monitor queues, query provenance data, and more.

Prerequisite -Nifi is installed. In my case, I have installed nifi on port 8081 but by default, it will be installed on port 8080.

Scenario 1- We need to start a processor with API

In the First basic flow, we have generateFlowFile processor…

In this article, we will go through boto3 documentation and listing files from AWS S3. Personally, when I was going through the documentation, I didn’t found a direct solution to this functionality. In this tutorial, we will get to know how to install boto3 and AWS, setup for AWS, creating buckets, and then listing all the files in a bucket.


As per the documentation, Boto is the Amazon Web Services (AWS) SDK for Python. It enables Python developers to create, configure, and manage AWS services, such as EC2 and S3. …

Shubham Kanungo

Senior data Engineer at

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store