How to use Apache NiFi to migrate Object Storage Data

Apache NiFi - Overview

Apache NiFi is a system to process and distribute data between different machines. It can transfer data and manages the transfer between different sources and destination systems. The tool provides a web interface to facilitate the design, management, and control of data transfers.

We are going to use terraform to provision the machines and then configure Apache NiFi to do the data workflows we want.

Requirements:

Configuring Terraform

You should install terraform and export several variables that are necessary to run terraform.

1 . Download the terraform binary from the official page terraform.io. Once downloaded, copy it in your binary path.

2 . Set the following variables to be able to start an instance from Terraform.

export SCW_ACCESS_KEY="<ACCESS_KEY>"
export SCW_SECRET_KEY="<SECRET_KEY>"
export SCW_DEFAULT_ORGANIZATION_ID="<ORG_ID>"
export SCALEWAY_REGION=fr-par

Important: Replace <ORG_ID>, <ACCESS_KEY> and <SECRET_KEY> with the credentials of your API token.

Running Terraform

1 . Open a text editor and edit the following plan to your needs. A plan is a file with a *.tf extension. For example: scaleway.tf:

provider "scaleway" {
  zone = "fr-par-1"
}

resource scaleway_instance_ip "nifi" {
  server_id = scaleway_instance_server.nifi.id
}

resource "scaleway_instance_security_group" "nifi" {
  name = "sg-nifi"

  inbound_default_policy  = "drop"
  outbound_default_policy = "accept"

  inbound_rule {
    action = "accept"
    port   = "22"
    ip_range = "<MY_PUBLIC_IP>/32"
  }

  inbound_rule {
    action = "accept"
    port   = "8080"
    ip_range = "<MY_PUBLIC_IP>/32"
  }

}

resource "scaleway_instance_server" "nifi" {
  name  = "nifi"
  type  = "DEV1-L"
  image = "ubuntu-bionic"
  tags  = [ "nifi" ]

  security_group_id =  scaleway_instance_security_group.nifi.id

  connection {
    host         = self.public_ip
    user         = "root"
  }

  provisioner "remote-exec" {
    inline = [
      "apt-get update",
      "DEBIAN_FRONTEND=noninteractive apt-get install -yq python openjdk-8-jre-headless",
      "mkdir -p /opt/",
      "wget -O /opt/nifi.tar.gz <APACHE_MIRROR>",
      "tar xzf /opt/nifi.tar.gz -C /opt/",
      "/opt/nifi-1.9.2/bin/nifi.sh start",
    ]
  }

}

Important: Replace <MY_PUBLIC_IP> by your public ip which can be found here, also <APACHE_MIRROR> by one of the URLs listed here.

2 . Then run the following commands:

terraform init
terraform apply

It will generate a large output. Check out if any errors are displayed. If everything went well, proceed to the next step.

Configuring Apache NiFi

1 . You can open the interface from your browser by getting the public ip of your instance by either getting it in the web console or running the following command terraform show | grep address. Open your browser by typing the following URL: http://<INSTANCE_PUBLIC_IP>:8080/nifi/ (Important: NiFi may need some tome to start. If the page does not load, wait a few minutes before you retry). You’ll be greeted by NiFi’s interface.

Empty interface

2 . Once the interface displays proceed to the next step which is creating data flows using processors.

Synchronizing One Bucket to Another

To be able to synchronize a bucket to another, we will have to create a workflow.

1 . Start by draging the Processor icon onto the workspace, it will prompt the following window.

2 . We first want to list the content of our bucket, so start by creating a ListS3 processor. Double click on the processor or open the settings by clicking on the dented wheel in the Operate window and navigate to the Properties tab. We will start by creating an AWS Credential Provider service. This will permit to setup once our Credentials, click on Create new service...

Select AWS Creds Controler

3 . It will prompt you with a new window in which you can define the name.

New AWS Creds Controler

4 . The click on the small arrow at the right of the line, it will open the configuration window. This will send you to the NiFi Flow Configuration window. You’ll be able to set your S3 <ACCESS_KEY> and S3 <SECRET_KEY> by clicking on the dented wheel at the right and entering them in the separate credential fields.

Important: Replace <ACCESS_KEY> and <SECRET_KEY> with the credentials of your API token.

Set Credentials

5 . Once the credentials entered, return to the S3 List Processor settings, you will have to set three settings.

  • The endpoint URL (has to be set in each Processor)

Set Endpoint URL

  • The source or destination bucket name (has to be set in each Processor)

Set Bucket name

  • The listing method (only in S3 List Processor)

Set V2 listing method

6 . Once these three options are set, create a FetchS3Object, configure it the same way, then create a PutS3Object and set the destination bucket name.

7 . Then connect all Processors using arrows and create loops for failures.

It should look the following way:

Global Views

Then enable each Processor, and if there are no errors, it should start synchronizing from source bucket to destination bucket.

Combining Several Data Sources

You can combine different fetch source to import or migrate your data to Scaleway Elements Object Storage.

Sources

Here’s an example of a multi fetch with one upload:

multi fetch

Discover a New Cloud Experience

Deploy SSD Cloud Servers in seconds.