Store And Access Graph Data Using AWS Neptune. Part 1

We live in a connected world. Social, business, computer networks are around us. This knowledge brings us to the next natural step — analysis. Do A and B have mutual friends? Does outage of the device X affect the device Y also? Are Company 1 and Company 2 somehow connected?

Graph Example

At AgileVision.io we often need to implement such kind of analysis for our customers. At some point, even the most efficient algorithm reaches the limits of the underlying execution environment. But folks from AWS made sure it will be somewhat difficult to meet those limits.

As a participant of the AWS Neptune Preview program, we had a chance of evaluating the new service. This is an introductory blog post about the AWS Neptune.

We’ll create an AWS Neptune instance first with everything needed to access it from the EC2 instance. Then we’ll see how the data can be imported from the AWS S3 into the AWS Neptune instance. And finally, we’ll execute some SPARQL queries on your data.

Prerequisites

Because AWS Neptune is still in preview (as of February 18-th, 2018), you need to request the access to the service to be able to evaluate it. It can be done by filling out a form on the AWS Neptune preview page.

Processing takes some time, so most likely you won’t get the access immediately.

Creating an AWS Neptune instance

Once the preview access request is approved, it’s possible to access the AWS Neptune console and create an instance there.

First, go to the “Clusters” section of the AWS Neptune Console and click the “Launch DB Instance” button on the top right:

AWS Neptune Console: Launch a DB instance

On the basic settings page there are only three items that can be modified:

  • Instance class — one of db.r4.4xlarge(16 vCPUs, 122 GiB RAM) or db.r4.8xlarge(32 vCPUs, 244 GiB RAM)
  • Multi-AZ deployment
  • Instance Identifier

We’ll go with a smaller instance packed with 122 GiB RAM with Multi-AZ deployment disabled. Once DB details are in place, click the Next button:

AWS Neptune Console: DB Instance Details

The next step is to configure advanced settings of the DB instance. It is very similar to the AWS RDS setup flow, so if you have already worked with it, there should not be any difficulties.

A VPC is required for the AWS Neptune Instance. It’s possible both to select an existing VPC or create new. We’ll be using a default VPC for the demo purposes.

AWS Neptune Console: Advanced Settings - Network

Other advanced settings include:

  • Cluster Identifier
  • Database Port
  • Parameter Group
  • Encryption settings
  • Failover settings
  • Backup
  • Maintenance

We’ll be using default values for all settings above since they are suitable for our needs. After all advanced settings have values we need, it’s possible to launch the AWS Neptune Cluster. It will be started immediately, but then some time will be required to provision the DB instance itself.

Until then, it will have the “Creating” status:

AWS Neptune Console: Instance Creating

Usually, the provisioning takes 3-5 minutes. Once it’s finished, a DB instance will have the “Available” status. Please, note where is DB endpoint instance is located. We’ll need it to access the DB instance:

AWS Neptune Console: Instance Endpoint

Accessing AWS Neptune Using SPARQL

While we don’t have any data in the AWS Neptune yet, we’ll learn how to access it using the RDF4J Console. This way we’ll be able to insert several nodes to the graph DB and later check whether our import process is doing what we want.

Create an EC2 Instance

To access the database, we’ll need to create an EC2 instance in the same VPC and the same availability zone as the AWS Neptune DB instance. We’ll be using t2.micro instance with Amazon Linux AMI:

AWS Neptune Console: EC2 Instance Creation

After the instance is created, we need to correctly configure the security group of the AWS Neptune instance, so our EC2 can reach it.

We can name the Neptune instance security group as “Neptune SG” and the EC2 instance security group as “Neptune Client SG”:

AWS Neptune Console:Security Groups

Then Neptune Client SG must be allowed to access resources of the Neptune SG(port 8182 of the Neptune instance in particular). This is achieved by a proper line in Inbound rules of the Neptune SG:

AWS Neptune Console: Neptune SG  Inbound Rules

Check the conectivity

To check the connectivity between the EC2 and the AWS Neptune instance, the following query can be executed using curl command-line tool:

curl -X POST --data-binary 'query=select ?s ?p ?o where {?s ?p ?o}' http://<your-neptune-endpoint>:8182/sparql

Because we are querying an empty DB, the result is very predictable:

{
  "head" : {
    "vars" : [ "s", "p", "o" ]
  },
  "results" : {
    "bindings" : [ ]
  }
}
Since the first request to the AWS Neptune instance was using the RDF API endpoint, the instance will be in SPARQL mode. According to the AWS Neptune team, there is no way to change the database mode during the preview. All subsequent requests using Gremlin won't work for the AWS Neptune in the SPARQL mode and vice versa, including data import. For example, if the AWS Neptune is in Gremlin mode, import of RDFXML will fail with the following message:
{
    "status" : "403 Forbidden",
    "message" : "Incompatible load format. Engine is set to GREMLIN mode, format: rdfxml is not allowed for loading data."
}

Configure the RDF4J Console

So now we can access the instance and perform HTTP requests via curl to execute Gremlin queries. But low-level syntax of curl is not something we want. Let’s use RDF4J Console instead.

On the EC2 instance, execute the following bash script:

sudo yum install java-1.8.0-devel
sudo /usr/sbin/alternatives --config java # Select Java 8 on this step
curl -O http://ftp.heanet.ie/pub/eclipse//rdf4j/eclipse-rdf4j-2.2.4-sdk.zip
unzip eclipse-rdf4j-2.2.4-sdk.zip
cd eclipse-rdf4j-2.2.4/bin
./console.sh

The next step is to configure a SPARQL repository for the AWS Neptune instance:

Execute the following command:

create sparql

The configuration will look like this:

SPARQL query endpoint: http://<your-neptune-endpoint>:8182/sparql
SPARQL update endpoint: http://<your-neptune-endpoint>:8182/sparql
Local repository ID [endpoint@localhost]: neptune
Repository title [SPARQL endpoint repository @localhost]: AWS Neptune instance

To use the newly created repository, type “open neptune” in the RDF4J console prompt. Then try executing the following command:

sparql select ?s ?p ?o where {?s ?p ?o}

The result didn’t change a lot, just got more readable:

Evaluating SPARQL query...
+------------------------+------------------------+------------------------+
| s                      | p                      | o                      |
+------------------------+------------------------+------------------------+
+------------------------+------------------------+------------------------+
0 result(s) (147 ms)

Still it’s very exciting, since now we can start exploring the brave new world of graph databases. You’ll need to exit the RDF4J console to start importing the data into the AWS Neptune instance.

Importing data into AWS Neptune instance

It’s time to import some data to our instance. We could create it manually, but it’s suitable only for small graphs, while a 16 CPU / 122 GiB RAM instance definitely deserves some interesting dataset. We’ll be using Geospecies RDF dataset from http://rdf.geospecies.org.

It’s possible to download it using curl:

curl -O http://rdf.geospecies.org/geospecies.rdf.gz

Now we can copy the RDF file to S3 for further import using AWS Neptune Loader API:

aws s3 cp geospecies.rdf.gz s3://<your-bucket>

Time to start importing the data itself. We have it in RDF XML format which can be easily consumed by the AWS Neptune Loader. To start the import process, use the following snippet:

curl -X POST \
    -H 'Content-Type: application/json' \
    http://<your-neptune-endpoint>:8182/loader -d '
    { 
      "source" : "s3:/<your-bucket>/", 
      "format" : "rdfxml",  
      "iamRoleArn" : "<iam-role-to-allow-neptune-to-access-s3>", 
      "region" : "us-east-1", 
      "failOnError" : "FALSE"
    }'

The interesting part is that Neptune supports compressed data. This is really convenient for big, text-based files where compression has real benefits.

The response should be as follows:

{
    "status" : "200 OK",
    "payload" : {
        "loadId" : "<uuid>"
    }
}

If you got an error like this:

 Failed to start new load from the source s3://<bucket-name>/. Couldn't find the aws credential for iam_role_arn: <your-iam-role>

It means you haven’t set up IAM roles correctly.

Once you have the loadId, it’s possible to check the current status of the import job:

 curl -X GET    http://<your-endpoint>:8182/loader/<load-id>
{
    "status" : "200 OK",
    "payload" : {
        "feedCount" : [
            {
                "LOAD_COMPLETED" : 1
            }
        ],
        "overallStatus" : {
            "fullUri" : "<source-uri>",
            "runNumber" : 1,
            "retryNumber" : 0,
            "status" : "LOAD_COMPLETED",
            "totalTimeSpent" : 31,
            "totalRecords" : 2201532,
            "totalDuplicates" : 0,
            "parsingErrors" : 0,
            "datatypeMismatchErrors" : 0,
            "insertErrors" : 0
        }
    }
}

According to the response above, AWS Neptune imported 2 million records in 31 seconds.

Examples

To start executing some SPARQL queries, start the RDF4J console first:

./console.sh 

And open the “neptune” repository we created earlier

> open neptune
neptune> 

Now the RDF4J console is ready to evaluate queries on the AWS Neptune instance.

Query first 10 triples:

neptune>  select ?s ?p ?o where {?s ?p ?o} limit 10
Evaluating SPARQL query...

Result

s p o
http://lod.geospecies.org/ses/lexzO.rdf http://purl.org/dc/terms/title “About: Species Scylaceus pallidus, Family Linyphiidae”
http://lod.geospecies.org/ses/lexzO.rdf http://purl.org/dc/terms/publisher http://rdf.geospecies.org/ont/geospecies#GeoSpecies_Knowledge_Base_Project
http://lod.geospecies.org/ses/lexzO.rdf http://purl.org/dc/terms/creator http://rdf.geospecies.org/ont/people.owl#Peter_J_DeVries
http://lod.geospecies.org/ses/lexzO.rdf http://purl.org/dc/terms/description “GeoSpecies Knowledge Base RDF: Species Scylaceus pallidus”
http://lod.geospecies.org/ses/lexzO.rdf http://purl.org/dc/terms/identifier “http://lod.geospecies.org/ses/lexzO.rdf”
http://lod.geospecies.org/ses/lexzO.rdf http://purl.org/dc/terms/language “en”
http://lod.geospecies.org/ses/lexzO.rdf http://purl.org/dc/terms/isPartOf http://lod.geospecies.org/ontology/void#this
http://lod.geospecies.org/ses/lexzO.rdf http://xmlns.com/foaf/0.1/primaryTopic http://lod.geospecies.org/ses/lexzO
http://lod.geospecies.org/ses/lexzO.rdf http://purl.org/dc/terms/modified “2010-07-15T12:42:09-0500”
http://lod.geospecies.org/ses/lexzO.rdf http://creativecommons.org/ns#license http://creativecommons.org/licenses/by-sa/3.0/us/
10 result(s) (141 ms)

Query all geospecies kingdoms and their common names

neptune> sparql

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX geospecies: <http://rdf.geospecies.org/ont/geospecies#>
SELECT DISTINCT ?kingdom ?commonName WHERE{
	
	?node geospecies:hasKingdomName ?kingdom;
	a geospecies:KingdomConcept;
	geospecies:hasCommonName ?commonName . 
	
}
.

Result

kingdom commonName
“Animalia” “Animals”
“Plantae” “Plants”
“Fungi” “Fungus”
“Archaea” “Archaea”
“Protozoa” “Protozoa”
“Bacteria” “Bacteria”
“Chromista” “Chromista”
“Viruses” “Viruses”
8 result(s) (35 ms)
Note the "sparql" in the beginning and "." in the end of the query. "sparql" command tells the console the query will be multiline and "." denotes the end of the query.

Query all true owl species common names

neptune> sparql
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX geospecies: <http://rdf.geospecies.org/ont/geospecies#>
SELECT DISTINCT ?commonName WHERE{
	
	?node geospecies:hasFamilyName ?familyName;
	geospecies:hasCommonName ?commonName;
	FILTER regex(?familyName, "Strigidae") .	
} LIMIT 10
.

Result

commonName
“Northern Saw-whet Owl”
“Boreal Owl”
“Buff-fronted Owl”
“African Long-Eared Owl”
“Short-eared Owl”
“Madagascar Owl”
“Long-eared Owl”
“Stygian Owl”
“Spotted Owlet”
“Burrowing Owl”
10 result(s) (106 ms)

Conclusion

Amazon is rolling out their own service for graph databases, which means another set of self-hosted solutions can be moved into the cloud to benefit from the managed infrastructure, autoscaling and easy backups.

There are some limitations in the preview (like inability to change the instance mode after the first request), but the service looks great in general.

In this introductory post we haven’t utilized the power of AWS Neptune a lot and executed very basic queries. Part 2 of the article will be even more exciting, since we’ll describe how we managed to import the Bitcoin graph into the AWS Neptune.

{
    "status" : "200 OK",
    "payload" : {
        "feedCount" : [
            {
                "LOAD_COMPLETED" : 2358
            }
        ],
        "overallStatus" : {
            "fullUri" : "...",
            "runNumber" : 1,
            "retryNumber" : 0,
            "status" : "LOAD_COMPLETED",
            "totalTimeSpent" : 99481,
            "totalRecords" : 3849219295,
            "totalDuplicates" : 6,
            "parsingErrors" : 0,
            "datatypeMismatchErrors" : 0,
            "insertErrors" : 0
        }
    }
}

We hope you enjoyed the reading! Questions and comments are highly appreciated.

Useful Links

Comments

©2016-2018 — AgileVision sp. z o.o. All rights reserved.