py2neo 3.1: The World’s Most Amazing Python Driver for Neo4j

July 7, 2016, 12:00 am

≫ Next: My Neo4j Summer Road Trip to the World of Healthcare [Part 2]

≪ Previous: A Deeper Dive into Neo4j 3.0 Language Drivers

Learn about py2neo 3.1, a community Python driver for Neo4j, including the new Object-Graph Mapper

Even though we’ve now released officially supported drivers for Java, Python, JavaScript and .NET, many of the community drivers are still going strong. Indeed, version 3.1 of my own community driver py2neo was released this week and with it came a brand-new OGM for Python users.

An OGM (Object-Graph Mapper) is to a graph database what an Object-Relational Mapper (ORM) is to a traditional RDBMS: a framework on which database-aware domain objects can be built.

The py2neo OGM centres its operation around the new GraphObject class. This acts as both a base class upon which domain objects can be defined and a manager for the underlying node and relationships that persist it.

Take for example the Movie Graph that comes pre-packaged with Neo4j. We could model a Person from this dataset as below:

class Person(GraphObject):
    __primarykey__ = "name"

    name = Property()
    born = Property()

Here, we define a Person class with two properties. Properties in Neo4j have no fixed type so there’s less to define than there would be for a SQL field in a typical ORM.

We’re also using the same names for the class attributes as we do for the underlying properties: name and born. If necessary, these could be redirected to a differently-named property with an expression such as Property(name="actual_name").

Lastly, we define a __primarykey__. This tells py2neo which property should be treated as a unique identifier for push and pull operations. We could also define a __primarylabel__ although by default, the class name Person will be used instead.

All of this means that behind the scenes, the node for a specific Person object could be selected using a Cypher statement such as:

MATCH (a:Person) WHERE a.name = {n} RETURN a

Broadening out a little, if we wanted to model both Person and Movie from that same dataset, as well as the relationships that connect them, we could use the following:

class Movie(GraphObject):
    __primarykey__ = "title"

    title = Property()
    tagline = Property()
    released = Property()

    actors = RelatedFrom("Person", "ACTED_IN")
    directors = RelatedFrom("Person", "DIRECTED")
    producers = RelatedFrom("Person", "PRODUCED")

class Person(GraphObject):
    __primarykey__ = "name"

    name = Property()
    born = Property()

    acted_in = RelatedTo(Movie)
    directed = RelatedTo(Movie)
    produced = RelatedTo(Movie)

This introduces two new attribute types: RelatedTo and RelatedFrom. These define sets of related objects that are all connected in a similar way. That is, they share a common start or end node plus a common relationship type.

Take for example acted_in = RelatedTo(Movie). This describes a set of related Movie nodes that are all connected by an outgoing ACTED_IN relationship. Note that like the property name above, the relationship type defaults to match the attribute name itself, albeit upper-cased. Conversely, the corresponding reverse definition, actors = RelatedFrom("Person", "ACTED_IN"), specifies the relationship name explicitly as this differs from the attribute name.

So how do we work with these objects? Let’s say that we want to pluck Keanu Reeves from the database and link him to the timeless epic Bill & Ted’s Excellent Adventure (sadly omitted from the original graph). First we need to select the actor using the GraphObject class method select via the Person subclass. Then, we can build a new Movie object, add this to the set of movies acted_in by the talented Mr Reeves and finally push everything back into the graph. The code looks something like this:

keanu = Person.select(graph, "Keanu Reeves").first()
bill_and_ted = Movie()
bill_and_ted.title = "Bill & Ted's Excellent Adventure"
keanu.acted_in.add(bill_and_ted)
graph.push(keanu)

All related objects become available to instances of their parent class through a set-like interface, which offers methods such as add and remove. When these details are pushed back into the graph, the OGM framework automatically builds and runs all the necessary Cypher to make this happen.

More complex selections are possible through the select method as well. The where method can make use of any expression that can be used in a Cypher WHERE clause. For example, to output the names of every actor whose name starts with ‘K’, you could use:

for person in Person.select(graph).where("_.name =~ 'K.*'"):
    print(person.name)

Note that the underscore character is used here to refer to the node or nodes being matched.

There’s a lot more information available in the py2neo documentation and there’s also a demo application in the GitHub repository that shows how this all comes together in a mini movie browser (screenshot below).

The sample movie web app that comes with py2neo 3.1

As always, if you have any questions about py2neo or the official drivers, I’ll try my best to help. My contact details can be found somewhere on this page probably.

Want to learn more about graph databases and Neo4j? Click below to register for one of our online training classes, Introduction to Graph Databases or Neo4j in Production and catch up to speed with graph database technology.

↧

My Neo4j Summer Road Trip to the World of Healthcare [Part 2]

July 25, 2016, 12:00 am

≫ Next: My Neo4j Summer Road Trip to the World of Healthcare [Part 3]

≪ Previous: py2neo 3.1: The World’s Most Amazing Python Driver for Neo4j

Part 2: Visualize XML Files in Neo4j with APOC

Welcome back to my Neo4j summer adventure. In my previous post, I gathered all the available data and explored how to model the data into a healthcare graph. Starting with this post, I will be focusing on loading the data into the healthcare graph.

As a Neo4j newbie, before starting ETL, I researched on methods people have been using to transform XML data into Neo4j graph data. Most of them converted the XML files to CSV first then loaded the data into Neo4j. While I was teaching myself Cypher, I discovered that APOC allows me to extract information from XML and load them directly into a graph. However, there are few blogs out there that document this procedure, so why don’t I try the new way – it won’t be a real adventure without some fun explorations will it!

In this week’s blog, I want to show you how I load XML files into a graph using APOC. This week, I will be working with lobbying disclosures and contributions data, and by the end of this blog you will see some fun queries I created to gain interesting insights into how the healthcare system is influenced by the lobbying system.

Now let’s begin our adventure for this week!

1. Getting Ready

Download the data into a directory:

Download the latest APOC:

Install Python driver py2neo:

$ pip install py2neo

2. Data Integration

Now we are ready to go. Though Neo4j is schema-less, having a clear structure of the graph is helpful to determine where to go. It’s more like a map or compass, and this is especially true when I need to traverse an XML tree structure to access the child elements.

Now let’s take a look at the map of where we will be going for this week:

Part 2 of using Neo4j to graph the healthcare industry. This week: XML and lobbying disclosures

Nodes :Issue, :Disclosure and :Client will be extracted from disclosure XML files, and nodes :Legislator, :Committee, :Contribution and :Contributor will be extracted from contribution XML files. Both the disclosure and contribution XML data contain information about nodes :LobbyFirm and :Lobbyist, I will use a MERGE statement to create :LobbyFirm and :Lobbyist to prevent duplicates.

Now let me show you how I processed disclosure XML using APOC. (You can find the whole ETL python code here.)

A. Accessing Child Elements of XML in APOC

Let me start off by showing you the structure of the disclosure XML files.

The XML file structure of lobbying disclosures

APOC allows me to access the child elements of <LOBBYINGDISCLOSURE2>. Here is the Cypher statement to extract the properties of :LobbyFirm (in orange):

CALL apoc.load.xml('file:///2013_1stQuarter_XML/300529228.xml') 
YIELD value
WITH [attr in value._children 
WHERE attr._type in ['organizationName', 'address1', 'city', 'state', 'zip', 'country', 'houseID'] | [attr._type, attr._text]] as pairs 
CALL apoc.map.fromPairs(pairs) 
YIELD value as properties
RETURN properties

The query returns this:

An APOC Cypher query on lobbying disclosure data

The way of calling APOC to extract properties for other nodes is very similar; you can find every single detail of my Python code here. In this project, when creating nodes :Issue and :Lobbyist, I have to deal with more complicated parent-child structures (as you can see from the XML map above, <Lobbyists> and <issueAreaCode> are siblings, and <Lobbyists> has children <Lobbyist>; I maintained this structure in the healthcare graph).

If you are facing a similar problem, the collect() function will be helpful. I used it to aggregate properties (labeled in yellow and blue) into a list, then access the desired properties by indexing.

Now let’s run the query from the Python driver, I used py2neo in my project:

query = '''
   CALL apoc.load.xml({file})
   YIELD value
   WITH [attr in value._children
   WHERE attr._type in ['organizationName', 'firstName', 'lastName', 'address1’, 'city', 'state', 'zip', 'country', 'houseID'] | [attr._type,          attr._text]] as pairs
   CALL apoc.map.fromPairs(pairs)
   YIELD value as properties
   RETURN properties
   '''
properties = g.run(query, file=’file:///2013_1stQuarter_XML/300529228.xml’).evaluate()
print(properties)
print(‘type of properties:', type(properties))

Result:

{'city': 'Austin', 'organizationName': 'Tuggey Fernandez LLP', 'country': 'USA', 'firstName': None, 'houseID': '416750001', 'state': 'TX', 'address1': '611 South Congress Avenue, Suite 340', 'zip': '78704', 'lastName': None, 'address2': None}
type of properties:

Running the Cypher query will return a cursor object. In this case, I know there is only one value, Properties, being returned, so I could call the evaluate() method which returns the value of the cursor object. As we can see, evaluate() turns the cursor object into a dictionary which is very easy to work with in Python.

Knowing how to extract information using APOC and understanding the return value, I next define a Python function that cleans the data and returns a dictionary of properties of :LobbyFirm. Cypher supports some powerful string processing functions which can also be used to clean the data.

One more thing to notice here is that I only extract properties if the data is valid, NULL value properties should not be stored in Neo4j.

def get_LobbyFirm_property(file):
   '''
   :param file: the xml file path to be parsed
   :return: a dict of properties of LobbyFirm
   '''
   query = '''
       CALL apoc.load.xml({file})
       YIElD value
       WITH [attr in value._children
       WHERE attr._type in ['organizationName', 'firstName', 'lastName', 'address1',
       'address2', 'city', 'state', 'zip', 'country',
       'houseID'] | [attr._type, attr._text]] as pairs
       CALL apoc.map.fromPairs(pairs)
       YIELD value as properties
       RETURN properties
       '''
   pre_property = g.run(query, file=file).evaluate()
   property = {}
   # name
   if pre_property['organizationName']== None and pre_property['firstName'] != None and pre_property['lastName'] != None :
       property['name'] = str(pre_property['firstName'] + ' ' + pre_property['lastName'])
   elif pre_property['organizationName'] != None:
       property['name'] = pre_property['organizationName']
#address
   if pre_property['address1']!= None and pre_property['address2']!= None:
       property['address'] = str(pre_property['address1'] + ' ' + pre_property['address2'])

   elif pre_property['address1']!= None and pre_property['address2']== None:
       property['address'] = pre_property['address1']
#city
   if pre_property['city'] != None:
       property['city'] = pre_property['city']
   #State
   if pre_property['state'] != None:
       property['state'] = pre_property['state']
   # Country
   if pre_property['country'] == None:
       property['country'] = 'USA'
   else:
       property['country'] = pre_property['country']
   # zip
   if pre_property['zip'] != None:
       property['zip'] = pre_property['zip']
   # houseOrgId
   if pre_property['houseID'] != None:
       property['houseOrgId'] = pre_property['houseID'][:5]
   return property

B. Use MERGE and CREATE Statements to Load Data into Neo4j

def create_LobbyFirm_node(properties):
   '''
   :param properties: a dict of properties of the node
   :return: node internal id
   '''
   query = '''
       MERGE (lbf: LobbyFirm {houseOrgId:{houseOrgId}})
       ON CREATE SET lbf = {properties}
       RETURN id(lf)
       '''

   index = '''
   CREATE INDEX ON: LobbyFirm(houseOrgId)
   '''
   id = g.run(query, houseOrgId = properties['houseOrgId'], properties=properties).evaluate()
   g.run(index)
   return id

I decide to create the :LobbyFirm node by merging on houseOrgId which is a unique 5-digit number for each lobbying firm.

MERGE statement prevents duplicates in the graph. It’s a good practice to only merge on one property of the node. When merging on more than one property, only nodes that match ALL the values will be returned; otherwise, a duplicate will be created.

For example MERGE (lbf: LobbyFirm {houseOrgId: “12345”, firmName: “ABCD”}) is like saying “Find me the node labeled :LobbyFirm AND houseOrgId is 12345 AND firmName is ABCD. If no property is matched, create a new node with houseOrgId is 12345 and firmName is ABCD”.

In this case, there may be more than one node being created that has the same houseOrgId. Here is a great blog post that cleared up my confusions such as when to use MERGE vs CREATE.

C. Create Relations Using Internal Node ID

I have 72,002 disclosure files to be processed. As my Python code loops through each disclosure file, it needs to create relations among these nodes. A relationship is generated only when the two nodes are created within the same iteration. The graph created at each iteration looks like this:

A graph data model for lobbying disclosure data

Notice in the previous code where I created the :LobbyFirm I also returned the ID of the node. This internal ID allows me to identify the new nodes created at that iteration, and thus, I am able to generate relations for these nodes.

 
lf_dc_rel = g.run(
   '''MATCH (dc:Disclosure) WHERE id(dc) = {dc_id}
   MATCH (lf:LobbyFirm) WHERE id(lf) = {lf_id}
   CREATE (lf)-[r:FILED]->(dc)
   ''', dc_id = dc_id, lf_id = lf_id
)

Here dc_id and lf_id are passed as parameters, each of them represents the id of :Disclosure node and :LobbyFirm node.

There are some limitations when using internal node id to identify nodes. You need to be careful especially when you delete an existing node. The id for the deleted node will be reused when creating a new node.

In this case, you can use a plugin called UUID which “assigns UUIDs to newly created nodes and relationships in the graph and makes sure nobody can (accidentally or intentionally) change or delete them.”

3. Visualize the Healthcare Graph in Neo4j

Each year, corporations spend billions of dollars to gain access to government decision-makers, and healthcare organizations are no exception. One of the purposes of my project is to connect these organizations with the legislators by modeling the lobbying system.

Now that I have all of the lobbying data loaded into Neo4j, I would love to find out how the healthcare industry (or any other group) is influenced by the lobbying system.

First, let’s figure out the general lobbying issues in 2013:

MATCH (n: Issue) RETURN distinct(n.issueAreaCode) ORDER BY n.issueAreaCode

The query returns 79 unique issue area codes in the disclosures. You can refer to the general lobbying issue code to find out what these issues are. Here are the top 10 general lobbying issues in 2013:

MATCH (n:Issue) RETURN n.issueAreaCode, count(n) as num order by num DESC LIMIT 10

The top ten issues in healthcare lobbying

HCR (Health Issues) and MMM (Medicare/Medicaid) are the two areas that I am most interested in, and we can see there were 9988 HCR issues and 5016 MMM issues being lobbied in 2013.

Just for personal curiosity, I also wanted to know how many issues being lobbied are related to gun control in 2013, and here is a screenshot for my discovery:

Results of a Cypher query on gun control and healthcare

Second, find me the lobbying firms and lobbyists who lobby for Medicare and Medicaid issues:

MATCH (lf:LobbyFirm)<-[:WORKS_AT]-(lob: Lobbyist)-[:LOBBIES]->(iss: Issue {issueAreaCode:'MMM'})
RETURN lf.houseOrgId as Frim_ID, lob.firstName as First_Name, lob.lastName as Last_Name, iss.issueAreaCode as Issue, iss.description as Description LIMIT 8

Healthcare lobbying issues related to Medicare and Medicaid

Next, tell me who are the clients that signed disclosures with lobby firms for Medicare and Medicaid issues?

MATCH (cl:Client)-[:SIGNED]->(dc:Disclosure)-[:HAS]->(iss:Issue{issueAreaCode: "MMM"})
WITH cl, dc, iss
MATCH (lf:LobbyFirm)-[:FILED]->(dc), (lob:Lobbyist)-[:LOBBIES]->(iss)
RETURN distinct(cl.clientName) as Client, lf.houseOrgId as Firm_ID, lob.firstName as First_Name, lob.lastName as Last_Name LIMIT 25

Healthcare lobbying disclosures filed for Medicare and Medicaid

To visualize the result in a graph:

A graph visualization of healthcare lobbying disclosures filed for Medicare and Medicaid

We can see there are five clients who signed a disclosure with lobby firm No. 31603 for Medicare-related issues. All of the relevant issues are lobbied by Marshall.

Now, let’s find out – for these lobbyists and lobby firms who are involved in lobbying Medicare and Medicaid issues – how much they contributed to government leaders and who received these contributions?

MATCH (lf:LobbyFirm)<-[:WORKS_AT]-(lob: Lobbyist)-[:LOBBIES]->(iss: Issue {issueAreaCode:'MMM'})
WITH lob, lf
MATCH (lob)-[:FILED]->(cb:Contribution)-[:MADE_TO]->(com:Committee)-[:FUNDS]->(leg:Legislator)
OPTIONAL MATCH (lf)-[:FILED]->(cb)-[:MADE_TO]->(com)-[:FUNDS]->(leg)
RETURN lf.city as City, lf.houseOrgId as Firm_ID, lf.name as Firm_Name, 
lob.firstName as FirstName, lob.lastName as LastName, cb.amount as Amount, cb.date as Date, leg.name as Legislator LIMIT 50

Lobbyist contributions related to Medicare and Medicaid for both who and how much they contributed

What does the result look like in our healthcare graph?

A graph visualization of healthcare lobbying contributions related to Medicare and Medicaid

It is much easier to read the results as a graph in Neo4j!

Finally, how are healthcare organizations connected to legislators?

MATCH (cl:Client{clientName:'Pharmaceutical Research and Manufacturers of America (PhRMA)'})-[:SIGNED]->(dc:Disclosure)-[:HAS]->(iss:Issue{issueAreaCode:'MMM'})<-[:LOBBIES]-(lob:Lobbyist)-[:WORKS_AT]->(lf:LobbyFirm)
WITH cl,dc,iss,lob,lf
MATCH (lob)-[:FILED]->(cb:Contribution)-[:MADE_TO]->(com:Committee)-[:FUNDS]->(leg:Legislator)
OPTIONAL MATCH (lf)-[:FILED]->(cb)-[:MADE_TO]->(com)-[:FUNDS]->(leg)
RETURN cl,dc,iss,lob,lf,cb,com,leg LIMIT 300

A graph of connections between healthcare lobbyists

This looks amazingly interesting. Let’s take a closer look at the graph:

A closer look at the graph of connections between healthcare lobbyists

From the graph I can tell that in 2013, the lobbyist Drew Goesl lobbied a Medicare issue for Pharmaceutical Research and Manufacturers of America (PhRMA) which specifically focuses on “Legislative issues related to access to pharmaceuticals, including Medicare Part D, and Children’s Health Insurance Program (CHIP), rebates in Medicaid and for dual-eligibles; comparative effectiveness; 340B Drug Program; Medicare Part B prescription drug reimbursement, and related provisions.”

During the same year, the lobbyist Drew Goesl made contributions to several committees who fund legislators including James Lee Witt, Patrick Murphy, William Lewis Owens, Mike McIntryre, Mark Pryor, Corey Booker, John Larson, Linda Forrester, Edward Perlmutter, James Matheson, Joseph Crowley, Harry Reird, Susan DelBene, Scott Peters and Edward J. Markey.

Due to the data limitation, I cannot draw a conclusion that PhRMA and the legislators mentioned above have direct connections. However, the healthcare graph is helpful for the public to trace and integrate information just like this.

You may also have noticed there is a bug in my model: I have tons of duplicated nodes for the same legislator. This is because the data is not consistent. The real world data is not as friendly and tidy as might be the case in an academic scenario.

Conclusion

In the next few blog posts, I will demonstrate how to process strings and how to match nodes when you have messy and limited data sources. Next week, I will start to work on provider prescription data and will show you some tricks I used to load the large CSV files that I downloaded from the FDA and CMS websites. I hope you enjoyed the second post in this series – stay tuned for more excitements to come!

Ready to dig in and get started with graph databases? Click below to download this free ebook, Learning Neo4j and catch up to speed with the world’s leading graph databases.

Get the Book

Catch up with the rest of the Neo4j in healthcare series:

My Neo4j Summer Road Trip to the World of Healthcare [Part 1]

My Neo4j Summer Road Trip to the World of Healthcare [Part 3]

↧

My Neo4j Summer Road Trip to the World of Healthcare [Part 3]

August 8, 2016, 12:00 am

≫ Next: My Neo4j Summer Road Trip to the World of Healthcare [Part 4]

≪ Previous: My Neo4j Summer Road Trip to the World of Healthcare [Part 2]

Part 3: Cleaning CSV Files in Bash

Hi friends and welcome back to my summer road trip through the world of healthcare. For those who are new to my adventure, this is the third part of the blog series. Catch up right here with Part 1 and Part 2.

I am using Neo4j to connect the multiple stakeholders of healthcare and hope to gain some interesting insights into the healthcare industry by the end of my exploration. This blog series demonstrates the entire process from data modeling and ETL to exploratory analysis and more. In the previous two posts, I discussed data modeling and how to integrate XML data to Neo4j by using APOC, you can find every single detail about the project on Github.

This week, I will be working with CSV files. If you are using Neo4j for the first time (like me), I can tell you honestly that loading CSV files to Neo4j is a lot easier than loading XML files. But don’t get too optimistic about it unless your data is perfectly clean. Now, let me show you the steps I used to successfully load the CSV files.

1. Get the Data

This week, our data covers information on drugs, drug manufacturers, providers and prescriptions. You can download the same data from these sources:

FDA Drug Codes (txt, 35.1MB)
Drug Manufacturers (txt, 1.5 MB)
Provider Enumeration System (CSV, 5.96GB)
Provider Prescriptions (CSV, 92.8MB)

As the healthcare provider data gave me the most problems, I will use this data as a demonstration in this blog post.

2. Display the Data:

A. What does the data look like?

head npidata_20050523-20160612.csv

Wow, the data looks a little bit crazy, and because of that, I will not overwhelm you by copying the result here. However, I learned three characteristics about the data by displaying the first 10 rows:

The data has a header, and the header contains white space
The data has many columns (we will find out how many soon)
The data has a lot of empty values

B. How many rows are in the data?

wc -l npidata_20050523-20160612.csv

Results:

 4923673 npidata_20050523-20160612.csv

Each row of the data represents a registered provider in the United States from 2005 to 2016.

C. How many columns are in the data?

head -n 1 npidata_20050523-20160612.csv|awk -F',' '{print NF}'

Results:

Now you see my point why I said the data is a little bit crazy. But don’t panic – most of these columns do not contain values and we only need to extract a few columns to load them to my healthcare graph.

D. Remove the header from the data.

sed 1d npidata_20050523-20160612.csv > provider.csv

This will delete the first line and save the content to a new file named provider.csv. The original file will not be changed.

It’s optional to remove the header in your file because Cypher supports the ability to load a CSV file with a header and refer to the column using the header. Here is a great walkthrough tutorial of loading CSV files to Neo4j.

3. Load CSV into Neo4j

A. Display the CSV in Neo4j

LOAD CSV FROM 'file:///provider.csv' AS col
RETURN col[0] as npi, col[1] as entityType, col[20]+col[21] as address, col[22] as city, col[23] as state, col[24] as zip, col[25] as country, col[5] as lastName, col[6] as firstName, col[10] as credential, col[41] as gender, col[4] as orgName
limit 10

*npi: National Provider Identifier

In the figure above, I only displayed the columns that I will load into the healthcare graph.

B. Load CSV into Neo4j

Part 3 of using Neo4j to graph the healthcare industry. This week: Cleaning up CSV data of providers

I want to create :Provider nodes with these properties: npi, entityType, address, city, state, zip and country.

When entityType is 1, I add properties: lastName, firstName, credential and gender to the node. When entityType is 2, I add the property: OrgName to the node.

Here is the Cypher query that executes the above data model and rules:

LOAD CSV FROM 'file:///provider.csv' AS col
CREATE (pd:Provider {npi: col[0], entityType: col[1], address: col[20]+col[21], city: col[22], state: col[23], zip: col[24], country: col[25]})
FOREACH (row in CASE WHEN col[1]='1' THEN [1] else [] END | SET pd.firstName=col[6], pd.lastName = col[5], pd.credential= col[10], pd.gender = col[41])
FOREACH (row in CASE WHEN col[1]='2' THEN [1] else [] END | SET pd.orgName=col[4])

The FOREACH statement is used to mutate each element in a collection. Here I use CASE WHEN to group the data into two collections of rows: the rows with col[1] = 1 and the rows with col[1] = 2. For each row in the col[1]=1 group, I use the FOREACH statement to set the firstName, lastName, credential and gender properties, and for each row in the col[1]=2 group, I set the property orgName.

C. Fix the fields containing delimiters

Running the Cypher query above returns an error:

At /Users/yaqi/Documents/Neo4j/test_0802/import/provider.csv:113696 -  there's a field starting with a quote 
and whereas it ends that quote there seems to be characters in that field after that ending quote. 
That isn't supported. This is what I read: 'PRESIDENT","9'

Let’s take a look of the problematic line from the terminal:

sed -n "113697 p" provider.csv

Results:

"1790778355","2","","","BERNARD J DENNISON JR DDS PA","","","","","","","","","","","","","","","",
"908 N SANDHILLS BLVD","","ABERDEEN","NC","283152547","US","9109442383","9109449334","908 N SANDHILLS BLVD","
","ABERDEEN","NC","283152547","US","9109442383","9109449334","08/29/2005","07/08/2007",
"","","","","DENNISON","BERNARD","J","PRESIDENT\","9109442383","1223G0001X","4629","NC","Y","",
"","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","
","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","
","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","
","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","
","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","
","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","
","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","
","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","
","","","","","","","","","","","DR.","JR.","DDS","193400000X SINGLE SPECIALTY  GROUP","","","","","","","","","","","","","",""

The problem is there is a \ character inside the field PRESIDENT\, and when loading the file, Cypher will skip the double quotation followed by the \, thus it gets confused on how to map the fields.

Now let’s see if other rows contains the same problem:

grep '\\' provider.csv | wc -l

The command searches for the \ character in the file and counts the lines which contain the pattern that we are looking for. The result is 70.

There are many ways to fix this problem. Below, I replace the \ with / and load it into a new file.

tr "\\" "/" < provider.csv > provider_clean.csv

Now let’s try to reload the CSV file again. This time I am loading the file from Python client.

def create_provider_node(file, g):
   query = '''
    USING PERIODIC COMMIT 1000
    LOAD CSV FROM {file} AS col
    CREATE (pd:Provider {npi: col[0], entityType: col[1], address: col[20]+col[21], city: col[22], state: col[23], zip: col[24], country: col[25]})
    FOREACH (row in CASE WHEN col[1]='1' THEN [1] else [] END | SET pd.firstName=col[6], pd.lastName = col[5], pd.credential= col[10], pd.gender = col[41])
    FOREACH (row in CASE WHEN col[1]='2' THEN [1] else [] END | SET pd.orgName=col[4])
   '''

   index1 = '''
   CREATE INDEX ON: Provider(npi)
   '''
   g.run(index1)

   return g.run(query, file = file)

pw = os.environ.get('NEO4J_PASS')
g = Graph("http://localhost:7474/", password=pw) 
tx = g.begin()
file = 'file:///provider_clean.csv'
create_provider_node(file, g)

By using periodic commit, you can set up a number of transactions to be committed. It helps to prevent from using large amount of memory when loading large CSV files.

4. Conclusion

Now, I have successfully loaded the healthcare provider data into Neo4j. The process of loading drug, drug manufacturer and prescription data are very similar. I also created the relationship WRITES for the nodes :Provider and :Prescription based on the NPI information contained in both files. By now, all the data is stored in the graph database.

Let’s take a look at the healthcare graph data model again:

A graph data model of the healthcare industry

I hope you find this blog post helpful. Sometimes cleaning large CSV files can be tricky, but using the command line to manipulate the files can make the work go faster. In the next blog post, I will show you how to link data when you have limited resources. Specifically, I will demonstrate how I created the relationship (:Prescription)-[:PRESCRIBE]->(:Drug) and (:Drug Firm)-[BRANDS]->(:Drug). Stay tuned, and I’ll see you soon!

Ready to dig in and get started with graph technology? Click below to download this free ebook, Learning Neo4j and catch up to speed with the world’s leading graph database.

Get the Book

Catch up with the rest of the Neo4j in healthcare series:

My Neo4j Summer Road Trip to the World of Healthcare [Part 1]

My Neo4j Summer Road Trip to the World of Healthcare [Part 2]

↧

My Neo4j Summer Road Trip to the World of Healthcare [Part 4]

September 7, 2016, 3:15 am

≫ Next: From the Neo4j Community: August 2016

≪ Previous: My Neo4j Summer Road Trip to the World of Healthcare [Part 3]

Part 4: Create Relationships with FuzzyWuzzy

Welcome back to my adventure to the world of healthcare! In the past three blog posts, I have discussed the data model of the healthcare graph, loading XML data into Neo4j and cleaning CSV data in the command-line interface. In this post, I will demonstrate how to link data from multiple data sources, especially when there is a lack of foreign IDs to identify records.

The healthcare graph consists of four groups of data, with each group of nodes generated from different data sources. Let’s start off by looking at the four groups of nodes and how I created relationships within the group.

1. Lobby Disclosure Nodes Group

This group is extracted from public lobbying disclosures, which includes these nodes from our healthcare data model: (:Disclosure), (:Client), (:Issue), (:Lobbyist), (:LobbyFirm), (:Contribution), (:Contributor), (:Committee) and (:Legislator). While extracting the nodes from the original XML files, nodes IDs are generated internally by Neo4j. As a result, I could use the internal node IDs to create relationships to connect these nodes. I have documented this process in my second blog post.

2. Legislator Nodes Group

This group includes nodes (:LegislatorInfo), (:State), (:Body) and (:Party), which are extracted from a single CSV file that can be downloaded here. Relationships are created through a single Cypher statement during the ETL process. All the ETL code can be found at GitHub here.

3. Provider Prescription Nodes Group

Nodes (:Prescription) and (:Provider) are generated from CMS data sources. I stored National Provider Identifiers as a property {NPI} for both nodes and used it to connect (:Provider) with (:Prescription).

4. Drug Nodes Group

Nodes (:GenericDrug) and (:Drug) are extracted from FDA data sources. Both nodes along with (:Prescription) have property {GenericName}. (:GenericDrug) is created as an intermediate node to present each unique {GenericName} value in (:Drug).

RxNorm is a national medical library that provides a RESTful API which allows me to link clinical drug vocabularies to normalized names such as Rxcui, a unique ID for drugs. I use the batch mode to send {GenericName} of (:Prescription) and (:GenericDrug) to get Rxcui and connect these two nodes on the drug ID Rxcui.

Part 4 of using Neo4j to graph the healthcare industry: String matching for data relationships

Nodes can be simply connected together if they are extracted within a single file or from the same source. Using standardized unique IDs is even more convenient to join nodes together. However, pulling data from a variety of sources and having limited access to data are often the cases that data journalists or data engineers need to tackle.

The issue I am trying to demonstrate in this post is how to create relationships to connect different groups of data. Specifically, how to link (:Client) and (:DrugFirm), (:DrugFirm) and (:Drug), as well as (:Legislator) and (:LegislatorInfo).

Graph databases emphasize representation of relationships among data points. Without the relationships, I won’t be able to create the whole path of the healthcare graph and would lose the ability to track information following by these paths. The idea of connecting the nodes, such as (:Drug) and (:DrugFirm) is that if the drug is branded by a drug firm, a relation should exist to connect these two nodes together.

Now let’s take a look of the properties of these two nodes and we may find some useful information.

MATCH (d:Drug), (df:DrugFirm) RETURN d as Drug, df as DrugFirm LIMIT 25

Data relationships between drugs and drug firms

(:Drug) has a property {labelerName} and (:DrugFirm) contains {firmName}. The logic of connecting a drug firm with a drug is the {firmName} can be recognized as identical or similar to the {labelerName}. This may sound like an easy task for human beings to do, but automating this process can be a little tricky.

Luckily I have heard of FuzzyWuzzy, a practical Python package, which does string matching and returns the matching rate. I decided to give it a try to match the {firmName} with {labelerName}.

1. Array Structure

The first step before comparing two arrays of strings is to structure the array to make it easy to work with:

#======= RETURN Drug object: list of dics, key: labelerName, id ======#
q1 = '''
MATCH (d: Drug)
RETURN id(d), d.labelerName
'''
drug_obj = g.run(q1)
drugs_lst = []
for object in drug_obj:
	drug_dic = {}
	drug_dic['id'] = object['id(d)']
	drug_dic['labelerName'] = object['d.labelerName']
	drugs_lst.append(drug_dic)

#======= RETURN DrugFirm object: list of dics, key: firmName, id ======#
q2 = '''
MATCH (df:DrugFirm)
RETURN id(df), df.firmName'''
df_obj = g.run(q2)
df_lst = []
for object in df_obj:
	df_dic = {}
	df_dic['id'] = object['id(df)']
	df_dic['firmName'] = object['df.firmName']
	df_lst.append(df_dic)

Here I returned an array of dictionaries for both (:Drug) and (:DrugFirm). Each dictionary represents an object, and each object has two keys: Node internal ID and {labelerName} / {firmName}. The node internal ID is used later on to fetch the nodes that we want, which we will talk about shortly. The {labelerName} and {firmName} are the strings that we will compare. Now let’s print out the data structure:

DrugFirm:

[{'id': 23049075, 'firmName': 'Teva Branded Pharmaceutical Products R&D, Inc.'},
{'id': 23049076, 'firmName': "George's Family Farms, LLC"},
{'id': 23049077, 'firmName': 'Baxter Healthcare S.A.'},
{'id': 23049078, 'firmName': 'Tokuyama Corporation'},
{'id': 23049079, 'firmName': 'Alps Pharmaceutical Ind. Co., Ltd.'}]

Drug:

[{'id': 22941414, 'labelerName': 'Eli Lilly and Company'},
{'id': 22941415, 'labelerName': 'Eli Lilly and Company'},
{'id': 22941416, 'labelerName': 'Eli Lilly and Company'},
{'id': 22941417, 'labelerName': 'Eli Lilly and Company'},
{'id': 22941418, 'labelerName': 'Eli Lilly and Company'}]

2. String Preprocessing

As we could see, some company names are in uppercase and some are in lowercase. Some of the names contain non-alphanumeric characters. There are also many duplicates in the array, which will slow down the string matching process.

To improve the matching rate, I passed the arrays to a series of string processing functions to clean up the strings.

#lower case: convert all to lower case
lc_ln = lower_case(drugs_lst, 'labelerName')
lc_fn = lower_case(df_lst, 'firmName')

#remove_marks: remove non-alphanumeric characters
rm_ln = rm_mark(lc_ln, 'labelerName')
rm_fn = rm_mark(lc_fn, 'firmName')

#Chop_end: remove ‘s’ at the end of a string
ce_ln = chop_end(rm_ln, 'labelerName', 's')
ce_fn = chop_end(rm_fn, 'firmName', 's')

#sort_strings: sort words in a string
sort_ln = sort_strings(ce_ln,'labelerName')
sort_fn = sort_strings(ce_fn, 'firmName')

#uniq strings: de-duplicate: collect nodes ID in to a list for each unique string
uq_ln = uniq_elem(sort_ln, 'labelerName')
uq_fn = uniq_elem(sort_fn, 'firmName')

All the functions in the script can be found here. After processing the strings, I de-duplicated the objects in {labelerName} and {firmName}, the number of objects are decreased from 106683 to 5928 for {labelerName} and 10205 to 7040 for {firmName}.

Now let’s print out the uq_ln and uq_fn to understand the structure before we move on.

uq_ln:

defaultdict(,
{'akorn llc stride': [22965573, 22965574, 22965575, 22965576],
'brand inc silver star': [23040080, 23040084, 23040091, 23040097, 23040151, 23040169, 23040171, 23040172, 23040174],
'co ltd osung': [23007551, 23007552, 23007553],
'beauticontrol': [23013024],
'biological glaxosmithkline sa': [23013557, 23013558, 23013559, 23013560, 23013561, 23013562, 23013563, 23013564, 23013565, 23013566, 23013567, 23013568, 23013569, 23013570, 23013571],…

uq_fn:

defaultdict(,
{'biological glaxosmithkline sa': [23053065],
'bioservice capua spa': [23052723],
'american homepatient': [23057681, 23057683, 23057686, 23057687, 23057688, 23057689, 23057690, 23057691, 23057692, 23057693, 23057694, 23057695, 23057696, 23057697, 23057698],
'healthcare limited novarti private': [23056487],…

I have two defaultdicts to work with. The keys in each array are the strings to be compared, the values are node IDs representing the nodes that contains the same strings (i.e., the key). Now I can call FuzzyWuzzy to get string matching rates.

3. FuzzyWuzzy String Matching

for k1 in uq_ln:
	labeler_name = k1
	nodeId_drug = uq_ln[k1]

for k2 in uq_fn:
    	company_name = k2
    	nodeId_df = uq_fn[k2]
    	r1 = fuzz.partial_ratio(labeler_name, company_name)
    	r2 = fuzz.ratio(labeler_name, company_name)
    	if r1 > 85:
        	print('r1:',r1, 'r2:',r2, 'ln:',k1, 'fn:',k2)
    	if r2 > 85:
        	print('r1:',r1, 'r2:',r2, 'ln:',k1, 'fn:',k2)

I started off by choosing 85 to be the cut off for both partial_ratio and ratio just to see the results of the string matching. I generalized the result into three cases:

Case 1: Both partial_ratio(r1) and ratio(r2) are equal to 100

The two strings are identical to each other, for example:

r1: 100 r2: 100 ln: abilene inc oxygen fn: abilene inc oxygen
r1: 100 r2: 100 ln: abilene inc oxygen fn: abilene inc oxygen

Case 2: Only r1 is 100

One of the strings is a substring of the other one, but needs to exclude false positives. Here is the example:

r1: 100 r2: 65 ln: gavi llc pharmaceutical fn: llc pharmac

From my observation on the example, fn is the substring of ln and r1 is equal to 100, supporting my observation. However, the r2 is relatively low. By judging from the two strings, I cannot infer they represent the same company. Thus we need to do some further modification either on the cut off rate or the strings to exclude the false positives.

Case 3: Both r1 and r2 are >85

The two strings are similar, but need to exclude false positives:

r1: 98 r2: 96 ln: barr inc laboratorie fn: arg inc laboratorie
r1: 89 r2: 94 ln: company perrigo fn: company l perrigo

Both r1 and r2 are greater than 85 in the two examples above. However, I identify the first line being a false positive, which needs to be excluded. Whereas in the second line, the two strings may represent the same company even though both r1 and r2 are lower than the first line’s. Again, we need some further modifications to improve the accuracy of the string matching.

4. Modification on String Matching

Let’s take a look at the strings in case 2 and case 3 again:

“gavi llc pharmaceutical” vs “llc pharmac” -> false positive
“barr inc laboratorie fn” vs “arg inc laboratorie” -> false positive
“company perrigo” vs “company l perrigo” -> may represent the same company

I want to do some changes on the strings so my program will only return the third line and ignore the first two lines. I realize most of the company names are constructed by two components: a unique component (in bold) that differentiates the company from other companies, and a common component (in regular text) that represents the type of organization.

If I remove the common component from the string, whatever is left should be the unique component which is much more precise and easier for a computer to decide whether the two strings are similar or not.

For example, if remove all the common components in all the three lines, the strings will look like this:

“gavi” vs “ ” -> false positive
“barr” vs “arg” -> false positive
“perrigo” vs “l perrigo” -> may represent the same company

Now the computer can pick up the third line (r1 improved from 89 to 100) where “perrigo” is a substring of “I perrigo”, and it can ignore the first two false positive cases.

I also noticed for most of the cases when r1 = 100 and r1–r2 > 30, it’s hard to say if the two strings represent the same company, such as in this example:

r1: 100 r2: 24 ln: brand inc tween fn: br
r1: 100 r2: 43 ln: bio co general ltd fn: bio c
r1: 100 r2: 47 ln: dava inc pharmaceutical fn: ava inc
r1: 100 r2: 59 ln: bio cosmetic fn: bio c
r1: 100 r2: 27 ln: chartwell governmental llc rx specialty fn: al llc
r1: 100 r2: 69 ln: canton laboratorie fn: canton laboratorie limited private
r1: 100 r2: 48 ln: beach inc productsd fn: ch inc
r1: 100 r2: 63 ln: genentech inc fn: ch inc
r1: 100 r2: 41 ln: dental llc scott supply fn: al llc

As a result, I decided to exclude these cases by controlling the differences between the cut off values within the range of 30.

5. The Final Solution

# #======= Create relation :BRANDS (String Fuzzying Matching) ======#
q3 = '''
MATCH (d:Drug) where id(d) in {drug_id} and d.tradeName is not NULL
MATCH (df:DrugFirm) where id(df) in {drug_firm_id}
MERGE (df)-[r:BRANDS]->(d)
ON CREATE SET r.ratio = {r2}, r.partial_ratio = {r1}'''

num = 0  # Number of rel that are created
for k1 in uq_ln:
	labeler_name = k1
	nodeId_drug = uq_ln[k1]

for k2 in uq_fn:
    	company_name = k2
    	nodeId_df = uq_fn[k2]
    	r1 = fuzz.partial_ratio(labeler_name, company_name)
    	r2 = fuzz.ratio(labeler_name, company_name)

    	if r1 == 100 and (r1 - r2) <= 30:
        	g.run(q3, drug_id = nodeId_drug, drug_firm_id = nodeId_df, r1 = r1, r2 = r2)
        	num += 1
        	print("CREATE relation :BRANDS number:", num)

    	elif (100 > r1 >= 95 and r2 >= 85) or (95 > r1 >= 85 and r2 >= 90):  ### miss spell or miss a word  r1 and r2 > 95
        	md_r1 = fuzz.partial_ratio(string_filter(labeler_name, nostring), string_filter(company_name, nostring))
        	md_r2 = fuzz.ratio(string_filter(labeler_name, nostring), string_filter(company_name, nostring))

        	if md_r1 >= 95 and md_r2 >= 95:
	            g.run(q3, drug_id=nodeId_drug, drug_firm_id=nodeId_df, r1=md_r1, r2=md_r2)
            	num += 1
            	print("CREATE relation :BRANDS rel number:", num)

I decided to create the relationship [:BRANDS] for nodes (:Drug) and (:DrugFirm) if r1 is 100 and r1–r2 is less than 30. When r1 and r2 are both above 85, I decided to filter out some common words in the strings, such as inc, co, ltd, llc, corporation, pharmaceutical, laboratory, company, product, pharma, etc. and then recalculate r1 and r2 on the modified strings.

It’s also helpful to store the value of r1 and r2 as properties for [:BRANDS], so when I am querying information between drug firms and drugs, I am also able to trace the confidence level of the answer. The Cypher below returns the {firmName} and {labelerName} based on the ascending order of r1:

MATCH (df:DrugFirm)-[r:BRANDS]->(d:Drug)
RETURN df.firmName, d.labelerName, r.partial_ratio as r1, r.ratio as r2 order by r1 ASC limit 10

The Cypher results for connections between drug firm names and drug labeler names in healthcare

Now we have successfully created the relationship [:BRANDS] for (:DrugFirm) and (:Drug), and the matching results seem pretty trustworthy. I also used the same method to create relationships for (:Client) and (:DrugFirm), as well as (:Legislator) and (:LegislatorInfo).

Lastly, let’s find out which drug firm brand drug has the generic component morphine sulfate:

MATCH (df:DrugFirm)-[r:BRANDS]->(d:Drug{genericName:'Morphine Sulfate'}) RETURN df,d

(There are 193 matched results and I randomly choose 3 companies to display the results.)

Conclusion

I hope you enjoy reading through this blog post. If you are facing similar issues such as having difficulties connecting to your graph, I hope this post is helpful to you.

At the same time, I am eager to hear your thoughts or ideas on how to solve the problem. Don’t be hesitate to send me a message on Twitter, LinkedIn or email if you have any questions about my project.

With the summer passing by, it is getting close to the end of my fun summer road trip through the world of healthcare. I will be very excited to show you some the most interesting discoveries I have made in my next blog post. If you want to know more about graph technology in the healthcare industry, don’t miss out on my last blog post. See you soon!

Ready to dig in and get started with graph technology? Click below to download this free ebook, Learning Neo4j and catch up to speed with the world’s leading graph database.

Get the Book

Catch up with the rest of the Neo4j in healthcare series:

My Neo4j Summer Road Trip to the World of Healthcare [Part 1]

My Neo4j Summer Road Trip to the World of Healthcare [Part 2]

My Neo4j Summer Road Trip to the World of Healthcare [Part 3]

↧

From the Neo4j Community: August 2016

September 16, 2016, 12:00 am

≫ Next: Neo4j + AWS Lambda & API Gateway to Create a Recommendation Engine

≪ Previous: My Neo4j Summer Road Trip to the World of Healthcare [Part 4]

Explore all of the great articles and projects created by the Neo4j community in August 2016

The Neo4j community has been busy this summer with a number of great projects, code libraries and a host of great articles on how graphs are everywhere – including everything from the Rio Summer Olympics to examining ISIS support and opposition networks.

We can’t wait to see what other great uses of Neo4j the community comes up with in September!

If you would like to see your post featured in September’s “From the Community” blog post, follow us on Twitter and use the #Neo4j hashtag for your chance to get picked.

Neo4j + AWS Lambda & API Gateway to Create a Recommendation Engine

November 8, 2016, 2:48 am

≫ Next: Adding Users to the Node.js / React.js Neo4j Movie App

≪ Previous: From the Neo4j Community: August 2016

The Challenge of Findability and Why Graphs?

Here at the SAVO Group, we are in the business of sales enablement via Software as a Service (SaaS). One of the tools that we provide our customers is the ability to prescribe content proactively for their sales people or sellers. We are moving past that concept and into ways that we can have content “find” sellers much quicker.

Ever since listening to Emil Eifrem on the O’Reilly Data Show talk about Neo4j, I have been super intrigued by graph technology and working to create an opportunity to use it at SAVO. As a Microsoft SQL Server DBA, the idea of connected data and the Cypher query language feel very natural to me.

At SAVO, we track all activity on the content we manage for our customers. The light bulb for me was to leverage this activity and its connected “DNA” to recommend new content to sellers based on what other sellers have used. Knowing the real-time recommendation engine use case, Neo4j felt like a natural fit for this.

Data Model and Cypher Query

Below is the data model.

We serve hundreds of customers (i.e., tenants), so tenant_id is important in a single Neo4j database. The basic pattern is:

(user)-[:ACTION]->(action)-[:DOWNLOADED|EMAILED|SEARCH_VIEW]->(document)

User being the person acting on content, with metadata such as id and tenant
Action being the metadata about the action record in our database such as id and date
Document being the content acted on, with metadata such as id and tenant

The key here is the additional (:action) node which gives us the flexibility for date filtering and the ability to leverage indexing. With this pattern, we can extend to other users that have consumed this document and return what else they have consumed.

Cypher query for recommendations:

MATCH (u1:user {user_id: toInt({user})}),
      (d:document{document_id:toInt({document})})<--(a2:action)
      <--(u2:user)-->(a3:action)-[r]->(d2:document)
WHERE u1 <> u2 AND d <> d2
AND NOT  (u1)-[*2]->(d2)
AND a3.action_date >= a2.action_date
RETURN d2.document_id AS document_id, 
       sum(case when type(r) = 'DOWNLOADED' then 1 else 0 end) as downloads,
       sum(case when type(r) = 'EMAILED' then 1 else 0 end) as emailed,
       sum(case when type(r) = 'SEARCH_VIEW' then 1 else 0 end) as search_views,
       count(r) as score
ORDER BY score desc
LIMIT 25

We also filter to make sure we are not showing documents to a user that they have already seen since the idea is to bring unseen content to the user.

Enter AWS Managed Services and EC2

SAVO is currently in process of moving new and existing applications and infrastructure to Amazon Web Services (AWS). This presented the optimal opportunity to display the awesomeness of graphs and how a recommendation engine could be created very quickly with AWS, specifically using managed services like Lambda and API Gateway instead of spinning up new VMs or adding to existing applications. (I also get to progress with learning AWS and Neo4j all at once – win-win!)

The architecture for the proof of concept looks like this:

Learn how to create a recommendation engine using Neo4j alongside Lambda and API Gateway from AWS

The EC2 instance is a very simple m3.medium built from the Ubuntu 14.04 AMI inside of a default VPC we have set up in SAVO’s scratch account, which we use for POC work and various types of exploration. I set up the EC2 instance with a security group that limits Neo4j Bolt driver port access at 7687 to another security group that I will use later on. Also I added a 30GB EBS drive and then installed Neo4j per the Ubuntu installation documentation.

Here is how Amazon describes Lambda and API Gateway:

AWS Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume – there is no charge when your code is not running. With Lambda, you can run code for virtually any type of application or backend service – all with zero administration. Just upload your code and Lambda takes care of everything required to run and scale your code with high availability. You can set up your code to automatically trigger from other AWS services or call it directly from any web or mobile app.

Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor and secure APIs at any scale. With a few clicks in the AWS Management Console, you can create an API that acts as a “front door” for applications to access data, business logic or functionality from your backend services, such as workloads running on Amazon Elastic Compute Cloud (Amazon EC2), code running on AWS Lambda or any web application. Amazon API Gateway handles all the tasks involved in accepting and processing up to hundreds of thousands of concurrent API calls, including traffic management, authorization and access control, monitoring and API version management. Amazon API Gateway has no minimum fees or startup costs. You pay only for the API calls you receive and the amount of data transferred out.

Using Lambda, which supports Python 2.7, along with the Neo4j Bolt driver and API Gateway, I am able to turn my Cypher query into a fully functional microservice. Leveraging Amazon’s awesome documentation makes it quite easy to set this up.

I set up the Lambda function to work within the same default VPC stated above and added the security group that is allowed access to the Bolt driver port in order to keep the communication between our database and the Lambda function private and isolated. I upped the default timeout to 30 seconds, just to be safe.

With Lambda you can upload any dependent Python packages along with your code as a ZIP file. For this project, the .py file and the Neo4j Python Bolt driver are manually packaged into a ZIP file and uploaded to Lambda via the AWS Console.

Example files and ZIP package:

Lambda console for uploading ZIP package:

The Python code in the package:

from __future__ import print_function
from neo4j.v1 import GraphDatabase, basic_auth


def get_recommendation(event,context):
    results = []
    user = event['user_id']
    document = event['document_id']
    driver = GraphDatabase.driver("bolt://", auth=basic_auth("neo4j", "neo4j"), encrypted=False)
    session = driver.session()
    cypher_query = '''
    MATCH (u1:user {user_id: toInt({user})}),
          (d:document{document_id:toInt({document})})<--(a2:action)<--(u2:user)
          -->(a3:action)-[r]->(d2:document)
    WHERE u1 <> u2 AND d <> d2
    AND NOT  (u1)-[*2]->(d2)
    AND a3.action_date >= a2.action_date
    RETURN d2.document_id AS document_id, 
          sum(case when type(r) = 'DOWNLOADED' then 1 else 0 end) as downloads,
          sum(case when type(r) = 'EMAILED' then 1 else 0 end) as emailed,
          sum(case when type(r) = 'SEARCH_VIEW' then 1 else 0 end) as search_views,
          count(r) as score
    ORDER BY score desc
    LIMIT 25
    '''
    result = session.run(cypher_query,{'user':user,'document':document})
    session.close()
    for record in result:
        item = {'document_id':record['document_id'],  
                'downloads':record['downloads'], 'emailed':record['emailed'],  
                'search_views':record['search_views'], 'score':record['score']}
        results.append(item)
    return results

The actual EC2 instance DNS name is omitted in the above example. Both that and the Neo4j username and password are hard-coded in. A better way to this in the future would be to pass these values in at runtime. This could be accomplished by calling an S3 bucket or RDS instance, both encrypted at rest, and pulling the appropriate values into the Lambda function when triggered.

You will notice that I had to use encrypted=False in order to authenticate into Neo4j. Using encrypted=True would not work because we do not have ability to leverage a known_hosts file or signed certificate using Lambda. At this point we will rely on the protection of the AWS VPC. The signed certificate option may work with Node.js driver, but that is out of scope for this example.

Once this is working, we set up the API Gateway to integrate with the Lambda function via a GET request that will send user_id and document_id and return recommendations.

In API Gateway, we add a document-recommendations resource and a {user} and {document} GET method. The brackets allow us to send the ids as a part of the request.

On the {document} method, we integrate with our Lambda function to send user_id and document_id and receive recommendations.

Integration between AWS Lambda and API Gateway

Once we have this set up and tested, we deploy to a stage and get our very own API endpoint.

To test things, I opened up Postman, executed my HTTPS API endpoint with user and document values, and through the magic of AWS and Neo4j, we have recommendations in less than 500 milliseconds!

Real-time recommendation engine results with Neo4j and AWS

Next Steps

I am really passionate about Neo4j and love working with graphs. My hope is that this POC can grow into a fully-fledged content recommendation engine with many endpoints for various flavors of recommendations.

Currently, there is a lot of manual setup involved. I envision that the production version would be an HA cluster with automated deployment of AWS components and code via some mix of Cloudformation and Jenkins.

We are also using the default Neo4j password in this example, as well as the default settings for API Gateway. In future, I would potentially leverage password encryption in AWS to get a password at runtime. API Gateway also gives us the ability to use an “Authorizer” to control access, which is something we will explore.

In conclusion, this was such a cool example to get up and running and really only took a few hours, since I already had some sample Cypher and a zipped-up copy of my local database. I think this is a credit to how accessible Neo4j and AWS are. I look forward to building more graph-based solutions in the future.

The world of graph databases truly is wonderful and they are indeed everywhere…even on AWS

Already know Neo4j? Prove it.
Take the Neo4j Certification exam and validate your graph database skills to current and future employers and customers. Click below and get certified in less than an hour.

Get Certified

↧

Adding Users to the Node.js / React.js Neo4j Movie App

December 8, 2016, 1:28 am

≫ Next: The 5-Minute Interview: Daniel Himmelstein, Postdoctoral Fellow at University of Pennsylvania

≪ Previous: Neo4j + AWS Lambda & API Gateway to Create a Recommendation Engine

Introduction

The Neo4j Movie App Template provides an easy-to-use foundation for your next Neo4j project or experiment using either Node.js or React.js. This article will walk through the creation of users that can log in and interact with the web app’s data.

In the Neo4j Movie App Template example, these users will be able to log in and out, rate movies, and receive movie recommendations.

The User Model

Aside from creating themselves and authenticating with the app, Users (blue) can rate Movies (yellow) with the :RATED relationship, illustrated in the graph data model below.

Learn how to add users to the Node.js / React.js example Neo4j Movie App

`User` Properties

password: The hashed version of the user’s chosen password
api_key: The user’s API key, which the user uses to authenticate requests
id: The user’s unique ID
username: The user’s chosen username

`:RATED` Properties

rating: an integer rating between 1 and 5, with 5 being love it and 1 being hate it.

Users Can Create Accounts

Before a User can rate a Movie, the the user has to exist – someone has to sign up for an account. Signup will create a node in the database with a User label along with properties necessary for logging in and maintaining a session.

Create a new user account in the Neo4j Movie App

Figure 1. web/src/pages/Signup.jsx

The registration endpoint is located at /api/v0/register. The app submits a request to the register endpoint when a user fills out the “Create an Account” form and taps “Create Account”.

Assuming you have the API running, you can test requests either by using the interactive docs at 3000/docs/, or by using cURL.

Use Case: Create New User

Request

curl -X POST 
--header 'Content-Type: application/json' 
--header 'Accept: application/json' 
-d '{ "username": "Mary Jane", "password": "SuperPassword"}' 'http://localhost:3000/api/v0/register'

Response

{
   "id":"e1e157a2-1fb5-416a-b819-eb75c480dfc6",
   "username":"Mary333 Jane",
   "avatar":{
      "full_size":"https://www.gravatar.com/avatar/b2a02b21db2222c472fc23ff78804687?d=retro"
   }
}

Use Case: Try to Create New User but Username Is Already Taken

Request

curl -X POST 
--header 'Content-Type: application/json' 
--header 'Accept: application/json' 
-d '{ "username": "Mary Jane", "password": "SuperPassword"}' 'http://localhost:3000/api/v0/register'

Response

{
   "username":"username already in use"
}

User registration logic is implemented in /api/models/users.js. Here’s the JavaScript:

var register = function(session, username, password) {
    return session.run('MATCH (user:User {username: {username}}) RETURN user', {
            username: username
        })
        .then(results => {
            if (!_.isEmpty(results.records)) {
                throw {
                    username: 'username already in use',
                    status: 400
                }
            }
            else {
                return session.run('CREATE (user:User {id: {id}, username: {username},
                       password: {password}, api_key: {api_key}}) RETURN user', {
                    id: uuid.v4(),
                    username: username,
                    password: hashPassword(username, password),
                    api_key: randomstring.generate({
                        length: 20,
                        charset: 'hex'
                    })
                }).then(results => {
                    return new User(results.records[0].get('user'));
                })
            }
        });
};

Users Can Log in

Now that users are able to register for an account, we can define the view that allows them to login to the site and start a session.

Figure 2. /web/src/pages/Login.jsx

The registration endpoint is located at /api/v0/login. The app submits a request to the login endpoint when a user fills a username and password and taps “Create Account”.

Assuming you have the API running, you can test requests either by using the interactive docs at 3000/docs/, or by using cURL.

Use Case: Login

Request

curl -X POST 
--header 'Content-Type: application/json' 
--header 'Accept: application/json' 
-d '{"username": "Mary Jane", "password": "SuperPassword"}' 'http://localhost:3000/api/v0/login'

Response

{
	"token":"5a85862fb28a316ea6a1"
}

Use Case: Wrong Password

Request

curl -X POST 
--header 'Content-Type: application/json' 
--header 'Accept: application/json' 
-d '{ "username": "Mary Jane", "password": "SuperPassword"}' 'http://localhost:3000/api/v0/register'

Response

{
   "username":"username already in use"
}

Use Case: See Myself

Request

curl -X GET 
--header 'Accept: application/json' 
--header 'Authorization: Token 5a85862fb28a316ea6a1' 'http://localhost:3000/api/v0/users/me'

Response

{
  "id": "94a604f7-3eab-4f28-88ab-12704c228936",
  "username": "Mary Jane",
  "avatar": {
    "full_size": "https://www.gravatar.com/avatar/c2eab5611cabda1c87463d7d24d98026?d=retro"
  }
}

You can take a look at the implementation in /api/models/users.js:

var me = function(session, apiKey) {
    return session.run('MATCH (user:User {api_key: {api_key}}) RETURN user', {
            api_key: apiKey
        })
        .then(results => {
            if (_.isEmpty(results.records)) {
                throw {
                    message: 'invalid authorization key',
                    status: 401
                };
            }
            return new User(results.records[0].get('user'));
        });
};
var login = function(session, username, password) {
    return session.run('MATCH (user:User {username: {username}}) RETURN user', {
            username: username
        })
        .then(results => {
            if (_.isEmpty(results.records)) {
                throw {
                    username: 'username does not exist',
                    status: 400
                }
            }
            else {
                var dbUser = _.get(results.records[0].get('user'), 'properties');
                if (dbUser.password != hashPassword(username, password)) {
                    throw {
                        password: 'wrong password',
                        status: 400
                    }
                }
                return {
                    token: _.get(dbUser, 'api_key')
                };
            }
        });
};

The code here should look similar to /register. There is a similar form to fill out, where a user types in their username and password.

With the given username, a User is initialized. The password they filled out in the form is verified against the hashed password that was retrieved from the corresponding :User node in the database.

If the verification is successful it will return a token. The user is then directed to an authentication page, from which they can navigate through the app, view their user profile and rate movies. Below is a rather empty user profile for a freshly created user:

Figure 3. /web/src/pages/Profile.jsx

Users Can Rate Movies

Once a user has logged in and navigated to a page that displays movies, the user can select a star rating for the movie or remove the rating of a movie he or she has already rated.

The user should be able to access their previous ratings (and the movies that were rated) both on their user profile and the movie detail page in question.

Use Case: Rate a Movie

Request

curl -X POST 
--header 'Content-Type: application/json' 
--header 'Accept: application/json' 
--header 'Authorization: Token 5a85862fb28a316ea6a1' 
-d '{"rating":4}' 'http://localhost:3000/api/v0/movies/683/rate'

Response

{}

Use Case: See All of My Ratings

Request

curl -X GET 
--header 'Accept: application/json' 
--header 'Authorization: Token 5a85862fb28a316ea6a1' 'http://localhost:3000/api/v0/movies/rated'

Response

[
  {
    "summary": "Six months after the events depicted in The Matrix, ...",
    "duration": 138,
    "rated": "R",
    "tagline": "Free your mind.",
    "id": 28,
    "title": "The Matrix Reloaded",
    "poster_image": "http://image.tmdb.org/t/p/w185/ezIurBz2fdUc68d98Fp9dRf5ihv.jpg",
    "my_rating": 4
  },
  {
    "summary": "Thomas A. Anderson is a man living two lives....",
    "duration": 136,
    "rated": "R",
    "tagline": "Welcome to the Real World.",
    "id": 1,
    "title": "The Matrix",
    "poster_image": "http://image.tmdb.org/t/p/w185/gynBNzwyaHKtXqlEKKLioNkjKgN.jpg",
    "my_rating": 4
  }
]

Use Case: See My Rating on a Particular Movie

Request

curl -X GET 
--header 'Accept: application/json' 
--header 'Authorization: Token 5a85862fb28a316ea6a1' 'http://localhost:3000/api/v0/movies/1'

Response

{
   "summary":"Thomas A. Anderson is a man living two lives....",
   "duration":136,
   "rated":"R",
   "tagline":"Welcome to the Real World.",
   "id":1,
   "title":"The Matrix",
"poster_image":"http://image.tmdb.org/t/p/w185/gynBNzwyaHKtXqlEKKLioNkjKgN.jpg",
   "my_rating":4,
   "directors":[...],
   "genres":[...],
   "producers":[...],
   "writers":[...],
   "actors":[...],
   "related":[...],
   "keywords":[...]
}

Users Can Be Recommended Movies Based on Their Recommendations

When a user visits their own profile, the user will see movie recommendations. There are many ways to build a recommendation engine, and you might want to use one or a combination of the methods below to build the appropriate recommendation system for your particular use case.

In the movie template, you can find the recommendation endpoint at movies/recommended.

User-Centric, User-Based Recommendations

Here’s an example Cypher query for a user-centric recommendation:

MATCH (me:User {username:'Sherman'})-[my:RATED]->(m:Movie)
MATCH (other:User)-[their:RATED]->(m)
WHERE me <> other
AND abs(my.rating - their.rating) < 2
WITH other,m
MATCH (other)-[otherRating:RATED]->(movie:Movie)
WHERE movie <> m
WITH avg(otherRating.rating) AS avgRating, movie
RETURN movie
ORDER BY avgRating desc
LIMIT 25

Movie-Centric, Keyword-Based Recommendations

Newer movies will have few or no ratings, so they will never be recommended to users if the application uses users’ rating-based recommendations.

Since movies have keywords, the application can recommend movies with similar keywords for a particular movie. This case is useful when the user has made few or no ratings.

For example, site visitors interested in movies like Elysium will likely be interested in movies with similar keywords as Elysium.

MATCH (m:Movie {title:'Elysium'})
MATCH (m)-[:HAS_KEYWORD]->(k:Keyword)
MATCH (movie:Movie)-[r:HAS_KEYWORD]->(k)
WHERE m <> movie
WITH movie, count(DISTINCT r) AS commonKeywords
RETURN movie
ORDER BY commonKeywords DESC
LIMIT 25

User-Centric, Keyword-Based Recommendations

Users with established tastes may be interested in finding movies with similar characteristics as his or her highly-rated movies, while not necessarily caring about whether another user has or hasn’t already rated the movie. For example, Sherman has seen many movies and is looking for new movies similar to the ones he has already watched.

MATCH (u:User {username:'Sherman'})-[:RATED]->(m:Movie)
MATCH (m)-[:HAS_KEYWORD]->(k:Keyword)
MATCH (movie:Movie)-[r:HAS_KEYWORD]->(k)
WHERE m <> movie
WITH movie, count(DISTINCT r) AS commonKeywords
RETURN movie
ORDER BY commonKeywords DESC
LIMIT 25

Next Steps

Want to learn more about what you can do with graph databases like Neo4j?
Click below to get your free copy the O’Reilly Graph Databases book and discover how to harness the power of graph technology.

Download My Free Copy

↧

The 5-Minute Interview: Daniel Himmelstein, Postdoctoral Fellow at University of Pennsylvania

January 6, 2017, 12:00 am

≫ Next: 2016: The Year in Neo4j Drivers

≪ Previous: Adding Users to the Node.js / React.js Neo4j Movie App

“This is a really advanced graph algorithm and Cypher nailed it,” said Daniel Himmelstein, a Postdoctoral Fellow at the University of Pennsylvania.

Before using Neo4j, it took as many as 1,000 lines of code to write the main query for Himmelstein’s graph algorithm used in a bioinformatics application. But with Neo4j’s Cypher graph query language, the query took only 20 lines.

In this week’s 5-Minute Interview (conducted at at GraphConnect San Francisco), we discuss how Neo4j is being used for biological and medical research at UPenn. Himmelstein also narrates where he believes the field of bioinformatics research is headed in 2017.

Tell us about how you use Neo4j at UPenn.

Daniel Himmelstein: I use Neo4j to encode biological and medical knowledge into a network. Neo4j was the best way to encode this type of knowledge – which is produced by millions of studies over the past 50 years – where we are able to represent the rich types of nodes and relationships from real-world biological data.

What made you choose to work with Neo4j?

Daniel: The Neo4j community is the reason I chose it. First, the features are fantastic and were exactly what we needed, mainly because Neo4j dealt with different types of networks extremely well. But the community — with so many things on GitHub where I could report any issues with code and then have it fixed quickly, or ask a question on Stack Overflow, was really great.

The developers have been extremely helpful, and I went to some meetups in San Francisco where I met some of the team. The company provides great support, even though we were never a paying customer as open source users of the product. The community has been great to be a part of.

What are some of the most interesting or surprising results you’ve seen while using Neo4j?

Catch this week's 5-Minute Interview with Daniel Himmelstein, University of Pennsylvania

Daniel: Before using Neo4j, I had written a Python package called Hetio, which dealt with a number of different types of networks. It took as many as 1,000 lines of code to do the main query for our algorithm. But when I switched to Neo4j and was able to pour the algorithm into Cypher, the code was only 20 lines. I thought, “Wow. This is a really advanced graph algorithm and Cypher nailed it.”

Cypher had exactly the right constructs to be able to express exactly what we wanted. And it was cool to have people finally think about how to query a graph; previously people hadn’t put much effort into developing a good query language for networks.

If you could start over with Neo4j, taking everything you know now, what would you do differently?

Daniel: If I could go back in time, maybe I would have used Neo4j a little bit earlier. When I first considered Neo4j, I don’t think Cypher was out yet. And because I program primarily in Python, and a little bit in R, there originally wasn’t an intuitive way to interact with Neo4j. But with the new Bolt drivers and the Cypher query language, it has become quite easy to work with Python in Neo4j.

Anything else you want to add or say?

Daniel: I’m really excited. There have been several talks here at GraphConnect San Francisco from people in the bioinformatics field. I know when Emil did the keynote he didn’t include the biology or medicine as one of his six fields, but this will likely be a field in 2017 because it’s really blowing up. We have a lot of data, it has types, and we need to understand those connections, so I expect big growth in the biology field in the next year.

Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at content@neo4j.com

Use your RDBMS expertise to learn about graph databases: Download this ebook, The Definitive Guide to Graph Databases for the RDBMS Developer, and discover when and how to use graphs in conjunction with a relational database.

Get the Ebook

↧

2016: The Year in Neo4j Drivers

January 26, 2017, 3:07 am

≫ Next: From the Neo4j Community: January 2017

≪ Previous: The 5-Minute Interview: Daniel Himmelstein, Postdoctoral Fellow at University of Pennsylvania

Spring is in the Air

2016 was the year when, in April with the availability of Neo4j 3.0 we also introduced our own binary protocol named Bolt. We also provided the first set of officially supported drivers for Bolt, including Java, .Net, JavaScript and Python that were developed in-house and documented in the Neo4j developer manual.

Learn about Neo4j drivers for JavaScript, Java, .NET, Python and other community language drivers.

Since the first days of Neo4j we’ve been supported by our active community of contributors who did a great job of providing drivers for our HTTP and REST endpoints for more than 20 popular programming languages.

Thank you all so much for this impressive work!

With Neo4j 3.0 and the Bolt binary protocol, we saw this amazing work continue. Originally we were a bit concerned because of the higher effort required to develop a Neo4j driver for a custom binary protocol but our contributors surprised us here.

Even during the development of Neo4j 3.0 the first three drivers: Nigel’s py2neo (Python), neo4j-php-client (PHP) by Chris Willemsen from our partner GraphAware (UK) and libneo4j-client (C) by Chris Leishman had their first releases.

To make it easier for contributors to develop Neo4j drivers using Bolt, Nigel started the boltkit project: a detailed, executable documentation (in Python) that details how to structure and implement a driver for the Bolt protocol and the Packstream serialization. This also includes some tools for driver authors and is used in-house here at Neo Technology.

boltkit for Neo4j drivers includes and API and PackStream details

Summertime and the Livin’ Is Easy

In the Go community, several people actively worked on Neo4j drivers using Bolt. John Nadratowski developed the golang-neo4j-bolt-driver of which a more idiomatic fork was made and maintained by Eric Lagergren from SermoDigital to a more idiomatic variant. And Hugo Bost wrote neoql a variant that was (similar to cq) based on the widely used database/sql API for Go.

Florin Patrascu enjoyed working in Elixir and couldn’t live without a current Neo4j driver. So he provided neo4j-sips, a Bolt driver for Elixir.

Forever Autumn

In October at GraphConnect San Francisco, the third beta of Neo4j 3.1 was launched, with some notable improvements in the official Neo4j drivers, especially in concurrent operations and session reuse.

The bigger changes were in the new APIs to support Causal Clustering in Neo4j 3.1. Smart client routing (bolt+routing://host:port) that uses information on cluster topology together with demarcation of read and write sessions alleviates the need for a load balancer. And the ability to use a transaction-state token (bookmark) allows for causal consistency to read your own writes even on an eventually consistent cluster underneath. These features were first to launch with the 1.1 version of the Java driver.

Pavel Yakovlev found time besides his job as the research director of a biotech company to develop a Bolt Neo4j driver for Haskell named hasbolt.

Hazy Shade of Winter

The neo4j.rb (Ruby) team (Brian Underwood, Chris Grigg, et al) worked over the year – besides other improvements – on implementing the Bolt protocol in neo4j-core gem so that it is supported both on the low-level APIs as well as in the ActiveRecord module of neo4j.rb, both of which were released at the end of the year.

These core Neo4j drivers were not the only libraries developed in 2016, we also saw them being used by other projects. The Java Neo4j driver was used for the neo4j-spark-connector and for the Neo4j-JDBC driver by the team of our partner Larus BA in Italy. The JavaScript driver powers the Neo4j Browser and the Tableau 10 (WDC2) Neo4j connector, maintained by our partner Ralf Becher from TIQ in Germany.

The PHP driver is used in the drupal module developed by Pronovix and also in the brand-new neo4j-symfony. Neomodel 3, the Django-friendly OGM by Robin Edwards uses py2neo under the hood.

The Jan 19th release of the new official 1.1 driver series for .NET, JavaScript and Python adds smart routing and bookmarking capabilities to work seamlessly with causal clusters.

You can find all these mentioned drivers on our language guide pages for developers and for many of them also an implementation of our example movie application in our github.com/neo4j-examples repositories.

As with any open source project, feedback from users is crucial to our success, so if you use any of the abovementioned Neo4j drivers make sure to raise issues if you encounter problems or have ideas and/or suggestions for improvements.

We’re sure any driver author would appreciate a “thank you” for their efforts as well. And if you are using the Neo4j drivers in a commercial project, perhaps you can find an opportunity to either contribute back code that you’ve developed or consider contracting the author or the author’s company to help improve the driver for real-world usage.

Are there languages missing that you would like to see supported by the Neo4j community or officially by Neo4j? Please let us know! Drop us an email to feedback@neo4j.com.

If you would love to work on Neo4j drivers and related topics full-time, we’re hiring for positions in the drivers team.

Think you have what it takes to be Neo4j certified?
Show off your graph database skills to the community and employers with the official Neo4j Certification. Click below to get started and you could be done in less than an hour.

Start My Certification

↧

From the Neo4j Community: January 2017

February 14, 2017, 2:13 am

≫ Next: Just for Flask & React.js Developers: A New Neo4j Movies Template

≪ Previous: 2016: The Year in Neo4j Drivers

Discover all of the great articles created by the Neo4j community in January 2017

The year is off to a great start when to comes to the Neo4j community. If this month is any indication of what’s to come, then we know that 2017 will be a big year for Neo4j projects, drivers and integrations across the board. Here are some of our favorite picks from last month’s Neo4j community contributions.

If you would like to see your post featured in February’s “From the Neo4j Community” blog post, follow us on Twitter and use the #Neo4j hashtag for your chance to get picked.

Articles and Blog Posts

The big news in January was the release of Buzzfeed’s TrumpWorld dataset and Michael Hunger created a GraphGist showing how to import the data into Neo4j. William Lyon then went a step further and combined The BuzzFeed Trumpworld Graph with Government Contracting Data in Neo4j.
Evgeniy Prohorenko analysed the USA election of 2016 using Apache Spark GraphX and Neo4j and found some interesting insights in the type of language used in tweets by both candidates.
Emmanuel Stefanakis explained the history of geospatial features in Neo4j and how the recent creation of spatial procedures will boost the spread of graph databases in the GIS community.
Max de Marzi wrote a hands-on guide showing how to extend the security features introduced in Neo4j 3.1 to achieve Property-Level Security. If you like roaring bitmaps and bit shifting this is the post for you.
Max also wrote another post showing how to multi-thread a traversal to massively reduce the response time.
42 Talents published an article showing how to use Neo4j and open data to find out DIY facts and figures for your favorite sport. Tennis is the chosen sport in this example, but even if you’re not a tennis fan, there are lots of graph modeling tips and LOAD CSV examples to gain inspiration from.
Neo4j founder Emil Eifrem features in a guest post on Brian McKenna’s Computer Weekly column – graph and the next wave of Digital Public Services.
Reed Jessen wrote a cool blog post showing how to load patent data from the IP Street API into Neo4j. Reed then shows how to query the data using Neo4j’s Python driver and finds an interesting case of Hit-By-A-Bus Syndrome at Magic Leap.
Kristof van Tomme introduces the new Drupal Neo4j Module that integrates with the Rules module. The new module uses Christophe Willemsen‘s PHP Bolt Driver.
Chuck Calio explains how Neo4j on IBM POWER8 makes it possible to store and process massive-scale graphs in real time.
ComputerWorld UK’s Scott Carey interviewed David Meza, NASA’s Chief Knowledge Architect, who explained how NASA are using Neo4j to allow researchers and external stakeholders to more easily navigate NASA’s complex web of documents. David has previously talked about his work in our 5-Minute Interview series.
Michael Hunger created a graph for the OOP 2017 Munich Conference in which he makes good use of the apoc.load.json procedure from the APOC library.
Brian Roy wrote a blog post in which he used Neo4j to analyze the accuracy of images classified by the AWS Rekognition API. Brian then follows up with another post where he looks at the distribution of confidence values returned by the API.
Big Datums shared some Cypher queries showing how to count relationships in Neo4j. This is one of the first things that people do when they’ve just imported a dataset.

Graph Visualisation

Tom Sawyer released Version 7.6, Java Edition, Beta Release of their Tom Sawyer Perspectives product – a graph visualisation tool. This release introduces support for maps to display location based information.
Duke Ganote uses Neo4j to visualize a complex organizational hierarchy.
The Graphileon blog features a post showing how to evaluate logical hierarchies and rules. The video shows some impressive Cypher queries that use the APOC library.
Graph Commons created a visualization titled “The Trump Network,” which shows the power relations around Donald Trump and how they’re linked to each other and other organizations.
Right Relevance’s John Swain shows how to visualize the climate change conversation on Twitter in 2016. There’s also some cool use of graph algorithms to find the biggest influencers and communities within the climate change graph.
JeffProd released Twitter Network, an online tool which allows you to visualize the relationships between Twitter accounts. Give it a try with your own account, it’s quite fun!

Videos

Linkurious – Strengthening Anti-money Laundering (AML) systems with graph technologies

Language Drivers

January saw the release of version 1.1.0 of the Javascript, Python, and .NET drivers. It also saw the release of version 1.1.1 of the Java driver.
Nigel Small released version 4.0.0b2 of py2neo – a Python client library and toolkit for Neo4j. This release introduced a new Cypher console and CLI tool.
Florin Patrascu released version 0.2.16 of neo4j_sips – a Neo4j driver for Elixir which now throws an error if drive authentication fails and has Travis CI support.

Libraries, GraphGists, and Code Repos

Neo4j OGM (Object-Graph-Mapper) for use in Ruby on Rails and Rack frameworks released version 8.0.5 which has some bug fixes and improved documentation for importing lots of data into Neo4j.
Jens Nerche explains how he developed a plugin for jQAssistant, the code analysis tool, so that he could use it to analyse a C++ project. In other news, jQAssistant released version 1.2.0 which now has support for rule parameters.
Gunnar Morling also writes about his use of jQAssistant to prevent leaky APIs in his Java code base. If you haven’t used jQAssistant before, this post provides a step-by-step walkthrough of how to get started. Gunnar has also created a demo project showing how to use the tool.
James Glattfelder shared the Neo4j tutorial from the “Network Analysis and Visualization Tools” session that he presented at the Big Data Finance Winter School in Zurich.
Sean Cheatham released version 0.0.2 of scala-graph – a collection of Scala graph libraries and adapters for graph databases.
Rasmus Précenth created the eta-neo4j-example repository which looks like a pretty cool attempt to write user-defined procedures and functions using the Eta programming language. Eta is a dialect of Haskell that runs on the JVM.
From Node.js land, Bruce Paul released ger-neo4j – a good enough recommendation engine using Neo4j. It focuses on domains which have categories and items, so it’d probably work well for ecommerce applications.
Andreas Kollegger released neo4j-here, a Node.js package to help manage a local Neo4j installation.
Jon Winton released amphora-redis-to-neo4j, which converts Redis batch ops into Neo4j Cypher queries.
Julia Fernee created the up-neo4j-service-files repository, which contains configuration for launching Neo4j in a Docker container.
In related news, Joaquin Menchaca released neo4j-kit, which makes it easy to spin up an instance of Neo4j using Docker or Vagrant.
Knowledge.Bio – a web application for collaboratively building and exploring networks of biological relationships was released in mid January. The code is available on BitBucket if you’d like to explore further.

Want to take your Neo4j skills up a notch? Take our online training class, Neo4j in Production, and learn how scale the world’s leading graph database to unprecedented levels.

Take the Class

↧

Just for Flask & React.js Developers: A New Neo4j Movies Template

February 28, 2017, 12:04 am

≫ Next: This Week in Neo4j – 11 March 2017

≪ Previous: From the Neo4j Community: January 2017

Introduction

Let’s jump right into it. You’re a Python developer interested in Neo4j and want to build a web app, microservice or mobile app. You’ve already read up on Neo4j, played around with some datasets, and know enough Cypher to get going. Now you’re looking for a demo app or template to get the ball rolling.

Enter the Neo4j Movies Template.

This blog post will walk you rating a movie on a sample movie rating application, from initial setup to viewing the list of movies you’ve rated.

What comes with the Neo4j Movies Template:

Flask / React.js app ready to be bent to your will
API Documentation

More time coding, less time Googling for examples on how to:

Use Neo4j Import Tool on Unix or Windows
Use the new Python Bolt driver for Neo4j

Overview of the Data Model and the Implementation

The Classic Movie Database

This project uses a classic Neo4j dataset: the movie database. It includes Movie, Person, Genre and Keyword nodes, connected by relationships as described in the following image:

(:Movie)-[:HAS_GENRE]→(:Genre)
(:Movie)-[:HAS_KEYWORD]→(:Keyword)
(:Person)-[:ACTED_IN]→(:Movie)
(:Person)-[:WROTE]→(:Movie)
(:Person)-[:DIRECTED]→(:Movie)
(:Person)-[:PRODUCED]→(:Movie)

Additionally, users can add ratings to movies:

Learn how to use Flask and React.js with Neo4j with this all-new Movies template

(:User)-[:RATED]→(:Movie)

Or, in table form:

from	props_from	via	to	props_to
[User]	[api_key, username, password, id]	RATED	[Movie]	[id, title, tagline, summary, poster_image, duration, rated]
[Person]	[id,name,born,poster_image]	ACTED_IN	[Movie]	[id,title,tagline,summary,poster_image,duration,rated]
[Movie]	[id,title,tagline,summary,poster_image,duration,rated]	HAS_KEYWORD	[Keyword]	[id,name]
[Person]	[id,name,born,poster_image]	DIRECTED	[Movie]	[id,title,tagline,summary,poster_image,duration,rated]
[Person]	[id,name,born,poster_image]	PRODUCED	[Movie]	[id,title,tagline,summary,poster_image,duration,rated]
[Person]	[id,name,born,poster_image]	WRITER_OF	[Movie]	[id,title,tagline,summary,poster_image,duration,rated]
[Movie]	[id,title,tagline,summary,poster_image,duration,rated]	HAS_GENRE	[Genre]	[id,name]

The API

The Flask portion of the application interfaces with the database and presents data to the React.js front-end via a RESTful API.

The Front-End

The front-end, built in React.js, consumes the data presented by the Flask API and presents some views to the end user, including:

Home page
Movie detail page
Person detail page
User detail page
Login

Setting Up

To get the project running, clone the repo then check the project’s README for environment-specific setup instructions.

The README covers how to:

Download and install Neo4j
Prepare the database
Import the nodes and relationships using neo4j-import

Start the Database!

Start Neo4j if you haven’t already!
Set your username and password (You’ll run into less trouble if you don’t use the defaults)
Set environment variables (Note: the following is for Unix; for Windows you will be using set=…)
Export your Neo4j database username export MOVIE_DATABASE_USERNAME=myusername
Export your Neo4j database password export MOVIE_DATABASE_PASSWORD=mypassword
You should see a database populated with Movie, Genre, Keyword and Person nodes.

Start the Flask Backend

The Neo4j-powered Flask API lives in the flask-api directory.

cd flask-api
pip install -r requirements.txt (you should be using a virtualenv)
export FLASK_APP=app.py
flask run starts the API
Take a look at the docs at http://localhost:5000/docs

The Python Flask backend of the Neo4j Movies template app, looking at movie genres

Start the React.js Front-End

With the database and Express.js backend running, open a new terminal tab or window and move to the project’s /web subdirectory. Install the bower and npm dependencies, then start the app by running gulp (read the “getting started” on gulpjs.com). Edit config/settings.js by changing the apiBaseURL to http://localhost:5000/api/v0

Over on http://localhost:4000/, you should see the homepage of the movie app, displaying three featured movies and other movies below.

Home page of the Neo4j Flask Movies template app

Click on a movie to see the movie detail page:

Movie detail page for the Neo4j Flask movies template

Click on a person to see that person’s related people and movies the person has acted in, directed, written or produced:

Person detail page in the Neo4j Flask movies template app

A Closer Look: Using the Python Neo4j Bolt Driver

Let’s take a closer look at what sort of responses we get from the driver.

Import dependencies, including the Neo4j driver, and connect the driver to the database:

Getting Ready

app = Flask(__name__)
app.config['SECRET_KEY'] = 'super secret guy'
api = Api(app, title='Neo4j Movie Demo API', api_version='0.0.10')
CORS(app)


driver = GraphDatabase.driver('bolt://localhost', 
                              auth=basic_auth(config.DATABASE_USERNAME,
                              str(config.DATABASE_PASSWORD)))

Let’s look at how we would ask the database to return all the genres in the database. The GenreList class queries the database for all Genre nodes, serializes the results, and returns them via /api/v0/genres.

class GenreList(Resource):
    @swagger.doc({
        'tags': ['genres'],
        'summary': 'Find all genres',
        'description': 'Returns all genres',
        'responses': {
            '200': {
                'description': 'A list of genres',
                'schema': GenreModel,
            }
        }
    })

    def get(self):
        db = get_db()
        result = db.run('MATCH (genre:Genre) RETURN genre')
        return [serialize_genre(record['genre']) for record in result]

...

def serialize_genre(genre):
    return {
        'id': genre['id'],
        'name': genre['name'],
    }

...

api.add_resource(GenreList, '/api/v0/genres')

What’s Going on with the Serializer?

The Bolt driver responses are different than what you might be used to if you’ve used a non-Bolt Neo4j driver.

In the “get all Genres” example described above, result = db.run('MATCH (genre:Genre) RETURN genre') returns a series of records:

An Example Record

{
   "keys":[
      "genre"
   ],
   "length":1,
   "_fields":[
      {
         "identity":{
            "low":719,
            "high":0
         },
         "labels":[
            "Genre"
         ],
         "properties":{
            "name":"Action",
            "id":{
               "low":16,
               "high":0
            }
         },
         "id":"719"
      }
   ],
   "_fieldLookup":{
      "genre":0
   }
}

The serializer parses these messy results into the data we need to build a useful API:

def serialize_genre(genre):
    return {
        'id': genre['id'],
        'name': genre['name'],
    }

Voila! An array of genres appears at /genres.

Beyond the /Genres Endpoint

Of course, an app that just shows movie genres isn’t very interesting. Take a look at the routes and models used to build the home page, movie detail page and person detail page.

The User Model

Aside from creating themselves and authenticating with the app, Users (blue) can rate Movies (yellow) with the :RATED relationship, illustrated below.

User data model for the Neo4j Flask movies template app

`User` Properties

password: The hashed version of the user’s chosen password
api_key: The user’s API key, which the user uses to authenticate requests
id: The user’s unique ID
username: The user’s chosen username

`:RATED` Properties

rating: an integer rating between 1 and 5, with 5 being love it and 1 being hate it.

My rated movies in the Neo4j Flask movies template app

Users Can Create Accounts

Before a User can rate a Movie, the the user has to exist, i.e., someone has to sign up for an account. Sign up will create a node in the database with a User label along with the properties necessary for logging in and maintaining a session.

Create user account page in the Neo4j Flask movies template app

Figure 1. web/src/pages/Signup.jsx

The registration endpoint is located at /api/v0/register. The app submits a request to the register endpoint when a user fills out the “Create an Account” form and taps “Create Account”.

Assuming you have the API running, you can test requests either by using the interactive docs at 3000/docs/ or by using cURL.

Use Case: Create a New User

Request

curl -X POST --header 'Content-Type: application/json' 
             --header 'Accept: application/json' -d 
                      '{ "username": "Mary Jane", "password": "SuperPassword"}' 
                      'http://localhost:5000/api/v0/register'

Response

{
   "id":"e1e157a2-1fb5-416a-b819-eb75c480dfc6",
   "username":"Mary333 Jane",
   "avatar":{
      "full_size":"https://www.gravatar.com/avatar/b2a02..."
   }
}

Use Case: Try to Create a New User but Username is Already Taken

Request

curl -X POST --header 'Content-Type: application/json' 
             --header 'Accept: application/json' -d 
                      '{ "username": "Mary Jane", "password": "SuperPassword"}'      
                      'http://localhost:5000/api/v0/register'

Response

{
   "username":"username already in use"
}

User registration logic is implemented in /flask-api/app.py as described below:

class Register(Resource):
    @swagger.doc({
        'tags': ['users'],
        'summary': 'Register a new user',
        'description': 'Register a new user',
        'parameters': [
            {
                'name': 'body',
                'in': 'body',
                'schema': {
                    'type': 'object',
                    'properties': {
                        'username': {
                            'type': 'string',
                        },
                        'password': {
                            'type': 'string',
                        }
                    }
                }
            },
        ],
        'responses': {
            '201': {
                'description': 'Your new user',
                'schema': UserModel,
            },
            '400': {
                'description': 'Error message(s)',
            },
        }
    })
    def post(self):
        data = request.get_json()
        username = data.get('username')
        password = data.get('password')
        if not username:
            return {'username': 'This field is required.'}, 400
        if not password:
            return {'password': 'This field is required.'}, 400

        db = get_db()

        results = db.run(
            '''
            MATCH (user:User {username: {username}}) RETURN user
            ''', {'username': username}
        )
        try:
            results.single()
        except ResultError:
            pass
        else:
            return {'username': 'username already in use'}, 400

        results = db.run(
            '''
            CREATE (user:User {id: {id}, username: {username}, 
                               password: {password}, 
                               api_key: {api_key}}) RETURN user
            ''',
            {
                'id': str(uuid.uuid4()),
                'username': username,
                'password': hash_password(username, password),
                'api_key': binascii.hexlify(os.urandom(20)).decode()
            }
        )
        user = results.single()['user']
        return serialize_user(user), 201

Users Can Log In

Now that users are able to register for an account, we can define the view that allows them to login to the site and start a session.

User login page on the Neo4j Flask movies template app

Figure 2. /web/src/pages/Login.jsx

The registration endpoint is located at /api/v0/login. The app submits a request to the login endpoint when a user fills a username and password and taps “Create Account”.

Assuming you have the API running, you can test requests either by using the interactive docs at 5000/docs/ or by using cURL.

Use Case: Login

Request

curl -X POST --header 'Content-Type: application/json' 
             --header 'Accept: application/json' -d 
                      '{"username": "Mary Jane", "password": "SuperPassword"}' 
                      'http://localhost:5000/api/v0/login'

Response

{
  "token":"5a85862fb28a316ea6a1"
}

Use Case: Wrong Password

Request

curl -X POST --header 'Content-Type: application/json' 
             --header 'Accept: application/json' -d 
                      '{ "username": "Mary Jane", "password": "SuperPassword"}' 
                      'http://localhost:5000/api/v0/register'

Response

{
   "username":"username already in use"
}

See Myself

Request

curl -X GET --header 'Accept: application/json' 
            --header 'Authorization: Token 5a85862fb28a316ea6a1' 
                     'http://localhost:5000/api/v0/users/me'

Response

{
  "id": "94a604f7-3eab-4f28-88ab-12704c228936",
  "username": "Mary Jane",
  "avatar": {
    "full_size": "https://www.gravatar.com/avatar/c2eab..."
  }
}

The code here is similar to that of /register. There is a similar form to fill out, where a user types in their username and password.

If the verification is successful, it will return a token. The user is then directed to an authentication page, from which they can navigate through the app, view their user profile and rate movies. Below is a rather empty user profile for a freshly created user:

Figure 3. /web/src/pages/Profile.jsx

Users Can Rate Movies

Once a user has logged in and navigated to a page that displays movies, the user can select a star rating for the movie or remove the rating of a movie he or she has already rated.

The user should be able to access their previous ratings (and the movies that were rated) both on their user profile and the movie detail page in question.

Use Case: Rate a Movie

Request

curl -X POST --header 'Content-Type: application/json' 
             --header 'Accept: application/json' 
             --header 'Authorization: Token 5a85862fb28a316ea6a1' -d 
                      '{"rating":4}' 
                      'http://localhost:5000/api/v0/movies/683/rate'

Response

{}

Python Implementation

class RateMovie(Resource):
    @login_required
    def post(self, id):
        parser = reqparse.RequestParser()
        parser.add_argument('rating', choices=list(range(0, 6)), 
                            type=int, required=True, 
                            help='A rating from 0 - 5 inclusive (integers)')
        args = parser.parse_args()
        rating = args['rating']

        db = get_db()
        results = db.run(
            '''
            MATCH (u:User {id: {user_id}}),(m:Movie {id: {movie_id}})
            MERGE (u)-[r:RATED]->(m)
            SET r.rating = {rating}
            RETURN m
            ''', {'user_id': g.user['id'], 'movie_id': id, 'rating': rating}
        )
        return {}

    @login_required
    def delete(self, id):
        db = get_db()
        db.run(
            '''
            MATCH (u:User {id: {user_id}})
                          -[r:RATED]->(m:Movie {id: {movie_id}}) DELETE r
            ''', {'movie_id': id, 'user_id': g.user['id']}
        )
        return {}, 204

Use Case: See All of My Ratings

Request

curl -X GET --header 'Accept: application/json' 
            --header 'Authorization: Token 5a85862fb28a316ea6a1'
                     'http://localhost:5000/api/v0/movies/rated'

Response

[
  {
    "summary": "Six months after the events depicted in The Matrix, ...",
    "duration": 138,
    "rated": "R",
    "tagline": "Free your mind.",
    "id": 28,
    "title": "The Matrix Reloaded",
    "poster_image": "http://image.tmdb.org/t/p/w185/ezIur....jpg",
    "my_rating": 4
  },
  {
    "summary": "Thomas A. Anderson is a man living two lives....",
    "duration": 136,
    "rated": "R",
    "tagline": "Welcome to the Real World.",
    "id": 1,
    "title": "The Matrix",
    "poster_image": "http://image.tmdb.org/t/p/w185/gyn....jpg",
    "my_rating": 4
  }
]

Python Implementation

class MovieListRatedByMe(Resource):
    @login_required
    def get(self):
        db = get_db()
        result = db.run(
            '''
            MATCH (:User {id: {user_id}})-[rated:RATED]->(movie:Movie)
            RETURN DISTINCT movie, rated.rating as my_rating
            ''', {'user_id': g.user['id']}
        )
        return [serialize_movie(record['movie'], 
        record['my_rating']) for record in result]

...

def serialize_movie(movie, my_rating=None):
    return {
        'id': movie['id'],
        'title': movie['title'],
        'summary': movie['summary'],
        'released': movie['released'],
        'duration': movie['duration'],
        'rated': movie['rated'],
        'tagline': movie['tagline'],
        'poster_image': movie['poster_image'],
        'my_rating': my_rating,
    }

Next Steps

Fork the repo and hack away! Find directors that work with multiple genres, or find people who happen to work with each other often as writer-director pairs.
Find a way to improve the template or the Python driver? Create a GitHub Issue and/or submit a pull request.

Resources

Found a Bug? Got Stuck?

The neo4j-users #help channel will be happy to assist you.
Make a GitHub issue on the driver or app repos.

Neo4j

Want to learn more about what you can do with graph databases? Click below to get your free copy the O’Reilly Graph Databases book and learn to harness the power of graph technology.

Get My Free Copy

↧

This Week in Neo4j – 11 March 2017

March 11, 2017, 12:00 am

≫ Next: This Week in Neo4j – 18 March 2017

≪ Previous: Just for Flask & React.js Developers: A New Neo4j Movies Template

Welcome to this week in Neo4j.

This week we’ve got articles showing how to integrate Neo4j with Kibana, using jQAssistant from Pandas, and lots of releases of Neo4j and related projects.

But first:

International Women’s Day

Explore everything that's happening in the Neo4j community for the week of 11 March 2017

Praveena and Eve answering questions at the Neo4j booth

On Wednesday 8th March Neo4j sponsored Tech (K)now Day – a mini conference hosted by Skillsmatter for International Women’s Day.

There were a variety of different talks and workshops including a Neo4j one run by Eve Bright, Praveena Fernandes, and me. Attendees had the chance to explore Buzzfeed’s TrumpWorld dataset and learn Neo4j in the process.

The next day we ran a similar workshop for people interested in journalism at journocoders in London. If you’d like to get your hands on the dataset, you can get up and running in a few minutes with your own TrumpWorld Neo4j sandbox.

There were a number of updates pushed to the TrumpWorld-Graph repository and Will Lyon released updated data and a browser guide for campaign financing in 2016 for his NICAR workshop.

New releases of Neo4j and Neo4j Drivers

It’s been a busy week for releases!

The drivers team have released the first versions of the 1.2 series for the Java and .NET driver. The Python one is planned for next week. The Javascript driver will follow two weeks later.

This release has removed some boilerplate code and introduce retry logic based on encapsulated “unit of work” operations. We released Neo4j 3.2.0 ALPHA06 as part of the early release program. This version contained some Windows fixes and supporting for whitelisting procedures. For all changes see the release notes.

New release of APOC – lots of goodies to play with

Activity on the APOC project

The APOC community have been busy as well. This week has seen the most commits since the surge in May/June 2016 when a lot of procedures were added.

There have been releases of APOC that are compatible with Neo4j 3.2.0-alpha06, Neo4j 3.1.2, and Neo4j 3.0.8. The documentation was also updated and is now available for each version.

Included in these releases are new date functions, a couple of cool new procedures for working with paths, as well as new functions for working with collections. Notable improvements in apoc.periodic.iterate allow now much faster operations and retries. Manual free-text indexes can now be kept up to date and the expire(TTL) functionality got more robust.

You can read full release notes for 3.2.0.1 (for Neo4j 3.2.0-alpha06), 3.1.2.5 (Neo4j 3.1.2), and 3.1.0.4 (Neo4j 3.1.1).

If you try any of these releases, let us know how you get on by dropping us an email devrel@neo4j.com.

The Neo4j Grails plugin saw its 6.0.9 and 6.1.0-RC1 releases and our partner GraphAware published the 1.0.0-RC1 version of the new Neo4j-PHP-OGM.

Analysing Web Traffic with Neo4j

In September 2016 Dmitriy Nesteryuk wrote an article explaining how web browsers could pre-render the next page a user might visit if they could predict what that page might be. He’s now created Sirko Engine which does this prediction in Neo4j.

I think searching for user journeys through web sites is a fascinating use of Neo4j and Dmitriy’s project reminded me about a blog post written by Nick Dingwell of Snowplow Analytics and how they’d used Neo4j to run path analysis on their own website.

Connecting Neo4j to Kibana, analysing source code with jQAssistant/Pandas, and more

In other news:

Brock Tibert created a Neo4j Docker image that also includes APOC and JDBC drivers. This looks like a good place to start if you want to import relational data into the graph using APOC’s JDBC procedure.
Tomasz Bratanic wrote a blog post explaining how to connect Neo4j to Elastic using APOC so that he could visualise geographical data using Kibana. In a second post, he showed a really cool combination of APOC triggers with apoc.load.json to enrich data created in the graph with information from the Google Knowledge Graph.
Markus Harrer shows how to combine jQAssistant and Python’s Pandas library to check how many getters and setters a code base has. I didn’t realise it was so easy to go from a Neo4j query to Pandas data frame – this looks really cool for doing data science work.
Gabriel de Maeztu created neo4jupyter which makes it really easy to visualise Neo4j queries in a Jupyter notebook. It uses the popular vis.js library under the covers.
Miro Marchi wrote about the NOT recommendation: breaking free from similarity segregation with Neo4j, in which he discusses recommender systems and the importance of sometimes recommending something outside of the user’s usual preferences. Graph theory and the strength of weak ties comes in handy for breaking out of the similarity trap.
Rob Schoening released mercator – a tool for analysing physical, virtual and cloud infrastructure. There’s also a docker image that will scan your AWS infrastructure and build a graph from it.
Samathy Barratt shared the slides from her adventure into graph databases with Neo4j talk that we covered last week.
Cristina Escalante demonstrated how to use the recently released Microsoft Concept Graph with Neo4j in this interesting post. She previously published tutorials for developing a Movie App using React and Javascript and Python(Flask) backends.
Connecting Apache Zeppelin to Neo4j has been written about in the past, and this week Kaushik Chatterjee shows how to do it with d3. A full Zeppelin-Neo4j connector is being developed by Andrea Santurbano. You can follow his work in this pull-request.
Michael Hunger started a collection of useful Cypher tips and queries in this GitHub Gist. This follows on from his tips for efficient bulk updates using Cypher that we featured last week.
Michael also created The Oscars Graph, starting from a Kaggle (now Google) dataset and loading it into Neo4j.
Jose Chavez had the fun idea of mapping his wardrobe combinations in a graph to help him chose a good outfit each day.
Ralf Becher, who created the Neo4j Tableau integration, spiced up the QlikSense Neo4j Dashboard coloring with CodeMirror.
For his Graphistania Podcast, Rik van Bruggen interviewed Kristof van Tomme about integrating CMS platforms like Drupal with Neo4j, e.g. for article recommendations or organizing knowledge. Kristof’s colleague Tamasz created a Drupal connector to Neo4j which is extendable with Drupal rules and can also record a users journey. They will run a workshop about Neo4j and Drupal at the end of March at Drupal Developer Days in Seville,Spain.
Eric Lagergren created an interesting intermediary library called neograph that auto-generates and runs Cypher queries from protobuf/json definitions.

So what’s there to look forward to in the world of graphs next week?

On Wednesday, March 15, 2017 Carlos Justiniano will be giving a talk titled NoSQL rows, columns, documents? How about graphs? at the NYC Node meetup in Manhattan.
On Thursday, March 16, 2017 Mike Morley and Dave Bennett will be presenting at the Calgary Neo4j Graph Meetup. They have some seriously cool stuff to show.
Also on Thursday at 17.00 UTC be there for the next Neo4j Online Meetup where Mesosphere’s Johannes Unterstein will be showing how he created the Neo4j package for Universe and how you can scale a Neo4j cluster with Marathon on DC/OS.
If you’re like us and can’t get enough of Neo4j, check out what’s happening locally in your area.

↧

This Week in Neo4j – 18 March 2017

March 18, 2017, 12:00 am

≫ Next: This Week in Neo4j – 25 March 2017

≪ Previous: This Week in Neo4j – 11 March 2017

Welcome to This Week in Neo4j.

If you’ve got any ideas for things we should cover in future editions, I’m @markhneedham on Twitter or send an email to devrel@neo4j.com.

WordPress Recommendation Engine

Adam Cowley has been busy over the last couple of weeks building a Neo4j-based recommendation engine for WordPress.

The WordPress graph

You can follow his work in a three-part blog series:

Social Network Analysis, Software Analytics and RDBMS-to-Graph

In February, I somehow missed an excellent post by Romain Thalineau where he shows how to monitor the French Presidential Election on Twitter using Python and Neo4j. Romain shows how to load Twitter’s streaming API into Neo4j and then does some geospatial and time-series analysis. Michael Hunger‘s neo4j-twitter-stream project is also worth looking at if you’re interested in loading Twitter data into Neo4j.
Andrii Soldatenko presented Building a Social Network with Neo4j and Python at PyCaribbean 2017 in which he compares 3 Python libraries for interacting with Neo4j: neomodel, py2neo, and the neo4j-python-driver.
Markus Harrer, who was also featured last week, has written another couple of interesting blog posts. In the first, he shows how to combine Neo4j and Python’s pandas library to find the top committers in a GitHub repository with a lot of data wrangling and cleansing along the way.
In his second post, Markus summarises his experiences working with jQAssistant. It’s a good resource for all things software analytics. Michael Hunger also has a blog post from a few years ago where he shows how to use jQAssistant and other OSS tools to do software analytics.
Ryan Boyd wrote a blog post about moving RDBMS data into the graph using the apoc.load.jdbc procedure. Ryan also presented a webinar on this topic late last week:
Ben Rund wrote The Good, The Bad and the Hype about Graph Databases for MDM in which he makes some good points about the merits of using graphs for master data management and where you might want to use a custom MDM tool. Aaron Wallace from Pitney Bowes presented a talk “Mastering Customer Information” at GraphConnect Europe 2015 which combines both!

What’s happening on GitHub?

This week I decided to do some exploration of Neo4j projects on GitHub that haven’t necessarily surfaced on Twitter. I queried the Neo4j community graph to find the most recent Neo4j-based projects.

These were the most interesting ones I found:

Seb Insua created graphviz-config-template which converts dot syntax into Cypher queries that create a graph. This could be a fun one to play with if you’ve got any architecture diagrams lying around.
It’s been a while since I read any Visual Basic code, but Kmahmoudi has created a small project where he combines Visual Basic and Neo4j using the Neo4jClient driver.
Alfred Dobradi has been experimenting with loading the Twitch API into Neo4j and created the twitch-graph project complete with Docker scripts to get it up and running.
Nathan Danielsen created fec-2016-neo4j in which he loads US Elections, campaign finance, and US Congress data into Neo4j.
Gert Sallaerts created neo4j-retried, a Node module which allows the user to automatically retry queries and handle DeadlockDetected exceptions more cleanly.
Tom Shafer created neo4j-db-manager, a neat little tool for working with multiple local databases.

Next Week

So what’s there to look forward to in the world of graphs next week?

On Tuesday March 21st, 2017, Will Lyon will be presenting Analyzing The TrumpWorld Graph: Applying Network Analysis to Public Data at NYC Neo4j.
On Wednesday March 22nd, 2017, Amanda Schaffer will be presenting an Intro to graph databases at pyladies Seattle.
Also on Wednesday, Rik van Bruggen will presenting “Graph Databases and TrumpWorld” with Neo4j at Data Scientists Ireland in Dublin.

Tweet of the Week

We’ll finish with my favourite tweet of the week by Tobias Zander.

Great introduction to @neo4j. Looks like my plans for the weekend now changed.
— Tobias Zander (@airbone42) March 15, 2017

If you’re having fun playing with Neo4j, tweet with the #Neo4j hashtag and maybe you’ll feature in next week’s post.

Have a good weekend!

↧

This Week in Neo4j – 25 March 2017

March 25, 2017, 12:00 am

≫ Next: Public Service Announcement: Neo4j Drivers 1.2 Release

≪ Previous: This Week in Neo4j – 18 March 2017

Welcome to this week in Neo4j where we collect the most interesting things that have happened in the world of graph databases over the last 7 days.

If you’ve got something that you’d like to see featured in a future version let me know. I’m @markhneedham on Twitter or send an email to devrel@neo4j.com.

Featured Community Member: Johannes Unterstein

In last week’s online meetup Mesosphere’s Johannes Unterstein showed us how to get a Neo4j causal cluster up and running on DC/OS.

This was the culmination of several weeks’ effort where Johannes started with the Neo4j Docker image, figured out how to get it to play nicely with the Mesos ecosystem and created a Mesosphere Universe package so that users can easily create Neo4j clusters via the Marathon scheduler.

On top of this Johannes has been a part of the Neo4j community since 2013 and has organized several meetups as well as writing a Play Framework integration for Spring Data Neo4j.

On behalf of the Neo4j community I’d like to thank Johannes for all his efforts and I’m looking forward to your talk at GraphConnect Europe on 11th May 2017!

Using Graph Visualization to Explore Corruption in Egypt and FIFA

There were a couple of interesting posts showing how to use graph visualizations to explore two different types of corruption.

Lana Chan wrote What Do Big Data Paris and the Panama Papers Have In Common? In this post Lana shows how you can use the Tom Sawyer graph data visualization tool to explore the 2015 FIFA corruption scandal.

Explore everything that's happening in the Neo4j community for the week of 25 March 2017

Visualizing the Egypt corruption network

Noonpost, an interactive Arabic media website, explain how they used Linkurious for large-scale investigations in a project on Egypt’s corruption networks.

In the post, they explain how they were able to explore connections between the army and its affiliates across various influence networks including the health, food, and tourism sectors using a combination of Cypher queries and graph visualizations.

There’s lots of good stuff in both of these posts if you’re interested in data journalism.

If you’d like to do data journalism work using Neo4j but don’t know how, sign up for the Neo4j Data Journalism Accelerator Program and you’ll get the opportunity to work with engineers from Neo4j’s Developer Relations team to get your analysis up and running.

Visual Graph Modeling and Importing

Michael Hunger created a video showing how to sketch graph models and load them into Neo4j using Alistair Jones‘ arrows tool.

You can also do something similar using the Graph Commons visualization library.

Will Lyon presented a webinar late last week where he showed how to model and import real-world datasets using Neo4j.

Will shows how to import data from Yelp using several different approaches:

apoc.load.json – a procedure from the APOC library that can import JSON data directly.
LOAD CSV – a Cypher command for importing CSV files. Works well up to ~10 million rows.
neo4j-import – a tool for importing large initial datasets.

Will also talks about Neo4j’s user-defined procedures and functions, and if you’re interested in creating your own ones we’ve created a couple of new pages on the Neo4j developer site to help you get started:

Emil in Forbes, Hiking Recommendations, Malware Clustering, and DC/OS

Neo4j’s CEO Emil Eifrem features in a Forbes article – Growth Stories: The Magical Power Of A Name – in which he talks about the history of Neo4j and how he came up with the graph databases category. This is a multi-part interview so stay tuned for more next week!
Dirk Mahler released version 0.8 of the object graph mapping library for Java extended-objects. It now supports the Bolt protocol which was introduced in Neo4j 3.0.
Amanda Schaffer posted slides and code from last week’s talk at pyladies Seattle. Amanda’s created a hiking recommendation engine which uses content-based filtering based on features (e.g., lakes, waterfalls) that hikes have in common. There’s even a bit of web scraping of the WTA using Python’s beautifulsoup library.
Our friends from Neueda released version 2.5.0 of the Graph Databases Plugin for the Jetbrains IDE family. The new version adds node and relationship editing as well as listing indexes and constraints.
Max de Marzi has a new blog post where he shows how to search for objects across multiple dimensions. Max shows how to use the trusty RoaringBitmap to write a user-defined procedure that short circuits as soon as possible when searching across multiple facets.
Shusei Tomonaga wrote about a malware clustering and network analysis tool called impfuzzy that can be used to visualize and look for similar pieces of malware using Neo4j. The similarity score is calculated using the Louvain community detection and Fuzzy Hash algorithms.
Pavel Yakovlev released version 0.1.1.2 of hasbolt, a Haskell driver for Neo4j. This release has some minor fixes to keep the strictness and laziness gods happy!

On the Podcast

This week Rik interviewed Alistair Jones about the Causal Clustering feature released in Neo4j 3.1 back in December.

They go through the history of clustering in Neo4j from the use of Zookeeper in the 1.8 series up to the current day where we’ve implemented a version of Diego Ongaro‘s Raft consensus protocol.

If you want to learn more, there’s also a video of Alistair presenting on this topic.

Next Week

So what’s there to look forward to in the world of graphs next week?

On Wednesday March 29th, 2017 Greg Walker, Robin Bramley and Adam Hill will present Using Neo4j to explore the Bitcoin Blockchain and open government data at the Neo4j London User Group.
On Thursday March 30th, 2017 Ryan Boyd will present Building the Neo4j Sandbox cloud trial env: AWS ECS + Lambda + Docker + Auth0 ++ at the Neo4j Online meetup. We’ve also created an online meetup page where you can catchup on any episodes that you might have missed.

Tweet of the Week

My favorite tweet this week was by Jose Ramón Cajide who’s been analyzing Twitter networks using Neo4j in RStudio:

Visualizing my Twitter network using #Rstats and #Neo4j using @twitterapi #DataScience CC @esanchezrojo @txemaskapao @sorprendida pic.twitter.com/5pigMWa5P6
— Jose Ramón Cajide (@jrcajide) March 22, 2017

If you want to graph your own Twitter network you can try out the Neo4j Twitter Sandbox. Don’t forget to tweet your graph using the #Neo4j hashtag if you give it a try.

Enjoy your weekend, it’s finally spring – hoorah!

Cheers, Mark

↧

Public Service Announcement: Neo4j Drivers 1.2 Release

April 4, 2017, 12:00 am

≫ Next: This Week in Neo4j – 22 April 2017

≪ Previous: This Week in Neo4j – 25 March 2017

We are happy to announce that all our officially supported Bolt drivers are now available as versions 1.2. With this release, we massively improved the way you write code to work with a cluster, introducing reusable “transaction functions” and built-in retry functionality.

For some new capabilities we added new APIs. Here you can find detailed documentation and the driver repositories.

New Capabilities in all Neo4j Drivers

Drivers now handle cluster server failures and role changes automatically, allowing the application to treat the cluster as a single black box providing read and write service. This simplifies the programming model massively. You don’t have to care about cluster state or retrying operations after its change.

A Bolt+routing URI represents a network address
Automatic DNS “Round Robin” resolution can yield multiple hosts → addresses
A load balancer (e.g., AWS ELB) can route to multiple hosts → addresses
These are the routing bootstrap addresses: they should be configured to be probable core servers
Read Replicas cannot provide routing tables
When the driver is initialized, it goes to one of the bootstrap addresses to get a routing table

Neo4j drivers cluster returns routing table

The Neo4j driver will switch traffic to an appropriate read or write connection depending on the transaction access mode. The read/write transaction access mode is a familiar SQL/ODBC/JDBC pattern of use.

We added new methods Session.read_transaction and Session.write_transaction to allow the execution of reusable units of work. You simply pass in a transaction function to the method. To allow re-execution of failed operations, duration for retries is configurable via max_retry_time in the Neo4j driver configuration (the default is 30s).

Here is an example on how you would use this capability:

Python Example

from neo4j.v1 import GraphDatabase


driver = GraphDatabase.driver("bolt+routing://server:7687",
                              auth=("neo4j", "password"))


def add_friends(tx, name, friend_name):
    tx.run("MERGE (p:Person {name: $name}) "
           "MERGE (f:Person {name: $friend_name}) "
           "MERGE (p)-[:KNOWS]-(f)",
           name=name, friend_name=friend_name)


def print_friends(tx, name):
    for record in tx.run(
          "MATCH (a:Person)-[:KNOWS]->(friend) WHERE a.name = $name "
          "RETURN friend.name ORDER BY friend.name", name=name):
        print(record["friend.name"])


with driver.session() as session:
    session.write_transaction(
      lambda tx:
        tx.run("create constraint on (p:Person) assert p.name is unique"))
    session.write_transaction(add_friends, "Arthur", "Guinevere")
    session.write_transaction(add_friends, "Arthur", "Lancelot")
    session.write_transaction(add_friends, "Arthur", "Merlin")
    session.read_transaction(print_friends, "Arthur")

Java Example

You can find the full code in this example project.

public class Person{
    private final static String COUNT_PEOPLE =
         ("MATCH (a:Person) RETURN count(a)");

    // callback method
    public static long count(Transaction tx){
        StatementResult result = tx.run(COUNT_PEOPLE);
        return result.single().get(0).asLong();
    }
    ...
}


public class SocialNetwork{
    public long countUsers() {
        try (Session session = driver.session()){
            return session.readTransaction(Person::count);
        }
    }

    public long addUser(Person user) {
        System.out.println(format("Adding user %s", user));
        try (Session session = driver.session(){
            return session.writeTransaction(user::save);
        }
    }
}

We decoupled the Session from a single underlying connection; a Session can now be defined as a causally linked sequence of transactional units of work.

You don’t need to manage bookmarks for causal consistency manually any longer. Bookmarks are now automatically passed between transactions within a routing session. This makes causal consistency the default interaction mode with the database cluster.

Auto-commit transactions (Session.run) will now run partially synchronous to the network (RUN and PULL_ALL will be sent to the server, the RUN response will be immediately received); this allows exceptions to be raised at a more logical point in the application

Updates in Some of the Neo4j Drivers

The Python language driver now includes a compiled C module included for improved performance on supported platforms. Please let us know if this works for you.

If the provided hostname resolves to multiple IP addresses most of the drivers (except .NET) can handle this now.

As always, we’d love your feedback, so please try out the new Neo4j driver releases and raise feature or bug requests on the driver repositories. Please let us also know what you think about the new APIs and if there are ways to improve them.

If you need quick help, please join neo4j.com/slack and ask in the #drivers or the appropriate #neo4j-<language> channel. Otherwise you can also ask on Stack Overflow. Please tag your Stack Overflow questions with [neo4j-<language>-driver]

Enjoy the new Neo4j drivers,

Nigel Small, for the Neo4j Drivers Team

↧

This Week in Neo4j – 22 April 2017

April 22, 2017, 12:00 am

≫ Next: This Week in Neo4j – 29 April 2017

≪ Previous: Public Service Announcement: Neo4j Drivers 1.2 Release

Welcome to this week in Neo4j where we round up what’s been happening in the world of graph databases in the last 7 days.

Featured Community Member: Dmitry Vrublevsky

Dmitry Vrublevsky from Neueda Labs

This week’s featured community member is Dmitry Vrublevsky who works for Neueda Labs and has been very active in Neo4j’s community for quite some time.

He started helping people on StackOverflow and Slack and then started the development of the Neo4j plugin for all the Jetbrains IDEs. That work has evolved into a full featured database tool, which was recently featured on this blog.

Dmitry also spoke at the openCypher implementers meeting in February and will be at GraphConnect in London. He and his team is currently helping us to add some cool features to the Neo4j Browser.

Neo4j at the Galway-Mayo Institute of Technology

Multiple students from GMIT have been using Neo4j as part of their graph theory course and have been building a graph of the university timetable.

I wish I’d got to use Neo4j at university so I’m very jealous – it was Oracle all the way where I studied!

APOC, Call Data Records, GORM, Twitter Clone

Nicolle Cysneiros posted Graph Databases: Talking about your Data Relationships with Python in which Nicolle shows how to model and build a mini social network using Python and the py2neo driver.
Anurag Srivastava wins a prize in APOC awareness month. He demoed several data import features from APOC for relational databases in his post Neo4j Apoc : A Blessing For Developers
The Neo4j GORM Plugin released version 6.1 with a lot of new capabilities and features. You can use it with either Spring Boot or Grails and other web frameworks. The team around Graeme Rocher also published a complete Getting Started Guide and two example applications as GitHub repositories. Neo4j Object Mapping
Tomaz Bratanic did it again and published a new post on using the kNN and Euclidean coefficient algorithms in APOC. He also demonstrates how to visualize query results quickly with the neo4j-spoon browser bookmarklet.
Kamal Murthy detailed the use of Neo4j for analyzing Call Data Records (CDRs) on the Neo4j Blog. Based on an original GraphGist. He looks at call distributions, traces calls that go to voicemail and determines sources and timings of incoming calls. A great example to start with for exploring this domain.
Max de Marzi continues his Building a Twitter Clone series with part 6 which looks into using node-degrees, low-level index access and some caching to provide trending tags, saved searches and most-recent changes.
For our Portuguese readers Jhonathan Souza Soares shared the slides from Neo4j + Node.js.
Rik van Bruggen introduces his multi-part series of Neo4j explainers based on Google search auto-completion question suggestions.

Online Meetup: Building the Wikipedia Knowledge Graph

In this week’s Neo4j online meetup, Dr Jesús Barrasa and I showed how to load the Wikipedia Knowledge Graph into Neo4j and write queries against it.

We’ve been hosting meetups almost every week for the last couple of months so if you want to catch up on earlier episodes you can find all of them on the Neo4j Online Meetup playlist.

From The Knowledge Base

This week from the Neo4j Knowledge Base we have a Perl script to help you convert the timezone in Neo4j log files from UTC to your local timezone.

We also have a really cool discussion of ways to limit MATCHes in subqueries by Andrew Bowman, our featured community member in the 25 February 2017 edition of TWIN4j.

On GitHub: Mahout, Holocaust Research, Kafka Connector

There’s been an incredible amount of activity on GitHub this week. These were the most interesting projects that I came across.

UserLine automates the process of creating logon relations from MS Windows Security Events by showing a graphical realtion among users domains, source and destination logons as well as session duration.
Nigel Small created Memgraph – a Python library that provides a Neo4j-compatible in-memory graph store.
There were some updates to the European Holocaust Research Infrastructure project, which provides a business layer and JAX-RS resource classes for managing holocaust data.
Erick Peirson created cidoc-crm-neo4j which is a meta-implementation of the CIDOC Conceptual Reference Model (CRM). The CIDOC CRM provides definitions and formal structure for describing the implicit and explicit concepts and relationships used in cultural heritage documentation. The project uses Python’s neomodel to interact with a Neo4j database
gbrodar created pcap4j – a repository of scripts for analysing the output of the Unix pcap tool.
Mark Wood created neo4j-mahout which wraps calls to Mahout functions in Neo4j user defined functions. I played around with Mahout a couple of years ago so I’m quite excited to try combine it with Neo4j using this tool.
JunfengDuan created kafka-neo4j-connector, which transfers data from Kafka to Neo4j.

Neo4j Jobs

I’ve not listed jobs in TWIN4j before but I came across an interesting one posted by Musimap, a B2B cognitive music intelligence company in Brussels. They’re hiring a Full-Stack Web Developer with Neo4j and Python experience so if that sounds like your type of thing it might be worth applying.

If you have any jobs that you’d like me to feature in future versions, drop me a tweet @markhneedham.

Next Week

What’s happening next week in the world of graph databases?

On Wednesday April 26th, 2017, Ryan Boyd will be presenting ‘Graph Algorithms on ACID’ at NASA’s JSC Data Science Day 2.0 in Houston, Texas.
On Thursday April 27th, 2017, we’ll have Diego Rodrigues and Fernando Izquierdo on the online meetup showing how to learn Chinese using Neo4j. You’ll remember that Diego and his project chinese_exp featured in TWIN4j on 8 April 2017.

Tweet of the Week

My favorite tweet this week was by Felix Victor Münch:

Just falling in love with Cypher Query Language by @neo4j again ? pic.twitter.com/NY6fVIMKuf
— Felix Victor Münch (@FlxVctr) April 19, 2017

Don’t forget to retweet Felix’s post if you liked it as well!

That’s all for this week. Have a great weekend.

Cheers, Mark

↧

This Week in Neo4j – 29 April 2017

April 29, 2017, 12:00 am

≫ Next: An Introduction & Tutorial for Structr 2.1

≪ Previous: This Week in Neo4j – 22 April 2017

Welcome to this week in Neo4j where we round up what’s been happening in the world of graph databases in the last 7 days.

But before we begin, a quick announcement from us, the Neo4j Developer Relations team.

Developer Zone at GraphConnect Europe 2017

To provide the best developer experience at our GraphConnect conference in London, on May 11th 2017, we will open a dedicated Developer Zone.

We will all be joined by Neo4j engineers, eager to answer your questions and talk about cool stuff you can do with Neo4j.

So if you can make it to London for GraphConnect, don’t miss out for the best experience of the show – the Developer Zone. You can register with the DEVZONE30 discount for 30% off, or send an email to devrel@neo4j.com to get one of the few free or 50% off tickets.

Featured Community Member: Michael Moussa

This week’s featured community member – Michael Moussa

Michael has been active for quite a while in the Neo4j community, presenting introductions to Neo4j at multiple PHP conferences. Last week he presented at the Lone Star PHP Conference in Dallas, TX.

He’s also contributed to PHP related projects in the Neo4j community and answered questions in our open channels.

Last few days of APOC Awareness Month

We’re in the last days of the APOC Awareness Month so if you haven’t published your article, you have until Monday evening (May 1st might be a good day off to work on this).

This week Alessio di Angelis won his prize by publishing a really cool piece about importing and routing OpenStreetMap data using APOC.

Tomaz Bratanic continued his APOC algorithm series and wrote this time about similarities, cluster finding and visualizing them with virtual nodes and relationships. A very interesting read!

Python, PyData, Flask, NeoModel, and Neo4j

Nigel Small, author of py2neo and tech lead of the drivers team visited Amsterdam a couple of weeks ago to present “A Pythonic Tour of Neo4j and the Cypher Query Language” at the PyData conference.

Mostafa Moradian published gRest, a quickstart repository to build applications with Python, Flask, and NeoModel – a Django-like OGM for Neo4j.

The GraphConnect schedule is a graph

The GraphConnect Europe 2017 Schedule

Besides interviewing our community for the Graphistania Podcast and creating Graph-Karaokes, Rik van Bruggen also loves to recreate event schedules in Neo4j, for easy querying and recommendations.

GraphConnect is no exception and you can now view the schedule as a graph.

Wikipedia Knowledge Graph, GraphQL, Causal Clustering

As a follow up to last week’s online meetup my colleague Jesús Barrasa published a blog post explaining how to create the Wikipedia Knowledge Graph in Neo4j. He loads pages and categories and enriches them by querying dbpedia. You can follow along by running the Neo4j-Browser Guide Jesús created in the blank Neo4j Sandbox.
Rik also published parts 2 , 3, and 4 of his series of explaining common questions about Neo4j. You get very detailed answers on the questions of scale, usage of Lucene, Solr and transactions and the Gremlin support of Neo4j.
If you love to extend Neo4j you will like this article by Igor Borojevic, who shows as part of the security series with Neo4j how to build a custom security plugin, to chose your own way of doing Authentication and authorization
Chris Skardon explains step by step how to manually set up a causal cluster with Neo4 3.1.3 on Microsoft Azure. Enjoy his funny observations and comments in his blog post: So you want to go Causal Neo4j in Azure? Sure we can do that
Magnus Wallberg wrote up the PhUSE conference where he attended a workshop led by Tim Williams comparing RDF and graphs.
If you’re looking for a job where you can work with Neo4j full time, Matt Andrews at the Financial Times is hiring:

Looking for front end / full stack contractors to join Nikkei-FT project in London! If you like Node, Neo4J &/or ElasticSearch get in touch!
— Matt Andrews (@andrewsmatt) April 26, 2017

The Mattermark GraphQL API Graph

GraphQL has been on our minds, lately. So, when the Mattermark GraphQL API became available, Will Lyon looked into it and created this insightful blog post on analysing local startup ecosystems based on their data.

He uses ApolloClient to access the API and turn the data of startups based in his home state of Montana into a graph in Neo4j.

Will then goes on to use Cypher queries to answer questions such as:

What are the companies in Montana that are raising venture capital?
Who are the founders?
Who is funding them and what industries are they in?_

Online Meetup: Learning Chinese with Neo4j

In this week’s online meetup Fernando Izquierdo showed us how to learn Chinese using Neo4j.

Even if you’ve got no interest in learning Chinese this is still worth watching because it’s such an innovative use of graphs.

From The Knowledge Base

This week from the Neo4j Knowledge Base:

How do I quickly identify long gc pauses via the messages or debug logs supplies a simple set of commands to quickly analyse Neo4j log files.
The article for Limiting MATCH results per row was recently extended with pattern comprehensions for Neo4j 3.1

On GitHub: Rust, Spring Data Neo4j, The Bible

Here are some of the most interesting projects I found on my GitHub travels:

If you like to work in Rust, this Crate can help you to access Neo4j natively. It uses Cypher via the HTTP protocol and is well documented in the readme. It even offers a Macro based approach for less clutter in your code.
Marco Falcier created a quick Spring Data Neo4j example project for managing forests of trees, that gives you a good starting point. It runs on a temporary in-memory database and comes with an Angular frontend and provides Mockito based tests.
The MetaV viz.bible is an online and mobile site publishing detailed connections between bible verses and provides a lot of insights and charts. Olin Blodgett took the CSV data which is available under a CC license and transforms it into a graph in Neo4j. You can also see the underlying data model and some example queries. Would be interesting to build an app on top of that graph data which could augment viz.bible with deeper insights based on graph queries and analytics.
If you are into life-sciences research and want to work with Snomed data in Neo4j, Pradeep created a Docker based workflow using the official containers for Neo4j and Snomed and a Groovy script to load the data into a graph.

Tweet of the Week

My favorite tweet this week was by Christos Delivorias:

Last day of the @AberdeenAssetUK #Hackathon. 30K nodes 250K relationships across different systems. I @neo4j ???? pic.twitter.com/82px797Fv1
— Christos Delivorias (@cmdel) April 26, 2017

That’s all for this week. Have a great weekend.

Cheers, Michael & Mark

↧

An Introduction & Tutorial for Structr 2.1

May 3, 2017, 12:00 am

≫ Next: This Week in Neo4j – 6 May 2017

≪ Previous: This Week in Neo4j – 29 April 2017

In one of our previous blog posts, we promised to write more about new features of our upcoming release of Structr, version 2.1, so here we are.

New Tutorial

But before we dive into the details, we’d like to to announce the first tutorial that our friends over at The SilverLogic created and which will be part of a series of example projects we’ll publish over the next few months. The detailed tutorial on how to create a Structr app shows many of the new features listed in this post. If you follow the tutorial, you will be able to create a simple blogging app within a couple of hours.

You can find the full tutorial on the Structr blog at https://structr.org/blog/blog-app-tutorial.

And now back to the features.

New Features

One of the most requested features among many other improvements and bugfixes is finally here and aims at developer productivity: We added a new deployment tool that allows you to export a complete Structr application in form of a collection of HTML and JSON files so that you can store it in any version control system (VCS).

We found a way to serialize and export all information which makes up a Structr app and is stored in Neo4j at runtime, to a filesystem structure. This allows you to use your favorite Integrated Development Environment (IDE) and diff and merge tools to make and track changes. In addition, the deployment tool (export/import) can even be used remotely over HTTP(S) so you don’t need a console login on the server to update your Structr instance.

Another new feature which makes operating Structr easier is the new web-based configuration tool: No need to manually edit the structr.conf file anymore!

The most anticipated feature of the new configuration interface is that you can now start and stop services individually while Structr is running. That means you can disconnect Structr from one Neo4j database and connect it to another, all without stopping the JVM instance, or you can enable and disable debugging and logging flags at runtime, which will greatly improve productivity.

Apart from that, the upcoming 2.1 release contains lots of new features to boost productivity: There’s a new administration console (press Ctrl-Shift-C to activate) for quick and easy scripting tasks, maintenance operations or monitoring log files, etc. We also improved the internal JavaScript scripting bridge and built a foundation which allows us to add support for more scripting languages like Ruby, PHP, Python or R.

Some More Improvements

A few other things we improved:

The test coverage has been improved and the tests are running much faster now due to better reuse of Neo4j instances.
A couple of new widgets to massively speed up app development
Improved schema layout and schema editor enhancements
Favourites: Define editable texts like script files or content elements as favourites and access them quickly via a keyboard shortcut (Ctrl-Alt-F)

Developer Support Program

Due to the rapidly growing demand for documentation, training materials and project support, we created a new program called the Developer Support Program which covers the most requested support services in an attractive package. We’ll announce more details soon.

GraphConnect Europe

Last but not least, Structr is once again happy to be a Gold Sponsor of the upcoming GraphConnect Europe happening in London on 11 May 2017. Save 30% on all tickets with the promo code STRUCTR30.

See you in London!

Join us at the Europe’s premier graph technology event: Get your ticket to GraphConnect Europe and we’ll see you on 11th May 2017 at the QEII Centre in downtown London!

Get My Ticket

↧

This Week in Neo4j – 6 May 2017

May 6, 2017, 12:01 am

≫ Next: This Week in Neo4j – Moving Adobe Behance from Cassandra to Neo4j, New Go Driver, Emil on The New Stack Makers Podcast

≪ Previous: An Introduction & Tutorial for Structr 2.1

Welcome to this week in Neo4j where we round up what’s been happening in the world of graph databases in the last 7 days.

Featured Community Member: Alessio De Angelis

This week’s featured community member is Alessio De Angelis, an IT consultant at Whitehall Reply for projects held by SOGEI, the Information and Communication Technology company linked to the Economics and Finance Ministry in Italy.

This week’s featured community member: Alessio De Angelis

Alessio first came onto the Neo4j scene while taking part in a GraphGist competition a few years ago and created an entry showing Santa’s shortest weighted path around the world.

More recently Alessio has been blogging about APOC and also featured on Rik van Bruggen’s Graphistania podcast.

Querying the Neo4j TrumpWorld Graph with Amazon Alexa

The coolest Neo4j project of the week award goes to Christophe Willemsen, our featured community member on 2 April 2017.

Christophe has created a tool that executes Cypher queries in response to commands issue to his Amazon Alexa.

Rare diseases research, APOC spatial, Twitter Clone

Rare diseases research using graphs and Linkurious

Linkurious partner SciBite explain how they’ve been able to use graphs to combine complex data from multiple sources to help solve the challenges of rare diseases research.
Max De Marzi is back with parts seven and eight of his building a Twitter clone series in which he builds a front end application to go with the back end system he’s built over the last couple of months.
Michael Morley shows how to use the spatial features in the APOC library to create a map view of a Neo4j graph.

Online Meetup: Planning your next hike with Neo4j

In this week’s online meetup Amanda Schaffer showed us how to plan hikes using Neo4j.

There’s lots of Cypher queries and a hiking recommendation engine, so if that’s your thing give it a watch.

From The Knowledge Base

This week from the Neo4j Knowledge Base we’ve got an article showing how to improve the performance of a query that counts the number of relationships on a node.

On the podcast: Andrew Bowman

In his latest podcast interview Rik van Bruggen interviews our newest Neo4j employee, Andrew Bowman. You’ll remember that Andrew was our very first featured community member on 25 February 2017.

Rik and Andrew talk about Andrew’s contributions to the community and Andrew’s introduction to Neo4j while building social graphs for Athena Health.

On GitHub: Graph isomorphisms, visualization, natural language processing

There’s a variety of different projects on my GitHub travels this week.

Rui Jia created subgraph-isomorphism-neo4j, which given a query graph and a target graph will calculate all possible subgraphs of the target graph isomorphic to the query graph.
Julian Woodward created visual-knowledge, a visualization library using vis.js. Julian also has a cool demo of the library showing how artists are connected to each other.
Dan Kondratyuk created graph-nlu – a library which builds a graph based on the output Python’s NLTK library and then uses it to make predictions.
Ed Finkler created osmi-survey-graph – a project to import and analyse the 2016 OSMI Survey results in Neo4j.

Next Week

It’s GraphConnect Europe 2017 week so the European graph community will be at the QE2 in London on Thursday 11th May 2017.

The QE2 in London, the venue for GraphConnect Europe 2017

If you would like to be in with a chance of winning a last minute ticket don’t forget to register for our online preview meetup on Monday 8th May 2017 at 11am UK time.

We’ll be joined by a few of the speakers who’ll give a sneak peek of their talks as well as talk about what they love about GraphConnect.

Hope to see you there!

Tweet of the Week

I’m going to cheat again and have two favourite tweets of the week.

First up is Chris Leishman sharing his favourite font for writing Cypher queries:

New favorite font for writing Cypher in! Fira Code – monospace font with programming ligatures: https://t.co/kofvVdXKfd #neo4j #cypher pic.twitter.com/oEDJxXLoKZ
— Chris Leishman (@cleishm) April 29, 2017

And there was also a great tweet by Caitlin McDonald:

"Dancing Graph" https://t.co/ldmN6kDj5S on @LinkedIn My early experiments using @neo4j to graph #socialnetwork data about a dance company.
— Caitlin McDonald (@caitiewrites) April 29, 2017

That’s all for this week. Have a great weekend and I’ll hopefully see some of you next week at GraphConnect.

Cheers, Mark

↧

This Week in Neo4j – Moving Adobe Behance from Cassandra to Neo4j, New Go Driver, Emil on The New Stack Makers Podcast

July 21, 2018, 12:00 am

≫ Next: Democratizing Data Discovery at Airbnb

≪ Previous: This Week in Neo4j – 6 May 2017

Welcome to this week in Neo4j where we round up what’s been happening in the world of graph databases in the last 7 days.

This week David Fox explains how his team at Adobe moved from a 48 instance Cassandra cluster to a 3 instance Neo4j one, Emil is interviewed on The New Stack Makers Podcast, Neo4j Launches Commercial Kubernetes Application on GCP Marketplace, and we have the first alpha release of our new Go driver!

Featured Community Member: David Fox

This week’s featured community member is David Fox, Software Engineer at Adobe.

David Fox – This Week’s Featured Community Member

David has been a member of the Neo4j community for many years and presented Connections Through Friends: The Second Degree and Beyond at GraphConnect 2013.

I first came across David in my role in Neo4j’s customer success team while David was working at Snap Interactive (now PeerStream). David has since presented his experiences there in a talk at the Neo4j New York meetup titled Running Neo4j in Production: Tips, Tricks and Optimizations.

David now works for Adobe, and is responsible for the backend infrastructure and performance on Behance – a social network for creatives, serving over 10 million members. We’ll cover more about his experience there below.

David also built devRant – a community especially crafted with the wants and needs of developers in mind – and wrote about his experience using Neo4j as part of that application.

On behalf of the Neo4j community, thanks for all your work David!

Moving Adobe Behance’s activity feed from Cassandra → Neo4j

As mentioned above, David was interviewed by Prof. Roberto V. Zicari, about his experience building a new implementation of Behance’s activity feed feature.

In the first part of the interview David explains how the activity feed feature and some of the limitations they had with their original implementation which was using Cassandra as the underlying storage engine.

He goes on to observe that the full dataset size has been reduced from 50TB when it was stored in Cassandra, down to around 40 GB in Neo4j. They’re also able to power this system using a cluster of 3 Neo4j instances, down from 48 Cassandra instances of equal specs.

As a result of this they ‘ve been able to exponentially decrease the amount of developer-operations staff hours required each month to keep the activity feed running.

Read the full interview

Neo4j Launches Commercial Kubernetes Application on GCP Marketplace

On Wednesday David Allen announced the release of the Neo4j Graph Platform within a commercial Kubernetes application to all users of the newly renamed Google Cloud Platform Marketplace.

This means that users can now easily deploy Neo4j’s native graph database capabilities for Kubernetes directly into their GKE-hosted Kubernetes cluster.

One click Neo4j deployment on GKE

On The New Stack Makers Podcast: Emil Eifrem

This week Emil Eifrem, Neo4j’s CEO, was interviewed on The New Stack Makers Podcast.

They talk about the history of Neo4j from its origins solving a problem in enterprise Content Management, through to the release of the Neo4j Bloom last month, and Emil’s vision of the future of Machine Learning and graphs.

You can listen to the interview below.

RDFS/OWL ontologies → Neo4j, Part 4 of Dating Site, Merging data from optional keys

In part 4 of Max De Marzi‘s series on building a dating site he continues building out the backend API, adding endpoints to capture the things that users like and hate.
Lju Lazarevic has written a blog post showing how to importing RDFS/OWL ontologies into Neo4j using the W3C Organizational Ontology as an example dataset. Lju uses a procedure from neosemantics to import the data and then shows how to query the resulting ontology using the Cypher query language and APOC library.
I came across an interesting question (and answer!) on StackOverflow – How to Merge Nodes from JSON Data when Key is optional. The WITH clause is sometimes unintuitive to new users of Cypher and FrobberOfBits does a great job of explaining how it works.
Alfred Sawatzky has created a video showing how to use yFiles Neo4j Explorer to render the Neoj4 graph schema.

First alpha of Go Neo4j driver

Based on popular demand our drivers team have been working on a Go driver for Neo4j, and this week released its first alpha version.

You can find instructions for using the driver in the neo4j-go-driver GitHub repository, and if you’ve used any of the other language drivers you will find the same familiar API that you’re used to.

The GA for the Go Driver is planned along with the Neo4j 3.5 release later this year. If you want to learn more you can join the #neo4j-golang channel of the Neo4j users slack.

Learn about the Neo4j Go Driver

Creating Nodes and Relationships Dynamically with APOC

Creating nodes and relationships with Cypher is really straightforward. It only gets tricky when you have labels, relationship-types or property-keys that are driven by data and dynamic.

The Cypher planner only works with static tokens and in this video Michael shows how APOC procedures come to the rescue here for creating, merging and updating nodes and relationships with dynamic data coming from user provided strings or lists.

Watch the whole APOC series

Python Dependency Graph, Fraud Detection with Neo4j, Neo4j OGM Release

I wrote a blog post showing how to analyse a graph of your Python depencies using centrality algorithms from the Neo4j Graph Algorithms library.
Joe Depeau presented a webinar showing How to Build a Fraud Detection Solution with Neo4j. Joe shows the value that graphs can add beyond traditional fraud detection methods, shows how Neo4j can fit in a typical architecture, and demonstrates how Neo4j Bloom can be used to explore a fraud dataset.
Michael Simons released version 3.0.4 of Neo4j OGM. This version has support for version 1.5 of Bolt drivers, compatibility for 3.4 point types, and several bug fixes.
Jennifer Reif has written a blog post in which she covers the history of data storage, contrasts relational and graph data modeling, and shares some common use cases for graphs.

Next Week

What’s happening next week in the world of graph databases?

Date	Title	Group	Speaker
July 25th 2018	Neo4j Quick Graphs: Extracting Taxonomies, Strava, Wikipedia, Python Dependencies	Neo4j – London User Group	Mark Needham, Jesús Barrasa
July 25th 2018	Querying Open Civic Data Using Cypher & Neo4j	Philly GraphDB

Date

Title

Group

Speaker

July 25th 2018

Neo4j Quick Graphs: Extracting Taxonomies, Strava, Wikipedia, Python Dependencies

Neo4j – London User Group

Mark Needham, Jesús Barrasa

July 25th 2018

Querying Open Civic Data Using Cypher & Neo4j

Philly GraphDB

Tweet of the Week

My favourite tweet this week was by Iian Neill:

Some nights #Codex feels like some secret technology from the future that I'm the only one using. Once you get used to unlimited overlapping annotations linked to the @Neo4j graph … you can't go back. The possibilities of standoff properties and the graph seem endless!
— Iian Neill (@IianNeill) July 13, 2018

Don’t forget to RT if you liked it too.

That’s all for this week. Have a great weekend!

Cheers, Mark

↧

Part 2: Visualize XML Files in Neo4j with APOC

1. Getting Ready

2. Data Integration

A. Accessing Child Elements of XML in APOC

B. Use MERGE and CREATE Statements to Load Data into Neo4j

C. Create Relations Using Internal Node ID

3. Visualize the Healthcare Graph in Neo4j

Conclusion

Part 3: Cleaning CSV Files in Bash

1. Get the Data

2. Display the Data:

A. What does the data look like?

B. How many rows are in the data?

C. How many columns are in the data?

D. Remove the header from the data.

3. Load CSV into Neo4j

A. Display the CSV in Neo4j

B. Load CSV into Neo4j

C. Fix the fields containing delimiters

4. Conclusion

Part 4: Create Relationships with FuzzyWuzzy

1. Lobby Disclosure Nodes Group

2. Legislator Nodes Group

3. Provider Prescription Nodes Group

4. Drug Nodes Group

1. Array Structure

2. String Preprocessing

3. FuzzyWuzzy String Matching

Case 1: Both partial_ratio(r1) and ratio(r2) are equal to 100

Case 2: Only r1 is 100

Case 3: Both r1 and r2 are >85

4. Modification on String Matching

5. The Final Solution

Conclusion

Articles and Blog Posts

Podcasts and Audio

Videos

Slides and Presentations

Libraries, GraphGists and Code Repos

Other Content

The Challenge of Findability and Why Graphs?

Data Model and Cypher Query

Enter AWS Managed Services and EC2

Next Steps

Introduction

The User Model

User Properties

:RATED Properties

Users Can Create Accounts

Use Case: Create New User

Request

Response

Use Case: Try to Create New User but Username Is Already Taken

Request

Response

Users Can Log in

Use Case: Login

Request

Response

Use Case: Wrong Password

Request

Response

Use Case: See Myself

Request

Response

Users Can Rate Movies

Use Case: Rate a Movie

Request

Response

Use Case: See All of My Ratings

Request

Response

Use Case: See My Rating on a Particular Movie

Request

Response

Users Can Be Recommended Movies Based on Their Recommendations

User-Centric, User-Based Recommendations

Movie-Centric, Keyword-Based Recommendations

User-Centric, Keyword-Based Recommendations

Next Steps

Case 1: Both `partial_ratio(r1)` and `ratio(r2)` are equal to 100

Case 2: Only `r1` is 100

Case 3: Both `r1` and `r2` are >85

`User` Properties

`:RATED` Properties

`User` Properties

`:RATED` Properties