Quantcast
Channel: python Archives - Graph Database & Analytics
Viewing all 195 articles
Browse latest View live

py2neo 3.1: The World’s Most Amazing Python Driver for Neo4j

$
0
0
Learn about py2neo 3.1, a community Python driver for Neo4j, including the new Object-Graph MapperEven though we’ve now released officially supported drivers for Java, Python, JavaScript and .NET, many of the community drivers are still going strong. Indeed, version 3.1 of my own community driver py2neo was released this week and with it came a brand-new OGM for Python users.

An OGM (Object-Graph Mapper) is to a graph database what an Object-Relational Mapper (ORM) is to a traditional RDBMS: a framework on which database-aware domain objects can be built.

The py2neo OGM centres its operation around the new GraphObject class. This acts as both a base class upon which domain objects can be defined and a manager for the underlying node and relationships that persist it.

Take for example the Movie Graph that comes pre-packaged with Neo4j. We could model a Person from this dataset as below:

class Person(GraphObject):
    __primarykey__ = "name"

    name = Property()
    born = Property()

Here, we define a Person class with two properties. Properties in Neo4j have no fixed type so there’s less to define than there would be for a SQL field in a typical ORM.

We’re also using the same names for the class attributes as we do for the underlying properties: name and born. If necessary, these could be redirected to a differently-named property with an expression such as Property(name="actual_name").

Lastly, we define a __primarykey__. This tells py2neo which property should be treated as a unique identifier for push and pull operations. We could also define a __primarylabel__ although by default, the class name Person will be used instead.

All of this means that behind the scenes, the node for a specific Person object could be selected using a Cypher statement such as:

MATCH (a:Person) WHERE a.name = {n} RETURN a

Broadening out a little, if we wanted to model both Person and Movie from that same dataset, as well as the relationships that connect them, we could use the following:

class Movie(GraphObject):
    __primarykey__ = "title"

    title = Property()
    tagline = Property()
    released = Property()

    actors = RelatedFrom("Person", "ACTED_IN")
    directors = RelatedFrom("Person", "DIRECTED")
    producers = RelatedFrom("Person", "PRODUCED")

class Person(GraphObject):
    __primarykey__ = "name"

    name = Property()
    born = Property()

    acted_in = RelatedTo(Movie)
    directed = RelatedTo(Movie)
    produced = RelatedTo(Movie)

This introduces two new attribute types: RelatedTo and RelatedFrom. These define sets of related objects that are all connected in a similar way. That is, they share a common start or end node plus a common relationship type.

Take for example acted_in = RelatedTo(Movie). This describes a set of related Movie nodes that are all connected by an outgoing ACTED_IN relationship. Note that like the property name above, the relationship type defaults to match the attribute name itself, albeit upper-cased. Conversely, the corresponding reverse definition, actors = RelatedFrom("Person", "ACTED_IN"), specifies the relationship name explicitly as this differs from the attribute name.

So how do we work with these objects? Let’s say that we want to pluck Keanu Reeves from the database and link him to the timeless epic Bill & Ted’s Excellent Adventure (sadly omitted from the original graph). First we need to select the actor using the GraphObject class method select via the Person subclass. Then, we can build a new Movie object, add this to the set of movies acted_in by the talented Mr Reeves and finally push everything back into the graph. The code looks something like this:

keanu = Person.select(graph, "Keanu Reeves").first()
bill_and_ted = Movie()
bill_and_ted.title = "Bill & Ted's Excellent Adventure"
keanu.acted_in.add(bill_and_ted)
graph.push(keanu)

All related objects become available to instances of their parent class through a set-like interface, which offers methods such as add and remove. When these details are pushed back into the graph, the OGM framework automatically builds and runs all the necessary Cypher to make this happen.

More complex selections are possible through the select method as well. The where method can make use of any expression that can be used in a Cypher WHERE clause. For example, to output the names of every actor whose name starts with ‘K’, you could use:

for person in Person.select(graph).where("_.name =~ 'K.*'"):
    print(person.name)

Note that the underscore character is used here to refer to the node or nodes being matched.

There’s a lot more information available in the py2neo documentation and there’s also a demo application in the GitHub repository that shows how this all comes together in a mini movie browser (screenshot below).

The sample movie web app that comes with py2neo 3.1


As always, if you have any questions about py2neo or the official drivers, I’ll try my best to help. My contact details can be found somewhere on this page probably.


Want to learn more about graph databases and Neo4j? Click below to register for one of our online training classes, Introduction to Graph Databases or Neo4j in Production and catch up to speed with graph database technology.


My Neo4j Summer Road Trip to the World of Healthcare [Part 2]

$
0
0

Part 2: Visualize XML Files in Neo4j with APOC


Welcome back to my Neo4j summer adventure. In my previous post, I gathered all the available data and explored how to model the data into a healthcare graph. Starting with this post, I will be focusing on loading the data into the healthcare graph.

As a Neo4j newbie, before starting ETL, I researched on methods people have been using to transform XML data into Neo4j graph data. Most of them converted the XML files to CSV first then loaded the data into Neo4j. While I was teaching myself Cypher, I discovered that APOC allows me to extract information from XML and load them directly into a graph. However, there are few blogs out there that document this procedure, so why don’t I try the new way – it won’t be a real adventure without some fun explorations will it!

In this week’s blog, I want to show you how I load XML files into a graph using APOC. This week, I will be working with lobbying disclosures and contributions data, and by the end of this blog you will see some fun queries I created to gain interesting insights into how the healthcare system is influenced by the lobbying system.

Now let’s begin our adventure for this week!

1. Getting Ready


  1. Download the data into a directory:
  2. In this project, I am working on XML data from 2013. The contributions contain 87.5MB of data and disclosures contain 894.9MB data. You can download the same data here:
  • Download the latest APOC:
  • Install Python driver py2neo:
  • $ pip install py2neo

    2. Data Integration


    Now we are ready to go. Though Neo4j is schema-less, having a clear structure of the graph is helpful to determine where to go. It’s more like a map or compass, and this is especially true when I need to traverse an XML tree structure to access the child elements.

    Now let’s take a look at the map of where we will be going for this week:

    Part 2 of using Neo4j to graph the healthcare industry. This week: XML and lobbying disclosures


    Nodes :Issue, :Disclosure and :Client will be extracted from disclosure XML files, and nodes :Legislator, :Committee, :Contribution and :Contributor will be extracted from contribution XML files. Both the disclosure and contribution XML data contain information about nodes :LobbyFirm and :Lobbyist, I will use a MERGE statement to create :LobbyFirm and :Lobbyist to prevent duplicates.

    Now let me show you how I processed disclosure XML using APOC. (You can find the whole ETL python code here.)

    A. Accessing Child Elements of XML in APOC


    Let me start off by showing you the structure of the disclosure XML files.

    The XML file structure of lobbying disclosures


    APOC allows me to access the child elements of <LOBBYINGDISCLOSURE2>. Here is the Cypher statement to extract the properties of :LobbyFirm (in orange):

    CALL apoc.load.xml('file:///2013_1stQuarter_XML/300529228.xml') 
    YIELD value
    WITH [attr in value._children 
    WHERE attr._type in ['organizationName', 'address1', 'city', 'state', 'zip', 'country', 'houseID'] | [attr._type, attr._text]] as pairs 
    CALL apoc.map.fromPairs(pairs) 
    YIELD value as properties
    RETURN properties
    

    The query returns this:

    An APOC Cypher query on lobbying disclosure data


    The way of calling APOC to extract properties for other nodes is very similar; you can find every single detail of my Python code here. In this project, when creating nodes :Issue and :Lobbyist, I have to deal with more complicated parent-child structures (as you can see from the XML map above, <Lobbyists> and <issueAreaCode> are siblings, and <Lobbyists> has children <Lobbyist>; I maintained this structure in the healthcare graph).

    If you are facing a similar problem, the collect() function will be helpful. I used it to aggregate properties (labeled in yellow and blue) into a list, then access the desired properties by indexing.

    Now let’s run the query from the Python driver, I used py2neo in my project:

    query = '''
       CALL apoc.load.xml({file})
       YIELD value
       WITH [attr in value._children
       WHERE attr._type in ['organizationName', 'firstName', 'lastName', 'address1’, 'city', 'state', 'zip', 'country', 'houseID'] | [attr._type,          attr._text]] as pairs
       CALL apoc.map.fromPairs(pairs)
       YIELD value as properties
       RETURN properties
       '''
    properties = g.run(query, file=’file:///2013_1stQuarter_XML/300529228.xml’).evaluate()
    print(properties)
    print(‘type of properties:', type(properties))
    

    Result:

    {'city': 'Austin', 'organizationName': 'Tuggey Fernandez LLP', 'country': 'USA', 'firstName': None, 'houseID': '416750001', 'state': 'TX', 'address1': '611 South Congress Avenue, Suite 340', 'zip': '78704', 'lastName': None, 'address2': None}
    type of properties: 
    

    Running the Cypher query will return a cursor object. In this case, I know there is only one value, Properties, being returned, so I could call the evaluate() method which returns the value of the cursor object. As we can see, evaluate() turns the cursor object into a dictionary which is very easy to work with in Python.

    Knowing how to extract information using APOC and understanding the return value, I next define a Python function that cleans the data and returns a dictionary of properties of :LobbyFirm. Cypher supports some powerful string processing functions which can also be used to clean the data.

    One more thing to notice here is that I only extract properties if the data is valid, NULL value properties should not be stored in Neo4j.

    def get_LobbyFirm_property(file):
       '''
       :param file: the xml file path to be parsed
       :return: a dict of properties of LobbyFirm
       '''
       query = '''
           CALL apoc.load.xml({file})
           YIElD value
           WITH [attr in value._children
           WHERE attr._type in ['organizationName', 'firstName', 'lastName', 'address1',
           'address2', 'city', 'state', 'zip', 'country',
           'houseID'] | [attr._type, attr._text]] as pairs
           CALL apoc.map.fromPairs(pairs)
           YIELD value as properties
           RETURN properties
           '''
       pre_property = g.run(query, file=file).evaluate()
       property = {}
       # name
       if pre_property['organizationName']== None and pre_property['firstName'] != None and pre_property['lastName'] != None :
           property['name'] = str(pre_property['firstName'] + ' ' + pre_property['lastName'])
       elif pre_property['organizationName'] != None:
           property['name'] = pre_property['organizationName']
    #address
       if pre_property['address1']!= None and pre_property['address2']!= None:
           property['address'] = str(pre_property['address1'] + ' ' + pre_property['address2'])
    
       elif pre_property['address1']!= None and pre_property['address2']== None:
           property['address'] = pre_property['address1']
    #city
       if pre_property['city'] != None:
           property['city'] = pre_property['city']
       #State
       if pre_property['state'] != None:
           property['state'] = pre_property['state']
       # Country
       if pre_property['country'] == None:
           property['country'] = 'USA'
       else:
           property['country'] = pre_property['country']
       # zip
       if pre_property['zip'] != None:
           property['zip'] = pre_property['zip']
       # houseOrgId
       if pre_property['houseID'] != None:
           property['houseOrgId'] = pre_property['houseID'][:5]
       return property
    

    B. Use MERGE and CREATE Statements to Load Data into Neo4j


    def create_LobbyFirm_node(properties):
       '''
       :param properties: a dict of properties of the node
       :return: node internal id
       '''
       query = '''
           MERGE (lbf: LobbyFirm {houseOrgId:{houseOrgId}})
           ON CREATE SET lbf = {properties}
           RETURN id(lf)
           '''
    
       index = '''
       CREATE INDEX ON: LobbyFirm(houseOrgId)
       '''
       id = g.run(query, houseOrgId = properties['houseOrgId'], properties=properties).evaluate()
       g.run(index)
       return id
    

    I decide to create the :LobbyFirm node by merging on houseOrgId which is a unique 5-digit number for each lobbying firm.

    MERGE statement prevents duplicates in the graph. It’s a good practice to only merge on one property of the node. When merging on more than one property, only nodes that match ALL the values will be returned; otherwise, a duplicate will be created.

    For example MERGE (lbf: LobbyFirm {houseOrgId: “12345”, firmName: “ABCD”}) is like saying “Find me the node labeled :LobbyFirm AND houseOrgId is 12345 AND firmName is ABCD. If no property is matched, create a new node with houseOrgId is 12345 and firmName is ABCD”.

    In this case, there may be more than one node being created that has the same houseOrgId. Here is a great blog post that cleared up my confusions such as when to use MERGE vs CREATE.

    C. Create Relations Using Internal Node ID


    I have 72,002 disclosure files to be processed. As my Python code loops through each disclosure file, it needs to create relations among these nodes. A relationship is generated only when the two nodes are created within the same iteration. The graph created at each iteration looks like this:

    A graph data model for lobbying disclosure data


    Notice in the previous code where I created the :LobbyFirm I also returned the ID of the node. This internal ID allows me to identify the new nodes created at that iteration, and thus, I am able to generate relations for these nodes.

     
    lf_dc_rel = g.run(
       '''MATCH (dc:Disclosure) WHERE id(dc) = {dc_id}
       MATCH (lf:LobbyFirm) WHERE id(lf) = {lf_id}
       CREATE (lf)-[r:FILED]->(dc)
       ''', dc_id = dc_id, lf_id = lf_id
    )
    

    Here dc_id and lf_id are passed as parameters, each of them represents the id of :Disclosure node and :LobbyFirm node.

    There are some limitations when using internal node id to identify nodes. You need to be careful especially when you delete an existing node. The id for the deleted node will be reused when creating a new node.

    In this case, you can use a plugin called UUID which “assigns UUIDs to newly created nodes and relationships in the graph and makes sure nobody can (accidentally or intentionally) change or delete them.”

    3. Visualize the Healthcare Graph in Neo4j


    Each year, corporations spend billions of dollars to gain access to government decision-makers, and healthcare organizations are no exception. One of the purposes of my project is to connect these organizations with the legislators by modeling the lobbying system.

    Now that I have all of the lobbying data loaded into Neo4j, I would love to find out how the healthcare industry (or any other group) is influenced by the lobbying system.

    First, let’s figure out the general lobbying issues in 2013:

    MATCH (n: Issue) RETURN distinct(n.issueAreaCode) ORDER BY n.issueAreaCode
    

    The query returns 79 unique issue area codes in the disclosures. You can refer to the general lobbying issue code to find out what these issues are. Here are the top 10 general lobbying issues in 2013:

    MATCH (n:Issue) RETURN n.issueAreaCode, count(n) as num order by num DESC LIMIT 10
    

    The top ten issues in healthcare lobbying


    HCR (Health Issues) and MMM (Medicare/Medicaid) are the two areas that I am most interested in, and we can see there were 9988 HCR issues and 5016 MMM issues being lobbied in 2013.

    Just for personal curiosity, I also wanted to know how many issues being lobbied are related to gun control in 2013, and here is a screenshot for my discovery:

    Results of a Cypher query on gun control and healthcare


    Second, find me the lobbying firms and lobbyists who lobby for Medicare and Medicaid issues:

    MATCH (lf:LobbyFirm)<-[:WORKS_AT]-(lob: Lobbyist)-[:LOBBIES]->(iss: Issue {issueAreaCode:'MMM'})
    RETURN lf.houseOrgId as Frim_ID, lob.firstName as First_Name, lob.lastName as Last_Name, iss.issueAreaCode as Issue, iss.description as Description LIMIT 8
    

    Healthcare lobbying issues related to Medicare and Medicaid


    Next, tell me who are the clients that signed disclosures with lobby firms for Medicare and Medicaid issues?

    MATCH (cl:Client)-[:SIGNED]->(dc:Disclosure)-[:HAS]->(iss:Issue{issueAreaCode: "MMM"})
    WITH cl, dc, iss
    MATCH (lf:LobbyFirm)-[:FILED]->(dc), (lob:Lobbyist)-[:LOBBIES]->(iss)
    RETURN distinct(cl.clientName) as Client, lf.houseOrgId as Firm_ID, lob.firstName as First_Name, lob.lastName as Last_Name LIMIT 25
    

    Healthcare lobbying disclosures filed for Medicare and Medicaid


    To visualize the result in a graph:

    A graph visualization of healthcare lobbying disclosures filed for Medicare and Medicaid


    We can see there are five clients who signed a disclosure with lobby firm No. 31603 for Medicare-related issues. All of the relevant issues are lobbied by Marshall.

    Now, let’s find out – for these lobbyists and lobby firms who are involved in lobbying Medicare and Medicaid issues – how much they contributed to government leaders and who received these contributions?

    MATCH (lf:LobbyFirm)<-[:WORKS_AT]-(lob: Lobbyist)-[:LOBBIES]->(iss: Issue {issueAreaCode:'MMM'})
    WITH lob, lf
    MATCH (lob)-[:FILED]->(cb:Contribution)-[:MADE_TO]->(com:Committee)-[:FUNDS]->(leg:Legislator)
    OPTIONAL MATCH (lf)-[:FILED]->(cb)-[:MADE_TO]->(com)-[:FUNDS]->(leg)
    RETURN lf.city as City, lf.houseOrgId as Firm_ID, lf.name as Firm_Name, 
    lob.firstName as FirstName, lob.lastName as LastName, cb.amount as Amount, cb.date as Date, leg.name as Legislator LIMIT 50
    

    Lobbyist contributions related to Medicare and Medicaid for both who and how much they contributed


    What does the result look like in our healthcare graph?

    A graph visualization of healthcare lobbying contributions related to Medicare and Medicaid


    It is much easier to read the results as a graph in Neo4j!

    Finally, how are healthcare organizations connected to legislators?

    MATCH (cl:Client{clientName:'Pharmaceutical Research and Manufacturers of America (PhRMA)'})-[:SIGNED]->(dc:Disclosure)-[:HAS]->(iss:Issue{issueAreaCode:'MMM'})<-[:LOBBIES]-(lob:Lobbyist)-[:WORKS_AT]->(lf:LobbyFirm)
    WITH cl,dc,iss,lob,lf
    MATCH (lob)-[:FILED]->(cb:Contribution)-[:MADE_TO]->(com:Committee)-[:FUNDS]->(leg:Legislator)
    OPTIONAL MATCH (lf)-[:FILED]->(cb)-[:MADE_TO]->(com)-[:FUNDS]->(leg)
    RETURN cl,dc,iss,lob,lf,cb,com,leg LIMIT 300
    

    A graph of connections between healthcare lobbyists


    This looks amazingly interesting. Let’s take a closer look at the graph:

    A closer look at the graph of connections between healthcare lobbyists


    From the graph I can tell that in 2013, the lobbyist Drew Goesl lobbied a Medicare issue for Pharmaceutical Research and Manufacturers of America (PhRMA) which specifically focuses on “Legislative issues related to access to pharmaceuticals, including Medicare Part D, and Children’s Health Insurance Program (CHIP), rebates in Medicaid and for dual-eligibles; comparative effectiveness; 340B Drug Program; Medicare Part B prescription drug reimbursement, and related provisions.”

    During the same year, the lobbyist Drew Goesl made contributions to several committees who fund legislators including James Lee Witt, Patrick Murphy, William Lewis Owens, Mike McIntryre, Mark Pryor, Corey Booker, John Larson, Linda Forrester, Edward Perlmutter, James Matheson, Joseph Crowley, Harry Reird, Susan DelBene, Scott Peters and Edward J. Markey.

    Due to the data limitation, I cannot draw a conclusion that PhRMA and the legislators mentioned above have direct connections. However, the healthcare graph is helpful for the public to trace and integrate information just like this.

    You may also have noticed there is a bug in my model: I have tons of duplicated nodes for the same legislator. This is because the data is not consistent. The real world data is not as friendly and tidy as might be the case in an academic scenario.

    Conclusion


    In the next few blog posts, I will demonstrate how to process strings and how to match nodes when you have messy and limited data sources. Next week, I will start to work on provider prescription data and will show you some tricks I used to load the large CSV files that I downloaded from the FDA and CMS websites. I hope you enjoyed the second post in this series – stay tuned for more excitements to come!


    Ready to dig in and get started with graph databases? Click below to download this free ebook, Learning Neo4j and catch up to speed with the world’s leading graph databases.



    Catch up with the rest of the Neo4j in healthcare series:

    My Neo4j Summer Road Trip to the World of Healthcare [Part 3]

    $
    0
    0

    Part 3: Cleaning CSV Files in Bash


    Hi friends and welcome back to my summer road trip through the world of healthcare. For those who are new to my adventure, this is the third part of the blog series. Catch up right here with Part 1 and Part 2.

    I am using Neo4j to connect the multiple stakeholders of healthcare and hope to gain some interesting insights into the healthcare industry by the end of my exploration. This blog series demonstrates the entire process from data modeling and ETL to exploratory analysis and more. In the previous two posts, I discussed data modeling and how to integrate XML data to Neo4j by using APOC, you can find every single detail about the project on Github.

    This week, I will be working with CSV files. If you are using Neo4j for the first time (like me), I can tell you honestly that loading CSV files to Neo4j is a lot easier than loading XML files. But don’t get too optimistic about it unless your data is perfectly clean. Now, let me show you the steps I used to successfully load the CSV files.

    1. Get the Data


    This week, our data covers information on drugs, drug manufacturers, providers and prescriptions. You can download the same data from these sources: As the healthcare provider data gave me the most problems, I will use this data as a demonstration in this blog post.

    2. Display the Data:


    A. What does the data look like?

    head npidata_20050523-20160612.csv
    

    Wow, the data looks a little bit crazy, and because of that, I will not overwhelm you by copying the result here. However, I learned three characteristics about the data by displaying the first 10 rows:
      • The data has a header, and the header contains white space
      • The data has many columns (we will find out how many soon)
      • The data has a lot of empty values
    B. How many rows are in the data?

    wc -l npidata_20050523-20160612.csv
    

    Results:
     4923673 npidata_20050523-20160612.csv
    

    Each row of the data represents a registered provider in the United States from 2005 to 2016.

    C. How many columns are in the data?

    head -n 1 npidata_20050523-20160612.csv|awk -F',' '{print NF}'
    

    Results:
    329
    

    Now you see my point why I said the data is a little bit crazy. But don’t panic – most of these columns do not contain values and we only need to extract a few columns to load them to my healthcare graph.

    D. Remove the header from the data.

    sed 1d npidata_20050523-20160612.csv > provider.csv
    

    This will delete the first line and save the content to a new file named provider.csv. The original file will not be changed.

    It’s optional to remove the header in your file because Cypher supports the ability to load a CSV file with a header and refer to the column using the header. Here is a great walkthrough tutorial of loading CSV files to Neo4j.

    3. Load CSV into Neo4j


    A. Display the CSV in Neo4j

    LOAD CSV FROM 'file:///provider.csv' AS col
    RETURN col[0] as npi, col[1] as entityType, col[20]+col[21] as address, col[22] as city, col[23] as state, col[24] as zip, col[25] as country, col[5] as lastName, col[6] as firstName, col[10] as credential, col[41] as gender, col[4] as orgName
    limit 10
    

    *npi: National Provider Identifier
    Healthcare CSV data in Neo4j


    In the figure above, I only displayed the columns that I will load into the healthcare graph.

    B. Load CSV into Neo4j

    Part 3 of using Neo4j to graph the healthcare industry. This week: Cleaning up CSV data of providers


    I want to create :Provider nodes with these properties: npi, entityType, address, city, state, zip and country.

    When entityType is 1, I add properties: lastName, firstName, credential and gender to the node. When entityType is 2, I add the property: OrgName to the node.

    Here is the Cypher query that executes the above data model and rules:

    LOAD CSV FROM 'file:///provider.csv' AS col
    CREATE (pd:Provider {npi: col[0], entityType: col[1], address: col[20]+col[21], city: col[22], state: col[23], zip: col[24], country: col[25]})
    FOREACH (row in CASE WHEN col[1]='1' THEN [1] else [] END | SET pd.firstName=col[6], pd.lastName = col[5], pd.credential= col[10], pd.gender = col[41])
    FOREACH (row in CASE WHEN col[1]='2' THEN [1] else [] END | SET pd.orgName=col[4])
    

    The FOREACH statement is used to mutate each element in a collection. Here I use CASE WHEN to group the data into two collections of rows: the rows with col[1] = 1 and the rows with col[1] = 2. For each row in the col[1]=1 group, I use the FOREACH statement to set the firstName, lastName, credential and gender properties, and for each row in the col[1]=2 group, I set the property orgName.

    C. Fix the fields containing delimiters

    Running the Cypher query above returns an error:

    At /Users/yaqi/Documents/Neo4j/test_0802/import/provider.csv:113696 -  there's a field starting with a quote 
    and whereas it ends that quote there seems to be characters in that field after that ending quote. 
    That isn't supported. This is what I read: 'PRESIDENT","9'
    

    Let’s take a look of the problematic line from the terminal:

    sed -n "113697 p" provider.csv
    

    Results:
    "1790778355","2","","","BERNARD J DENNISON JR DDS PA","","","","","","","","","","","","","","","",
    "908 N SANDHILLS BLVD","","ABERDEEN","NC","283152547","US","9109442383","9109449334","908 N SANDHILLS BLVD","
    ","ABERDEEN","NC","283152547","US","9109442383","9109449334","08/29/2005","07/08/2007",
    "","","","","DENNISON","BERNARD","J","PRESIDENT\","9109442383","1223G0001X","4629","NC","Y","",
    "","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","
    ","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","
    ","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","
    ","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","
    ","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","
    ","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","
    ","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","
    ","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","
    ","","","","","","","","","","","DR.","JR.","DDS","193400000X SINGLE SPECIALTY  GROUP","","","","","","","","","","","","","",""
    

    The problem is there is a \ character inside the field PRESIDENT\, and when loading the file, Cypher will skip the double quotation followed by the \, thus it gets confused on how to map the fields.

    Now let’s see if other rows contains the same problem:

    grep '\\' provider.csv | wc -l
    

    The command searches for the \ character in the file and counts the lines which contain the pattern that we are looking for. The result is 70.

    There are many ways to fix this problem. Below, I replace the \ with / and load it into a new file.

    tr "\\" "/" < provider.csv > provider_clean.csv
    

    Now let’s try to reload the CSV file again. This time I am loading the file from Python client.

    def create_provider_node(file, g):
       query = '''
        USING PERIODIC COMMIT 1000
        LOAD CSV FROM {file} AS col
        CREATE (pd:Provider {npi: col[0], entityType: col[1], address: col[20]+col[21], city: col[22], state: col[23], zip: col[24], country: col[25]})
        FOREACH (row in CASE WHEN col[1]='1' THEN [1] else [] END | SET pd.firstName=col[6], pd.lastName = col[5], pd.credential= col[10], pd.gender = col[41])
        FOREACH (row in CASE WHEN col[1]='2' THEN [1] else [] END | SET pd.orgName=col[4])
       '''
    
       index1 = '''
       CREATE INDEX ON: Provider(npi)
       '''
       g.run(index1)
    
       return g.run(query, file = file)
    
    pw = os.environ.get('NEO4J_PASS')
    g = Graph("http://localhost:7474/", password=pw) 
    tx = g.begin()
    file = 'file:///provider_clean.csv'
    create_provider_node(file, g)
    

    By using periodic commit, you can set up a number of transactions to be committed. It helps to prevent from using large amount of memory when loading large CSV files.

    4. Conclusion


    Now, I have successfully loaded the healthcare provider data into Neo4j. The process of loading drug, drug manufacturer and prescription data are very similar. I also created the relationship WRITES for the nodes :Provider and :Prescription based on the NPI information contained in both files. By now, all the data is stored in the graph database.

    Let’s take a look at the healthcare graph data model again:

    A graph data model of the healthcare industry


    I hope you find this blog post helpful. Sometimes cleaning large CSV files can be tricky, but using the command line to manipulate the files can make the work go faster. In the next blog post, I will show you how to link data when you have limited resources. Specifically, I will demonstrate how I created the relationship (:Prescription)-[:PRESCRIBE]->(:Drug) and (:Drug Firm)-[BRANDS]->(:Drug). Stay tuned, and I’ll see you soon!


    Ready to dig in and get started with graph technology? Click below to download this free ebook, Learning Neo4j and catch up to speed with the world’s leading graph database.



    Catch up with the rest of the Neo4j in healthcare series:

    My Neo4j Summer Road Trip to the World of Healthcare [Part 4]

    $
    0
    0

    Part 4: Create Relationships with FuzzyWuzzy


    Welcome back to my adventure to the world of healthcare! In the past three blog posts, I have discussed the data model of the healthcare graph, loading XML data into Neo4j and cleaning CSV data in the command-line interface. In this post, I will demonstrate how to link data from multiple data sources, especially when there is a lack of foreign IDs to identify records.

    The healthcare graph consists of four groups of data, with each group of nodes generated from different data sources. Let’s start off by looking at the four groups of nodes and how I created relationships within the group.

    1. Lobby Disclosure Nodes Group


    This group is extracted from public lobbying disclosures, which includes these nodes from our healthcare data model: (:Disclosure), (:Client), (:Issue), (:Lobbyist), (:LobbyFirm), (:Contribution), (:Contributor), (:Committee) and (:Legislator). While extracting the nodes from the original XML files, nodes IDs are generated internally by Neo4j. As a result, I could use the internal node IDs to create relationships to connect these nodes. I have documented this process in my second blog post.

    2. Legislator Nodes Group


    This group includes nodes (:LegislatorInfo), (:State), (:Body) and (:Party), which are extracted from a single CSV file that can be downloaded here. Relationships are created through a single Cypher statement during the ETL process. All the ETL code can be found at GitHub here.

    3. Provider Prescription Nodes Group


    Nodes (:Prescription) and (:Provider) are generated from CMS data sources. I stored National Provider Identifiers as a property {NPI} for both nodes and used it to connect (:Provider) with (:Prescription).

    4. Drug Nodes Group


    Nodes (:GenericDrug) and (:Drug) are extracted from FDA data sources. Both nodes along with (:Prescription) have property {GenericName}. (:GenericDrug) is created as an intermediate node to present each unique {GenericName} value in (:Drug).

    RxNorm is a national medical library that provides a RESTful API which allows me to link clinical drug vocabularies to normalized names such as Rxcui, a unique ID for drugs. I use the batch mode to send {GenericName} of (:Prescription) and (:GenericDrug) to get Rxcui and connect these two nodes on the drug ID Rxcui.

    Part 4 of using Neo4j to graph the healthcare industry: String matching for data relationships


    Nodes can be simply connected together if they are extracted within a single file or from the same source. Using standardized unique IDs is even more convenient to join nodes together. However, pulling data from a variety of sources and having limited access to data are often the cases that data journalists or data engineers need to tackle.

    The issue I am trying to demonstrate in this post is how to create relationships to connect different groups of data. Specifically, how to link (:Client) and (:DrugFirm), (:DrugFirm) and (:Drug), as well as (:Legislator) and (:LegislatorInfo).

    Graph databases emphasize representation of relationships among data points. Without the relationships, I won’t be able to create the whole path of the healthcare graph and would lose the ability to track information following by these paths. The idea of connecting the nodes, such as (:Drug) and (:DrugFirm) is that if the drug is branded by a drug firm, a relation should exist to connect these two nodes together.

    Now let’s take a look of the properties of these two nodes and we may find some useful information.

    MATCH (d:Drug), (df:DrugFirm) RETURN d as Drug, df as DrugFirm LIMIT 25
    

    Data relationships between drugs and drug firms


    (:Drug) has a property {labelerName} and (:DrugFirm) contains {firmName}. The logic of connecting a drug firm with a drug is the {firmName} can be recognized as identical or similar to the {labelerName}. This may sound like an easy task for human beings to do, but automating this process can be a little tricky.

    Luckily I have heard of FuzzyWuzzy, a practical Python package, which does string matching and returns the matching rate. I decided to give it a try to match the {firmName} with {labelerName}.

    1. Array Structure


    The first step before comparing two arrays of strings is to structure the array to make it easy to work with:

    #======= RETURN Drug object: list of dics, key: labelerName, id ======#
    q1 = '''
    MATCH (d: Drug)
    RETURN id(d), d.labelerName
    '''
    drug_obj = g.run(q1)
    drugs_lst = []
    for object in drug_obj:
    	drug_dic = {}
    	drug_dic['id'] = object['id(d)']
    	drug_dic['labelerName'] = object['d.labelerName']
    	drugs_lst.append(drug_dic)
    
    #======= RETURN DrugFirm object: list of dics, key: firmName, id ======#
    q2 = '''
    MATCH (df:DrugFirm)
    RETURN id(df), df.firmName'''
    df_obj = g.run(q2)
    df_lst = []
    for object in df_obj:
    	df_dic = {}
    	df_dic['id'] = object['id(df)']
    	df_dic['firmName'] = object['df.firmName']
    	df_lst.append(df_dic)
    

    Here I returned an array of dictionaries for both (:Drug) and (:DrugFirm). Each dictionary represents an object, and each object has two keys: Node internal ID and {labelerName} / {firmName}. The node internal ID is used later on to fetch the nodes that we want, which we will talk about shortly. The {labelerName} and {firmName} are the strings that we will compare. Now let’s print out the data structure:

    DrugFirm:
    [{'id': 23049075, 'firmName': 'Teva Branded Pharmaceutical Products R&D, Inc.'},
    {'id': 23049076, 'firmName': "George's Family Farms, LLC"},
    {'id': 23049077, 'firmName': 'Baxter Healthcare S.A.'},
    {'id': 23049078, 'firmName': 'Tokuyama Corporation'},
    {'id': 23049079, 'firmName': 'Alps Pharmaceutical Ind. Co., Ltd.'}]
    

    Drug:
    [{'id': 22941414, 'labelerName': 'Eli Lilly and Company'},
    {'id': 22941415, 'labelerName': 'Eli Lilly and Company'},
    {'id': 22941416, 'labelerName': 'Eli Lilly and Company'},
    {'id': 22941417, 'labelerName': 'Eli Lilly and Company'},
    {'id': 22941418, 'labelerName': 'Eli Lilly and Company'}]
    

    2. String Preprocessing


    As we could see, some company names are in uppercase and some are in lowercase. Some of the names contain non-alphanumeric characters. There are also many duplicates in the array, which will slow down the string matching process.

    To improve the matching rate, I passed the arrays to a series of string processing functions to clean up the strings.

    #lower case: convert all to lower case
    lc_ln = lower_case(drugs_lst, 'labelerName')
    lc_fn = lower_case(df_lst, 'firmName')
    
    #remove_marks: remove non-alphanumeric characters
    rm_ln = rm_mark(lc_ln, 'labelerName')
    rm_fn = rm_mark(lc_fn, 'firmName')
    
    #Chop_end: remove ‘s’ at the end of a string
    ce_ln = chop_end(rm_ln, 'labelerName', 's')
    ce_fn = chop_end(rm_fn, 'firmName', 's')
    
    #sort_strings: sort words in a string
    sort_ln = sort_strings(ce_ln,'labelerName')
    sort_fn = sort_strings(ce_fn, 'firmName')
    
    #uniq strings: de-duplicate: collect nodes ID in to a list for each unique string
    uq_ln = uniq_elem(sort_ln, 'labelerName')
    uq_fn = uniq_elem(sort_fn, 'firmName')
    

    All the functions in the script can be found here. After processing the strings, I de-duplicated the objects in {labelerName} and {firmName}, the number of objects are decreased from 106683 to 5928 for {labelerName} and 10205 to 7040 for {firmName}.

    Now let’s print out the uq_ln and uq_fn to understand the structure before we move on.

    uq_ln:
    defaultdict(,
    {'akorn llc stride': [22965573, 22965574, 22965575, 22965576],
    'brand inc silver star': [23040080, 23040084, 23040091, 23040097, 23040151, 23040169, 23040171, 23040172, 23040174],
    'co ltd osung': [23007551, 23007552, 23007553],
    'beauticontrol': [23013024],
    'biological glaxosmithkline sa': [23013557, 23013558, 23013559, 23013560, 23013561, 23013562, 23013563, 23013564, 23013565, 23013566, 23013567, 23013568, 23013569, 23013570, 23013571],…
    

    uq_fn:
    defaultdict(,
    {'biological glaxosmithkline sa': [23053065],
    'bioservice capua spa': [23052723],
    'american homepatient': [23057681, 23057683, 23057686, 23057687, 23057688, 23057689, 23057690, 23057691, 23057692, 23057693, 23057694, 23057695, 23057696, 23057697, 23057698],
    'healthcare limited novarti private': [23056487],…
    

    I have two defaultdicts to work with. The keys in each array are the strings to be compared, the values are node IDs representing the nodes that contains the same strings (i.e., the key). Now I can call FuzzyWuzzy to get string matching rates.

    3. FuzzyWuzzy String Matching


    for k1 in uq_ln:
    	labeler_name = k1
    	nodeId_drug = uq_ln[k1]
    
    for k2 in uq_fn:
        	company_name = k2
        	nodeId_df = uq_fn[k2]
        	r1 = fuzz.partial_ratio(labeler_name, company_name)
        	r2 = fuzz.ratio(labeler_name, company_name)
        	if r1 > 85:
            	print('r1:',r1, 'r2:',r2, 'ln:',k1, 'fn:',k2)
        	if r2 > 85:
            	print('r1:',r1, 'r2:',r2, 'ln:',k1, 'fn:',k2)
    

    I started off by choosing 85 to be the cut off for both partial_ratio and ratio just to see the results of the string matching. I generalized the result into three cases:

    Case 1: Both partial_ratio(r1) and ratio(r2) are equal to 100

    The two strings are identical to each other, for example:

    r1: 100 r2: 100 ln: abilene inc oxygen fn: abilene inc oxygen
    r1: 100 r2: 100 ln: abilene inc oxygen fn: abilene inc oxygen
    

    Case 2: Only r1 is 100

    One of the strings is a substring of the other one, but needs to exclude false positives. Here is the example:

    r1: 100 r2: 65 ln: gavi llc pharmaceutical fn: llc pharmac
    

    From my observation on the example, fn is the substring of ln and r1 is equal to 100, supporting my observation. However, the r2 is relatively low. By judging from the two strings, I cannot infer they represent the same company. Thus we need to do some further modification either on the cut off rate or the strings to exclude the false positives.

    Case 3: Both r1 and r2 are >85

    The two strings are similar, but need to exclude false positives:

    r1: 98 r2: 96 ln: barr inc laboratorie fn: arg inc laboratorie
    r1: 89 r2: 94 ln: company perrigo fn: company l perrigo
    

    Both r1 and r2 are greater than 85 in the two examples above. However, I identify the first line being a false positive, which needs to be excluded. Whereas in the second line, the two strings may represent the same company even though both r1 and r2 are lower than the first line’s. Again, we need some further modifications to improve the accuracy of the string matching.

    4. Modification on String Matching


    Let’s take a look at the strings in case 2 and case 3 again:

    gavi llc pharmaceutical” vs “llc pharmac” -> false positive
    barr inc laboratorie fn” vs “arg inc laboratorie” -> false positive
    “company perrigo” vs “company l perrigo” -> may represent the same company


    I want to do some changes on the strings so my program will only return the third line and ignore the first two lines. I realize most of the company names are constructed by two components: a unique component (in bold) that differentiates the company from other companies, and a common component (in regular text) that represents the type of organization.

    If I remove the common component from the string, whatever is left should be the unique component which is much more precise and easier for a computer to decide whether the two strings are similar or not.

    For example, if remove all the common components in all the three lines, the strings will look like this:

    gavi” vs “ ” -> false positive
    barr” vs “arg” -> false positive
    perrigo” vs “l perrigo” -> may represent the same company


    Now the computer can pick up the third line (r1 improved from 89 to 100) where “perrigo” is a substring of “I perrigo”, and it can ignore the first two false positive cases.

    I also noticed for most of the cases when r1 = 100 and r1r2 > 30, it’s hard to say if the two strings represent the same company, such as in this example:

    r1: 100 r2: 24 ln: brand inc tween fn: br
    r1: 100 r2: 43 ln: bio co general ltd fn: bio c
    r1: 100 r2: 47 ln: dava inc pharmaceutical fn: ava inc
    r1: 100 r2: 59 ln: bio cosmetic fn: bio c
    r1: 100 r2: 27 ln: chartwell governmental llc rx specialty fn: al llc
    r1: 100 r2: 69 ln: canton laboratorie fn: canton laboratorie limited private
    r1: 100 r2: 48 ln: beach inc productsd fn: ch inc
    r1: 100 r2: 63 ln: genentech inc fn: ch inc
    r1: 100 r2: 41 ln: dental llc scott supply fn: al llc
    

    As a result, I decided to exclude these cases by controlling the differences between the cut off values within the range of 30.

    5. The Final Solution


    # #======= Create relation :BRANDS (String Fuzzying Matching) ======#
    q3 = '''
    MATCH (d:Drug) where id(d) in {drug_id} and d.tradeName is not NULL
    MATCH (df:DrugFirm) where id(df) in {drug_firm_id}
    MERGE (df)-[r:BRANDS]->(d)
    ON CREATE SET r.ratio = {r2}, r.partial_ratio = {r1}'''
    
    num = 0  # Number of rel that are created
    for k1 in uq_ln:
    	labeler_name = k1
    	nodeId_drug = uq_ln[k1]
    
    for k2 in uq_fn:
        	company_name = k2
        	nodeId_df = uq_fn[k2]
        	r1 = fuzz.partial_ratio(labeler_name, company_name)
        	r2 = fuzz.ratio(labeler_name, company_name)
    
        	if r1 == 100 and (r1 - r2) <= 30:
            	g.run(q3, drug_id = nodeId_drug, drug_firm_id = nodeId_df, r1 = r1, r2 = r2)
            	num += 1
            	print("CREATE relation :BRANDS number:", num)
    
        	elif (100 > r1 >= 95 and r2 >= 85) or (95 > r1 >= 85 and r2 >= 90):  ### miss spell or miss a word  r1 and r2 > 95
            	md_r1 = fuzz.partial_ratio(string_filter(labeler_name, nostring), string_filter(company_name, nostring))
            	md_r2 = fuzz.ratio(string_filter(labeler_name, nostring), string_filter(company_name, nostring))
    
            	if md_r1 >= 95 and md_r2 >= 95:
    	            g.run(q3, drug_id=nodeId_drug, drug_firm_id=nodeId_df, r1=md_r1, r2=md_r2)
                	num += 1
                	print("CREATE relation :BRANDS rel number:", num)
    

    I decided to create the relationship [:BRANDS] for nodes (:Drug) and (:DrugFirm) if r1 is 100 and r1r2 is less than 30. When r1 and r2 are both above 85, I decided to filter out some common words in the strings, such as inc, co, ltd, llc, corporation, pharmaceutical, laboratory, company, product, pharma, etc. and then recalculate r1 and r2 on the modified strings.

    It’s also helpful to store the value of r1 and r2 as properties for [:BRANDS], so when I am querying information between drug firms and drugs, I am also able to trace the confidence level of the answer. The Cypher below returns the {firmName} and {labelerName} based on the ascending order of r1:

    MATCH (df:DrugFirm)-[r:BRANDS]->(d:Drug)
    RETURN df.firmName, d.labelerName, r.partial_ratio as r1, r.ratio as r2 order by r1 ASC limit 10
    

    The Cypher results for connections between drug firm names and drug labeler names in healthcare


    Now we have successfully created the relationship [:BRANDS] for (:DrugFirm) and (:Drug), and the matching results seem pretty trustworthy. I also used the same method to create relationships for (:Client) and (:DrugFirm), as well as (:Legislator) and (:LegislatorInfo).

    Lastly, let’s find out which drug firm brand drug has the generic component morphine sulfate:

    MATCH (df:DrugFirm)-[r:BRANDS]->(d:Drug{genericName:'Morphine Sulfate'}) RETURN df,d
    

    The relationships between brands and drugs that contain morphine sulfate


    (There are 193 matched results and I randomly choose 3 companies to display the results.)

    Conclusion


    I hope you enjoy reading through this blog post. If you are facing similar issues such as having difficulties connecting to your graph, I hope this post is helpful to you.

    At the same time, I am eager to hear your thoughts or ideas on how to solve the problem. Don’t be hesitate to send me a message on Twitter, LinkedIn or email if you have any questions about my project.

    With the summer passing by, it is getting close to the end of my fun summer road trip through the world of healthcare. I will be very excited to show you some the most interesting discoveries I have made in my next blog post. If you want to know more about graph technology in the healthcare industry, don’t miss out on my last blog post. See you soon!


    Ready to dig in and get started with graph technology? Click below to download this free ebook, Learning Neo4j and catch up to speed with the world’s leading graph database.



    Catch up with the rest of the Neo4j in healthcare series:

    From the Neo4j Community: August 2016

    $
    0
    0
    Explore all of the great articles and projects created by the Neo4j community in August 2016The Neo4j community has been busy this summer with a number of great projects, code libraries and a host of great articles on how graphs are everywhere – including everything from the Rio Summer Olympics to examining ISIS support and opposition networks.

    We can’t wait to see what other great uses of Neo4j the community comes up with in September!

    If you would like to see your post featured in September’s “From the Community” blog post, follow us on Twitter and use the #Neo4j hashtag for your chance to get picked.

    Articles and Blog Posts


    Podcasts and Audio


    Videos


    Slides and Presentations


    Libraries, GraphGists and Code Repos


    Other Content



    Curious how relational databases compare to graph technology?
    Download this ebook, The Definitive Guide to Graph Databases for the RDBMS Developer, and discover when and how to use graphs in conjunction with your relational database.


    Get the Ebook

    Neo4j + AWS Lambda & API Gateway to Create a Recommendation Engine

    $
    0
    0

    The Challenge of Findability and Why Graphs?


    Here at the SAVO Group, we are in the business of sales enablement via Software as a Service (SaaS). One of the tools that we provide our customers is the ability to prescribe content proactively for their sales people or sellers. We are moving past that concept and into ways that we can have content “find” sellers much quicker.

    Ever since listening to Emil Eifrem on the O’Reilly Data Show talk about Neo4j, I have been super intrigued by graph technology and working to create an opportunity to use it at SAVO. As a Microsoft SQL Server DBA, the idea of connected data and the Cypher query language feel very natural to me.

    At SAVO, we track all activity on the content we manage for our customers. The light bulb for me was to leverage this activity and its connected “DNA” to recommend new content to sellers based on what other sellers have used. Knowing the real-time recommendation engine use case, Neo4j felt like a natural fit for this.

    Data Model and Cypher Query


    Below is the data model.

    A graph data model for sales enablement


    We serve hundreds of customers (i.e., tenants), so tenant_id is important in a single Neo4j database. The basic pattern is:
    (user)-[:ACTION]->(action)-[:DOWNLOADED|EMAILED|SEARCH_VIEW]->(document) 
    

      • User being the person acting on content, with metadata such as id and tenant
      • Action being the metadata about the action record in our database such as id and date
      • Document being the content acted on, with metadata such as id and tenant
    The key here is the additional (:action) node which gives us the flexibility for date filtering and the ability to leverage indexing. With this pattern, we can extend to other users that have consumed this document and return what else they have consumed.

    Cypher query for recommendations:

    MATCH (u1:user {user_id: toInt({user})}),
          (d:document{document_id:toInt({document})})<--(a2:action)
          <--(u2:user)-->(a3:action)-[r]->(d2:document)
    WHERE u1 <> u2 AND d <> d2
    AND NOT  (u1)-[*2]->(d2)
    AND a3.action_date >= a2.action_date
    RETURN d2.document_id AS document_id, 
           sum(case when type(r) = 'DOWNLOADED' then 1 else 0 end) as downloads,
           sum(case when type(r) = 'EMAILED' then 1 else 0 end) as emailed,
           sum(case when type(r) = 'SEARCH_VIEW' then 1 else 0 end) as search_views,
           count(r) as score
    ORDER BY score desc
    LIMIT 25
    

    We also filter to make sure we are not showing documents to a user that they have already seen since the idea is to bring unseen content to the user.

    Enter AWS Managed Services and EC2


    SAVO is currently in process of moving new and existing applications and infrastructure to Amazon Web Services (AWS). This presented the optimal opportunity to display the awesomeness of graphs and how a recommendation engine could be created very quickly with AWS, specifically using managed services like Lambda and API Gateway instead of spinning up new VMs or adding to existing applications. (I also get to progress with learning AWS and Neo4j all at once – win-win!)

    The architecture for the proof of concept looks like this:

    Learn how to create a recommendation engine using Neo4j alongside Lambda and API Gateway from AWS


    The EC2 instance is a very simple m3.medium built from the Ubuntu 14.04 AMI inside of a default VPC we have set up in SAVO’s scratch account, which we use for POC work and various types of exploration. I set up the EC2 instance with a security group that limits Neo4j Bolt driver port access at 7687 to another security group that I will use later on. Also I added a 30GB EBS drive and then installed Neo4j per the Ubuntu installation documentation.

    Here is how Amazon describes Lambda and API Gateway:
    AWS Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume – there is no charge when your code is not running. With Lambda, you can run code for virtually any type of application or backend service – all with zero administration. Just upload your code and Lambda takes care of everything required to run and scale your code with high availability. You can set up your code to automatically trigger from other AWS services or call it directly from any web or mobile app.

    Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor and secure APIs at any scale. With a few clicks in the AWS Management Console, you can create an API that acts as a “front door” for applications to access data, business logic or functionality from your backend services, such as workloads running on Amazon Elastic Compute Cloud (Amazon EC2), code running on AWS Lambda or any web application. Amazon API Gateway handles all the tasks involved in accepting and processing up to hundreds of thousands of concurrent API calls, including traffic management, authorization and access control, monitoring and API version management. Amazon API Gateway has no minimum fees or startup costs. You pay only for the API calls you receive and the amount of data transferred out.
    Using Lambda, which supports Python 2.7, along with the Neo4j Bolt driver and API Gateway, I am able to turn my Cypher query into a fully functional microservice. Leveraging Amazon’s awesome documentation makes it quite easy to set this up.

    I set up the Lambda function to work within the same default VPC stated above and added the security group that is allowed access to the Bolt driver port in order to keep the communication between our database and the Lambda function private and isolated. I upped the default timeout to 30 seconds, just to be safe.

    With Lambda you can upload any dependent Python packages along with your code as a ZIP file. For this project, the .py file and the Neo4j Python Bolt driver are manually packaged into a ZIP file and uploaded to Lambda via the AWS Console.

    Example files and ZIP package:

    A ZIP of the AWS Lamdba Python package


    Lambda console for uploading ZIP package:

    The AWS Lambda console with Neo4j


    The Python code in the package:

    from __future__ import print_function
    from neo4j.v1 import GraphDatabase, basic_auth
    
    
    def get_recommendation(event,context):
        results = []
        user = event['user_id']
        document = event['document_id']
        driver = GraphDatabase.driver("bolt://", auth=basic_auth("neo4j", "neo4j"), encrypted=False)
        session = driver.session()
        cypher_query = '''
        MATCH (u1:user {user_id: toInt({user})}),
              (d:document{document_id:toInt({document})})<--(a2:action)<--(u2:user)
              -->(a3:action)-[r]->(d2:document)
        WHERE u1 <> u2 AND d <> d2
        AND NOT  (u1)-[*2]->(d2)
        AND a3.action_date >= a2.action_date
        RETURN d2.document_id AS document_id, 
              sum(case when type(r) = 'DOWNLOADED' then 1 else 0 end) as downloads,
              sum(case when type(r) = 'EMAILED' then 1 else 0 end) as emailed,
              sum(case when type(r) = 'SEARCH_VIEW' then 1 else 0 end) as search_views,
              count(r) as score
        ORDER BY score desc
        LIMIT 25
        '''
        result = session.run(cypher_query,{'user':user,'document':document})
        session.close()
        for record in result:
            item = {'document_id':record['document_id'],  
                    'downloads':record['downloads'], 'emailed':record['emailed'],  
                    'search_views':record['search_views'], 'score':record['score']}
            results.append(item)
        return results
    

    The actual EC2 instance DNS name is omitted in the above example. Both that and the Neo4j username and password are hard-coded in. A better way to this in the future would be to pass these values in at runtime. This could be accomplished by calling an S3 bucket or RDS instance, both encrypted at rest, and pulling the appropriate values into the Lambda function when triggered.

    You will notice that I had to use encrypted=False in order to authenticate into Neo4j. Using encrypted=True would not work because we do not have ability to leverage a known_hosts file or signed certificate using Lambda. At this point we will rely on the protection of the AWS VPC. The signed certificate option may work with Node.js driver, but that is out of scope for this example.

    Once this is working, we set up the API Gateway to integrate with the Lambda function via a GET request that will send user_id and document_id and return recommendations.

    In API Gateway, we add a document-recommendations resource and a {user} and {document} GET method. The brackets allow us to send the ids as a part of the request.

    The AWS API Gateway with Cypher


    On the {document} method, we integrate with our Lambda function to send user_id and document_id and receive recommendations.

    Integration between AWS Lambda and API Gateway


    Once we have this set up and tested, we deploy to a stage and get our very own API endpoint.

    To test things, I opened up Postman, executed my HTTPS API endpoint with user and document values, and through the magic of AWS and Neo4j, we have recommendations in less than 500 milliseconds!

    Real-time recommendation engine results with Neo4j and AWS


    Next Steps


    I am really passionate about Neo4j and love working with graphs. My hope is that this POC can grow into a fully-fledged content recommendation engine with many endpoints for various flavors of recommendations.

    Currently, there is a lot of manual setup involved. I envision that the production version would be an HA cluster with automated deployment of AWS components and code via some mix of Cloudformation and Jenkins.

    We are also using the default Neo4j password in this example, as well as the default settings for API Gateway. In future, I would potentially leverage password encryption in AWS to get a password at runtime. API Gateway also gives us the ability to use an “Authorizer” to control access, which is something we will explore.

    In conclusion, this was such a cool example to get up and running and really only took a few hours, since I already had some sample Cypher and a zipped-up copy of my local database. I think this is a credit to how accessible Neo4j and AWS are. I look forward to building more graph-based solutions in the future.

    The world of graph databases truly is wonderful and they are indeed everywhere…even on AWS 🙂


    Already know Neo4j? Prove it.
    Take the Neo4j Certification exam and validate your graph database skills to current and future employers and customers. Click below and get certified in less than an hour.


    Get Certified

    Adding Users to the Node.js / React.js Neo4j Movie App

    $
    0
    0

    Introduction


    The Neo4j Movie App Template provides an easy-to-use foundation for your next Neo4j project or experiment using either Node.js or React.js. This article will walk through the creation of users that can log in and interact with the web app’s data.

    In the Neo4j Movie App Template example, these users will be able to log in and out, rate movies, and receive movie recommendations.

    The User Model


    Aside from creating themselves and authenticating with the app, Users (blue) can rate Movies (yellow) with the :RATED relationship, illustrated in the graph data model below.

    Learn how to add users to the Node.js / React.js example Neo4j Movie App


    User Properties

      • password: The hashed version of the user’s chosen password
      • api_key: The user’s API key, which the user uses to authenticate requests
      • id: The user’s unique ID
      • username: The user’s chosen username
    :RATED Properties

    rating: an integer rating between 1 and 5, with 5 being love it and 1 being hate it.

    My Rated Movie in the Neo4j Movie App


    Users Can Create Accounts


    Before a User can rate a Movie, the the user has to exist – someone has to sign up for an account. Signup will create a node in the database with a User label along with properties necessary for logging in and maintaining a session.

    Create a new user account in the Neo4j Movie App

    Figure 1. web/src/pages/Signup.jsx

    The registration endpoint is located at /api/v0/register. The app submits a request to the register endpoint when a user fills out the “Create an Account” form and taps “Create Account”.

    Assuming you have the API running, you can test requests either by using the interactive docs at 3000/docs/, or by using cURL.

    Use Case: Create New User


    Request

    curl -X POST 
    --header 'Content-Type: application/json' 
    --header 'Accept: application/json' 
    -d '{ "username": "Mary Jane", "password": "SuperPassword"}' 'http://localhost:3000/api/v0/register'
    

    Response

    {
       "id":"e1e157a2-1fb5-416a-b819-eb75c480dfc6",
       "username":"Mary333 Jane",
       "avatar":{
          "full_size":"https://www.gravatar.com/avatar/b2a02b21db2222c472fc23ff78804687?d=retro"
       }
    }
    

    Use Case: Try to Create New User but Username Is Already Taken


    Request

    curl -X POST 
    --header 'Content-Type: application/json' 
    --header 'Accept: application/json' 
    -d '{ "username": "Mary Jane", "password": "SuperPassword"}' 'http://localhost:3000/api/v0/register'
    

    Response

    {
       "username":"username already in use"
    }
    

    User registration logic is implemented in /api/models/users.js. Here’s the JavaScript:

    var register = function(session, username, password) {
        return session.run('MATCH (user:User {username: {username}}) RETURN user', {
                username: username
            })
            .then(results => {
                if (!_.isEmpty(results.records)) {
                    throw {
                        username: 'username already in use',
                        status: 400
                    }
                }
                else {
                    return session.run('CREATE (user:User {id: {id}, username: {username},
                           password: {password}, api_key: {api_key}}) RETURN user', {
                        id: uuid.v4(),
                        username: username,
                        password: hashPassword(username, password),
                        api_key: randomstring.generate({
                            length: 20,
                            charset: 'hex'
                        })
                    }).then(results => {
                        return new User(results.records[0].get('user'));
                    })
                }
            });
    };
    

    Users Can Log in


    Now that users are able to register for an account, we can define the view that allows them to login to the site and start a session.



    User login on the Neo4j Movies AppFigure 2. /web/src/pages/Login.jsx

    The registration endpoint is located at /api/v0/login. The app submits a request to the login endpoint when a user fills a username and password and taps “Create Account”.

    Assuming you have the API running, you can test requests either by using the interactive docs at 3000/docs/, or by using cURL.

    Use Case: Login


    Request

    curl -X POST 
    --header 'Content-Type: application/json' 
    --header 'Accept: application/json' 
    -d '{"username": "Mary Jane", "password": "SuperPassword"}' 'http://localhost:3000/api/v0/login'
    

    Response

    {
    	"token":"5a85862fb28a316ea6a1"
    }
    

    Use Case: Wrong Password


    Request

    curl -X POST 
    --header 'Content-Type: application/json' 
    --header 'Accept: application/json' 
    -d '{ "username": "Mary Jane", "password": "SuperPassword"}' 'http://localhost:3000/api/v0/register'
    

    Response

    {
       "username":"username already in use"
    }
    

    Use Case: See Myself


    Request

    curl -X GET 
    --header 'Accept: application/json' 
    --header 'Authorization: Token 5a85862fb28a316ea6a1' 'http://localhost:3000/api/v0/users/me'
    

    Response

    {
      "id": "94a604f7-3eab-4f28-88ab-12704c228936",
      "username": "Mary Jane",
      "avatar": {
        "full_size": "https://www.gravatar.com/avatar/c2eab5611cabda1c87463d7d24d98026?d=retro"
      }
    }
    

    You can take a look at the implementation in /api/models/users.js:

    var me = function(session, apiKey) {
        return session.run('MATCH (user:User {api_key: {api_key}}) RETURN user', {
                api_key: apiKey
            })
            .then(results => {
                if (_.isEmpty(results.records)) {
                    throw {
                        message: 'invalid authorization key',
                        status: 401
                    };
                }
                return new User(results.records[0].get('user'));
            });
    };
    var login = function(session, username, password) {
        return session.run('MATCH (user:User {username: {username}}) RETURN user', {
                username: username
            })
            .then(results => {
                if (_.isEmpty(results.records)) {
                    throw {
                        username: 'username does not exist',
                        status: 400
                    }
                }
                else {
                    var dbUser = _.get(results.records[0].get('user'), 'properties');
                    if (dbUser.password != hashPassword(username, password)) {
                        throw {
                            password: 'wrong password',
                            status: 400
                        }
                    }
                    return {
                        token: _.get(dbUser, 'api_key')
                    };
                }
            });
    };
    

    The code here should look similar to /register. There is a similar form to fill out, where a user types in their username and password.

    With the given username, a User is initialized. The password they filled out in the form is verified against the hashed password that was retrieved from the corresponding :User node in the database.

    If the verification is successful it will return a token. The user is then directed to an authentication page, from which they can navigate through the app, view their user profile and rate movies. Below is a rather empty user profile for a freshly created user:

    An empty user profile in the Neo4j Movies App

    Figure 3. /web/src/pages/Profile.jsx

    Users Can Rate Movies


    Once a user has logged in and navigated to a page that displays movies, the user can select a star rating for the movie or remove the rating of a movie he or she has already rated.

    My Rated Movie in the Neo4j Movie App


    The user should be able to access their previous ratings (and the movies that were rated) both on their user profile and the movie detail page in question.

    Use Case: Rate a Movie


    Request

    curl -X POST 
    --header 'Content-Type: application/json' 
    --header 'Accept: application/json' 
    --header 'Authorization: Token 5a85862fb28a316ea6a1' 
    -d '{"rating":4}' 'http://localhost:3000/api/v0/movies/683/rate'
    

    Response

    {}
    

    Use Case: See All of My Ratings


    Request

    curl -X GET 
    --header 'Accept: application/json' 
    --header 'Authorization: Token 5a85862fb28a316ea6a1' 'http://localhost:3000/api/v0/movies/rated'
    

    Response

    [
      {
        "summary": "Six months after the events depicted in The Matrix, ...",
        "duration": 138,
        "rated": "R",
        "tagline": "Free your mind.",
        "id": 28,
        "title": "The Matrix Reloaded",
        "poster_image": "http://image.tmdb.org/t/p/w185/ezIurBz2fdUc68d98Fp9dRf5ihv.jpg",
        "my_rating": 4
      },
      {
        "summary": "Thomas A. Anderson is a man living two lives....",
        "duration": 136,
        "rated": "R",
        "tagline": "Welcome to the Real World.",
        "id": 1,
        "title": "The Matrix",
        "poster_image": "http://image.tmdb.org/t/p/w185/gynBNzwyaHKtXqlEKKLioNkjKgN.jpg",
        "my_rating": 4
      }
    ]
    

    Use Case: See My Rating on a Particular Movie


    Request

    curl -X GET 
    --header 'Accept: application/json' 
    --header 'Authorization: Token 5a85862fb28a316ea6a1' 'http://localhost:3000/api/v0/movies/1'
    

    Response

    {
       "summary":"Thomas A. Anderson is a man living two lives....",
       "duration":136,
       "rated":"R",
       "tagline":"Welcome to the Real World.",
       "id":1,
       "title":"The Matrix",
    "poster_image":"http://image.tmdb.org/t/p/w185/gynBNzwyaHKtXqlEKKLioNkjKgN.jpg",
       "my_rating":4,
       "directors":[...],
       "genres":[...],
       "producers":[...],
       "writers":[...],
       "actors":[...],
       "related":[...],
       "keywords":[...]
    }
    

    Users Can Be Recommended Movies Based on Their Recommendations


    When a user visits their own profile, the user will see movie recommendations. There are many ways to build a recommendation engine, and you might want to use one or a combination of the methods below to build the appropriate recommendation system for your particular use case.

    In the movie template, you can find the recommendation endpoint at movies/recommended.

    User-Centric, User-Based Recommendations

    Here’s an example Cypher query for a user-centric recommendation:

    MATCH (me:User {username:'Sherman'})-[my:RATED]->(m:Movie)
    MATCH (other:User)-[their:RATED]->(m)
    WHERE me <> other
    AND abs(my.rating - their.rating) < 2
    WITH other,m
    MATCH (other)-[otherRating:RATED]->(movie:Movie)
    WHERE movie <> m
    WITH avg(otherRating.rating) AS avgRating, movie
    RETURN movie
    ORDER BY avgRating desc
    LIMIT 25
    

    Movie-Centric, Keyword-Based Recommendations

    Newer movies will have few or no ratings, so they will never be recommended to users if the application uses users’ rating-based recommendations.

    Since movies have keywords, the application can recommend movies with similar keywords for a particular movie. This case is useful when the user has made few or no ratings.

    For example, site visitors interested in movies like Elysium will likely be interested in movies with similar keywords as Elysium.

    MATCH (m:Movie {title:'Elysium'})
    MATCH (m)-[:HAS_KEYWORD]->(k:Keyword)
    MATCH (movie:Movie)-[r:HAS_KEYWORD]->(k)
    WHERE m <> movie
    WITH movie, count(DISTINCT r) AS commonKeywords
    RETURN movie
    ORDER BY commonKeywords DESC
    LIMIT 25
    

    User-Centric, Keyword-Based Recommendations

    Users with established tastes may be interested in finding movies with similar characteristics as his or her highly-rated movies, while not necessarily caring about whether another user has or hasn’t already rated the movie. For example, Sherman has seen many movies and is looking for new movies similar to the ones he has already watched.

    MATCH (u:User {username:'Sherman'})-[:RATED]->(m:Movie)
    MATCH (m)-[:HAS_KEYWORD]->(k:Keyword)
    MATCH (movie:Movie)-[r:HAS_KEYWORD]->(k)
    WHERE m <> movie
    WITH movie, count(DISTINCT r) AS commonKeywords
    RETURN movie
    ORDER BY commonKeywords DESC
    LIMIT 25
    

    Next Steps




    Want to learn more about what you can do with graph databases like Neo4j?
    Click below to get your free copy the O’Reilly Graph Databases book and discover how to harness the power of graph technology.


    Download My Free Copy

    The 5-Minute Interview: Daniel Himmelstein, Postdoctoral Fellow at University of Pennsylvania

    $
    0
    0
    “This is a really advanced graph algorithm and Cypher nailed it,” said Daniel Himmelstein, a Postdoctoral Fellow at the University of Pennsylvania.

    Before using Neo4j, it took as many as 1,000 lines of code to write the main query for Himmelstein’s graph algorithm used in a bioinformatics application. But with Neo4j’s Cypher graph query language, the query took only 20 lines.

    In this week’s 5-Minute Interview (conducted at at GraphConnect San Francisco), we discuss how Neo4j is being used for biological and medical research at UPenn. Himmelstein also narrates where he believes the field of bioinformatics research is headed in 2017.



    Tell us about how you use Neo4j at UPenn.


    Daniel Himmelstein: I use Neo4j to encode biological and medical knowledge into a network. Neo4j was the best way to encode this type of knowledge – which is produced by millions of studies over the past 50 years – where we are able to represent the rich types of nodes and relationships from real-world biological data.

    What made you choose to work with Neo4j?


    Daniel: The Neo4j community is the reason I chose it. First, the features are fantastic and were exactly what we needed, mainly because Neo4j dealt with different types of networks extremely well. But the community — with so many things on GitHub where I could report any issues with code and then have it fixed quickly, or ask a question on Stack Overflow, was really great.

    The developers have been extremely helpful, and I went to some meetups in San Francisco where I met some of the team. The company provides great support, even though we were never a paying customer as open source users of the product. The community has been great to be a part of.

    What are some of the most interesting or surprising results you’ve seen while using Neo4j?


    Catch this week's 5-Minute Interview with Daniel Himmelstein, University of PennsylvaniaDaniel: Before using Neo4j, I had written a Python package called Hetio, which dealt with a number of different types of networks. It took as many as 1,000 lines of code to do the main query for our algorithm. But when I switched to Neo4j and was able to pour the algorithm into Cypher, the code was only 20 lines. I thought, “Wow. This is a really advanced graph algorithm and Cypher nailed it.”

    Cypher had exactly the right constructs to be able to express exactly what we wanted. And it was cool to have people finally think about how to query a graph; previously people hadn’t put much effort into developing a good query language for networks.

    If you could start over with Neo4j, taking everything you know now, what would you do differently?


    Daniel: If I could go back in time, maybe I would have used Neo4j a little bit earlier. When I first considered Neo4j, I don’t think Cypher was out yet. And because I program primarily in Python, and a little bit in R, there originally wasn’t an intuitive way to interact with Neo4j. But with the new Bolt drivers and the Cypher query language, it has become quite easy to work with Python in Neo4j.

    Anything else you want to add or say?


    Daniel: I’m really excited. There have been several talks here at GraphConnect San Francisco from people in the bioinformatics field. I know when Emil did the keynote he didn’t include the biology or medicine as one of his six fields, but this will likely be a field in 2017 because it’s really blowing up. We have a lot of data, it has types, and we need to understand those connections, so I expect big growth in the biology field in the next year.

    Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at content@neo4j.com


    Use your RDBMS expertise to learn about graph databases: Download this ebook, The Definitive Guide to Graph Databases for the RDBMS Developer, and discover when and how to use graphs in conjunction with a relational database.

    Get the Ebook

    2016: The Year in Neo4j Drivers

    $
    0
    0

    Spring is in the Air


    2016 was the year when, in April with the availability of Neo4j 3.0 we also introduced our own binary protocol named Bolt. We also provided the first set of officially supported drivers for Bolt, including Java, .Net, JavaScript and Python that were developed in-house and documented in the Neo4j developer manual.

    Learn about Neo4j drivers for JavaScript, Java, .NET, Python and other community language drivers.


    Since the first days of Neo4j we’ve been supported by our active community of contributors who did a great job of providing drivers for our HTTP and REST endpoints for more than 20 popular programming languages.

    Thank you all so much for this impressive work!


    With Neo4j 3.0 and the Bolt binary protocol, we saw this amazing work continue. Originally we were a bit concerned because of the higher effort required to develop a Neo4j driver for a custom binary protocol but our contributors surprised us here.

    Even during the development of Neo4j 3.0 the first three drivers: Nigel’s py2neo (Python), neo4j-php-client (PHP) by Chris Willemsen from our partner GraphAware (UK) and libneo4j-client (C) by Chris Leishman had their first releases.

    To make it easier for contributors to develop Neo4j drivers using Bolt, Nigel started the boltkit project: a detailed, executable documentation (in Python) that details how to structure and implement a driver for the Bolt protocol and the Packstream serialization. This also includes some tools for driver authors and is used in-house here at Neo Technology.

    boltkit for Neo4j drivers includes and API and PackStream details


    Summertime and the Livin’ Is Easy


    In the Go community, several people actively worked on Neo4j drivers using Bolt. John Nadratowski developed the golang-neo4j-bolt-driver of which a more idiomatic fork was made and maintained by Eric Lagergren from SermoDigital to a more idiomatic variant. And Hugo Bost wrote neoql a variant that was (similar to cq) based on the widely used database/sql API for Go.

    Florin Patrascu enjoyed working in Elixir and couldn’t live without a current Neo4j driver. So he provided neo4j-sips, a Bolt driver for Elixir.

    Forever Autumn


    In October at GraphConnect San Francisco, the third beta of Neo4j 3.1 was launched, with some notable improvements in the official Neo4j drivers, especially in concurrent operations and session reuse.

    The bigger changes were in the new APIs to support Causal Clustering in Neo4j 3.1. Smart client routing (bolt+routing://host:port) that uses information on cluster topology together with demarcation of read and write sessions alleviates the need for a load balancer. And the ability to use a transaction-state token (bookmark) allows for causal consistency to read your own writes even on an eventually consistent cluster underneath. These features were first to launch with the 1.1 version of the Java driver.

    Pavel Yakovlev found time besides his job as the research director of a biotech company to develop a Bolt Neo4j driver for Haskell named hasbolt.

    Hazy Shade of Winter


    The neo4j.rb (Ruby) team (Brian Underwood, Chris Grigg, et al) worked over the year – besides other improvements – on implementing the Bolt protocol in neo4j-core gem so that it is supported both on the low-level APIs as well as in the ActiveRecord module of neo4j.rb, both of which were released at the end of the year.

    These core Neo4j drivers were not the only libraries developed in 2016, we also saw them being used by other projects. The Java Neo4j driver was used for the neo4j-spark-connector and for the Neo4j-JDBC driver by the team of our partner Larus BA in Italy. The JavaScript driver powers the Neo4j Browser and the Tableau 10 (WDC2) Neo4j connector, maintained by our partner Ralf Becher from TIQ in Germany.

    The PHP driver is used in the drupal module developed by Pronovix and also in the brand-new neo4j-symfony. Neomodel 3, the Django-friendly OGM by Robin Edwards uses py2neo under the hood.

    The Jan 19th release of the new official 1.1 driver series for .NET, JavaScript and Python adds smart routing and bookmarking capabilities to work seamlessly with causal clusters.

    You can find all these mentioned drivers on our language guide pages for developers and for many of them also an implementation of our example movie application in our github.com/neo4j-examples repositories.

    The Neo4j example movie application


    As with any open source project, feedback from users is crucial to our success, so if you use any of the abovementioned Neo4j drivers make sure to raise issues if you encounter problems or have ideas and/or suggestions for improvements.

    We’re sure any driver author would appreciate a “thank you” for their efforts as well. And if you are using the Neo4j drivers in a commercial project, perhaps you can find an opportunity to either contribute back code that you’ve developed or consider contracting the author or the author’s company to help improve the driver for real-world usage.

    Are there languages missing that you would like to see supported by the Neo4j community or officially by Neo4j? Please let us know! Drop us an email to feedback@neo4j.com.

    If you would love to work on Neo4j drivers and related topics full-time, we’re hiring for positions in the drivers team.



    Think you have what it takes to be Neo4j certified?
    Show off your graph database skills to the community and employers with the official Neo4j Certification. Click below to get started and you could be done in less than an hour.


    Start My Certification

    From the Neo4j Community: January 2017

    $
    0
    0
    Discover all of the great articles created by the Neo4j community in January 2017The year is off to a great start when to comes to the Neo4j community. If this month is any indication of what’s to come, then we know that 2017 will be a big year for Neo4j projects, drivers and integrations across the board. Here are some of our favorite picks from last month’s Neo4j community contributions.

    If you would like to see your post featured in February’s “From the Neo4j Community” blog post, follow us on Twitter and use the #Neo4j hashtag for your chance to get picked.

    Articles and Blog Posts


    Graph Visualisation

     

    Videos


    Language Drivers


    Libraries, GraphGists, and Code Repos



    Want to take your Neo4j skills up a notch? Take our online training class, Neo4j in Production, and learn how scale the world’s leading graph database to unprecedented levels.

    Take the Class

    Just for Flask & React.js Developers: A New Neo4j Movies Template

    $
    0
    0

    Introduction


    Let’s jump right into it. You’re a Python developer interested in Neo4j and want to build a web app, microservice or mobile app. You’ve already read up on Neo4j, played around with some datasets, and know enough Cypher to get going. Now you’re looking for a demo app or template to get the ball rolling.

    Enter the Neo4j Movies Template.

    This blog post will walk you rating a movie on a sample movie rating application, from initial setup to viewing the list of movies you’ve rated.

    What comes with the Neo4j Movies Template:

    Overview of the Data Model and the Implementation


    The Classic Movie Database

    This project uses a classic Neo4j dataset: the movie database. It includes Movie, Person, Genre and Keyword nodes, connected by relationships as described in the following image:

    Graph data model of the classic movie database


      • (:Movie)-[:HAS_GENRE]→(:Genre)
      • (:Movie)-[:HAS_KEYWORD]→(:Keyword)
      • (:Person)-[:ACTED_IN]→(:Movie)
      • (:Person)-[:WROTE]→(:Movie)
      • (:Person)-[:DIRECTED]→(:Movie)
      • (:Person)-[:PRODUCED]→(:Movie)

    Additionally, users can add ratings to movies:

    Learn how to use Flask and React.js with Neo4j with this all-new Movies template


      • (:User)-[:RATED]→(:Movie)

    Or, in table form:

    from props_from via to props_to
    [User] [api_key, username, password, id] RATED [Movie] [id, title, tagline, summary, poster_image, duration, rated]
    [Person] [id,name,born,poster_image] ACTED_IN [Movie] [id,title,tagline,summary,poster_image,duration,rated]
    [Movie] [id,title,tagline,summary,poster_image,duration,rated] HAS_KEYWORD [Keyword] [id,name]
    [Person] [id,name,born,poster_image] DIRECTED [Movie] [id,title,tagline,summary,poster_image,duration,rated]
    [Person] [id,name,born,poster_image] PRODUCED [Movie] [id,title,tagline,summary,poster_image,duration,rated]
    [Person] [id,name,born,poster_image] WRITER_OF [Movie] [id,title,tagline,summary,poster_image,duration,rated]
    [Movie] [id,title,tagline,summary,poster_image,duration,rated] HAS_GENRE [Genre] [id,name]


    The API

    The Flask portion of the application interfaces with the database and presents data to the React.js front-end via a RESTful API.

    The Front-End

    The front-end, built in React.js, consumes the data presented by the Flask API and presents some views to the end user, including:

      • Home page
      • Movie detail page
      • Person detail page
      • User detail page
      • Login

    Setting Up


    To get the project running, clone the repo then check the project’s README for environment-specific setup instructions.

    The README covers how to:

      • Download and install Neo4j
      • Prepare the database
      • Import the nodes and relationships using neo4j-import
    Start the Database!

      • Start Neo4j if you haven’t already!
      • Set your username and password (You’ll run into less trouble if you don’t use the defaults)
      • Set environment variables (Note: the following is for Unix; for Windows you will be using set=…​)
      • Export your Neo4j database username export MOVIE_DATABASE_USERNAME=myusername
      • Export your Neo4j database password export MOVIE_DATABASE_PASSWORD=mypassword
      • You should see a database populated with Movie, Genre, Keyword and Person nodes.
    Start the Flask Backend

    The Neo4j-powered Flask API lives in the flask-api directory.

      • cd flask-api
      • pip install -r requirements.txt (you should be using a virtualenv)
      • export FLASK_APP=app.py
      • flask run starts the API
      • Take a look at the docs at http://localhost:5000/docs
    The Python Flask backend of the Neo4j Movies template app, looking at movie genres


    Start the React.js Front-End


    With the database and Express.js backend running, open a new terminal tab or window and move to the project’s /web subdirectory. Install the bower and npm dependencies, then start the app by running gulp (read the “getting started” on gulpjs.com). Edit config/settings.js by changing the apiBaseURL to http://localhost:5000/api/v0

    Over on http://localhost:4000/, you should see the homepage of the movie app, displaying three featured movies and other movies below.

    Home page of the Neo4j Flask Movies template app


    Click on a movie to see the movie detail page:

    Movie detail page for the Neo4j Flask movies template


    Click on a person to see that person’s related people and movies the person has acted in, directed, written or produced:

    Person detail page in the Neo4j Flask movies template app


    A Closer Look: Using the Python Neo4j Bolt Driver


    Let’s take a closer look at what sort of responses we get from the driver.

    Import dependencies, including the Neo4j driver, and connect the driver to the database:

    Getting Ready
    app = Flask(__name__)
    app.config['SECRET_KEY'] = 'super secret guy'
    api = Api(app, title='Neo4j Movie Demo API', api_version='0.0.10')
    CORS(app)
    
    
    driver = GraphDatabase.driver('bolt://localhost', 
                                  auth=basic_auth(config.DATABASE_USERNAME,
                                  str(config.DATABASE_PASSWORD)))
    

    Let’s look at how we would ask the database to return all the genres in the database. The GenreList class queries the database for all Genre nodes, serializes the results, and returns them via /api/v0/genres.

    class GenreList(Resource):
        @swagger.doc({
            'tags': ['genres'],
            'summary': 'Find all genres',
            'description': 'Returns all genres',
            'responses': {
                '200': {
                    'description': 'A list of genres',
                    'schema': GenreModel,
                }
            }
        })
    
        def get(self):
            db = get_db()
            result = db.run('MATCH (genre:Genre) RETURN genre')
            return [serialize_genre(record['genre']) for record in result]
    
    ...
    
    def serialize_genre(genre):
        return {
            'id': genre['id'],
            'name': genre['name'],
        }
    
    ...
    
    api.add_resource(GenreList, '/api/v0/genres')
    

    What’s Going on with the Serializer?

    The Bolt driver responses are different than what you might be used to if you’ve used a non-Bolt Neo4j driver.

    In the “get all Genres” example described above, result = db.run('MATCH (genre:Genre) RETURN genre') returns a series of records:

    An Example Record
    {
       "keys":[
          "genre"
       ],
       "length":1,
       "_fields":[
          {
             "identity":{
                "low":719,
                "high":0
             },
             "labels":[
                "Genre"
             ],
             "properties":{
                "name":"Action",
                "id":{
                   "low":16,
                   "high":0
                }
             },
             "id":"719"
          }
       ],
       "_fieldLookup":{
          "genre":0
       }
    }
    

    The serializer parses these messy results into the data we need to build a useful API:

    def serialize_genre(genre):
        return {
            'id': genre['id'],
            'name': genre['name'],
        }
    

    Voila! An array of genres appears at /genres.

    Beyond the /Genres Endpoint


    Of course, an app that just shows movie genres isn’t very interesting. Take a look at the routes and models used to build the home page, movie detail page and person detail page.

    The User Model


    Aside from creating themselves and authenticating with the app, Users (blue) can rate Movies (yellow) with the :RATED relationship, illustrated below.

    User data model for the Neo4j Flask movies template app


    User Properties

      • password: The hashed version of the user’s chosen password
      • api_key: The user’s API key, which the user uses to authenticate requests
      • id: The user’s unique ID
      • username: The user’s chosen username
    :RATED Properties

    rating: an integer rating between 1 and 5, with 5 being love it and 1 being hate it.

    My rated movies in the Neo4j Flask movies template app


    Users Can Create Accounts


    Before a User can rate a Movie, the the user has to exist, i.e., someone has to sign up for an account. Sign up will create a node in the database with a User label along with the properties necessary for logging in and maintaining a session.

    Create user account page in the Neo4j Flask movies template app

    Figure 1. web/src/pages/Signup.jsx

    The registration endpoint is located at /api/v0/register. The app submits a request to the register endpoint when a user fills out the “Create an Account” form and taps “Create Account”.

    Assuming you have the API running, you can test requests either by using the interactive docs at 3000/docs/ or by using cURL.

    Use Case: Create a New User

    Request
    curl -X POST --header 'Content-Type: application/json' 
                 --header 'Accept: application/json' -d 
                          '{ "username": "Mary Jane", "password": "SuperPassword"}' 
                          'http://localhost:5000/api/v0/register'
    

    Response
    {
       "id":"e1e157a2-1fb5-416a-b819-eb75c480dfc6",
       "username":"Mary333 Jane",
       "avatar":{
          "full_size":"https://www.gravatar.com/avatar/b2a02..."
       }
    }
    

    Use Case: Try to Create a New User but Username is Already Taken

    Request
    curl -X POST --header 'Content-Type: application/json' 
                 --header 'Accept: application/json' -d 
                          '{ "username": "Mary Jane", "password": "SuperPassword"}'      
                          'http://localhost:5000/api/v0/register'
    

    Response
    {
       "username":"username already in use"
    }
    

    User registration logic is implemented in /flask-api/app.py as described below:

    class Register(Resource):
        @swagger.doc({
            'tags': ['users'],
            'summary': 'Register a new user',
            'description': 'Register a new user',
            'parameters': [
                {
                    'name': 'body',
                    'in': 'body',
                    'schema': {
                        'type': 'object',
                        'properties': {
                            'username': {
                                'type': 'string',
                            },
                            'password': {
                                'type': 'string',
                            }
                        }
                    }
                },
            ],
            'responses': {
                '201': {
                    'description': 'Your new user',
                    'schema': UserModel,
                },
                '400': {
                    'description': 'Error message(s)',
                },
            }
        })
        def post(self):
            data = request.get_json()
            username = data.get('username')
            password = data.get('password')
            if not username:
                return {'username': 'This field is required.'}, 400
            if not password:
                return {'password': 'This field is required.'}, 400
    
            db = get_db()
    
            results = db.run(
                '''
                MATCH (user:User {username: {username}}) RETURN user
                ''', {'username': username}
            )
            try:
                results.single()
            except ResultError:
                pass
            else:
                return {'username': 'username already in use'}, 400
    
            results = db.run(
                '''
                CREATE (user:User {id: {id}, username: {username}, 
                                   password: {password}, 
                                   api_key: {api_key}}) RETURN user
                ''',
                {
                    'id': str(uuid.uuid4()),
                    'username': username,
                    'password': hash_password(username, password),
                    'api_key': binascii.hexlify(os.urandom(20)).decode()
                }
            )
            user = results.single()['user']
            return serialize_user(user), 201
    

    Users Can Log In


    Now that users are able to register for an account, we can define the view that allows them to login to the site and start a session.

    User login page on the Neo4j Flask movies template app

    Figure 2. /web/src/pages/Login.jsx

    The registration endpoint is located at /api/v0/login. The app submits a request to the login endpoint when a user fills a username and password and taps “Create Account”.

    Assuming you have the API running, you can test requests either by using the interactive docs at 5000/docs/ or by using cURL.

    Use Case: Login

    Request
    curl -X POST --header 'Content-Type: application/json' 
                 --header 'Accept: application/json' -d 
                          '{"username": "Mary Jane", "password": "SuperPassword"}' 
                          'http://localhost:5000/api/v0/login'
    

    Response
    {
      "token":"5a85862fb28a316ea6a1"
    }
    

    Use Case: Wrong Password

    Request
    curl -X POST --header 'Content-Type: application/json' 
                 --header 'Accept: application/json' -d 
                          '{ "username": "Mary Jane", "password": "SuperPassword"}' 
                          'http://localhost:5000/api/v0/register'
    

    Response
    {
       "username":"username already in use"
    }
    

    See Myself

    Request
    curl -X GET --header 'Accept: application/json' 
                --header 'Authorization: Token 5a85862fb28a316ea6a1' 
                         'http://localhost:5000/api/v0/users/me'
    

    Response
    {
      "id": "94a604f7-3eab-4f28-88ab-12704c228936",
      "username": "Mary Jane",
      "avatar": {
        "full_size": "https://www.gravatar.com/avatar/c2eab..."
      }
    }
    

    The code here is similar to that of /register. There is a similar form to fill out, where a user types in their username and password.

    With the given username, a User is initialized. The password they filled out in the form is verified against the hashed password that was retrieved from the corresponding :User node in the database.

    If the verification is successful, it will return a token. The user is then directed to an authentication page, from which they can navigate through the app, view their user profile and rate movies. Below is a rather empty user profile for a freshly created user:

    User profile page on the Neo4j Flask movies template

    Figure 3. /web/src/pages/Profile.jsx

    Users Can Rate Movies


    Once a user has logged in and navigated to a page that displays movies, the user can select a star rating for the movie or remove the rating of a movie he or she has already rated.

    My rated movies in the Neo4j Flask movies template app


    The user should be able to access their previous ratings (and the movies that were rated) both on their user profile and the movie detail page in question.

    Use Case: Rate a Movie

    Request
    curl -X POST --header 'Content-Type: application/json' 
                 --header 'Accept: application/json' 
                 --header 'Authorization: Token 5a85862fb28a316ea6a1' -d 
                          '{"rating":4}' 
                          'http://localhost:5000/api/v0/movies/683/rate'
    

    Response
    {}
    

    Python Implementation

    class RateMovie(Resource):
        @login_required
        def post(self, id):
            parser = reqparse.RequestParser()
            parser.add_argument('rating', choices=list(range(0, 6)), 
                                type=int, required=True, 
                                help='A rating from 0 - 5 inclusive (integers)')
            args = parser.parse_args()
            rating = args['rating']
    
            db = get_db()
            results = db.run(
                '''
                MATCH (u:User {id: {user_id}}),(m:Movie {id: {movie_id}})
                MERGE (u)-[r:RATED]->(m)
                SET r.rating = {rating}
                RETURN m
                ''', {'user_id': g.user['id'], 'movie_id': id, 'rating': rating}
            )
            return {}
    
        @login_required
        def delete(self, id):
            db = get_db()
            db.run(
                '''
                MATCH (u:User {id: {user_id}})
                              -[r:RATED]->(m:Movie {id: {movie_id}}) DELETE r
                ''', {'movie_id': id, 'user_id': g.user['id']}
            )
            return {}, 204
    

    Use Case: See All of My Ratings

    Request
    curl -X GET --header 'Accept: application/json' 
                --header 'Authorization: Token 5a85862fb28a316ea6a1'
                         'http://localhost:5000/api/v0/movies/rated'
    

    Response
    [
      {
        "summary": "Six months after the events depicted in The Matrix, ...",
        "duration": 138,
        "rated": "R",
        "tagline": "Free your mind.",
        "id": 28,
        "title": "The Matrix Reloaded",
        "poster_image": "http://image.tmdb.org/t/p/w185/ezIur....jpg",
        "my_rating": 4
      },
      {
        "summary": "Thomas A. Anderson is a man living two lives....",
        "duration": 136,
        "rated": "R",
        "tagline": "Welcome to the Real World.",
        "id": 1,
        "title": "The Matrix",
        "poster_image": "http://image.tmdb.org/t/p/w185/gyn....jpg",
        "my_rating": 4
      }
    ]
    

    Python Implementation

    class MovieListRatedByMe(Resource):
        @login_required
        def get(self):
            db = get_db()
            result = db.run(
                '''
                MATCH (:User {id: {user_id}})-[rated:RATED]->(movie:Movie)
                RETURN DISTINCT movie, rated.rating as my_rating
                ''', {'user_id': g.user['id']}
            )
            return [serialize_movie(record['movie'], 
            record['my_rating']) for record in result]
    
    ...
    
    def serialize_movie(movie, my_rating=None):
        return {
            'id': movie['id'],
            'title': movie['title'],
            'summary': movie['summary'],
            'released': movie['released'],
            'duration': movie['duration'],
            'rated': movie['rated'],
            'tagline': movie['tagline'],
            'poster_image': movie['poster_image'],
            'my_rating': my_rating,
        }
    

    Next Steps


      • Fork the repo and hack away! Find directors that work with multiple genres, or find people who happen to work with each other often as writer-director pairs.
      • Find a way to improve the template or the Python driver? Create a GitHub Issue and/or submit a pull request.

    Resources


    Found a Bug? Got Stuck?

      • The neo4j-users #help channel will be happy to assist you.
      • Make a GitHub issue on the driver or app repos.
    Neo4j


    Want to learn more about what you can do with graph databases? Click below to get your free copy the O’Reilly Graph Databases book and learn to harness the power of graph technology.

    Get My Free Copy

    This Week in Neo4j – 11 March 2017

    $
    0
    0
    Welcome to this week in Neo4j.

    This week we’ve got articles showing how to integrate Neo4j with Kibana, using jQAssistant from Pandas, and lots of releases of Neo4j and related projects.

    But first:

    International Women’s Day


    Explore everything that's happening in the Neo4j community for the week of 11 March 2017

    Praveena and Eve answering questions at the Neo4j booth

    On Wednesday 8th March Neo4j sponsored Tech (K)now Day – a mini conference hosted by Skillsmatter for International Women’s Day.

    There were a variety of different talks and workshops including a Neo4j one run by Eve Bright, Praveena Fernandes, and me. Attendees had the chance to explore Buzzfeed’s TrumpWorld dataset and learn Neo4j in the process.

    The next day we ran a similar workshop for people interested in journalism at journocoders in London. If you’d like to get your hands on the dataset, you can get up and running in a few minutes with your own TrumpWorld Neo4j sandbox.

    There were a number of updates pushed to the TrumpWorld-Graph repository and Will Lyon released updated data and a browser guide for campaign financing in 2016 for his NICAR workshop.

    New releases of Neo4j and Neo4j Drivers


    It’s been a busy week for releases!

    The drivers team have released the first versions of the 1.2 series for the Java and .NET driver. The Python one is planned for next week. The Javascript driver will follow two weeks later.

    This release has removed some boilerplate code and introduce retry logic based on encapsulated “unit of work” operations. We released Neo4j 3.2.0 ALPHA06 as part of the early release program. This version contained some Windows fixes and supporting for whitelisting procedures. For all changes see the release notes.

    New release of APOC – lots of goodies to play with



    APOC Activity in March 2017

    Activity on the APOC project

    The APOC community have been busy as well. This week has seen the most commits since the surge in May/June 2016 when a lot of procedures were added.

    There have been releases of APOC that are compatible with Neo4j 3.2.0-alpha06, Neo4j 3.1.2, and Neo4j 3.0.8. The documentation was also updated and is now available for each version.

    Included in these releases are new date functions, a couple of cool new procedures for working with paths, as well as new functions for working with collections. Notable improvements in apoc.periodic.iterate allow now much faster operations and retries. Manual free-text indexes can now be kept up to date and the expire(TTL) functionality got more robust.

    You can read full release notes for 3.2.0.1 (for Neo4j 3.2.0-alpha06), 3.1.2.5 (Neo4j 3.1.2), and 3.1.0.4 (Neo4j 3.1.1).

    If you try any of these releases, let us know how you get on by dropping us an email devrel@neo4j.com.

    The Neo4j Grails plugin saw its 6.0.9 and 6.1.0-RC1 releases and our partner GraphAware published the 1.0.0-RC1 version of the new Neo4j-PHP-OGM.

    Analysing Web Traffic with Neo4j


    In September 2016 Dmitriy Nesteryuk wrote an article explaining how web browsers could pre-render the next page a user might visit if they could predict what that page might be. He’s now created Sirko Engine which does this prediction in Neo4j.

    I think searching for user journeys through web sites is a fascinating use of Neo4j and Dmitriy’s project reminded me about a blog post written by Nick Dingwell of Snowplow Analytics and how they’d used Neo4j to run path analysis on their own website.

    Connecting Neo4j to Kibana, analysing source code with jQAssistant/Pandas, and more


    In other news:

    So what’s there to look forward to in the world of graphs next week?


    This Week in Neo4j – 18 March 2017

    $
    0
    0
    Welcome to This Week in Neo4j.

    If you’ve got any ideas for things we should cover in future editions, I’m @markhneedham on Twitter or send an email to devrel@neo4j.com.

    WordPress Recommendation Engine


    Adam Cowley has been busy over the last couple of weeks building a Neo4j-based recommendation engine for WordPress.

    This week in Neo4j - 18 March 2017

    The WordPress graph

    You can follow his work in a three-part blog series:

    Social Network Analysis, Software Analytics and RDBMS-to-Graph


    What’s happening on GitHub?


    This week I decided to do some exploration of Neo4j projects on GitHub that haven’t necessarily surfaced on Twitter. I queried the Neo4j community graph to find the most recent Neo4j-based projects.

    These were the most interesting ones I found:

    Next Week


    So what’s there to look forward to in the world of graphs next week?

    Tweet of the Week


    We’ll finish with my favourite tweet of the week by Tobias Zander. If you’re having fun playing with Neo4j, tweet with the #Neo4j hashtag and maybe you’ll feature in next week’s post.

    Have a good weekend!

    This Week in Neo4j – 25 March 2017

    $
    0
    0

    Welcome to this week in Neo4j where we collect the most interesting things that have happened in the world of graph databases over the last 7 days.

    If you’ve got something that you’d like to see featured in a future version let me know. I’m @markhneedham on Twitter or send an email to devrel@neo4j.com.


    In last week’s online meetup Mesosphere’s Johannes Unterstein showed us how to get a Neo4j causal cluster up and running on DC/OS.



    This was the culmination of several weeks’ effort where Johannes started with the Neo4j Docker image, figured out how to get it to play nicely with the Mesos ecosystem and created a Mesosphere Universe package so that users can easily create Neo4j clusters via the Marathon scheduler.

    On top of this Johannes has been a part of the Neo4j community since 2013 and has organized several meetups as well as writing a Play Framework integration for Spring Data Neo4j.

    On behalf of the Neo4j community I’d like to thank Johannes for all his efforts and I’m looking forward to your talk at GraphConnect Europe on 11th May 2017!

    Using Graph Visualization to Explore Corruption in Egypt and FIFA


    There were a couple of interesting posts showing how to use graph visualizations to explore two different types of corruption.

    Lana Chan wrote What Do Big Data Paris and the Panama Papers Have In Common? In this post Lana shows how you can use the Tom Sawyer graph data visualization tool to explore the 2015 FIFA corruption scandal.

    Explore everything that's happening in the Neo4j community for the week of 25 March 2017

    Visualizing the Egypt corruption network

    Noonpost, an interactive Arabic media website, explain how they used Linkurious for large-scale investigations in a project on Egypt’s corruption networks.

    In the post, they explain how they were able to explore connections between the army and its affiliates across various influence networks including the health, food, and tourism sectors using a combination of Cypher queries and graph visualizations.

    There’s lots of good stuff in both of these posts if you’re interested in data journalism.

    If you’d like to do data journalism work using Neo4j but don’t know how, sign up for the Neo4j Data Journalism Accelerator Program and you’ll get the opportunity to work with engineers from Neo4j’s Developer Relations team to get your analysis up and running.

    Visual Graph Modeling and Importing


    Michael Hunger created a video showing how to sketch graph models and load them into Neo4j using Alistair Jonesarrows tool.



    Will Lyon presented a webinar late last week where he showed how to model and import real-world datasets using Neo4j.

    Will shows how to import data from Yelp using several different approaches:

      • apoc.load.json – a procedure from the APOC library that can import JSON data directly.
      • LOAD CSV – a Cypher command for importing CSV files. Works well up to ~10 million rows.
      • neo4j-import – a tool for importing large initial datasets.

    Will also talks about Neo4j’s user-defined procedures and functions, and if you’re interested in creating your own ones we’ve created a couple of new pages on the Neo4j developer site to help you get started:

    Emil in Forbes, Hiking Recommendations, Malware Clustering, and DC/OS


    On the Podcast


    This week Rik interviewed Alistair Jones about the Causal Clustering feature released in Neo4j 3.1 back in December.

    They go through the history of clustering in Neo4j from the use of Zookeeper in the 1.8 series up to the current day where we’ve implemented a version of Diego Ongaro‘s Raft consensus protocol.

    If you want to learn more, there’s also a video of Alistair presenting on this topic.

    Next Week


    So what’s there to look forward to in the world of graphs next week?

    Tweet of the Week


    My favorite tweet this week was by Jose Ramón Cajide who’s been analyzing Twitter networks using Neo4j in RStudio:

    If you want to graph your own Twitter network you can try out the Neo4j Twitter Sandbox. Don’t forget to tweet your graph using the #Neo4j hashtag if you give it a try.

    Enjoy your weekend, it’s finally spring – hoorah!

    Cheers, Mark

    Public Service Announcement: Neo4j Drivers 1.2 Release

    $
    0
    0

    We are happy to announce that all our officially supported Bolt drivers are now available as versions 1.2. With this release, we massively improved the way you write code to work with a cluster, introducing reusable “transaction functions” and built-in retry functionality.

    For some new capabilities we added new APIs. Here you can find detailed documentation and the driver repositories.

    New Capabilities in all Neo4j Drivers


    Drivers now handle cluster server failures and role changes automatically, allowing the application to treat the cluster as a single black box providing read and write service. This simplifies the programming model massively. You don’t have to care about cluster state or retrying operations after its change.

      • A Bolt+routing URI represents a network address
      • Automatic DNS “Round Robin” resolution can yield multiple hosts → addresses
      • A load balancer (e.g., AWS ELB) can route to multiple hosts → addresses
      • These are the routing bootstrap addresses: they should be configured to be probable core servers
      • Read Replicas cannot provide routing tables
      • When the driver is initialized, it goes to one of the bootstrap addresses to get a routing table

    Neo4j drivers requests routing table
    Neo4j drivers cluster returns routing table
    Neo4j drivers routes client request
    Neo4j drivers refreshes routing table


    The Neo4j driver will switch traffic to an appropriate read or write connection depending on the transaction access mode. The read/write transaction access mode is a familiar SQL/ODBC/JDBC pattern of use.

    We added new methods Session.read_transaction and Session.write_transaction to allow the execution of reusable units of work. You simply pass in a transaction function to the method. To allow re-execution of failed operations, duration for retries is configurable via max_retry_time in the Neo4j driver configuration (the default is 30s).

    Here is an example on how you would use this capability:

    Python Example


    from neo4j.v1 import GraphDatabase
    
    
    driver = GraphDatabase.driver("bolt+routing://server:7687",
                                  auth=("neo4j", "password"))
    
    
    def add_friends(tx, name, friend_name):
        tx.run("MERGE (p:Person {name: $name}) "
               "MERGE (f:Person {name: $friend_name}) "
               "MERGE (p)-[:KNOWS]-(f)",
               name=name, friend_name=friend_name)
    
    
    def print_friends(tx, name):
        for record in tx.run(
              "MATCH (a:Person)-[:KNOWS]->(friend) WHERE a.name = $name "
              "RETURN friend.name ORDER BY friend.name", name=name):
            print(record["friend.name"])
    
    
    with driver.session() as session:
        session.write_transaction(
          lambda tx:
            tx.run("create constraint on (p:Person) assert p.name is unique"))
        session.write_transaction(add_friends, "Arthur", "Guinevere")
        session.write_transaction(add_friends, "Arthur", "Lancelot")
        session.write_transaction(add_friends, "Arthur", "Merlin")
        session.read_transaction(print_friends, "Arthur")
    

    Java Example


    You can find the full code in this example project.

    public class Person{
        private final static String COUNT_PEOPLE =
             ("MATCH (a:Person) RETURN count(a)");
    
        // callback method
        public static long count(Transaction tx){
            StatementResult result = tx.run(COUNT_PEOPLE);
            return result.single().get(0).asLong();
        }
        ...
    }
    
    
    public class SocialNetwork{
        public long countUsers() {
            try (Session session = driver.session()){
                return session.readTransaction(Person::count);
            }
        }
    
        public long addUser(Person user) {
            System.out.println(format("Adding user %s", user));
            try (Session session = driver.session(){
                return session.writeTransaction(user::save);
            }
        }
    }

    We decoupled the Session from a single underlying connection; a Session can now be defined as a causally linked sequence of transactional units of work.

    You don’t need to manage bookmarks for causal consistency manually any longer. Bookmarks are now automatically passed between transactions within a routing session. This makes causal consistency the default interaction mode with the database cluster.

    Auto-commit transactions (Session.run) will now run partially synchronous to the network (RUN and PULL_ALL will be sent to the server, the RUN response will be immediately received); this allows exceptions to be raised at a more logical point in the application

    Updates in Some of the Neo4j Drivers


    The Python language driver now includes a compiled C module included for improved performance on supported platforms. Please let us know if this works for you.

    If the provided hostname resolves to multiple IP addresses most of the drivers (except .NET) can handle this now.

    As always, we’d love your feedback, so please try out the new Neo4j driver releases and raise feature or bug requests on the driver repositories. Please let us also know what you think about the new APIs and if there are ways to improve them.

    If you need quick help, please join neo4j.com/slack and ask in the #drivers or the appropriate #neo4j-<language> channel. Otherwise you can also ask on Stack Overflow. Please tag your Stack Overflow questions with [neo4j-<language>-driver]

    Enjoy the new Neo4j drivers,

    Nigel Small, for the Neo4j Drivers Team


    This Week in Neo4j – 22 April 2017

    $
    0
    0

    Welcome to this week in Neo4j where we round up what’s been happening in the world of graph databases in the last 7 days.


    This Week in Neo4j - 22 April - Dmitry Vrublevsky from Neueda Labs

    Dmitry Vrublevsky from Neueda Labs

    This week’s featured community member is Dmitry Vrublevsky who works for Neueda Labs and has been very active in Neo4j’s community for quite some time.

    He started helping people on StackOverflow and Slack and then started the development of the Neo4j plugin for all the Jetbrains IDEs. That work has evolved into a full featured database tool, which was recently featured on this blog.

    Dmitry also spoke at the openCypher implementers meeting in February and will be at GraphConnect in London. He and his team is currently helping us to add some cool features to the Neo4j Browser.

    Neo4j at the Galway-Mayo Institute of Technology


    Multiple students from GMIT have been using Neo4j as part of their graph theory course and have been building a graph of the university timetable.

    I wish I’d got to use Neo4j at university so I’m very jealous – it was Oracle all the way where I studied!

    APOC, Call Data Records, GORM, Twitter Clone


    Online Meetup: Building the Wikipedia Knowledge Graph


    In this week’s Neo4j online meetup, Dr Jesús Barrasa and I showed how to load the Wikipedia Knowledge Graph into Neo4j and write queries against it.

    We’ve been hosting meetups almost every week for the last couple of months so if you want to catch up on earlier episodes you can find all of them on the Neo4j Online Meetup playlist.

    From The Knowledge Base


    We also have a really cool discussion of ways to limit MATCHes in subqueries by Andrew Bowman, our featured community member in the 25 February 2017 edition of TWIN4j.

    On GitHub: Mahout, Holocaust Research, Kafka Connector


    There’s been an incredible amount of activity on GitHub this week. These were the most interesting projects that I came across.

      • UserLine automates the process of creating logon relations from MS Windows Security Events by showing a graphical realtion among users domains, source and destination logons as well as session duration.
      • Nigel Small created Memgraph – a Python library that provides a Neo4j-compatible in-memory graph store.
      • There were some updates to the European Holocaust Research Infrastructure project, which provides a business layer and JAX-RS resource classes for managing holocaust data.
      • Erick Peirson created cidoc-crm-neo4j which is a meta-implementation of the CIDOC Conceptual Reference Model (CRM). The CIDOC CRM provides definitions and formal structure for describing the implicit and explicit concepts and relationships used in cultural heritage documentation. The project uses Python’s neomodel to interact with a Neo4j database
      • gbrodar created pcap4j – a repository of scripts for analysing the output of the Unix pcap tool.
      • Mark Wood created neo4j-mahout which wraps calls to Mahout functions in Neo4j user defined functions. I played around with Mahout a couple of years ago so I’m quite excited to try combine it with Neo4j using this tool.
      • JunfengDuan created kafka-neo4j-connector, which transfers data from Kafka to Neo4j.

    Neo4j Jobs


    I’ve not listed jobs in TWIN4j before but I came across an interesting one posted by Musimap, a B2B cognitive music intelligence company in Brussels. They’re hiring a Full-Stack Web Developer with Neo4j and Python experience so if that sounds like your type of thing it might be worth applying.

    If you have any jobs that you’d like me to feature in future versions, drop me a tweet @markhneedham.

    Next Week


    What’s happening next week in the world of graph databases?

    Tweet of the Week


    My favorite tweet this week was by Felix Victor Münch:

    Don’t forget to retweet Felix’s post if you liked it as well!

    That’s all for this week. Have a great weekend.

    Cheers, Mark

    This Week in Neo4j – 29 April 2017

    $
    0
    0

    Welcome to this week in Neo4j where we round up what’s been happening in the world of graph databases in the last 7 days.

    But before we begin, a quick announcement from us, the Neo4j Developer Relations team.

    Developer Zone at GraphConnect Europe 2017


    To provide the best developer experience at our GraphConnect conference in London, on May 11th 2017, we will open a dedicated Developer Zone.

    We will all be joined by Neo4j engineers, eager to answer your questions and talk about cool stuff you can do with Neo4j.

    So if you can make it to London for GraphConnect, don’t miss out for the best experience of the show – the Developer Zone. You can register with the DEVZONE30 discount for 30% off, or send an email to devrel@neo4j.com to get one of the few free or 50% off tickets.


    This Week in Neo4j - 29 April 2017

    This week’s featured community member – Michael Moussa

    Michael has been active for quite a while in the Neo4j community, presenting introductions to Neo4j at multiple PHP conferences. Last week he presented at the Lone Star PHP Conference in Dallas, TX.

    He’s also contributed to PHP related projects in the Neo4j community and answered questions in our open channels.

    Last few days of APOC Awareness Month


    We’re in the last days of the APOC Awareness Month so if you haven’t published your article, you have until Monday evening (May 1st might be a good day off to work on this).

    Tomaz Bratanic continued his APOC algorithm series and wrote this time about similarities, cluster finding and visualizing them with virtual nodes and relationships. A very interesting read!

    Python, PyData, Flask, NeoModel, and Neo4j


    Nigel Small, author of py2neo and tech lead of the drivers team visited Amsterdam a couple of weeks ago to present “A Pythonic Tour of Neo4j and the Cypher Query Language” at the PyData conference.

    Mostafa Moradian published gRest, a quickstart repository to build applications with Python, Flask, and NeoModel – a Django-like OGM for Neo4j.

    The GraphConnect schedule is a graph


    GraphConnect Schedule Graph

    The GraphConnect Europe 2017 Schedule

    Besides interviewing our community for the Graphistania Podcast and creating Graph-Karaokes, Rik van Bruggen also loves to recreate event schedules in Neo4j, for easy querying and recommendations.

    GraphConnect is no exception and you can now view the schedule as a graph.

    Wikipedia Knowledge Graph, GraphQL, Causal Clustering


      • As a follow up to last week’s online meetup my colleague Jesús Barrasa published a blog post explaining how to create the Wikipedia Knowledge Graph in Neo4j. He loads pages and categories and enriches them by querying dbpedia. You can follow along by running the Neo4j-Browser Guide Jesús created in the blank Neo4j Sandbox.
      • Rik also published parts 2 , 3, and 4 of his series of explaining common questions about Neo4j. You get very detailed answers on the questions of scale, usage of Lucene, Solr and transactions and the Gremlin support of Neo4j.
      • If you love to extend Neo4j you will like this article by Igor Borojevic, who shows as part of the security series with Neo4j how to build a custom security plugin, to chose your own way of doing Authentication and authorization
      • Chris Skardon explains step by step how to manually set up a causal cluster with Neo4 3.1.3 on Microsoft Azure. Enjoy his funny observations and comments in his blog post: So you want to go Causal Neo4j in Azure? Sure we can do that
      • Magnus Wallberg wrote up the PhUSE conference where he attended a workshop led by Tim Williams comparing RDF and graphs.
      • If you’re looking for a job where you can work with Neo4j full time, Matt Andrews at the Financial Times is hiring:

    The Mattermark GraphQL API Graph


    GraphQL has been on our minds, lately. So, when the Mattermark GraphQL API became available, Will Lyon looked into it and created this insightful blog post on analysing local startup ecosystems based on their data.

    He uses ApolloClient to access the API and turn the data of startups based in his home state of Montana into a graph in Neo4j.

    Will then goes on to use Cypher queries to answer questions such as:

      • What are the companies in Montana that are raising venture capital?
      • Who are the founders?
      • Who is funding them and what industries are they in?_

    Online Meetup: Learning Chinese with Neo4j


    In this week’s online meetup Fernando Izquierdo showed us how to learn Chinese using Neo4j.

    Even if you’ve got no interest in learning Chinese this is still worth watching because it’s such an innovative use of graphs.

    From The Knowledge Base


    This week from the Neo4j Knowledge Base:

    On GitHub: Rust, Spring Data Neo4j, The Bible


    Here are some of the most interesting projects I found on my GitHub travels:

      • If you like to work in Rust, this Crate can help you to access Neo4j natively. It uses Cypher via the HTTP protocol and is well documented in the readme. It even offers a Macro based approach for less clutter in your code.
      • Marco Falcier created a quick Spring Data Neo4j example project for managing forests of trees, that gives you a good starting point. It runs on a temporary in-memory database and comes with an Angular frontend and provides Mockito based tests.
      • The MetaV viz.bible is an online and mobile site publishing detailed connections between bible verses and provides a lot of insights and charts. Olin Blodgett took the CSV data which is available under a CC license and transforms it into a graph in Neo4j. You can also see the underlying data model and some example queries. Would be interesting to build an app on top of that graph data which could augment viz.bible with deeper insights based on graph queries and analytics.
      • If you are into life-sciences research and want to work with Snomed data in Neo4j, Pradeep created a Docker based workflow using the official containers for Neo4j and Snomed and a Groovy script to load the data into a graph.

    Tweet of the Week


    My favorite tweet this week was by Christos Delivorias:

    That’s all for this week. Have a great weekend.

    Cheers, Michael & Mark

    An Introduction & Tutorial for Structr 2.1

    $
    0
    0
    In one of our previous blog posts, we promised to write more about new features of our upcoming release of Structr, version 2.1, so here we are.

    New Tutorial


    But before we dive into the details, we’d like to to announce the first tutorial that our friends over at The SilverLogic created and which will be part of a series of example projects we’ll publish over the next few months. The detailed tutorial on how to create a Structr app shows many of the new features listed in this post. If you follow the tutorial, you will be able to create a simple blogging app within a couple of hours.

    You can find the full tutorial on the Structr blog at https://structr.org/blog/blog-app-tutorial.

    Learn more about Structr 2.1 in this introduction and tutorial walking you through the new features


    And now back to the features.

    New Features


    One of the most requested features among many other improvements and bugfixes is finally here and aims at developer productivity: We added a new deployment tool that allows you to export a complete Structr application in form of a collection of HTML and JSON files so that you can store it in any version control system (VCS).

    We found a way to serialize and export all information which makes up a Structr app and is stored in Neo4j at runtime, to a filesystem structure. This allows you to use your favorite Integrated Development Environment (IDE) and diff and merge tools to make and track changes. In addition, the deployment tool (export/import) can even be used remotely over HTTP(S) so you don’t need a console login on the server to update your Structr instance.

    Another new feature which makes operating Structr easier is the new web-based configuration tool: No need to manually edit the structr.conf file anymore!

    The config tool UI in Structr 2.1


    The most anticipated feature of the new configuration interface is that you can now start and stop services individually while Structr is running. That means you can disconnect Structr from one Neo4j database and connect it to another, all without stopping the JVM instance, or you can enable and disable debugging and logging flags at runtime, which will greatly improve productivity.

    Apart from that, the upcoming 2.1 release contains lots of new features to boost productivity: There’s a new administration console (press Ctrl-Shift-C to activate) for quick and easy scripting tasks, maintenance operations or monitoring log files, etc. We also improved the internal JavaScript scripting bridge and built a foundation which allows us to add support for more scripting languages like Ruby, PHP, Python or R.

    Some More Improvements


    A few other things we improved:
      • The test coverage has been improved and the tests are running much faster now due to better reuse of Neo4j instances.
      • A couple of new widgets to massively speed up app development
      • Improved schema layout and schema editor enhancements
      • Favourites: Define editable texts like script files or content elements as favourites and access them quickly via a keyboard shortcut (Ctrl-Alt-F)

    Developer Support Program


    Due to the rapidly growing demand for documentation, training materials and project support, we created a new program called the Developer Support Program which covers the most requested support services in an attractive package. We’ll announce more details soon.

    GraphConnect Europe


    Last but not least, Structr is once again happy to be a Gold Sponsor of the upcoming GraphConnect Europe happening in London on 11 May 2017. Save 30% on all tickets with the promo code STRUCTR30.

    See you in London!


    Join us at the Europe’s premier graph technology event: Get your ticket to GraphConnect Europe and we’ll see you on 11th May 2017 at the QEII Centre in downtown London!

    Get My Ticket

    This Week in Neo4j – 6 May 2017

    $
    0
    0

    Welcome to this week in Neo4j where we round up what’s been happening in the world of graph databases in the last 7 days.


    This week’s featured community member is Alessio De Angelis, an IT consultant at Whitehall Reply for projects held by SOGEI, the Information and Communication Technology company linked to the Economics and Finance Ministry in Italy.

    This week’s featured community member: Alessio De Angelis

    Alessio first came onto the Neo4j scene while taking part in a GraphGist competition a few years ago and created an entry showing Santa’s shortest weighted path around the world.

    Querying the Neo4j TrumpWorld Graph with Amazon Alexa


    The coolest Neo4j project of the week award goes to Christophe Willemsen, our featured community member on 2 April 2017.

    Christophe has created a tool that executes Cypher queries in response to commands issue to his Amazon Alexa.

    Rare diseases research, APOC spatial, Twitter Clone


    Rare diseases research

    Rare diseases research using graphs and Linkurious

    Online Meetup: Planning your next hike with Neo4j


    In this week’s online meetup Amanda Schaffer showed us how to plan hikes using Neo4j.

    There’s lots of Cypher queries and a hiking recommendation engine, so if that’s your thing give it a watch.

    From The Knowledge Base


    On the podcast: Andrew Bowman


    In his latest podcast interview Rik van Bruggen interviews our newest Neo4j employee, Andrew Bowman. You’ll remember that Andrew was our very first featured community member on 25 February 2017.

    Rik and Andrew talk about Andrew’s contributions to the community and Andrew’s introduction to Neo4j while building social graphs for Athena Health.

    On GitHub: Graph isomorphisms, visualization, natural language processing


    There’s a variety of different projects on my GitHub travels this week.

    Next Week


    It’s GraphConnect Europe 2017 week so the European graph community will be at the QE2 in London on Thursday 11th May 2017.

    The venue for GraphConnect Europe 2017

    The QE2 in London, the venue for GraphConnect Europe 2017

    If you would like to be in with a chance of winning a last minute ticket don’t forget to register for our online preview meetup on Monday 8th May 2017 at 11am UK time.

    We’ll be joined by a few of the speakers who’ll give a sneak peek of their talks as well as talk about what they love about GraphConnect.

    Hope to see you there!

    Tweet of the Week


    I’m going to cheat again and have two favourite tweets of the week.

    First up is Chris Leishman sharing his favourite font for writing Cypher queries:

    And there was also a great tweet by Caitlin McDonald:

    That’s all for this week. Have a great weekend and I’ll hopefully see some of you next week at GraphConnect.

    Cheers, Mark

    This Week in Neo4j – Moving Adobe Behance from Cassandra to Neo4j, New Go Driver, Emil on The New Stack Makers Podcast

    $
    0
    0

    Welcome to this week in Neo4j where we round up what’s been happening in the world of graph databases in the last 7 days.

    This week David Fox explains how his team at Adobe moved from a 48 instance Cassandra cluster to a 3 instance Neo4j one, Emil is interviewed on The New Stack Makers Podcast, Neo4j Launches Commercial Kubernetes Application on GCP Marketplace, and we have the first alpha release of our new Go driver!


    This week’s featured community member is David Fox, Software Engineer at Adobe.

    David Fox – This Week’s Featured Community Member

    David has been a member of the Neo4j community for many years and presented Connections Through Friends: The Second Degree and Beyond at GraphConnect 2013.

    I first came across David in my role in Neo4j’s customer success team while David was working at Snap Interactive (now PeerStream). David has since presented his experiences there in a talk at the Neo4j New York meetup titled Running Neo4j in Production: Tips, Tricks and Optimizations.

    David now works for Adobe, and is responsible for the backend infrastructure and performance on Behance – a social network for creatives, serving over 10 million members. We’ll cover more about his experience there below.

    David also built devRant – a community especially crafted with the wants and needs of developers in mind – and wrote about his experience using Neo4j as part of that application.

    On behalf of the Neo4j community, thanks for all your work David!

    Moving Adobe Behance’s activity feed from Cassandra → Neo4j


    As mentioned above, David was interviewed by Prof. Roberto V. Zicari, about his experience building a new implementation of Behance’s activity feed feature.

    In the first part of the interview David explains how the activity feed feature and some of the limitations they had with their original implementation which was using Cassandra as the underlying storage engine.

    He goes on to observe that the full dataset size has been reduced from 50TB when it was stored in Cassandra, down to around 40 GB in Neo4j. They’re also able to power this system using a cluster of 3 Neo4j instances, down from 48 Cassandra instances of equal specs.

    As a result of this they ‘ve been able to exponentially decrease the amount of developer-operations staff hours required each month to keep the activity feed running.

    Neo4j Launches Commercial Kubernetes Application on GCP Marketplace


    On Wednesday David Allen announced the release of the Neo4j Graph Platform within a commercial Kubernetes application to all users of the newly renamed Google Cloud Platform Marketplace.

    This means that users can now easily deploy Neo4j’s native graph database capabilities for Kubernetes directly into their GKE-hosted Kubernetes cluster.

    On The New Stack Makers Podcast: Emil Eifrem


    They talk about the history of Neo4j from its origins solving a problem in enterprise Content Management, through to the release of the Neo4j Bloom last month, and Emil’s vision of the future of Machine Learning and graphs.

    You can listen to the interview below.



    RDFS/OWL ontologies → Neo4j, Part 4 of Dating Site, Merging data from optional keys


    First alpha of Go Neo4j driver


    Based on popular demand our drivers team have been working on a Go driver for Neo4j, and this week released its first alpha version.

    You can find instructions for using the driver in the neo4j-go-driver GitHub repository, and if you’ve used any of the other language drivers you will find the same familiar API that you’re used to.

    The GA for the Go Driver is planned along with the Neo4j 3.5 release later this year. If you want to learn more you can join the #neo4j-golang channel of the Neo4j users slack.

    Creating Nodes and Relationships Dynamically with APOC


    Creating nodes and relationships with Cypher is really straightforward. It only gets tricky when you have labels, relationship-types or property-keys that are driven by data and dynamic.



    The Cypher planner only works with static tokens and in this video Michael shows how APOC procedures come to the rescue here for creating, merging and updating nodes and relationships with dynamic data coming from user provided strings or lists.

    Python Dependency Graph, Fraud Detection with Neo4j, Neo4j OGM Release


      • I wrote a blog post showing how to analyse a graph of your Python depencies using centrality algorithms from the Neo4j Graph Algorithms library.
      • Joe Depeau presented a webinar showing How to Build a Fraud Detection Solution with Neo4j. Joe shows the value that graphs can add beyond traditional fraud detection methods, shows how Neo4j can fit in a typical architecture, and demonstrates how Neo4j Bloom can be used to explore a fraud dataset.
      • Michael Simons released version 3.0.4 of Neo4j OGM. This version has support for version 1.5 of Bolt drivers, compatibility for 3.4 point types, and several bug fixes.
      • Jennifer Reif has written a blog post in which she covers the history of data storage, contrasts relational and graph data modeling, and shares some common use cases for graphs.

    Next Week


    What’s happening next week in the world of graph databases?

    Date Title Group Speaker

    July 25th 2018

    Neo4j Quick Graphs: Extracting Taxonomies, Strava, Wikipedia, Python Dependencies

    Neo4j – London User Group

    Mark Needham, Jesús Barrasa

    July 25th 2018

    Querying Open Civic Data Using Cypher & Neo4j

    Philly GraphDB

    Tweet of the Week


    My favourite tweet this week was by Iian Neill:

    Don’t forget to RT if you liked it too.

    That’s all for this week. Have a great weekend!

    Cheers, Mark

    Viewing all 195 articles
    Browse latest View live