Democratizing Data Discovery at Airbnb

August 1, 2018, 12:00 am

≫ Next: Which GraphConnect Training Should You Take? [Quiz]

≪ Previous: This Week in Neo4j – Moving Adobe Behance from Cassandra to Neo4j, New Go Driver, Emil on The New Stack Makers Podcast

Learn how Airbnb democratized their data discovery with a graph database.

Editor’s Note: This presentation was given by John Bodley and Chris Williams at GraphConnect Europe in May 2017.

Presentation Summary

Airbnb, the online marketplace and hospitality service for people to lease or rent short-term lodging, generates many data points, which leads to logjams when users attempt to find the right data. Challenges managing all the data points have led the data team to search for solutions to “democratize the data,” helping employees with data exploration and discovery.

To address this challenge, Airbnb has developed the Dataportal, an internal data tool that helps with data discovery and decision-making and that runs on Neo4j. It’s designed to capture the company’s collective tribal knowledge.

As data accumulates, so do the challenges around the volume and complexity of the data. One example of where this data accumulates is in Airbnb’s Hive data warehouse. Airbnb has more than 200,000 tables in Hive spread across multiple clusters.

Each day the data starts off in Hive. Airbnb’s data engineers use Airflow to push it to Python. The data is eventually pushed to Neo4j by the Neo4j driver. The graph database is live, and every day they push updates from Hive into the graph database.

Why did Airbnb choose Neo4j? There are multiple reasons. Neo4j captures the relevancy of relationships between people and data resources, helping guide people to the data they need and want. On a technical level, it integrates well with Python and Elasticsearch.

Airbnb’s Dataportal UI is designed to help users, the ultimate holders of tribal knowledge, find the resources they need quickly.

Full Presentation: Democratizing Data at Airbnb

What we will be talking about today is how Airbnb uses Neo4j’s graph database to manage the many data points that accumulate in our Hive data warehouse.

What Is the Dataportal?

John Bodley: Airbnb is an online marketplace that connects people to unique travel experiences. We both work in an internal data tools team where our job is to help ensure that Airbnb makes data-informed business decisions.

The Dataportal is an internal data tool that we’re developing to help with data discovery and decision-making at Airbnb. We are going to describe how we modelled and engineered this solution, centered around Neo4j.

Addressing the Problem of Tribal Knowledge

The problem that the Dataportal project attempts to address is the proliferation of tribal knowledge. Relying on tribal knowledge often stifles productivity. As Airbnb grows, so do the challenges around the volume, the complexity and the obscurity of data. In a large and complex organization with a sea of data resources, users often struggle to find the right data.

We run an employee survey and consistently score really poorly on the question, “The information I need to do my job is easy to find.”

Data is often siloed, inaccessible and lacks context. I’m a recovering data scientist who wants to democratize data and provide context wherever possible.

Taming the Firehose of Hive

We have over 200,000 tables in our Hive data warehouse. It is spread across multiple clusters. When I joined Airbnb last year, it wasn’t evident how you could find the right table. We built a prototype, leveraging previous insights, giving users the ability to search for metadata. We quickly realized that we were somewhat myopic in our thinking and decided to include resources beyond just data tables.

Data Resources Beyond the Data Warehouse

We have over 10,000 Superset charts and dashboards. Superset is an open source, data analytics platform. We have in excess of 6,000 experiments in metrics. We have over 6,000 Tableau workbooks and charts, and over 1,500 knowledge posts, from Knowledge Repo, our open source, code knowledge-sharing platform that data scientists use to share their results, as well as a litany of other data types.

But most importantly, there’s over 3,500 employees at Airbnb. I can’t stress enough how valuable people are as a data resource. Surfacing who may be the point of contact for a resource is just as pertinent as the resource itself. To further complicate matters, we’re dispersed geographically, with over 20 offices worldwide.

The mandate of the Dataportal is quite simply to democratize data and to empower Airbnb employees to be data informed by aiding with data exploration, discovery and trust.

At a very high level, we want everyone to be able to search for data. The question is, how to frame our data in a meaningful way for searching. We have to be cognizant of ranking relevance as well. It should be fairly evident what we actually feed into our search indices, which is all these data resources and their associated metatypes.

The Relevancy of Relationships: Bringing People and Data Together

Thinking about our data in this way, we were missing something extremely important: relationships.

Our ecosystem is a graph, the data resources are nodes and the connectivity is all relationships. The relationships provide the necessary linkages between our siloed data components and the ability to understand the entire data ecosystem, all the way from logging to consumption.

Relationships are extremely pertinent for us. Knowing who created or consumed a resource (as shown below) is just as valuable as the resource itself. Where should we gather information from a plethora of disjointed tools? It would be really great if we could provide additional context.

Check out this graphic of how Airbnb defines their relevancy of data relationships with their employees.

Let’s walk through a high-level example, shown below. Using event logs, we discover a user consumes a Tableau chart, which lacks context. Piecing things together, we discover that the chart is from a Tableau workbook. The directionless edge is somewhat ambiguous, but we prefer the many-to-one direction from both a flow and a relevancy perspective. Digging a little further, both these resources were created by another user. Now we find an indirect relationship between these users.

We then discover that the workbook was derived from some aggregated table that wasn’t in Hive, thus exposing the underlying data to the user. Then we pass out the Hive order logs and determine that this table is actually derived from another table, which provides us with the underlying data. And finally, both these tables are associated with the same Hive schema, which may provide additional context with regards to the nature of the data.

How Airbnb's Dataportal graph search platform first took shape.

We leverage all these data sources, and we build a graph comprising of the nodes and relationships, and this resides in Hive. We pull from a number of different sources. Actually, Hive is our persistent data store, where the table schema mimics Neo4j. We have a notion of labels, and properties, and maybe an ID.

We pull from over six databases that come through scrapes that land in Hive. We create a number of APIs, be that Google, Slack and also some logging frameworks. That all goes into an Airflow Directed Acrylic Graph (DAG). (Airflow is an open source workflow tool that was also developed at Airbnb.) And then this workflow is run every day, and the graph is left to soak to prevent what we call “graph flickering.”

See the data resources Airbnb leverages to build a graph in Hive.

Dealing with “Graph Flickering”

Let me explain what I mean by graph flickering. Our graph is somewhat time-agnostic. It represents the most recent snapshot of the ecosystem. The issue is certain types of relationships are sporadic in nature, and that’s causing the graph to flicker. We resolve this by introducing the notion of relational state.

We have two sorts of relationships: persistent and transient.

Persistent relationships (see below) represent a snapshot in time of the system; they are the result of a DB scrape. In this example, the creator relationship will persist forever.

Check out how persistent relationships represent a snapshot in time.

Transient relationships, on the other hand, represent events that are somewhat sporadic in nature. In this example, the consumed relationship would only exist on certain days, which would cause the graph to flicker.

To solve this, we simply expand the time period from one to a trailing 28-day window, which acts as a smoothing function. This ensures the graph doesn’t flicker, but also enables us to capture only recent, and thus relevant, consumption information into our graph.

See how transient relationships are sporadic in nature.

How Airbnb Uses Neo4j with Python and Elasticsearch

Let’s touch upon how our data ends up in Neo4j and downstream resources.

Shown below is a very simplified view of our data path which, in itself, is a graph. Given that relationships have parity with nodes, it’s pertinent that we also discuss the conduit that connects these systems.

Every day, the data starts off in Hive. We use AirFlow to push it to Python. In Python, the graph is represented in NetworkX as an object and from this, we compute a weighted page rank on the graph and that helps improve search ranking. The data is then pushed to Neo4j by the Neo4j driver.

We have to be cognizant of how we do a merge here. The graph database is live, and every day we push updates from Hive into the graph database. That’s a merge, and it is something we have to be quite cautious of.

From here, the flow forks into two directions. The nodes get pushed into Elasticsearch via a GraphAware plugin, which is based on transaction hooks. From there, Elasticsearch will serve as our search engine. Finally, we use Flask as a lightweight Python web app, which is used with other data tools. Results from Elasticsearch queries are fetched by the web server.

Additionally, results from Neo4j queries pertaining to connectivity are fetched by the web server via Neo4j, using that same driver.

Why did we choose Neo4j as our graph database?

There are four main reasons. First, our data represents a graph, so it felt logical to use a graph database to store the data. Second, it’s nimble. We wanted a really fast, performant system. Third, it’s popular; it’s the world’s number one graph database. The community edition is free, which is really super helpful for exploring and prototyping. And finally, it integrates well with Python and Elasticsearch, existing technologies we wanted to leverage.

Learny why Airbnb choose Neo4j's graph database.

There’s a lovely symbiotic relationship between Elasticsearch and Neo4j, courtesy of some GraphAware plugins. The Neo4j plugin, which asynchronously replicates data from Neo4j to Elasticsearch. That means we actually don’t need to actively manage our Elasticsearch cluster. All our data persists. We use Neo4j as the source of truth.

The second plugin actually lives in Elasticsearch and allows Elasticsearch to consult with the Neo4j database during a search. And this allows us to enrich search rankings by leveraging the graph topology. For example, we could sort by recently created, which is a property on the relationship, or most consumed, where we have to explore topology of the graph.

This is how we represent our data model. We defined a node label hierarchy as follows.

Check out Airbnb's node label hierarchy.

This hierarchy enables us to organize data in both Neo4J and Hive. The top-level :Entity label really represents some base abstract node type, which I’ll explain later.

Let’s walk through a few examples here. Our schema was created in such a way that the nodes are globally unique in our database, by combining the set of labels and the locally scoped ID property.

First, we have a user who’s keyed by their LDAP username, then a table that’s keyed by the table name and finally a Tableau chart that’s keyed by the corresponding DB instance inside the Tableau database.

The graph cores are heavily leveraged in the user interface (UI), and they need to be incredibly fast. We can efficiently match queries by defining per label indices on the ID property and we leverage them for fast access. Here, we’re just explicitly forcing the use of the index because we’re using multiple labels.

Ideally, we’d love to have a more abstract representation of the graph, moving from local to global uniqueness. To achieve that, we leverage another GraphAware plugin, UUID. This plugin assigns a global UUID on newly created entities that cannot be mutated in any way. This gives us global uniqueness. We can talk about entities in the graph by using just this one unique UUID property in addition to the entity label.

This helps us use PrimeBase queries, which leads to faster query and execution times. This is especially relevant when we do bulk loads. Every day we do a bulk load of data and we need that to be really performant.

Here’s this same sort of example as before. Now we’ve simplified this, so we can just purely match any entity using this UUID property, and it’s global.

We have a RESTful API. In the first example, you can match a node based on its labels and IDs. And this is useful if you have like a slug type of URL. The second one, you can match a node based purely on the UUID. The third one is how we’d get a created relationship, based on leveraging these two UUIDs. The front-end uses these APIs, as covered in the next section.

Designing the Front-end of the Dataportal

Chris Williams: I’m going to describe how we enable Airbnb employees to harness the power of our data resource graph through the web application.

The backends of data tools are often so complex that the design of the front-end is an afterthought. This should never be the case, and in fact, the complexity and data density of these tools makes intentional design even more critical.

One of our project goals is to help build trust in data. As users encounter painful or buggy interactions, these can chip away at their trust in your tool. On the other hand, a delightful data product can build trust and confidence. Therefore, with the Dataportal, we decided to embrace a product mindset from the start and ensure a thoughtful user interface and experience.

As a first step, we interviewed users across the company to assess needs and pain points around data resources and tribal knowledge. From these interviews, three overall user personas emerged. I want to point out that they span data literacy levels and many different use cases.

The first of these personas is Daphne Data. She is a technical data power user, the epitome of a tribal knowledge holder. She’s in the trenches tracing data lineage, but she also spends a lot of time explaining and pointing others to these resources.

Second, we have Manager Mel. Perhaps she’s less data literate, but she still needs to keep tabs on her team’s resources, share them with others, and stay up to date with other teams that she interacts with. Finally, we have Nathan New. He may be new to Airbnb, working with a new team, or new to data. In any case, he has no clue what’s going on and quickly needs to get ramped up.

With these personas in mind, we built up the front end of the Dataportal to support data exploration, discovery and trust through a variety of product features. At a high level, these broadly include search, more in-depth resource detail and metadata exploration, and user-centric, team-centric and company-centric data.

We do not really allow free-form exploration of our graph as the Neo4j UI does. The Dataportal offers a highly curated view of the graph, which attempts to provide utility while maintaining guardrails, where necessary, for less data-literate employees.

Designing the Dataportal for exploration, discovery and trust.

The Dataportal is primarily a data resource search engine. Clearly, it has to have killer search functionality. We tried to embrace a clean and minimalistic design. This aesthetic allows us to maintain clarity despite all the data content, which adds a lot of complexity on its own.

We also tried to make the app feel really fast and snappy. Slow interactions generally disincentivize exploration.

At the top of the screen (see below) are search filters that are somewhat analogous to Google. Rather than images, news and videos, we have things like data resources, charts, groups, teams and people.

The search cards have a hierarchy of information. The overall goal is to help provide a lot of context to allow users to quickly gauge the relevancy of results. We have things like the name, the type. We highlight search terms, the owner of the resource, when it was last updated, the number of views, and so on. And we also try to show the top consumers of any given result set. This is just another way to surface relationships and provide a lot more context.

Continuing with this flow, from a search result, users typically want to explore a resource in greater detail. For this, we have content pages. Here is an example of a Hive table content page.

At the top of the page, we have a description linked to the external resource and social features, such as favoriting and pinning, so users can pin a resource to their team page. Below that, we have metadata about the data resource, including who created it, when it was last updated, who consumes it, and so on.

The relationships between nodes provide context. This context isn’t available in any of our other siloed data tools. It’s something that makes the Dataportal unique, tying the entire ecosystem together.

Another way to surface graph relationships is through related content, so we show direct connections to this resource. For a data table, this could be something like the charts or dashboards that directly pull from the data table.

We also have a lot of links to promote exploration. You can see who created this resource and find out what other resources that they work on.

The screen below highlights some of the features we built out specifically for exploring data tables. You can explore column details and value distributions for any table. Additionally, tracing data lineage is important, so we allow users to explore both the parent tables and the child tables of any given table.

We’re also really excited about being able to enrich and edit metadata on the fly, we add table descriptions and column contents. And these are pushed directly to our Hive metastore.

The screen below highlights our Knowledge Repo, which is where data scientists can share analyses. You have dashboards and visualizations. We are typically iframing these data tools. That generates a log, which then our graph picks up, and it will trickle back into our graph, affect PageRank, and affect the number of views.

Helping Users, the Ultimate Holders of Tribal Knowledge

Users are the ultimate holders of tribal knowledge, so we created a dedicated user page, shown below, to reflects that.

On the left is basic contact information. On the right are resources that the user accesses frequently, created, or favorited in groups to which they belong. To help build trust in data, we wanted to be transparent about data. You can look at any resources a person views, including what your manager views and so on.

Along the lines of data transparency, we also made a conscious choice to keep former employees in the graph.

If we take George, the handsome intern that all the ladies talk about, he created a lot of data resources and he favorited things. If I wanted to find a cool dashboard that he made last summer, that I forgot the name of, this can be really relevant.

An example of data transparency with former employees tribal knowledge.

Another source of tribal knowledge is found within an organization’s teams. Teams have tables they query regularly, dashboards they look at and go-to metric definitions. We found that team members spend a lot of time telling people about the same resources, and they wanted a way to quickly point people to these items.

For that, we created group pages. The group overview below shows who’s in a particular team.

Group pages of tribal knowledge in Airbnb's Dataportal.

To enable curating content, we decided to borrow some ideas from Pinterest, so you can pin any content to a page. If a team doesn’t have any content that’s been curated, there’s a Popular tab. Rather than displaying an empty page, we can leverage our graph to inspect what resources the people on a given team use on a regular basis and provide context that way.

We leverage thumbnails for maximum context. We gathered about 15,000 thumbnails from Tableau or Knowledge Repo in our Superset internal data tool. They’re a combination of APIs and headless browser screenshots.

The screen below highlights the pinning and editing flows. On the left, similar to Pinterest, you can pin an item to a team page. On the right, you can customize and rearrange the resources on the team page.

Finally, we have company metric data.

We found that people on a team typically keep a tight pulse on relevant information for their team. A lot of times, as the company grows larger, they’ll feel more and more disconnected from company-level, high-level metrics. For that, we created a high-level Airbnb dashboard where they can explore up-to-date company-level data.

Front-End Technology Stack

Our front-end technology stack is similar to what many teams use at Airbnb.

We leverage modern JavaScript, ES6. We use node package manager (NPM) to manage package dependencies and build the application. We use an open source package called React from Facebook for generating the Document Object Model (DOM) and the UI. We use Redux, which is an application state tool. We use a cool open source package from Khan Academy called Aphrodite, which essentially allows you to write Cascading Style Sheets (CSS) in JavaScript. We use ESLint to enforce JavaScript’s style guide, which is also open source from Airbnb, and Enzyme, Mocha and Chai for testing.

Challenges in Building the Dataportal

We faced a number of challenges in building the Dataportal.

It is an umbrella data tool that brings together all of our siloed data tools and generates a picture of the overall ecosystem. The problem with this is that any umbrella data tool is vulnerable to changes in the upstream dependencies. This can include things on the backend like schema changes, which could break our graph generation, or URL changes, which would break the front-end.

Additionally, data-dense design, creating a UI that’s simple and still functional for people across a large number of data literacy levels, is challenging. To complicate this, most internal design patterns aren’t built for data-rich applications. We had to do a lot of improvising and creation of our own components.

We have a non-trivial Git-like merging of the graph that happens when we scrape everything from Hive and then push that to production in Neo4j.

The data ecosystem is quite complex, and for less data literate people, this can be confusing. We’ve used the idea of proxy nodes, in some cases, to abstract some of those complexities. For example, we have lots of data tables, which are often replicated across different clusters. Non-technical users could be confused by this, so we actually accurately model it on the backend, and then expose a simplified proxy node on the front end.

Future Directions for Airbnb and the Graph Database

We’re considering a number of future directions.

The first is a network analysis that finds obsolete nodes. In our case, this could be things like data tables that haven’t been queried for a long time and are costing us thousands of dollars each month. It could also be critical paths between resources.

One idea that we’re exploring is a more active curation of data resources. If you search for something and you get five dashboards with the same name, it’s often hard, if you lack context, to tell which one is relevant to you. We have passive mechanisms like PageRank and surfacing metadata that would, hopefully, surface more relevant results. We are thinking about more active forms of certification that we could use to boost results in search ranking.

We’re also excited about moving from active exploration to delivering more relevant updates and content suggestions through alerts and recommendations. For example, “Your dashboard is broken,” “This table you created hasn’t been queried for several months and is costing us X amount,” or “This group that you follow just added a lot of new content.”

And then, finally, what feature set would be complete without gamification?

We’re thinking about providing fun ways to give content producers a sense of value by telling them, for example, “You have the most viewed dashboard this month.”

Inspired by Dave’s talk? Click below to register for GraphConnect 2018 on September 20-21 in Times Square, New York City – and connect with leading graph experts from around the globe.

Get My Ticket

↧

Which GraphConnect Training Should You Take? [Quiz]

August 17, 2018, 12:00 am

≫ Next: Interchangeable Parts: 5-Minute Interview with Preston Hendrickson, Principal Systems Analyst at CALIBRE

≪ Previous: Democratizing Data Discovery at Airbnb

Take this six-question quiz to determine which Neo4j training you should take at GraphConnect 2018

It’s that time of year again.

In less than five weeks, Emil will be getting on stage at GraphConnect 2018 in the heart to Times Square, NYC and announcing the new release of….wow, I think we’re getting a little ahead of ourselves, aren’t we?

Let’s refocus: You know you’re going to attend GraphConnect, but you haven’t bought your tickets yet. No time like the present, right?

So you head over to GraphConnect.com to get yours…but wait. As your mouse (or finger) hovers over that training option you suddenly realize:

…but which Neo4j training should you sign up for? There’s like thirteen to choose from (more than ever before).

It’s time to make that decision a little easier.

Find the Neo4j Training That’s Right for You

Luckily for you, we’ve put together an awesome, six-question quiz to make your choice clear. Click below to get started!

No quiz is perfect, but we hope you’ve found the droids…err, the Neo4j training…that you’re looking for. Click here to take the quiz again or tweet me with your angry comments!

Neo4j Training Classes Offered at GraphConnect 2018

This year, we’ve changed up our entire training selection with mostly half-day courses, allowing you to take two training classes in one day (in most cases).

Here’s this year’s line-up, organized by role:

For beginners:

For data scientists and BI analysts:

For architects, DBAs and data modelers:

For developers:

No matter what Neo4j training you choose, we wish you the best of luck during your training session and we hope you enjoy all of the great speakers (among other reasons) at GraphConnect 2018!

See You Soon!

Of course, if you have more questions about Neo4j training at GraphConnect 2018, you can always reach out to the friendly team at graphconnect@neo4j.com. They’ll help you sort out any questions, concerns or last-minute details that require our attention.

What are you waiting for?
Click below to register for GraphConnect 2018 on September 20-21, 2018 in Times Square, New York City – and connect with leading graph experts from around the globe.

Sign Me Up

↧

Interchangeable Parts: 5-Minute Interview with Preston Hendrickson, Principal Systems Analyst at CALIBRE

October 26, 2018, 12:00 am

≫ Next: This Week in Neo4j – $80 Million Series E, New Neo4j Monitoring Tool, Cyber Attack Graphs, Spring Data Neo4j Tutorial

≪ Previous: Which GraphConnect Training Should You Take? [Quiz]

Check out this quick interview with Preston Hendrickson of CALIBRE.

“It’s easy to teach people to use Neo4j. It’s hands-on training, not death by PowerPoint,” said Preston Hendrickson, Principal Systems Analyst at CALIBRE.

CALIBRE works with large government customers, including the U.S. Army, where maintenance, operation and support costs of equipment (depending on the program and program longevity) represent as much as 80 percent of total lifecycle costs and a single tank has about 10 million parts to track.

In this week’s five-minute interview (conducted at GraphTour DC) we discuss how CALIBRE has replaced recursive SQL queries with Neo4j, and is now able to train analysts right alongside developers.

Talk to us about how you use Neo4j at CALIBRE.

Preston Hendrickson: One of our customers asked us to do deep dive into parts and ordering. For example, say you have a chair. That chair has legs, a back, a seat and armrests. And some of those parts are interchangeable. We need to know which parts can also fit on other chairs. With that information, we can build chairs, swap parts out and so forth.

Why did you choose Neo4j for the project?

Hendrickson: We chose Neo4j because it allows us to take those parts and actually go down as many levels as required. Chairs are a minor example; it could be a car. In a car, you have hundreds or thousands of parts, and we need to know what interchangeable parts there are across models and across different vendors, like an auto parts store.

In cases like this, Neo4j becomes the better candidate so that we don’t have to write recursive SQL or dynamic SQL. We just write queries for Neo4j and traverse as many levels as we want and trace relationships, which is a lot faster than writing code or querying a database.

Can you talk to me about some of your most interesting or surprising results you had while using Neo4j?

Hendrickson: The number one thing we’ve found is how easy it is to teach people to use Neo4j. Anytime you change technology, the first thing you do is sit people in a room and train them to death – death by PowerPoint. Neo4j was mainly hands-on training.

We actually had not only developers and others with strong, high-tech skills, we had people across the gamut. We included anyone who was new to the company. We had analysts in the room who were able to pick up Neo4j in the exact same way that developers did.

Now we’re all in one accord, and we can all share the graph database.

If you could start over with Neo4j, taking everything you know now, what would you do differently?

Hendrickson: The first thing I would do is, before touching it, I would try to train my brain and not think about it like a traditional RDBMS. It took me a couple of weeks to figure out that I should not model like a traditional database, with third normal form and all that stuff.

I did not get that message until well into the 34th time rebuilding a graph. I tried to do it that way. If I had to start over, I’d work on understanding of what NoSQL, no-schema means versus building hands-on like I’m used to.

What do you see as the future of graphs in your projects?

Hendrickson: In our area, instead of just data retrieval, we want to move more into data science. We are looking into using Python a lot more to connect to the database directly versus taking data, exporting it into something else, having Python read it, getting answers, and pushing data back in. We’re trying to integrate those processes.

Anything else you want to add or say?

Hendrickson: This is exciting for us. It is a new area, and we’re trying to get more of our analytical teams more involved. At the same time, other people are watching those using Neo4j, and they’ve been getting more questions, and it spreads. We like it. We’re having a ball here.

Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at content@neo4j.com

Want to learn more on how relational databases compare to their graph counterparts? Get The Definitive Guide to Graph Databases for the RDBMS Developer, and discover when and how to use graphs in conjunction with your relational database.

Get the Ebook

↧

This Week in Neo4j – $80 Million Series E, New Neo4j Monitoring Tool, Cyber Attack Graphs, Spring Data Neo4j Tutorial

November 3, 2018, 12:00 am

≫ Next: Agile Open Source Intelligence: 5-Minute Interview with Frederick Kagan

≪ Previous: Interchangeable Parts: 5-Minute Interview with Preston Hendrickson, Principal Systems Analyst at CALIBRE

Welcome to this week in Neo4j where we round up what’s been happening in the world of graph databases in the last 7 days.

This week we have Neo4j’s $80 Million Series E, Cyber Attack Graphs, an interview with the author of the upcoming book Graph-Powered Machine Learning, a new monitoring tool designed with Neo4j Administrators in mind, and a detailed tutorial on the Neo4j Java Driver, Neo4j-OGM, and Spring Data Neo4j.

Featured Community Member: Roland Guijt

This weeks featured community member is Roland Guijt, Freelance software developer and trainer at R.M.G. Solutions.

Roland Guijt – This Week’s Featured Community Member

Roland has been part of the Neo4j community for several years, and is the author of the popular Pluralsight Graph databases and Neo4j course.

This week Roland updated the course to 3.x and redid the look and feel. The course now includes new sections on the Bolt protocol that was introduced in Neo4j 3.0 and the Neo4j Desktop introduced in Neo4j 3.3.

On behalf of the Neo4j community, thanks for all your work Roland!

Thank Roland for his work

Neo4j Closes $80 Million in Series E Funding

This week Neo4j closed an $80 million Series E funding round led by One Peak Partners and Morgan Stanley Expansion Capital.

Neo4j has now raised a total of $160 million in growth funding – the largest cumulative investment into a graph database company.

This week’s funding will help Neo4j continue to deliver customer success with graph-powered business applications, while increasing investment on popular use cases, including graph-enabled artificial intelligence (AI) and machine learning (ML) systems.

Read the announcement

Monitoring Neo4j with Halin

David Allen has written a blog post about Halin, a tool he’s been working on to make it easier to monitor Neo4j databases and clusters. Halin goes beyond what standard monitoring tools can offer as it’s been created with Neo4j administrators in mind.

One really cool feature is the Diagnostic Advisor which gathers metadata about your Neo4j instance on the fly, and then runs it through a series of rules which make suggestions about what’s good, what could use improvement, and where there are errors.

Read the blog post

Spring Data Neo4j Tutorial, gREST and GraphQL releases, Visualising Projected Graphs

In Michael Simons latest blog post he explores the different options for accessing, manipulating, adding and deleting data from Neo4j using the Neo4j Java Driver, Neo4j-OGM, and Spring Data Neo4j.
The first stable version of gREST has been released after being heavily refactored to make it optimized and fast enough for production environment. gREST is a RESTful API development framework on top of Python, Flask, Neo4j and Neomodel. Its primary purpose is to ease development of RESTful APIs with little effort and minimum amount of code.
Will Lyon released version 1.0.5 of the neo4j-graphql.js library, which now includes the ability to specify which type definitions to include in the auto-generated Query and Mutation types. You can read more about these new features in the docs.
I wrote a blog post showing how to visualise projected graphs using the APOC library.

5-Minute Interview with Dr. Alessandro Negro, Chief Scientist at GraphAware

This week Bryce interviewed Dr. Alessandro Negro as part of the 5-minute interview series.

They talk about Alessandro’s experiences finding the structure in text documents using Natural Language Processing, and the Hume product that has been built from this work.

Alessandro also discusses his upcoming book, Graph-Powered Machine Learning, and his thoughts on the future of graphs.

Read the interview

Big-Data Architecture for Cyber Attack Graphs

Steven Noel, Eric Harley, Kam Him Tam, and Greg Gyor have published their paper Big-Data Architecture for Cyber Attack Graphs.

They propose a new Neo4j based modeling framework for mapping vulnerability paths through networks and associating them with observed attacker activities.

They import network relationships and events, such as topology, firewall policies, host configurations, vulnerabilities, attack patterns, intrusion alerts, and logs, and then execute graph analytics over the data via the Cypher query language.

Read the paper

Next Week

What’s happening next week in the world of graph databases?

Date	Title	Group
November 7th 2018	How Graphs impact AI and Data Science	NoSQL – São Paulo
November 8th 2018	GraphTalk Madrid	Neo4j España

Date

Title

Group

November 7th 2018

How Graphs impact AI and Data Science

NoSQL – São Paulo

November 8th 2018

GraphTalk Madrid

Neo4j España

Tweet of the Week

My favourite tweet this week was by Malmö Startups:

Exciting day for Malmö's startup scene and @neo4j who just announced a 730million SEK investment!! https://t.co/tDeITLkpIc pic.twitter.com/5kTLrjLcND
— Malmö Startups (@malmostartups) November 1, 2018

Don’t forget to RT if you liked it too.

That’s all for this week. Have a great weekend!

Cheers, Mark

↧

Agile Open Source Intelligence: 5-Minute Interview with Frederick Kagan

February 1, 2019, 12:00 am

≫ Next: Holiday fun with Neo4j

≪ Previous: This Week in Neo4j – $80 Million Series E, New Neo4j Monitoring Tool, Cyber Attack Graphs, Spring Data Neo4j Tutorial

Check out this 5-minute interview with Frederick Kagan, Director of the Critical Threats Project.

“One thing that a relational database isn’t, is relational ,” said Frederick Kagan, Director of the Critical Threats Project at the American Enterprise Institute.

All intelligence organizations are interested in seeing data connections and traversing networks. The Critical Threats Project does open source intelligence, mining the internet and social media to produce intelligence analysis based on a rigorously cultivated dataset. From that analysis, they generate insights into what’s going on in conflict areas around the globe to share with policymakers, the media and the public.

In this week’s five-minute interview (conducted at GraphConnect 2018 in NYC), we discuss why Frederick Kagan chose Neo4j as the backbone for its intelligence analytics.

How do you use Neo4j?

Frederick Kagan: At the Critical Threats Project, we have a highly cultivated dataset of events and entities, people, places and things and their relationships to one another based on data that we pull from the Internet and from social media.

That database, its integrity and our ability to access it, visualize it and traverse it, are central to our ability to perform our analysis and forecasting and derive insight from our information.

We use Neo4j as the backbone for the information system in which that data resides. We chose Neo4j for a number of reasons, but principally because we think that a graph database is the right model for the intelligence community and for any kind of intelligence organization moving forward, and we think that Neo4j has the best graph database structure out there.

What is a major problem Neo4j has solved for you?

Kagan: One of our problems is that we operate extensively on unstructured data.

The team is going out and reading articles and so forth in local media, and we need to bring that into a database. But we also interact with a lot of structured databases provided by various non-governmental and governmental organizations, as well as from others who do the same type of thing we do. All of the data comes in different structures. And we had legacy data from another system that was in yet another structure.

We’ve learned that it’s very important to give analysts a single graphical user interface by which to interact with their data. The more clicks you put between a user and insight, the less insight you actually get and the more users run away from a tool.

Being able to bring all of those disparate data sources together in a single place and have users interact with them seamlessly is vital. The graph technology that Neo4j has makes that very, very easy in a way that traditional SQL databases make it very, very hard.

And the key to this is that Neo4j allows you to have multiple, overlapping ontologies coexisting simultaneously in the same dataset without degrading performance. Whereas in a traditional SQL database, if you have a new data structure, or even if you just add a new type of property, you have to add it and then you have to reindex all of the tables and all the JOIN tables and so forth.

With Neo4j, you don’t have to do any of that.

Basically, you add a new label or a new set of labels and bring it straight in. I’ve done that repeatedly and it’s very easy. Getting the data into the database is the easy part. And the only part that I then had to deal with a little bit is how to present it on the front end to the user, given that it’s in a slightly different structure. But the ability to munge data like that is incredibly important for the kind of work we do.

Why is flexibility so important to you?

We are operating in a very dynamic environment. We are exploring how to do what it is that we do. We’re thinking about it all the time. And we’re coming up with new ideas for how to organize our data, how to categorize it, how to arrange it, how to make links and so forth.

Having the ability to change the ontology that we’re using on the fly – and in a way that’s transparent to the user and doesn’t require taking databases offline or indexing or reindexing or anything like that – is incredibly important to sustaining the dynamism of our own development of our tech stack and of our analytical workflow and processes.

Why is graph technology valuable to the intelligence community?

Kagan: I think graph technology is really the ideal technology for the backbone for intelligence analytical systems. There are a couple of reasons for that. One is the ease with which it supports data munging and bringing together lots of different data sources, which is a huge problem in the intelligence community and generally.

The other, of course, is that all intelligence organizations, whether business or government, are interested in seeing network connections and traversing networks. And that is, of course, the thing that graph usually sells and is the most obvious thing, and it’s quite valid.

I have interacted with other systems that laid what looked like a graph GUI on top of a SQL database, and then I have used Neo4j to do similar things. It is unquestionable that you get much better performance when you’re doing graph traversals and when you’re reaching out to different degrees of separation with a true graph database versus a SQL database that is presenting you with a graph-like GUI.

And since understanding network diagramming and understanding network relationships and traversing graphs rapidly is going to continue to be a huge problem for the intelligence community and all of us, I think that graph is the natural place where the community should migrate to.

What made you choose Neo4j?

Kagan: I came to the task of writing the software that we use in a strange way. I’m a hobbyist programmer. I’ve never taken a computer science course. My college appreciation job years ago was writing Fortran 77 code for a geophysicist.

I picked up Python a few years ago because I thought it would be helpful and then I found myself for various reasons having to write code that will allow us to interact with a dataset and bring that dataset into something. So I decided to bring it into Neo4j.

To my amazement, learning how to use Neo4j, how to get the data into Neo4j, and how to write Cypher queries, was the easy part. All the other stuff that I had to do was the hard part. But I found Neo4j to be an incredibly user-friendly interface and Cypher to be an incredibly user-friendly and intuitive way of interacting with the data that made it super easy for me, as a novice programmer, to bring our data into this data structure and start interacting with it.

I’ve also found Neo4j to be incredibly reliable. Which is good, because I’m the backend engineer as well. I’ve had to do virtually nothing to maintain a healthy dataset for more than a year. It’s used at any given time by 40 or 50 analysts and has eight or nine million nodes in it of different types. It’s been incredibly stable and reliable with very little maintenance.

So, from all of those perspectives, as a tool for someone who is relatively inexperienced, working with Neo4j has been a dream.

What has been your most unexpected use of Neo4j?

Kagan: We were so excited about Neo4j and how we were using it for our backend, that when we had to redesign our website, we talked with the vendor who was working on that for us and we persuaded them to use Neo4j as the backend for the website. Instead of using the traditional WordPress SQL backend data store on the website, we’re running a Neo4j database behind the website that is also going to facilitate the integration of our research data directly into visualizations on the site. And it’s a really good backbone for a website because of the relational aspect of it.

Funny thing is, the one thing that a relational database isn’t, is relational in the sense that when we talk about graph relationships.

I think the company that did our website was very excited about learning how to use Neo4j and bringing it into the website. I think that that’s another application that also has a lot of interest.

Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at content@neo4j.com

Curious about using graphs in your business?
Download this white paper, The Top 5 Use Cases of Graph Databases, and discover how to tap into the power of graphs for the connected enterprise.

Read the White Paper

↧

Holiday fun with Neo4j

December 25, 2009, 6:07 am

≫ Next: Modeling Categories in a Graph Database

≪ Previous: Agile Open Source Intelligence: 5-Minute Interview with Frederick Kagan

Looking for something fun to do during the holidays? Here are a few suggestions for some new cool Neo4j things that you can play around with.

A very recent addition to the Neo4j space is the JRuby library Neo4jr-social by Matthew Deiters:

Neo4jr-Social is a self contained HTTP REST + JSON interface to the graph database Neo4j. Neo4jr-Social supports simple dynamic node creation, building relationships between nodes and also includes a few common social networking queries out of the box (i.e. linkedin degrees of seperation and facebook friend suggestion) with more to come. Think of Neo4jr-Social is to Neo4j like Solr is to Lucene.

Neo4jr-social is built on top of Neo4jr-simple:

A simple, ready to go JRuby wrapper for the Neo4j graph database engine.

There’s also the Neo4j.rb JRuby bindings by Andreas Ronge which have been developed for quite a while by multiple contributors.

Staying in Ruby land, there’s also some visualization and other social network analysis stuff going on.

Looking for something in Java? Then you definitely want to take a look at jo4neo by Taylor Cowan:

Simple object mapping for neo. No byte code interweaving, just plain old reflection and plain old objects.

There’s apparently a lot of work going on right now in the Django camp to enable support for SQL and NOSQL databases alike. Tobias Ivarsson (who’s the author and maintainer of the Neo4j Python bindings) recently implemented initial support for Neo4j in Django. Read his post Seamless Neo4j integration in Django for a look at what’s new.

One more recent project is the Neo4j plugin for Grails. There are already some projects out there using it. We want to make sure Neo4j is a first-class Grails backend so expect more noise in this area in the future.

You can find (some of the) projects using Neo4j on the Neo4j In The Wild page. From the front page of the Neo4j wiki you’ll find even more language bindings, tutorials and other things that will support you when playing around with Neo4j!

Happy Holidays and Happy Hacking wishes from the Neo4j team!

Want to learn more about graph databases? Click below to get your free copy of O’Reilly’s Graph Databases ebook and discover how to use graph technologies for your application today. Download My Ebook

↧

Modeling Categories in a Graph Database

March 23, 2010, 9:14 am

≫ Next: Nigel Small Discusses Py2neo

≪ Previous: Holiday fun with Neo4j

Storing hierarchical data can be a pain when using the wrong tools.

However, Neo4j is a good fit for these kind of problems, and this post will show you an example of how it can be used.

To top it off, today it’s time to have a look at the Neo4j Python language bindings as well.

Introduction

A little background info for newcomers: Neo4j stores data as nodes and relationships, with key-value style properties on both. Relationships connect two different nodes to each other, and are both typed and directed.

Relationships can be traversed in both directions (the direction can also be ignored when traversing if you like). You can create any relationship types; they are identified by their name.

For a quick introduction to the Neo4j Python bindings, have a look at the Neo4j.py component site. There’s also slides and video from a PyCon 2010 presentation by Tobias Ivarsson of the Neo4j team, who also contributed the Python code for this blog post.

If you take a look at a site like stackoverflow.com you will find many questions on how to store categories or, generally speaking, hierarchies in a database.

In this blog post, we’re going to look at how to implement something like what’s asked for here using Neo4j. However, using a graph database will allow us to bring the concept a bit further.

Data Model

It may come as a surprise to some readers, but even though we’re using a graph database here, we’ll use a common Entity-Relationship Diagram.

The entities we want to handle in this case are categories and products. The products holds attribute values, and we want to be able to define types and constraints on these attributes. The attributes that products can hold are defined on categories and inherited to all descendants. Products, categories and attribute types are modeled as entities, while the attributes have been modeled as relationships in this case. Categories may contain subcategories and products.

So this is the data model we end up with:

What can’t be expressed nicely in the ER-Diagram are the attribute values, as the actual names of those attributes are defined as data elsewhere in the model.

This mix of metadata and data may be a problem when using other underlying data models, but for a graph database, this is actually how it’s supposed to be used. When using an RDBMS with it’s underlying tabular model, the Entity-Attribute-Value model is a commonly suggested way of dealing with the data/metadata split. However, this solution comes with some downsides and hurts performance a lot.

That was it for the theoretical part, let’s get on to the practical stuff!

Node Space

What we want to do is to transfer the data model to the node space – that’s Neo4j lingo for a graph database instance, as it consists of nodes and relationship between nodes.

What we’ll do now is to simply convert some of the terminology from the Entity-Relationship model to the Neo4j API:

ER-model	Neo4j
Entity	Node
Relationship	Relationship
Attribute	Property

That wasn’t too hard, was it?! Let’s put some example data in the model and have a look at it (click for big image):

The image above gives an overview; the rest of the post will get into implementation details and good practices that can be useful.

Getting to the details

When a new Neo4j database is created, it already contains one single node, known as the reference node. This node can be used as a main entry point to the graph. Next, we’ll show a useful pattern for this.

In most real applications you’ll want multiple entry points to the graph, and this can be done by creating subreference nodes. A subreference node is a node that is connected to the reference node with a special relationship type, indicating it’s role. In this case, we’re interested in having a relationship to the category root and one to the attribute types. So this is how the subreference structure looks in the node space:

Now someone may ask: Hey, shouldn’t the products have a subreference node as well?! But, for two reasons, I don’t think so:

It’s redundant as we can find them by traversing from the category root.
If we want to find a single product, it’s more useful to index them on a property, like their name. We’ll save that one for another blog post, though.

Note that when using a graph database, the graph structure lends itself well to indexing.

As the subreference node pattern is such a nice thing, we added it to the utilities. The node is lazily created the first time it’s requested. Here’s what’s needed to create an ATTRIBUTE_ROOT typed subreference node:

import neo4j
from neo4j.util import Subreference
attribute_subref_node = Subreference.Node.ATTRIBUTE_ROOT(graphdb)

… where graphdb is the current Neo4j instance. Note that the subreference node itself doesn’t have a “node type”, but is implicitly given a type by the ATTRIBUTE_ROOT typed relationship leading to the node.

The next thing we need to take care of is connecting all attribute type nodes properly with the subreference node.

This is simply done like this:

attribute_subref_node.ATTRIBUTE_TYPE(new_attribute_type_node)

Always doing like this when adding a new attribute type makes the nodes easily discoverable from the ATTRIBUTE_ROOT subreference node:

Similarly, we want to have a subreference node for categories, and in this case we also want to add a property to the subreference node. Here’s how this looks in Python code:

category_subref_node = Subreference.Node.CATEGORY_ROOT(graphdb, Name="Products")

This is how it will look after we added the first actual category, namely the “Electronics” one:

Now let’s see how to add subcategories.

Basically, this is what’s needed to create a subcategory in the node space, using the SUBCATEGORY relationship type:

computers_node = graphdb.node(Name="Computers")
electronics_node.SUBCATEGORY(computers_node)

To fetch all the direct subcategories under a category and print their names, all we have to do is to fetch the relationships of the corresponding type and use the node at the end of the relationship, just like this:

for rel in category_node.SUBCATEGORY.outgoing:
  print rel.end['Name']

There’s not much to say regarding products, the product nodes are simply connected to one category node using a PRODUCT relationship:

But how to get all products in a category, including all it’s subcategories? Here it’s time to use a traverser, defined by the following code:

class SubCategoryProducts(neo4j.Traversal):
  types = [neo4j.Outgoing.SUBCATEGORY, neo4j.Outgoing.PRODUCT]
  def isReturnable(self, pos):
      if pos.is_start: return False
      return pos.last_relationship.type == 'PRODUCT'

This traverser will follow outgoing relationships for both SUBCATEGORY and PRODUCT type relationships. It will filter out the starting node and only return nodes reached over a PRODUCT relationship.

This is then how to use it:

for prod in SubCategoryProducts(category_node):
  print prod['Name']

At the core of our example is the way it adds attribute definitions to the categories. Attributes are modeled as relationships between a category and an attribute type node. The attribute type node holds information on the type – in our case only a name and a unit – while the relationship holds the name, a “required” flag and, in some cases, a default value as well.

From the viewpoint of a single category, this is how it is connected to attribute types, thus defining the attributes that can be used by products down that path in the category tree:

Our last code sample will show how to fetch all attribute definitions which apply to a product. Here we’ll define a traverser named categories which will find all categories for a product. The traverser is used by the attributes function, which will yield all the ATTRIBUTE relationship.

A simple example of usage is also included in the code:

def attributes(product_node):
  """Usage:
  for attr in attributes(product):
      print attr['Name'], " of type ", attr.end['Name']
  """
  for category in categories(product_node):
      for attr in category.ATTRIBUTE:
          yield attr

class categories(neo4j.Traversal):
  types = [neo4j.Incoming.PRODUCT, neo4j.Incoming.SUBCATEGORY]
  def isReturnable(self, pos):
      return not pos.is_start

Let’s have a final look at the attribute types. Seen from the viewpoint of an attribute type node things look this way:

As the image above shows, it’s really simple to find out which attributes (or categories) are using a specific attribute type. This is typical when working with a graph database: connect the nodes according to your data model, and you’ll be fine.

Wrap-up

Hopefully you had some fun diving into a bit of graph database thinking! These should probably be your next steps forward:

↧

Nigel Small Discusses Py2neo

September 2, 2011, 2:33 pm

≫ Next: From neo4j import GraphDatabase

≪ Previous: Modeling Categories in a Graph Database

After some considerable mocking from our good man Jim Webber, developer and architect Nigel Small started playing around with Neo4j.

His conclusion:

“In the end, I came to the conclusion that designing a graph database has far more in common with OO design than it does with relational database design.”

Be sure to visit Py2neo, Nigel’s project that provides bindings between Python and Neo4j via its RESTful web service interface. Want to learn more about graph databases? Click below to get your free copy of O’Reilly’s Graph Databases ebook and discover how to use graph technologies for your application today. Download My Ebook

↧

From neo4j import GraphDatabase

October 5, 2011, 3:54 am

≫ Next: Py2neo 1.6

≪ Previous: Nigel Small Discusses Py2neo

From neo4j import GraphDatabase

First of all, we’re really sorry. We have been saying that Python support for the embedded database is coming in “a few weeks” or “next month” for over half a year now, and so far, you have waited patiently, and you have waited in vain.

We promise to not give promises we can’t keep again, and we hope ya’ll know that we love Python just as much as the next guy.

Now, finally, the absolutely latest and greatest version of the embedded Neo4j database works in Python, and we’ve put a bunch of effort into ensuring it stays that way. The new bindings are constantly tested against each new build of the database, and are set up to deploy to PyPi as often as we all like them to.

The API is very similar to the original neo4j.py API. We also borrowed some of the API methods introduced in neo4j-rest-client, to make switching between the two as easy as possible.

This is a first release, so there may still be bugs lurking – please make sure to report any that you encounter and ideas for improvements to the project issue tracker!

Quick look

Here is a quick look at how you use neo4j-embedded.

from neo4j import GraphDatabase

db = GraphDatabase(‘/my/db/location’)

with db.transaction:
    oscar = db.node(name=’Oscar Wilde’)
    jacob = db.node(name=’Jacob’)

    # Create a relationship
    oscar.impressed_by_blogging_skills_of(jacob)
db.shutdown()

Requirements

The new bindings are tested on CPython 2.7.2 on Windows and Linux, but should work on Python 2.6 branches as well.

You’ll need JPype installed to bridge the gap to Java land, details about how to set that up can be found in the installation instructions.

Jython support is on the todo list, but because Neo4j uses Java’s ServiceLoader API (which does not currently work in Jython) it will have to wait until we find a good workaround.

Getting started

Full instructions for how to install and get started can be found in the Neo4j Manual. For feedback, hints and contributions, don’t hesitate to ask on the Neo4j Forums.

Happy Hacking!

Heart symbol from http://dryicons.com. Want to learn more about graph databases? Click below to get your free copy of O’Reilly’s Graph Databases ebook and discover how to use graph technologies for your application today. Download My Ebook

↧

Py2neo 1.6

September 27, 2013, 5:23 am

≫ Next: Building a Python Web Application Using Flask and Neo4j

≪ Previous: From neo4j import GraphDatabase

Py2neo 1.6

Hi all, It’s a weird thought that although Neo4j has been part of my life for well over two years, I’ve only met in person a few of the people that I know from its community. Thanks to the wonderful generosity of Emil and co though, that will soon change as I’ll be jetting over to San Francisco for Graph Connect, giving me a chance to meet both the Neo guys and my fellow driver authors. The timing is also pretty good as I’ve just released Py2neo 1.6 which introduces one of the most requested features of recent months: node labels. As most Neophiles will know, labels allow nodes to be tagged with keywords that can be used for categorisation and indexing. Adding labels to a node in Py2neo is straightforward with the add_labels method:

>>> from py2neo import neo4j, node

>>> graph_db = neo4j.GraphDatabaseService()

>>> alice, = graph_db.create(node(name=”Alice”))

>>> alice.add_labels(“Female”, “Human”)

The set_labels and remove_labels methods similarly allow labels to be replaced or deleted and get_labels returns the set of labels currently defined. The GraphDatabaseService.find method can then be used to gather up all the nodes with a particular label and iterate through them: >>> for node in graph_db.find(“Human”):

… print node[“name”]Aside from labels, the biggest change in the 1.6 release is a complete rewrite of the underlying HTTP/REST mechanism. In order to achieve better support for streaming responses, it was necessary to rip out the simple mechanism that had been with Py2neo since the early days and build a more comprehensive layer from the ground up. Incremental JSON decoding is a key feature that allows server responses to be handled step by step instead of only after the response has been completely received. This new layer has grown into a separate project, HTTPStream, but is embedded into Py2neo to avoid dependencies. But what advantages does HTTPStream give to Py2neo-based applications? Well, it’s now possible to incrementally handle the results of Cypher queries and batch requests as well as those from a few other functions, such as match. These functions now provide result iterators instead of full result objects. Here’s an example of a Cypher query streamed against the data inserted above: >>> query = neo4j.CypherQuery(graph_db, “MATCH (being:Human) RETURN being”) >>> for result in query.stream(): … print result.being[“name”] Neotool has received some love too. The command line variant of Py2neo now fully supports Unicode, provides facilities for Cypher execution, Geoff insertion and XML conversion as well as options for HTTP authentication. The diagram below shows the conversion paths now available:

For a quick demonstration of the XML conversion feature in action, check out this web service. Another good place for a good neotool overview is my recent lightning talk from the London Graph Café.So what isn’t included? Cypher transactions are the main omission from the Neo4j 2.0 feature set and have been deliberately left out until a few major technical challenges have been overcome. Other than that, Py2neo 1.6 is the perfect companion to Neo4j 2.0 and well worth a try! Py2neo 1.6 is available from PyPI, the source is hosted on GitHub and the documentation at ReadTheDocs. For a full list of changes, have a peek at the release notes. /Nigel Small (@technige)

↧

Building a Python Web Application Using Flask and Neo4j

January 23, 2015, 12:07 pm

≫ Next: Polyglot Persistence Case Study: Wanderu + Neo4j + MongoDB

≪ Previous: Py2neo 1.6

Flask, a popular Python web framework, has many tutorials available online which use an SQL database to store information about the website’s users and their activities.

While SQL is a great tool for storing information such as usernames and passwords, it is not so great at allowing you to find connections among your users for the purposes of enhancing your website’s social experience.

The quickstart Flask tutorial builds a microblog application using SQLite.

In my tutorial, I walk through an expanded, Neo4j-powered version of this microblog application that uses py2neo, one of Neo4j’s Python drivers, to build social aspects into the application. This includes recommending similar users to the logged-in user, along with displaying similarities between two users when one user visits another user’s profile.

My microblog application consists of Users, Posts, and Tags modeled in Neo4j:

With this graph model, it is easy to ask questions such as:

“What are the top tags of posts that I’ve liked?”

MATCH (me:User)-[:LIKED]->(post:Post)<-[:TAGGED]-(tag:Tag)
WHERE me.username = 'nicole'
RETURN tag.name, COUNT(*) AS count
ORDER BY count DESC

“Which user is most similar to me based on tags we’ve both posted about?”

MATCH (me:User)-[:PUBLISHED]->(:Post)<-[:TAGGED]-(tag:Tag), 
(other:User)-[:PUBLISHED]->(:Post)<-[:TAGGED]-(tag)
WHERE me.username = 'nicole' AND me <> other
WITH other,

COLLECT(DISTINCT tag.name) AS tags,

COUNT(DISTINCT tag) AS len ORDER BY len DESC LIMIT 3 RETURN other.username AS similar_user, tags
Links to the full walkthrough of the application and the complete code are below.

Watch the Webinar:

Want to learn more about graph databases? Click below to get your free copy of O’Reilly’s Graph Databases ebook and discover how to use graph technologies for your application today.

Download My Copy

↧

Polyglot Persistence Case Study: Wanderu + Neo4j + MongoDB

June 15, 2015, 4:00 am

≫ Next: Cypher: LOAD JSON from URL AS Data

≪ Previous: Building a Python Web Application Using Flask and Neo4j

Solution Architectural Diagram of Polyglot Persistence for Wanderu between Neo4j and MongoDB

Every language and data storage solution has its strengths. After all, no single solution is most performant and cost-effective for every possible task in your application. In order to tap into the varying strengths of different data storage solutions, your application needs to take advantage of polyglot persistence. That’s exactly what Wanderu did when building their meta-search travel website – and here’s how they did it:

The Technical Challenge

Wanderu provides the ability for consumers to search for bus and train tickets, with journeys combining legs from multiple different transportation companies. The route data is stored in JSON, making a document storage engine like MongoDB a great solution for their route leg data storage. However, they also needed to be able to find the optimal path from origin to destination. This is perfect for a graph database like Neo4j, because Neo4j can understand the data relationships between different transit route legs.

Polyglot Persistence: Using MongoDB and Neo4j

Wanderu didn’t want to force MongoDB (a document-based data store) to handle graph-style relationships because the implementation would have been costly and inefficient. Instead, they used a polyglot persistence approach to capitalize on the strengths of each, deciding to use both MongoDB and Neo4j together.

Solution Architectural Diagram

Wanderu's polyglot persistence architecture between mongoDB and Neo4j

The Wanderu ticket search engine uses both MongoDB (for easy JSON document storage) and Neo4j (for efficient route calculations).

The Challenge of Sync

With the bus route legs stored in MongoDB, Wanderu had to decide whether to write application code to synchronize this information into Neo4j as a graph model or use a syncing technology to handle this automatically. Eddy Wong, CTO and Co-Founder of Wanderu, discovered the GitHub project called “mongo-connector,” which enabled Mongo’s built-in replication service to replicate data to another database. Eddy only had to write a Doc Manager for Neo4j which handled callbacks on each MongoDB insert or update operation. As new entries are added to the MongoDB OpLog, the Mongo Connector calls the Neo4j DocMgr. The Neo4j DocMgr code written by Wanderu then uses the py2neo open source Python library to create the corresponding nodes, properties and relationships in Neo4j. The API server then uses Node-Neo4j to send queries to the graph database. The resulting solution takes advantage of Neo4j, MongoDB, JSON, Node.js, Express.js, Mongo Connector, Python and py2neo. Polyglot persistence ensures that each of these technologies are used according to their greatest strengths. And for Wanderu, it means a better search and routing experience for their users. Read more about Wanderu’s use of Neo4j in their online case study. O’Reilly’s Graph Databases compares NoSQL database solutions and shows you how to apply graph technologies to real-world problems. Click below to get your free copy of the definitive book on graph databases and your introduction to Neo4j.

Download My Free Copy

↧

Cypher: LOAD JSON from URL AS Data

July 29, 2015, 4:00 am

≫ Next: Making a Difference: The Public Neo4j-Users Slack Group

≪ Previous: Polyglot Persistence Case Study: Wanderu + Neo4j + MongoDB

Discover How to LOAD JSON Files from URLs AS Graph-Ready Data

Update: Much of this got much easier today with user defined procedures, like apoc.load.json, which add this kind of capability to Cypher directly.

Neo4j’s query language Cypher supports loading data from CSV directly but not from JSON files or URLs. Almost every site offers some kind of API or endpoint that returns JSON and we can also query many NOSQL databases via HTTP and get JSON responses back. It’s quite useful to be able to ingest document structured information from all those different sources into a more usable graph model. I want to show here that retrieving that data and ingesting it into Neo4j using Cypher is really straightforward and takes only little effort. As Cypher is already pretty good at deconstructing nested documents, it’s actually not that hard to achieve it from a tiny program. I want to show you today how you can achieve this from Python, Javascript, Ruby, Java, and Bash.

The Domain: Stack Overflow

Being a developer I love Stack Overflow; just crossed 20k reputation by only answering 1100 Neo4j-related questions :). You can do that too. That’s why I want to use Stack Overflow users with their questions, answers, comments and tags as our domain today.

Pulling Stack Overflow information into a graph model allows me to find interesting insights, like:

What are the people asking or answering about Neo4j also interested in
How is their activity distributed across tags and between questions, answers and comments
Which kinds of questions attract answers and which don’t
Looking at my own data, which answers to what kinds of questions got the highest approval rates

We need some data and a model suited to answer those questions.

Stack Overflow API

Stack Overflow offers an API to retrieve that information, it’s credential protected as usual, but there is the cool option to pre-generate an API-URL that encodes your secrets and allows you to retrieve data without sharing them. You can still control some parameters like tags, page size and page-number though. With this API-URL below, we load the last 10 questions with the Neo4j tag. https://api.stackexchange.com/2.2/questions?pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf The response should look something like this (or scroll to the far bottom).

Overall Response Structure

{ "items": [{
	"question_id": 24620768,
	"link": "http://stackoverflow.com/questions/24620768/neo4j-cypher-query-get-last-n-elements",
	"title": "Neo4j cypher query: get last N elements",
 	"answer_count": 1,
 	"score": 1,
 	.....
 	"creation_date": 1404771217,
 	"body_markdown": "I have a graph....How can I do that?",
 	"tags": ["neo4j", "cypher"],
 	"owner": {
 		"reputation": 815,
 		"user_id": 1212067,
        ....
 		"link": "http://stackoverflow.com/users/1212067/"
 	},
 	"answers": [{
 		"owner": {
 			"reputation": 488,
 			"user_id": 737080,
 			"display_name": "Chris Leishman",
            ....
 		},
 		"answer_id": 24620959,
 		"share_link": "http://stackoverflow.com/a/24620959",
        ....
 		"body_markdown": "The simplest would be to use an ... some discussion on this here:...",
 		"title": "Neo4j cypher query: get last N elements"
 	}]
 }

Graph Model

So what does the graph-model look like? We can develop it by looking at the questions we want to answer and the entities and relationships they refer to. We need this model upfront to know where to put our data when we insert it into the graph. After all we don’t want to have loose ends.

Cypher Import Statement

The Cypher query to create that domain is also straightforward. You can deconstruct maps with dot notation map.key and arrays with slices array[0..4]. You’d use UNWIND to convert collections into rows and FOREACH to iterate over a collection with update statements. To create nodes and relationships we use MERGE and CREATE commands. My friend Mark just published a blog post explaining in detail how you apply these operations to your data. The JSON response that we retrieved from the API call is passed in as a parameter {json} to the Cypher statement, which we alias with the more handy data identifier. Then we use the aforementioned means to extract the relevant information out of the data collection of questions, treating each as q. For each question we access the direct attributes but also related information like the owner or contained collections like tags or answers which we deconstruct in turn.

WITH {json} as data
UNWIND data.items as q
MERGE (question:Question {id:q.question_id}) ON CREATE
  SET question.title = q.title, question.share_link = q.share_link, question.favorite_count = q.favorite_count

MERGE (owner:User {id:q.owner.user_id}) ON CREATE SET owner.display_name = q.owner.display_name
MERGE (owner)-[:ASKED]->(question)

FOREACH (tagName IN q.tags | MERGE (tag:Tag {name:tagName}) MERGE (question)-[:TAGGED]->(tag))
FOREACH (a IN q.answers |
   MERGE (question)<-[:ANSWERS]-(answer:Answer {id:a.answer_id})
   MERGE (answerer:User {id:a.owner.user_id}) ON CREATE SET answerer.display_name = a.owner.display_name
   MERGE (answer)<-[:PROVIDED]-(answerer)
)

Calling Cypher with the JSON parameters

To pass in the JSON to Cypher we have to programmatically call the Cypher endpoint of the Neo4j server, which can be done via one of the many drivers for Neo4j or manually by POSTing the necessary payload to Neo4j. We can also call the Java API. So without further ado here are our examples for a selection of different languages, drivers and APIs:

Python

We use the py2neo driver by Nigel Small to execute the statement:

import os
import requests
from py2neo import neo4j

# Connect to graph and add constraints.
neo4jUrl = os.environ.get('NEO4J_URL',"http://localhost:7474/db/data/")
graph = neo4j.GraphDatabaseService(neo4jUrl)

# Add uniqueness constraints.
neo4j.CypherQuery(graph, "CREATE CONSTRAINT ON (q:Question) ASSERT q.id IS UNIQUE;").run()

# Build URL.
apiUrl = "https://api.stackexchange.com/2.2/questions...." % (tag,page,page_size)
# Send GET request.
json = requests.get(apiUrl, headers = {"accept":"application/json"}).json()

# Build query.
query = """
UNWIND {json} AS data ....
"""

# Send Cypher query.
neo4j.CypherQuery(graph, query).run(json=json)

We also did something similar with getting tweets from the Twitter search API into Ne4oj for the OSCON conference.

Javascript

For JavaScript I want to show how to call the transactional Cypher endpoint directly, by just using the request node module.

var r=require("request");
var neo4jUrl = (env["NEO4J_URL"] || "http://localhost:7474") + "/db/data/transaction/commit";

function cypher(query,params,cb) {
  r.post({uri:neo4jUrl,
          json:{statements:[{statement:query,parameters:params}]}},
         function(err,res) { cb(err,res.body)})
}

var query="UNWIND {json} AS data ....";
var apiUrl = "https://api.stackexchange.com/2.2/questions....";

r.get({url:apiUrl,json:true,gzip:true}, function(err,res,json) {
  cypher(query,{json:json},function(err, result) { console.log(err, JSON.stringify(result))});
});

Java

With Java I want to show how to use the Neo4j embedded API to execute Cypher.

import org.apache.http.*;
import org.codehaus.jackson.map.ObjectMapper;
import org.neo4j.graphdb.*;

// somewhere in your application-scoped setup code
ObjectMapper mapper = new ObjectMapper();
HttpClient http = HttpClients.createMinimal();
GraphDatabaseService db = new GraphDatabaseFactory().newEmbeddedGraphDatabase(PATH);

// execute API request and parse response as JSON
HttpResponse response = http.execute(new HttpGet( apiUrl ));
Map json = mapper.readValue(response.getEntity().getContent(), Map.class)

// execute Cypher
String query = "UNWIND {json} AS data ....";
db.execute(query, singletonMap("json",json));

// application scoped shutdown, or JVM-shutdown-hook
db.shutdown();

Ruby

Using the neo4j-core Gem, we can talk to Neo4j Server or embedded (using jRuby) by just changing a single line of configuration.

require 'rubygems'
require 'neo4j-core'
require 'rest-client'
require 'json'

QUERY="UNWIND {json} AS data ...."
API = "https://api.stackexchange.com/2.2/questions...."

res = RestClient.get(API)
json = JSON.parse(res.to_str)

session = Neo4j::Session.open
session.query(QUERY, json: json)

Bash

Bash is of course most fun, as we have to do fancy text substitutions to make this work.

load_json.sh

#!/bin/bash
echo "Usage load_json.sh 'http://json.api.com?params=values' import_json.cypher"
echo "Use {data} as parameter in your query for the JSON data"
JSON_API="$1"
QUERY=`cat "$2"` # cypher file
JSON_DATA=`curl --compress -s -H accept:application/json -s "$JSON_API"`
POST_DATA="{\"statements\":[{\"statement\": \"$QUERY\", \"parameters\": {\"data\":\"$JSON_DATA\"}}]}"
DB_URL=${NEO4J_URL-http://localhost:7474}
curl -i -H accept:application/json -H content-type:application/json -d "$POST_DATA" -XPOST 
"$DB_URL/db/data/transaction/commit"

Example Use-Cases

Here are some simple example queries that I now can run on top of this imported dataset. To not overload this blog post with too much information, we’ll answer our original questions in Part 2.

Find the User who was most active

MATCH (u:User)
OPTIONAL MATCH (u)-[:PROVIDED|ASKED|COMMENTED]->()
RETURN u,count(*)
ORDER BY count(*) DESC
LIMIT 5

Find co-used Tags

MATCH (t:Tag)
OPTIONAL MATCH (t)<-[:TAGGED]-(question)-[:TAGGED]->(t2)
RETURN t.name,t2.name,count(distinct question) as questions
ORDER BY questions DESC

MATCH (t:Tag)<-[r:TAGGED]->(question)
RETURN t,r,question

Conclusion

So as you can see, even with LOAD JSON not being part of the language, it’s easy enough to retrieve JSON data from an API endpoint and deconstruct and insert it into Neo4j by just using plain Cypher. Accessing web-APIs is a simple task in all stacks and languages and JSON as transport format is ubiquitous. Fortunately, the unfortunately lesser known capabilities of Cypher to deconstruct complex JSON documents allow us to quickly turn them into a really nice graph structure without duplication of information and rich relationships. I encourage you to try it with your favorite web-APIs and send us your example with graph model, Cypher import query and 2-3 use-case queries that reveal some interesting insights into the data you ingested to content@neotechnology.com. Want to learn more about graph databases? Click below to get your free copy of O’Reilly’s Graph Databases ebook and start building better apps powered by graph technologies.

Get My Free Copy

Appendix: Stack Overflow Response

{
	"items": [{
		"answers": [{
			"owner": {
				"reputation": 488,
				"user_id": 737080,
				"user_type": "registered",
				"accept_rate": 45,
				"profile_image": "https://www.gravatar.com/avatar/
ffa6eed1e8a9c1b2adb37ca88c07dede?s=128&d=identicon&r=PG",
				"display_name": "Chris Leishman",
				"link": "http://stackoverflow.com/users/737080/chris-leishman"
			},
			"tags": [],
			"comment_count": 0,
			"down_vote_count": 0,
			"up_vote_count": 2,
			"is_accepted": false,
			"score": 2,
			"last_activity_date": 1404772223,
			"creation_date": 1404772223,
			"answer_id": 24620959,
			"question_id": 24620768,
			"share_link": "http://stackoverflow.com/a/24620959",
			"body_markdown": "The simplest would be to use an ... some discussion on this here:
 http://docs.neo4j.org/chunked/stable/cypherdoc-linked-lists.html)",
			"link": "http://stackoverflow.com/questions/24620768/neo4j-cypher-query-get-last-n-elements/24620959#24620959",
			"title": "Neo4j cypher query: get last N elements"
		}],
		"tags": ["neo4j", "cypher"],
		"owner": {
			"reputation": 815,
			"user_id": 1212067,
			"user_type": "registered",
			"accept_rate": 73,
			"profile_image": "http://i.stack.imgur.com/nnyS1.png?s=128&g=1",
			"display_name": "C&#233;sar Garc&#237;a Tapia",
			"link": "http://stackoverflow.com/users/1212067/c%c3%a9sar-garc%c3%ada-tapia"
		},
		"comment_count": 0,
		"delete_vote_count": 0,
		"close_vote_count": 0,
		"is_answered": true,
		"view_count": 14,
		"favorite_count": 0,
		"down_vote_count": 0,
		"up_vote_count": 1,
		"answer_count": 1,
		"score": 1,
		"last_activity_date": 1404772230,
		"creation_date": 1404771217,
		"question_id": 24620768,
		"share_link": "http://stackoverflow.com/q/24620768",
		"body_markdown": "I have a graph that...How can I do that?",
		"link": "http://stackoverflow.com/questions/24620768/neo4j-cypher-query-get-last-n-elements",
		"title": "Neo4j cypher query: get last N elements"
	}, {
		"tags": ["neo4j", "cypher"],
		"owner": {
			"reputation": 63,
			"user_id": 845435,
			"user_type": "registered",
			"accept_rate": 67,
			"profile_image": "https://www.gravatar.com/avatar/
610458a30958c9d336ee691fa1a87369?s=128&d=identicon&r=PG",
			"display_name": "user845435",
			"link": "http://stackoverflow.com/users/845435/user845435"
		},
		"comment_count": 0,
		"delete_vote_count": 0,
		"close_vote_count": 0,
		"is_answered": false,
		"view_count": 16,
		"favorite_count": 0,
		"down_vote_count": 0,
		"up_vote_count": 0,
		"answer_count": 0,
		"score": 0,
		"last_activity_date": 1404768987,
		"creation_date": 1404768987,
		"question_id": 24620297,
		"share_link": "http://stackoverflow.com/q/24620297",
		"body_markdown": 
"I&#39;m trying to implement a simple graph db for NYC subway................Thanks!\r\n",
		"link": "http://stackoverflow.com/questions/24620297/cypher-query-with-infinite-relationship-takes-forever",
		"title": "Cypher query with infinite relationship takes forever"
	}],
	"has_more": true,
	"quota_max": 300,
	"quota_remaining": 205
}

↧

Making a Difference: The Public Neo4j-Users Slack Group

August 5, 2015, 10:30 am

≫ Next: From the Neo4j Community: July 2015

≪ Previous: Cypher: LOAD JSON from URL AS Data

Update

We are moving our Neo4j Community Support Forum to a new place as we have outgrown Slack. Thank you all for your help and support there.

Now join us on community.neo4j.com for a better experience

Making a Difference: The Public Neo4j-Users Slack Group

We’ve been asked several time in the past to open a neo4j-users Slack group for the many enthusiastic people in the Neo4j user community. Now, that Slack group is a reality. This group is meant to be a hip alternative to IRC for quick questions and feedback. StackOverflow is still the place to go for canonical, persistent answers and also providing them. We’ll also posting interesting StackOverflow questions and answers to the Slack group. So, yesterday we gave it a go and were totally overwhelmed by the users pouring in. During the first hours we had about 100 sign-ups per hour, making it 600 members in the first 6 hours. Wow, that was impressive. And everyone was very thankful and enthusiastic. New channels were created and people instantly began helping each other, discussing questions and giving feedback. If you want to join now, please sign up here, and then come back to finish reading this article. Driver authors and community contributors are also all there to help you with specific questions. In addition, many Neo Technology employees participate as well to help you out with questions or ideas.

Screenshot of the new Neo4j-users Slack group

Slack Channels

For the best experience just join the channels that your are interested in, and for whose topics you can provide help. No need to be everywhere. Here is just a quick overview to get an idea:

Purpose Channels

Purpose	Channels
General Discussions	`#general`
General Help & Questions	`#help`
Help in different areas of interest	`#help-newbies, #help-install, #help-cypher, #help-import, #help-modeling`
Language / Driver specific questions	`#neo4j-(java, python ,ruby, dotnet, php, golang, sdn, rstats), #neo4j-spatial, #neo4j-unix`
Share your project or story	`#share-your-story`
Organizing Neo4j Meetups	`#organize-meetups`
Neo4j Trainers (private)	`#trainers`
Feedback and Ideas for Neo4j	`#product-feedback`
News from @Neo4j Twitter	`#twitter`
Latest Neo4j Events	`#events`
Banter and Fun	`#rants`

General Discussions

#general

General Help & Questions

#help

Help in different areas of interest

#help-newbies, #help-install, #help-cypher, #help-import, #help-modeling

Language / Driver specific questions

#neo4j-(java, python ,ruby, dotnet, php, golang, sdn, rstats), #neo4j-spatial, #neo4j-unix

Share your project or story

#share-your-story

Organizing Neo4j Meetups

#organize-meetups

Neo4j Trainers (private)

#trainers

Feedback and Ideas for Neo4j

#product-feedback

News from @Neo4j Twitter

#twitter

Latest Neo4j Events

#events

Banter and Fun

#rants

Of course we added our Neo4j-Slack integration so that you can explore channels and users and get recommendations on channels that might be interesting for you with the /graph cypher [query] slash command.

/graph cypher MATCH (c:Channel)<-[:MEMBER_OF]-()
              RETURN c.name, count(*) AS users
              ORDER BY users DESC LIMIT 10;

    | c.name             | users
----+--------------------+-------
  1 | general            |   645
  2 | help-cypher        |    64
  3 | neo4j-java         |    57
  4 | help-modeling      |    53
  5 | share-your-project |    42
  6 | neo4j-python       |    36
  7 | help               |    34
  8 | product-feedback   |    33
  9 | events             |    30
 10 | organize-meetups   |    30

You can use that too, to try tiny Cypher snippets, like this:

/graph cypher unwind range(1,10) as n return n % 2, count(*), collect(n)

  | n % 2 | count(*) | collect(n)
--+-------+----------+------------------
1 |   1.0 |        5 | [1, 3, 5, 7, 9]
2 |   0.0 |        5 | [2, 4, 6, 8, 10]

Slack Experience

The Slack experience has been pretty good so far. It was no issue processing 5,000 total invites (which were a pain to export from Google Groups) and so far interaction has had no hiccups. It is unfortunate that people can’t sign up for public Slack groups on their own, and that signing into another team on the Desktop client is a bit of a hassle. It would be cool if there was another ordering mechanism for channel names other than just alphabetic, and it was a bit tricky to get a good channel structure that doesn’t overlap. Also, the missing history for non-paid Slack teams is sad. I hope that Slack will provide a good solution for open source projects that just want to engage with their users in a better way. Despite these few issues, however, Slack is a great tool for working with our community and we hope you’ll join the discussion. Sign up here to join the Neo4j-users Slack group today. Want to catch up with the rest of the Neo4j community? Click below to get your free copy of the Learning Neo4j ebook and catch up to speed with the world’s leading graph database.

Learn Neo4j Today

↧

From the Neo4j Community: July 2015

August 11, 2015, 4:00 am

≫ Next: Import 10M Stack Overflow Questions into Neo4j In Just 3 Minutes

≪ Previous: Making a Difference: The Public Neo4j-Users Slack Group

Learn What Videos, Slides and Articles the Neo4j Community Has Been Publishing in the Month of July

It’s apparent that Neo4j graphistas have been hard at work this summer, with lots of awesome videos, slides and articles being published. Here are some of our favorites that show off how amazing the community really is. If you would like to see more posts from the graphista community, follow us on Twitter and use the #Neo4j hashtag to be featured in September’s “From the Community” blog post. Videos:

Real World Microservices by Jed Wesley-Smith

Slides:

Managing Genetic Ancestry at Scale by Jason Clark
Open Source Big Graph Analytics on Neo4j with Apache Spark by Kenny Bastani

Articles:

Connecting Meteor to a Neo4j database and deploying with GrapheneDB by Sam Corcos
Understanding Neo4j Database and its uses by Saurabh Jain
The essence of Spring Data Neo4j 4 by Luanne Misquitta
Next-Generation Databases Take On Big Data Management Challenges by Meta S. Brown
Using the Testing Harness for Neo4j Extensions by Max de Marzi
How to evaluate Neo4j with guaranteed success by David Montag
CAC 40: visualizing the French boardroom network by Jean Villedieu
Querying Neo4j with Cypher by Steven Haines
Stop Developing Databases the Hard Way!! by Bruce Hilton
Hierarchies and the Google Product Taxonomy in Neo4j and Loading the Belgian Corporate Registry into Neo4j – part 1 by Rik Van Bruggen
Neo4j: MERGE’ing on super nodes by Mark Needham
Board Members Overlap by RJ Andrews
How to write faster from Python to Neo4j with OpenMpi by Sotirios Tsartsaris

Want to see more of the best from the graph databases community? Click below to register for GraphConnect San Francisco on October 21, 2015 at Pier 27 and meet more hackers, developers, architects and data professionals from the Neo4j ecosystem.

↧

Import 10M Stack Overflow Questions into Neo4j In Just 3 Minutes

September 1, 2015, 4:00 am

≫ Next: Free Neo4j Books (+ Discounts), from Beginner to Advanced

≪ Previous: From the Neo4j Community: July 2015

Learn How We Imported 10 Million Stack Overflow Questions into Neo4j in Just 3 Minutes

I want to demonstrate how you can take the Stack Overflow dump and quickly import it into Neo4j. After that, you’re ready to start querying the graph for more insights and then possibly build an application on top of that dataset. If you want to follow along, we have a running (readonly) Neo4j server with the data available here. But first things first: Congratulations to Stack Overflow for being so awesome and helpful. They’ve just recently announced that over ten million programming questions (and counting) have been answered on their site. (They’re also doing a giveaway around the #SOreadytohelp hashtag. More on that below.) Without the Stack Overflow platform, many questions around Neo4j could’ve never been asked nor answered. We’re still happy that we started to move away from Google Groups for our public user support. The Neo4j community on Stack Overflow has grown a lot, as have the volume of questions there.

Stack Overflow Has Answered 10 million questions

(and it is a graph)

Importing the Stack Overflow Data into Neo4j

Importing the millions of Stack Overflow questions, users, answers and comments into Neo4j has been a long-time goal of mine. One of the distractions that kept me from doing it was answering many of the 8,200 Neo4j questions out there. Two weeks ago, Damien at Linkurious pinged me in our public Slack channel. He asked about Neo4j’s import performance for ingesting the full Stack Exchange data dump into Neo4j. After a quick discussion, I pointed him to Neo4j’s CSV import tool, which is perfect for the task as the dump consists of only relational tables wrapped in XML.

So Damien wrote a small Python script to extract the CSV from XML and with the necessary headers the neo4j-import tool did the grunt work of creating a graph out of huge tables. You can find the script and instructions on GitHub here. Importing the smaller Stack Exchange community data only takes a few seconds. Amazingly, the full Stack Overflow dump with users, questions and answers takes 80 minutes to convert back to CSV and then only 3 minutes to import into Neo4j on a regular laptop with an SSD. Here is how we did it:

Download Stack Exchange Dump Files

First, we downloaded the dump files from the Internet archive for the Stack Overflow community (total 11 GB) into a directory:

7.3G stackoverflow.com-Posts.7z
576K stackoverflow.com-Tags.7z
154M stackoverflow.com-Users.7z

The other data could be imported separately if we wanted to:

91M stackoverflow.com-Badges.7z
2.0G stackoverflow.com-Comments.7z
36M stackoverflow.com-PostLinks.7z
501M stackoverflow.com-Votes.7z

Unzip the .7z Files

for i in *.7z; do 7za -y -oextracted x $i; done

This extracts the files into an extracted directory and takes 20 minutes and uses 66GB on disk.

Clone Damien’s GitHub repository

The next step was to clone Damien’s GitHub repo:

git clone https://github.com/mdamien/stackoverflow-neo4j

Note: This command uses Python 3, so you have to install xmltodict.

sudo apt-get install python3-setuptools
easy_install3 xmltodict

Run the XML-to-CSV Conversion

After that, we ran the conversion of XML to CSV.

python3 to_csv.py extracted

The conversion ran for 80 minutes on my system and resulted in 9.5GB CSV files, which were compressed to 3.4G. This is the data structure imported into Neo4j. The header lines of the CSV files provide the mapping. Nodes:

posts.csv
postId:ID(Post),title,postType:INT,createdAt,score:INT,views:INT,
answers:INT,comments:INT,favorites:INT,updatedAt,body

users.csv userId:ID(User),name,reputation:INT,createdAt,accessedAt,url,location,
views:INT,upvotes:INT,downvotes:INT,age:INT,accountId:INT
tags.csv
tagId:ID(Tag),count:INT,wikiPostId:INT

Relationships:

posts_answers.csv:ANSWER   -> :START_ID(Post),:END_ID(Post)
posts_rel.csv:PARENT_OF    -> :START_ID(Post),:END_ID(Post)
tags_posts_rel.csv:HAS_TAG -> :START_ID(Post),:END_ID(Tag)
users_posts_rel.csv:POSTED -> :START_ID(User),:END_ID(Post)

Import into Neo4j

We then used the Neo4j import tool neo/bin/neo4j-import to ingest Posts, Users, Tags and the relationships between them.

../neo/bin/neo4j-import \
--into ../neo/data/graph.db \
--id-type string \
--nodes:Post csvs/posts.csv \
--nodes:User csvs/users.csv \
--nodes:Tag csvs/tags.csv \
--relationships:PARENT_OF csvs/posts_rel.csv \
--relationships:ANSWER csvs/posts_answers.csv \
--relationships:HAS_TAG csvs/tags_posts_rel.csv \
--relationships:POSTED csvs/users_posts_rel.csv

The actual import only takes 3 minutes, creating a graph store of 18 GB.

IMPORT DONE in 3m 48s 579ms. Imported:
  31138559 nodes
  77930024 relationships
  260665346 properties

Neo4j Configuration

We then wanted to adapt Neo4j’s config in conf/neo4j.properties to increase the dbms.pagecache.memory option to 10G. We also edited the conf/neo4j-wrapper.conf to provide some more heap, like 4G or 8G. Then we started the Neo4j server with ../neo/bin/neo4j start

Adding Indexes

We then had the option of running the next queries either directly in Neo4j’s server UI or on the command-line with ../neo/bin/neo4j-shell which connects to the running server. Here’s how much data we had in there:

neo4j-sh (?)$ match (n) return head(labels(n)) as label, count(*);
+-------------------+
| label  | count(*) |
+-------------------+
| "Tag"  | 41719    |
| "User" | 4551115  |
| "Post" | 26545725 |
+-------------------+
3 rows

Next, we created some indexes and constraints for later use:

create index on :Post(title);
create index on :Post(createdAt);
create index on :Post(score);
create index on :Post(views);
create index on :Post(favorites);
create index on :Post(answers);
create index on :Post(score);

create index on :User(name);
create index on :User(createdAt);
create index on :User(reputation);
create index on :User(age);

create index on :Tag(count);

create constraint on (t:Tag) assert t.tagId is unique;
create constraint on (u:User) assert u.userId is unique;
create constraint on (p:Post) assert p.postId is unique;

We then waited for the indexes to be finished.

schema await

Please note: Neo4j as a graph database wasn’t originally built for these global-aggregating queries. That’s why the responses are not instant.

Getting Insights with Cypher Queries

Below are just some of the insights we gleaned from the Stack Overflow data using Cypher queries:

The Top 10 Stack Overflow Users

match (u:User) 
with u,size( (u)-[:POSTED]->()) as posts order by posts desc limit 10 
return u.name, posts;
+---------------------------+
| u.name            | posts |
+---------------------------+
| "Jon Skeet"       | 32174 |
| "Gordon Linoff"   | 20989 |
| "Darin Dimitrov"  | 20871 |
| "BalusC"          | 16579 |
| "CommonsWare"     | 15493 |
| "anubhava"        | 15207 |
| "Hans Passant"    | 15156 |
| "Martijn Pieters" | 14167 |
| "SLaks"           | 14118 |
| "Marc Gravell"    | 13400 |
+---------------------------+
10 rows
7342 ms

The Top 5 tags That Jon Skeet Used in Asking Questions

It seems he never really asked questions, but only answered.

match (u:User)-[:POSTED]->()-[:HAS_TAG]->(t:Tag) 
where u.name = "Jon Skeet" 
return t,count(*) as posts order by posts desc limit 5;
+------------------------------------------------+
| t                                      | posts |
+------------------------------------------------+
| Node[31096861]{tagId:"c#"}             | 14    |
| Node[31096855]{tagId:".net"}           | 7     |
| Node[31101268]{tagId:".net-4.0"}       | 4     |
| Node[31118174]{tagId:"c#-4.0"}         | 4     |
| Node[31096911]{tagId:"asp.net"}        | 3     |
+------------------------------------------------+
10 rows
36 ms

The Top 5 Tags that BalusC Answered

match (u:User)-[:POSTED]->()-[:HAS_TAG]->(t:Tag) 
where u.name = "BalusC" 
return t.tagId,count(*) as posts order by posts desc limit 5;

+------------------------+
| t.tagId        | posts |
+------------------------+
| "java"         | 5     |
| "jsf"          | 3     |
| "managed-bean" | 2     |
| "eclipse"      | 2     |
| "cdi"          | 2     |
+------------------------+
5 rows
23 ms

How am I Connected to Darin Dimitrov

MATCH path = allShortestPaths(
     (u:User {name:"Darin Dimitrov"})-[*]-(me:User {name:"Michael Hunger"}))
RETURN path;

Result Visualization in the Neo4j Browser

Result Visualisation in Neo4j Browser

Which Mark Answered the Most Questions about neo4j?

MATCH (u:User)-[:POSTED]->(answer)<-[:PARENT_OF]-()-[:HAS_TAG]-(:Tag {tagId:"neo4j"}) 
WHERE u.name like "Mark %" 
RETURN u.name, u.reputation,u.location,count(distinct answer) AS answers
ORDER BY answers DESC;

+--------------------------------------------------------------------------+
| u.name                 | u.reputation | u.location             | answers |
+--------------------------------------------------------------------------+
| "Mark Needham"         | 1352         | "United Kingdom"       | 36      |
| "Mark Leighton Fisher" | 4065         | "Indianapolis, IN"     | 3       |
| "Mark Byers"           | 377313       | "Denmark"              | 2       |
| "Mark Whitfield"       | 899          | <null>                 | 1       |
| "Mark Wojciechowicz"   | 1473         | <null>                 | 1       |
| "Mark Hughes"          | 586          | "London, UK"           | 1       |
| "Mark Mandel"          | 859          | "Melbourne, Australia" | 1       |
| "Mark Jackson"         | 56           | "Atlanta, GA"          | 1       |
+--------------------------------------------------------------------------+
8 rows
38 ms

Top 20 paths rendered as graph

The Top 5 Tags of All Time

match (t:Tag) 
with t order by t.count desc limit 5 
return t.tagId, t.count;
+------------------------+
| t.tagId      | t.count |
+------------------------+
| "javascript" | 917772  |
| "java"       | 907289  |
| "c#"         | 833458  |
| "php"        | 791534  |
| "android"    | 710585  |
+------------------------+
5 rows
30 ms

Co-occurrence of the javascript Tag

match (t:Tag {tagId:"javascript"})<-[:HAS_TAG]-()-[:HAS_TAG]->(other:Tag) 
WITH other, count(*) as freq order by freq desc limit 5
RETURN other.tagId,freq;
+----------------------+
| other.tagId | freq   |
+----------------------+
| "jquery"    | 318868 |
| "html"      | 165725 |
| "css"       | 76259  |
| "php"       | 65615  |
| "ajax"      | 52080  |
+----------------------+
5 rows

The Most Active Answerers for the neo4j Tag

Quick aside: Thank you to everyone who answered Neo4j questions!

match (t:Tag {tagId:"neo4j"})<-[:HAS_TAG]-()
       -[:PARENT_OF]->()<-[:POSTED]-(u:User) 
WITH u, count(*) as freq order by freq desc limit 10
RETURN u.name,freq;

+-------------------------------+
| u.name                 | freq |
+-------------------------------+
| "Michael Hunger"       | 1352 |
| "Stefan Armbruster"    | 760  |
| "Peter Neubauer"       | 308  |
| "Wes Freeman"          | 277  |
| "FrobberOfBits"        | 277  |
| "cybersam"             | 277  |
| "Luanne"               | 235  |
| "Christophe Willemsen" | 190  |
| "Brian Underwood"      | 169  |
| "jjaderberg"           | 161  |
+-------------------------------+
10 rows
45 ms

Where Else Were the Top Answerers Also Active?

MATCH (neo:Tag {tagId:"neo4j"})<-[:HAS_TAG]-()
      -[:PARENT_OF]->()<-[:POSTED]-(u:User) 
WITH neo,u, count(*) as freq order by freq desc limit 10
MATCH (u)-[:POSTED]->()<-[:PARENT_OF]-(p)-[:HAS_TAG]->(other:Tag)
WHERE NOT (p)-[:HAS_TAG]->(neo)
WITH u,other,count(*) as freq2 order by freq2 desc 
RETURN u.name,collect(distinct other.tagId)[1..5] as tags;


+----------------------------------------------------------------------------------------+
| u.name                 | tags                                                          |
+----------------------------------------------------------------------------------------+
| "cybersam"             | ["java","javascript","node.js","arrays"]                      |
| "Luanne"               | ["spring-data-neo4j","java","cypher","spring"]                |
| "Wes Freeman"          | ["go","node.js","java","php"]                                 |
| "Peter Neubauer"       | ["graph","nosql","data-structures","java"]                    |
| "Brian Underwood"      | ["ruby-on-rails","neo4j.rb","ruby-on-rails-3","activerecord"] |
| "Michael Hunger"       | ["spring-data-neo4j","nosql","cypher","graph-databases"]      |
| "Christophe Willemsen" | ["php","forms","doctrine2","sonata"]                          |
| "Stefan Armbruster"    | ["groovy","intellij-idea","tomcat","grails-plugin"]           |
| "FrobberOfBits"        | ["python","xsd","xml","django"]                               |
| "jjaderberg"           | ["vim","logging","python","maven"]                            |
+----------------------------------------------------------------------------------------+
10 rows
84 ms

Note that this Cypher query above contains the equivalent of 14 SQL joins.

Rendered in Linkurious Visualizer

People Who Posted the Most Questions about Neo4j

MATCH (t:Tag {tagId:'neo4j'})<-[:HAS_TAG]-(:Post)<-[:POSTED]-(u:User)
RETURN u.name,count(*) as count
ORDER BY count DESC LIMIT 10;

+------------------------+
| c.name         | count |
+------------------------+
| "LDB"          | 39    |
| "deemeetree"   | 39    |
| "alexanoid"    | 38    |
| "MonkeyBonkey" | 35    |
| "Badmiral"     | 35    |
| "Mik378"       | 27    |
| "Kiran"        | 25    |
| "red-devil"    | 24    |
| "raHul"        | 23    |
| "Sovos"        | 23    |
+------------------------+
10 rows
42 ms

The Top Answerers for the py2neo Tag

MATCH (:Tag {tagId:'py2neo'})<-[:HAS_TAG]-()-[:PARENT_OF]->()
      <-[:POSTED]-(u:User)
RETURN u.name,count(*) as count
ORDER BY count DESC LIMIT 10;

+--------------------------------+
| u.name                 | count |
+--------------------------------+
| "Nigel Small"          | 88    |
| "Martin Preusse"       | 24    |
| "Michael Hunger"       | 22    |
| "Nicole White"         | 9     |
| "Stefan Armbruster"    | 8     |
| "FrobberOfBits"        | 6     |
| "Peter Neubauer"       | 5     |
| "Christophe Willemsen" | 5     |
| "cybersam"             | 4     |
| "Wes Freeman"          | 4     |
+--------------------------------+
10 rows
2 ms

Which Users Answered Their Own Question

This global graph query takes a bit of time as it touches 200 million paths in the database, it returns after about 60 seconds.
If you would want to execute it only on a subset of the 4.5M users you could add a filtering condition, e.g. on reputation.

MATCH (u:User) WHERE u.reputation > 20000
MATCH (u)-[:POSTED]->(question)-[:ANSWER]->(answer)<-[:POSTED]-(u)
WITH u,count(distinct question) AS questions
ORDER BY questions DESC LIMIT 5
RETURN u.name, u.reputation, questions;

+---------------------------------------------+
| u.name           | u.reputation | questions |
+---------------------------------------------+
| "Stefan Kendall" | 31622        | 133       |
| "prosseek"       | 31411        | 114       |
| "Cheeso"         | 100779       | 107       |
| "Chase Florell"  | 21207        | 99        |
| "Shimmy"         | 29175        | 96        |
+---------------------------------------------+
5 rows
10 seconds

More Information

We’re happy to provide you with the graph database of the Stack Overflow dump here:

Neo4j database dump for 2.3-SNAPSHOT or 2.2.4
Running Neo4j Server to explore the data (read-only)
CSV Files

If you want to learn about other ways to import or visualize Stack Overflow questions in Neo4j, please have a look at these blog posts:

LOAD JSON from URL AS Data
Making Master Data Management Fun with Neo4j
Visualizing Stack Overflow
Embrace Relationships with Neo4J, R & Java
Please also check out the Stack Overflow developer survey. It’s a very interesting read.

Thanks again to everyone who posts and answers Neo4j questions. You’re the ones who make the Neo4j community really tick, and without you this level of analysis would only be half as much fun. Circling back to Stack Overflow's 10 million question milestone, thank YOU for being #SOreadytohelp with any Stack Overflow questions related to Neo4j and Cypher. Please let us know if you find other interesting questions and answers on this dataset. Just drop us an email to content@neo4j.com. Want to catch up with the rest of the Neo4j community? Click below to get your free copy of the Learning Neo4j ebook and catch up to speed with the world’s leading graph database.

Learn Neo4j Today

↧

Free Neo4j Books (+ Discounts), from Beginner to Advanced

September 14, 2015, 4:00 am

≫ Next: From the Neo4j Community: September 2015

≪ Previous: Import 10M Stack Overflow Questions into Neo4j In Just 3 Minutes

Discover the Ever-Expanding Library of Neo4j Books Available from Packt Publishing

Editor’s Note: All promotions in this blog post have now expired. Please use this post for informational purposes only and to browse Neo4j books available from Packt Publishing.

Whether you’re a brand-new user of Neo4j or a seasoned vet, you can always stand to polish and refine your skills with graph databases.

No doubt you’ve read the classic Graph Databases book by Ian Robinson, Jim Webber and Emil Eifrem, but now it’s time to move beyond the basics, especially when it comes to Neo4j- and Cypher-specific skills. Good news: Packt Publishing has an ever-expanding host of Neo4j books, and they’re offering an exclusive discount to the Neo4j community. Use the discount code NEO4J25 to receive 25% off your order of any of these ebooks listed below. Better yet: When you purchase all of their Neo4j books an automatic discount gets you all seven books for just $100. But you know what’s even better than a discounted ebook? A free one. If you’d like a free copy of one of the Neo4j-specific titles below, tweet this article using the hashtag #PacktNeo4j. After a week, we’ll select seven winners to receive a 100% discount code for the book of your choice! Now, what can you hope to win? We’ve had each of the authors write a quick summary of their Neo4j book to give you an idea:

Beginner Level Books:

Learning Neo4j

By Rik Van Bruggen, @rvanbruggen

“Learning Neo4j will give you a step-by-step way of adopting Neo4j, the world’s leading graph database. The book includes a lot of background information, helps you grasp the fundamental concepts behind this radical new way of dealing with connected data and will give you lots of examples of use cases and environments where a graph database would be a great fit. “Contrary to many other books on Neo4j, this book is not only targeted at the hardcore developer: I have tried to make the book as accessible as possible for less technical audiences. Technically interested project/program managers should be able to get a great feel for the power of Neo4j by going through this book.”

Get My Copy of Learning Neo4j

Learning Cypher

By Onofrio Panzarino, @onof80

“Learning Cypher is a practical, hands-on guide to learn how to use Neo4j quickly with Cypher, from scratch. The first chapters show you how to manage a Neo4j database in all phases of its lifecycle: creation, querying, updating and maintenance, with a particular focus on Cypher, the powerful Neo4j query language. An entire chapter is dedicated to profiling and improving the performance of queries. The last chapter shows a simple approach to face migrations from SQL. It would be helpful to have a bit of familiarity with Java or and/or SQL but no prior experience is required.”

Get My Copy of Learning Cypher

Neo4j Essentials

By Sumit Gupta

“Neo4j Essentials is a comprehensive and fast-paced guide for developers or expert programmers, especially those experienced in a graph-based or NoSQL-based database and who want to quickly develop and produce real-world, complex use cases on Neo4j. It begins with basic steps of installation and explores various notable features of Neo4j like data structuring, querying (Cypher), pattern matching, integrations with BI tools, Spring Data, utilities and performance tuning, etc. This book also talks about the strategies for efficiently and effectively handling production nuances for enterprise-grade deployments and uncovers the methodologies for extending and securing Neo4j deployments.”

Get My Copy of Neo4j Essentials

Intermediate Level Books:

Neo4j Cookbook

By Ankur Goel

“Neo4j Cookbook provides easy-to-follow yet powerful ready-made recipes, which on one side covers all the recipes which you will need most of the time while working with Neo4j and on the other side takes you through new real-world use cases in travel, healthcare and e-commerce domains. Starting with a practical and vital introduction to Neo4j and various aspects of Neo4j installation, you will learn how to connect and access Neo4j servers from programming languages such as Java, Python, Ruby and Scala. You will also learn about Neo4j administration and maintenance before expanding and advancing your knowledge by dealing with large Neo4j installations and optimizing them for both storage and querying.”

Get My Copy of Neo4j Cookbook

Neo4j Graph Data Modeling

By Mahesh Lal, @Mahesh_Lal

“Neo4j Graph Data Modeling will introduce design concepts used in modeling data as a graph in Neo4j. Written for developers with some familiarity with Neo4j and data architects, the book takes a step-by-step approach to explaining how we can design various data models in Neo4j. The examples have a wide range, starting from graph problems (e.g., routing) to problems that are not an intuitive fit for graph databases. We have tried to craft the examples so that the reader is taken on a journey of discovery of the rationale behind design decisions.”

Get My Copy of Neo4j Graph Modeling

Advanced Level Books:

Neo4j High Performance

By Sonal Raj, @_sonalraj

“Neo4j High Performance presents an insight into how Neo4j can be applied to practical industry scenarios and also includes tweaks and optimizations for developers and administrators to make their systems more efficient and high-performing. By the end of this book you will have learnt about the following three aspects of Neo4j:

Understand concepts of graphs and Neo4j as a graph database, transactions, indexing and its querying with Cypher.
Create, build, deploy and test applications running Neo4j at the backend. Also get an introduction to an embedded application of Neo4j.
Use and setup of the Neo4j APIs including core API, REST API and an overview of its High Availability version of the framework.”

Get My Copy of Neo4j High Performance

Building Web Applications with Python and Neo4j

By Sumit Gupta

“Building Web Applications with Python and Neo4j is a step-by-step guide aimed at competent developers who have exposure and programming experience in Python and who now want to explore the world of relationships with Neo4j. This book discusses data modeling, programming and data analysis for application development with Python and Neo4j. This books also provides all necessary practical skills and exposure to Python developers, which not only helps them in leveraging the power of Neo4j, but at the same time it also provides insight into various Python-based frameworks like py2neo, OGM, Django, Flask, etc. for rapidly developing enterprise-grade application with Python and Neo4j.”

Get My Copy of Building Web Applications with Python and Neo4j

Don’t forget – Get 25% off your purchase of any Packt Neo4j title with the discount code NEO4J25 or share this article on Twitter with the hashtag #PacktNeo4j to win your free copy of any of the ebooks listed above! Having trouble with any of the discount codes? Email customercare@packtpub.com for help.

↧

From the Neo4j Community: September 2015

October 27, 2015, 4:00 am

≫ Next: Building the Graph Your Network App with the Neo4j Docker Image

≪ Previous: Free Neo4j Books (+ Discounts), from Beginner to Advanced

Explore All of the Great Articles & Videos Created by the Neo4j Community in the Month of September

Autumn is in full swing and as we continue to gather all the slides and videos from GraphConnect San Francisco (coming soon!), it’s time to look back at all of the amazing contributions from the Neo4j community this past September.

Below are some of our top picks from around the world.

If you would like to see more posts from the graphista community, follow us on Twitter and use the #Neo4j hashtag to be featured in November’s “From the Community” blog post.

Articles

Using graph database technology to unravel banking fraud by Ben Kepes
Organizational network analysis with Linkurious by Jean Villedieu
Scanning your Java App with jQAssistant and Neo4j by Mattias Olofsson
From JSON to Neo4j by Volkan Paksoy
Introducing legis-graph – US Congressional Data With Govtrack and Neo4j by William Lyon
ipython-cypher 0.2.3 by Javier de la Rosa
LARUS BA e NeoTechnology presentano la seconda edizione italiana dei Neo4j GraphTalks by Olimpia Ute Genghini

Videos

Thinking in Graphs, Spellbook and Grimoire by Peter Hillerström

Websites

neo4Art by the Larus Business Automation Team

Join the largest graph database ecosystem and catch up with the rest of the Neo4j community – click below to download your free copy of Learning Neo4j and sharpen your skills with the world’s leading graph database.

Get My Free Copy

↧

Building the Graph Your Network App with the Neo4j Docker Image

November 5, 2015, 4:00 am

≫ Next: From the Neo4j Community: October 2015

≪ Previous: From the Neo4j Community: September 2015

Learn How We Built the Graph Your Network App Using the Official Neo4j Docker Image

Neo4j has long been distributed with a dataset of movies – showing how movies, actors, actresses, directors and more relate to each other.

We’ve also recently added Northwind organizational data for those developers who are more business minded. Nonetheless, these datasets don’t capture the interest of all developers who work with a variety of types of data.

We wanted a dataset that everyone could feel a personal attachment to, so we decided to enable you to analyze your personal Twitter data in Neo4j!

In order to let the masses explore their Twitter data in an isolated environment, we decided to take advantage of the new Neo4j Docker image.

We set up a Neo4j Docker container for each user, running on Amazon’s Elastic Container Service (ECS).

Architecture

When a new user visits network.graphdemos.com, they are directed from one of the neo4j-twitter-head instances to login with their Twitter account.

This completes an OAuth 1.0a dance, enabling the Graph Your Network application to access the user’s Twitter data on their behalf. While most of the data being accessed is already public, acting on behalf of the user gives additional Twitter API quota and an ability to authenticate the user.

We then spin up a new instance, which runs the neo4j-twitter docker image, using the official Neo4j Docker image as the base.

This instance starts up Neo4j and then runs a Python script to import the user’s Twitter data into Neo4j. The credentials needed to do the import are passed into the Docker container using environment variables.

After Neo4j is started, the credentials are reset and the URL, username and password are provided to the user on a webpage. We also run some canned queries that are executed by the neo4j-twitter-head instances using py2neo calling your personal Neo4j instance.

Resource Allotment

Each instance is allocated 1/4th of a CPU core and 768MB of memory. While this is only a small amount, it is adequate for the Twitter graphs of most users.

Since the number of EC2 servers needed to host these containers can depend upon the load, we have a cron job running regularly on the head instances which increases the auto-scaling group size appropriately or terminates instances.

Imported Data & Data Model

We import the following data:

Your followers
People you follow
Your tweets
Your mentions
Recent tweets using your top 5 hashtags
Recent tweets with #GraphConnect
Recent tweets mentioning Neo4j

There are three separate threads running to call the Twitter API. When the threads hit Twitter API quotas, they sleep for 15 minutes. Your new tweets are imported every 30 minutes, and other data is updated every 90 minutes.

Example Queries

We provide a set of example queries on the web app and in the tutorial built in the Neo4j browser, including:

Who’s mentioning you on Twitter?
Who are your most influential followers?
What tags do you use frequently?
How many people you follow also follow you back?
Who are the people tweeting about you, but who you don’t follow?
What are the links from interesting retweets?
Who are other people tweeting with some of your top hashtags?

Browser Guide

Some folks have wondered how we accomplished the built-in browser guide (shown below), invoked by :play twitter. This custom guide was added to a build of the Neo4j browser, which we then replaced in the neo4j-twitter docker image.

cd $NEO4J/community/browser
vim app/content/guides/twitter.jade
mvn package
cp target/neo4j-browser-2.x.x-SNAPSHOT.jar $DOCKER_REPO/neo4j-browser-2.x.x.jar

vim $DOCKER_REPO/Dockerfile
  add: ADD neo4j-browser-2.x.x.jar /var/lib/neo4j/system/lib/neo4j-browser-2.x.x.jar

The Graph Your Network App Browser Guide using the Neo4j Docker Image

What Are Your Favorite Queries?

Let us know if you discover some great queries! Share them with me on Twitter, on Slack or on the issue tracker.

Start Exploring Now!

Visit http://network.graphdemos.com/

Want to build projects like the Graph Your Network app? Click below to get your free copy of the Learning Neo4j ebook and learn to master the world’s leading graph database.

Learn Neo4j Today

↧

From the Neo4j Community: October 2015

November 12, 2015, 4:00 am

≫ Next: How Backstory.io Uses Neo4j to Graph the News [Community Post]

≪ Previous: Building the Graph Your Network App with the Neo4j Docker Image

Explore All of the Great Articles & Videos Created by the Neo4j Community in October 2015

The Neo4j community has had a very busy October! Besides the three major announcements at GraphConnect San Francisco, community members have been abuzz about everything from real-time databases to competitive benchmarks.

Below are some of our top picks from our stellar (and growing!) community members.

If you would like to see your post featured in December’s “From the Community” blog post, follow us on Twitter and use the #Neo4j hashtag for your chance to get picked.

Join the fastest growing graph database community – click below to download your free copy of Learning Neo4j and master the world’s leading graph database.

Get My Free Copy

↧