Posted on
Network Photo by Alina Grubnyak on Unsplash

PyTorch BigGraph is a tool to create and handle large graph embeddings for machine learning. Currently there are two approaches in graph-based neural networks:

  • Directly use the graph structure and feed it to a neural network. The graph structure is then preserved at every layer. graphCNNs use that approach, see for instance my post or this paper on that.
  • But most graphs are too large for that. So it’s also reasonable to create a large embedding of the graph. And then use it as features in a traditional neural network.

PyTorch BigGraph handles the second approach, and we will do so as well below. Just for reference let’s talk about the size aspect for a second. Graphs are usually encoded by their adjacency matrix. If you have a graph with 3,000 nodes and an edge between each node, you end up with around 10,000,000 entries in your matrix. Even if that’s sparse, apparently this bursts most GPUs according to the paper linked above.

If you think about the usual graphs used in recommendation systems, you’ll realise they are typically much larger than that. Now there are already some excellent posts about the how and why of BigGraph, so I won’t spend more time on that. I’m interested in applying BigGraph to my machine learning problem and for that I like to take the simplest examples and getting things to work. I constructed two examples which we will walk through step by step.

The whole code is refactored and available at GitHub. It’s adapted from the example found at the BigGraph repository.

The first example is part of the LiveJournal graph and the data looks like this:

# FromNodeId ToNodeId
0 1
0 2
0 3
...
0 10
0 11
0 12
...
0 46
1 0
...

The second example are simply 8 nodes with edges:

# FromNodeId ToNodeId
0 1
0 2
0 3
0 4
1 0
1 2
1 3
1 4
2 1
2 3
2 4
3 1
3 2
3 4
3 7
4 1
5 1
6 2
7 3

Embedding a Part of LiveJournals Graph

BigGraph is made to work around the memory limit of machines, so it’s completely file based. You’ll have to trigger processes to create the appropriate file structure. And if you want run an example again, you’ll have to delete the checkpoints. We also have to split into train and test beforehand, again on file basis. The file format is TSV, tab separated values.

Let’s dive right into it. The first code snippet declares two helper functions, take from the BigGraph source, sets some constants and runs the file split.

https://gist.github.com/sbalnojan/2c0effd09e74e8d9f8c6273e82a5dd98

This splits the edges into a test and train set by creating the two files data/example_1/test.txt and train.txt. Next we use BigGraphs converters to create the file based structure for our datasets. We will “partition” into 1 partition. For that we already need parts of the config file. Here’s the relevant part of the config file, the I/O data part and the graph structure.

entities_base = 'data/example_1' 
def get_torchbiggraph_config():     
config = dict(       
# I/O data
entity_path=entities_base,
edge_paths=[],
checkpoint_path='model/example_1',
         # Graph structure
entities={
'user_id': {'num_partitions': 1},
},
relations=[{
'name': 'follow',
'lhs': 'user_id',
'rhs': 'user_id',
'operator': 'none',
}],
...

This tells BigGraph where to find our data and how to interpret our tab separated values. With this config we can run the next Python snippet.

https://gist.github.com/sbalnojan/86c9abfd905687ed4f6d1606aecf5983

The results should be a bunch of new files in the data dir, namely:

  • two folders test_partitioned, train_partitioned
  • one file per folder for the edges in h5 format for quick partial reads
  • the dictionary.json file containing the mapping between “user_ids” and new assigned ids.
  • entity_count_user_id_0.txt contains the entity count, in this case 47.

The dictionary.json is important to later map results of the BigGraph model to the actual embedding we want to have. Enough preparation, let’s train the embedding. Take a look at the config_1.py, it contains three relevant sections.

        # Scoring model - the embedding size
dimension=1024,
global_emb=False,
         # Training - the epochs to train and the learning rate
num_epochs=10,
lr=0.001,
         # Misc - not important
hogwild_delay=2,
)
     return config

To train we run the following Python code.

https://gist.github.com/sbalnojan/3eb5d8ca4a32a32f8422efe7d016a3dc

We can evaluate the model based on some preinstalled metrics on our test set via this code piece.

https://gist.github.com/sbalnojan/c91267c0956b21da595f91d4ebde1c49

Now let’s try to retrieve the actual embedding. Again as everything is file based, it should now be located as h5 in the models/ folder. We can load the embedding of user 0 by looking up his mapping in the dictionary like so:

https://gist.github.com/sbalnojan/f1048196942ddc76306155b6f88c7136


Now let’s switch to our second example, a constructed one on which we hopefully can do something partially useful. The liveJournal data is simply too huge to run through in a reasonable amount of time.

Link Prediction and Ranking on a Constructed Example

Alright, we will repeat the steps for the second example, except we will produce an embedding of dimension 10, so we can view it and work with it. Besides dimension 10 seems to me more than enough for 8 vertices. We set upthose things in the config_2.py.

entities_base = 'data/example_2'
def get_torchbiggraph_config():
config = dict(
# I/O data
entity_path=entities_base,
edge_paths=[],
checkpoint_path='model/example_2',
        # Graph structure
entities={
'user_id': {'num_partitions': 1},
},
relations=[{
'name': 'follow',
'lhs': 'user_id',
'rhs': 'user_id',
'operator': 'none',
}],
         # Scoring model
dimension=10,
global_emb=False,
         # Training
num_epochs=10,
lr=0.001,
         # Misc
hogwild_delay=2,
)
return config

Then we run the same code as before but in one go, taking care of different file paths and format. In this case we only have 3 lines of comments on top of the datafile:

https://gist.github.com/sbalnojan/658c99cfe2863ba323ba614d726373a2

As final output you should get a bunch of things and in particular all embeddings. Let’s do some basic tasks with the embedding. Of course we could now use it and load it into any framework we like, keras, tensorflow, but BigGraph already brings some implementations for common tasks like link prediction and ranking. So let’s try them out. The first task is link prediction. We predict the scores for the entities 0-7 and for 0–1 as we know from our data that 0–1 should be much more likely.

https://gist.github.com/sbalnojan/bd98a77740141211f46135751e200322

As comparator we loaded the “DotComparator” which computes the dot product or scalar product of the two 10-dimensional vectors. Turns out the outputted numbers are tiny, but at least score_2 is much higher than score_1 as we expected.

Finally as the last piece of code we can produce a ranking of similar items, which uses the same mechanism as before. We use the scalar product to compute the distances of embeddings to all other entities and then rank them.

https://gist.github.com/sbalnojan/994c467a6e514f51a729c6753bc36430

The top entities in this case are in orders 0, 1, 3, 7 … and if you look at the data that seems to be pretty much right.

More Fun

This is the most basic examples I could come up with. I did not ran the original examples on the freebase data or on the LiveJournal data, simply because they take quite some time to train. You can find the code and references here:

Problems You Might Encounter

I ran the code on my mac and encountered three issues:

  • An error stating “lib*…..Reason: image not found: “ The solution is to install what’s missing e.g. with “brew install libomp”
  • I then ran into an error “AttributeError: module ‘torch’ has no attribute ‘_six’”, which might simply be because of incompatible python & torch versions. Anyway I move from python 3.6 & torch 1.1 => python 3.7 & torch 1.X and had my problem solved.
  • Inspect the train.txt and test.txt before you move on, I saw some missing new lines there while testing.

Hope this helps and is fun to play with!

Leave a Reply