Building a coauthorship network from a bibtex file

2020-06-17

Have you ever been interested in extracting the coauthorship network from your bibtex file? I acknowledge I have been always fascinated by the information we could extract just by analyzing the coauthorship relations in an academic community.

Here, our primary interest is to analyze the relations existing in the optimization community in France. Indeed, as a member of this community, analyzing the existing patterns inside the coauthorship network is of particular interest to me. Furthermore, almost all academics in France publish their articles via HAL (Hyper Articles en Ligne), which comes with a great API to extract data about the articles published. Thus, building a suitable database is almost straightforward in our case, as we will show later. This article explains how to build a coauthorship network from a large bibtex file, load it into networkx and export it to the graphml format. Note that all the code is available freely on Github.

This article is the first step of a broader work. In future articles, we will show how to extract useful metrics to analyze the topology of the coauthorship network.

Importing a bibtex file from HAL

Querying HAL's API is not that difficult. Looking more closely at the specifications, it appears we have different choices available for the output format (XML, json, bibtex). So luckily, we could export the database directly in bibtex, exactly as we want! Using the bibtex format will help further the analysis, as we could looks at the coauthorship of each article just by looking at the author field in each entry.

It remains now to build our query. The subdomain we are interested in is mathematical optimization, which corresponds to math-oc in HAL's specification. To select articles published in this subfield, we add a field domain_s:1.math.math-oc in our request. We set the return field to wt=bibtex to specify the output format. We should also specify the number of articles we want to import. By default, HAL returns only 30 articles. To output all the articles, we increase the limit to 10000 (rows=10000).

The final HTTP request writes out:

wget -O raw_export.bib "https://api.archives-ouvertes.fr/search/?q=domain_s:1.math.math-oc&wt=bibtex&rows=10000"

Note that we could easily modify this query to import articles from a different subfield.

Once the request finished, we get a valid bibtex database. Now it remains to check whether the database is clean enough ... and as you could guess, we will have to do some preprocessing before being able to parse correctly the database.

Preprocessing the bibtex file

Looking more closely at the bibtex file, it appears that we have two major problems.

First, some entries are not well specified. To force LaTeX to parse an entry without additional reformatting, the authors could enter the fields in double brace: {{Please parse this TitlE as specified Here!!}}. This is a perfectly valid bibtex specification. But unfortunately some authors entered three braces }}} instead of the two required }}, leading to errors when parsing the bibtex file. We replaced the faulty braces manually.

The other problem is the accentuation. Indeed, LaTeX (and by extension, bibtex) was designed back in a time when ASCII reigns. People with accentuated names could enter special characters inside a brace, e.g. {\"i} is equivalent to a ï symbol. In bibtex, almost all accentuated names are using this convention, which at first glance is perfectly fine. The problem arises as some names could be accentuated in some articles (Valérie), and entered without any accentuation in some other articles (Valerie). To avoid further issue, we choose to convert all accentuated characters in ASCII, by using the following sed commands (the complete command is available in this script):

BIBFILE=raw_export.bib
sed -i -e "s/{\\\'a}/a/g" \
    -e "s/{\\\'e}/e/g" \
    -e "s/{\\\'i}/i/g" \
    -e "s/{\\\'n}/n/g" \
    -e "s/{\\\'o}/o/g" \
    -e "s/{\\\'u}/u/g" \
    .
    .
    .
    -e 's/{\\ae}/ae/g' \
    -e 's/{\\ss}/ss/g' \
    $BIBFILE

I address my humble apologies to all people who I have ASCIIed the names without further notice... But hopefully, that would ease the next steps.

Loading the base into Python

Once the bibtex file processed, we could pass it to Python. To load the bibtex database in Python, we use the great package bibtexparser. Using this package, importing the database is straightforward:

import bibtexparser
parser = bibtexparser.bparser.BibTexParser(common_strings=True)
with open(bibname) as bibtex_file:
    bib_database = bibtexparser.load(bibtex_file, parser)

We could get the total number of articles:

>>> print("Number of articles: ", len(bib_database.entries))
Number of articles:  9197

Let's do some basic data analysis on our database. For instance, when were the articles in our database published? We write a small Python function to count the number of articles per year:

def hist_years(db):
    counts = dict()
    for entrie in db.entries:
        year = entrie["year"]
        counts[year] = 1 + counts.get(year, 0)
    years = list(counts.keys())
    years.sort()
    for year in years:
        print("%s: %s" % (year, counts[year]))
    return counts

We get

>>> hist_years(bib_database)
0008: 1
1978: 1
1979: 1
1981: 1
1983: 3
1984: 3
1986: 1
1987: 2
1988: 3
1989: 3
1990: 8
1991: 4
1992: 7
1993: 14
1994: 31
1995: 17
1996: 31
1997: 32
1998: 37
1999: 39
2000: 57
2001: 48
2002: 60
2003: 76
2004: 101
2005: 116
2006: 260
2007: 286
2008: 300
2009: 371
2010: 444
2011: 508
2012: 552
2013: 685
2014: 715
2015: 669
2016: 649
2017: 850
2018: 830
2019: 1025
2020: 356

Here, we have just depicted the first bias in our analysis: a vast majority (79%) of the articles stored in our database have been written since 2010. This is consistent, as HAL was launched back in 2005 by the Centre pour la Communication Scientifique Directe (CCSD).

Building the graph of co-authors

It remains now to build the graph of co-authors. To do so, we scan the co-authors in each article, and add to the graph the new corresponding edges.

Parsing the authors in each article

For each article in the database, the authors are specified inside a string, with the authors' names separated by a and substring:

>>> authors = db.entries[0]["author"]
'Bonnans, J. Frederic and Zidani, Hasnaa'

To get each name individually, we define a separator AUTHOR_SEPARATOR=and and we split the string in as many substrings as authors. That gives the two expected names for our first entry:

>>> import re
>>> list_authors = re.split(AUTHOR_SEPARATOR, authors)
['Bonnans, J. Frederic', 'Zidani, Hasnaa']

By iterating over the database, we could load all names individually and start building the nodes of our network. But to avoid duplicate in names, we should take into account one last detail.

Affecting a key to each author

Indeed, bibtex' conventions lack consistency to specify the authors of a paper. Imagine your co-author is named Jean-Pierre Dupont. You could enter it in bibtex as Dupont, Jean-Pierre, but Dupont, JP or Dupont, J.P. are also perfectly valid entries. To avoid duplicate in our dataset, we choose to affect a unique key to each author. We will use a dedicated library to parse the names, named nameparser. Then, we could parse each name individually with the commands:

>>> from nameparser import HumanName
>>> name = HumanName("Dupont, Jean-Pierre")
>>> print(name)
Jean-Pierre Dupont
>>> name.first
'Jean-Pierre'
>>> name.last
'Dupont'

or equivalently

>>> name = HumanName("Dupont, J.P.")
>>> print(name)
J.P. Dupont

Parsing a name is almost straightforward with nameparser. Hence, we could affect a single key to each author, following the procedure:

We parse a name (e.g. "Dupont, J.P.") with "nameparser"
We convert the name to lowercase, to avoid issue with capitalization
We return as key the last name concatenated with the first letter of the first name (so "J.P. Dupont", "J Dupont" and "Jean-Pierre Dupont" would return the same key "dupont j")

This procedure has a single default: two authors with the same last name but with two first names sharing the same first letter would share the same key (e.g. Jeanne Dupont and Jean-Pierre Dupont). That was a choice we made: parse correctly composed names (as Jean-Pierre Dupont) or parse effectively homonyms. Fortunately, we have only a few cases of people sharing the same last name in our dataset.

The implementation is given via the function key_name:

def key_name(name):
    # parse name
    parsed = HumanName(name)
    if len(parsed.first) > 0:
        first_name = parsed.first[0]
        # Key is lowercased
        key = f"{parsed.last.lower()} {first_name.lower()}"
        return key
    else:
        return name.lower()

We add a if statement to handle the special case occurring when authors do not have any first name.

That eventually leads to the new function load_authors, scanning the database to look for unique authors.

def load_authors(db):
    authors = dict()
    for entrie in db.entries:
        names = entrie["author"]
        for author in parse_authors(names):
            name = parse_name(author)
            key = key_name(name)
            val = authors.get(key)
            if isinstance(val, list) and name not in val:
                val.append(name)
            else:
                authors[key] = [name]
    return authors

With this function load_authors, we finally have all the ingredients we need to build the coauthorship network with networkx.

Building the graph with networkx

It remains now to build the graph with networkx, a graph library written in pure Python. We start by importing the library:

import networkx as nx

By using the function load_authors, we could build a new function adding for each author a new node in the graph gx:

def _add_nodes(gx, database):
    id_node = 0
    authors = load_authors(database)
    correspondance = dict()
    for auth in authors:
        id_node += 1
        gx.add_node(id_node)
        correspondance[auth] = id_node
    return correspondance

The dictionary correspondance stores the correspondence between each author's key and its id in the graph. That would be necessary to build a metadata file associated to the graph.

We could now process all articles in the database, and load a new edge each time a new co-authorship is detected:

def _add_edges(gx, database, correspondance):
    for entrie in database.entries:
        names = entrie["author"]
        authors = []
        # Parse names
        for author in parse_authors(names):
            name = parse_name(author)
            authors.append(name)
        # Add all corresponding edges
        for name in authors:
            k1 = key_name(name)
            for coname in authors:
                k2 = key_name(coname)
                if k1 != k2:
                    o = correspondance[k1]
                    d = correspondance[k2]
                    gx.add_edge(o, d)

It remains to combine the two functions _add_nodes and _add_edges to build a new graph from scratch. That writes out:

def build_graph(database):
    gx = nx.Graph()
    correspondance = _add_nodes(gx, database)
    _add_edges(gx, database, correspondance)
    return gx, correspondance

We build a new graph with

>>> g = build_graph(bib_database)
>>> g.number_of_nodes()
7487
>>> g.number_of_edges()
19046

So, at the end we get a graph with as many nodes as existing keys, and 19046 co-authorship relations :)

It remains to dump the graph in graphml format for future uses:

>>> nx.write_graphml(g, "coauthors.graphml")

That would allow to load the graph in any graph library we want, as graphml is a standard format for graph.

Recap

So, we finally manage to build a coauthorship network from a bibtex file. The network we output correspond to the coauthorship network of the optimization community in France. Let's emphasize the biases our process introduces:

Almost 80% of the articles we processed were written after 2010. So we are missing the ancient history of the community and focus mostly on the most recent activity. Most of the seminal papers in optimization are not taken into account (think about the works of Jean-Jacques Moreau or Claude Lemaréchal).
No matter how frequent two authors write together, we affect a unit weight to each connection. Indeed, each time we add a coauthorship relation associated to an already existing edge, the function add_edge overwrites the previous edge.
We rely blindly on HAL's API, notably on its classification in subfields. However, whereas some authors write exclusively their articles in the math-oc subfields, some others disseminate their articles in different subfields (operation research, automatics, signal processing). We miss authors who write their articles in different communities, who are more and more common these days.

Note also that despite all my care, there might remain some mistakes when extracting metadata from the bibtex file. Notably, if you have a better idea to improve the key we affect to each name, feel free to update the code on Github.

In a next blog post, we will analyze more closely the structure of the graph of co-authors by using LightGraphs, a Julia library dedicated for network analysis.