Building a coauthorship network from a bibtex file
Have you ever been interested in extracting the coauthorship network from your bibtex file? I acknowledge I have been always fascinated by the information we could extract just by analyzing the coauthorship relations in an academic community.
Here, our primary interest is to analyze the relations existing in the optimization
community in France. Indeed, as a member of this community, analyzing the existing
patterns inside the coauthorship network is of particular
interest to me. Furthermore, almost all academics in France publish their
articles via HAL (Hyper Articles en Ligne), which comes with
a great API to extract data about the articles published.
Thus, building a suitable database is almost straightforward in our case,
as we will show later.
This article explains how to
build a coauthorship network from a large bibtex file, load it into networkx
and export it to the graphml
format. Note that all the code is available freely
on Github.
This article is the first step of a broader work. In future articles, we will show how to extract useful metrics to analyze the topology of the coauthorship network.
Importing a bibtex file from HAL
Querying HAL's API is not that difficult.
Looking more closely at the specifications,
it appears we have different
choices available for the output format (XML, json, bibtex). So luckily, we could
export the database directly in bibtex, exactly as we want!
Using the bibtex format will help further the analysis, as we could
looks at the coauthorship of each article just
by looking at the author
field in each entry.
It remains now to build our query.
The subdomain we are interested in is mathematical optimization,
which corresponds to math-oc
in HAL's specification. To select articles published
in this subfield, we add a field domain_s:1.math.math-oc
in our request.
We set the return field to wt=bibtex
to specify the output format.
We should also specify the number of
articles we want to import. By default, HAL returns only 30 articles.
To output all the articles, we increase the limit to 10000
(rows=10000
).
The final HTTP request writes out:
wget -O raw_export.bib "https://api.archives-ouvertes.fr/search/?q=domain_s:1.math.math-oc&wt=bibtex&rows=10000"
Note that we could easily modify this query to import articles from a different subfield.
Once the request finished, we get a valid bibtex database. Now it remains to check whether the database is clean enough ... and as you could guess, we will have to do some preprocessing before being able to parse correctly the database.
Preprocessing the bibtex file
Looking more closely at the bibtex file, it appears that we have two major problems.
First, some entries are not well specified. To force LaTeX to parse an entry
without additional reformatting, the authors could enter the fields in double
brace:
{{Please parse this TitlE as specified Here!!}}
. This is a perfectly
valid bibtex specification. But unfortunately some authors entered three
braces }}}
instead
of the two required }}
, leading to errors when parsing the bibtex file.
We replaced the faulty braces manually.
The other problem is the accentuation. Indeed, LaTeX (and by extension, bibtex)
was designed back in a time when ASCII reigns. People with accentuated names could
enter special characters inside a brace, e.g. {\"i}
is equivalent
to a ï
symbol. In bibtex, almost all accentuated names are
using this convention, which at first glance is perfectly fine. The problem
arises as some names could be accentuated in some articles (Valérie
),
and entered without any accentuation in some other articles (Valerie
). To
avoid further issue, we choose to convert all accentuated characters in ASCII,
by using the following sed
commands (the complete command is available
in this script):
BIBFILE=raw_export.bib
sed -i -e "s/{\\\'a}/a/g" \
-e "s/{\\\'e}/e/g" \
-e "s/{\\\'i}/i/g" \
-e "s/{\\\'n}/n/g" \
-e "s/{\\\'o}/o/g" \
-e "s/{\\\'u}/u/g" \
.
.
.
-e 's/{\\ae}/ae/g' \
-e 's/{\\ss}/ss/g' \
$BIBFILE
I address my humble apologies to all people who I have ASCIIed the names without further notice... But hopefully, that would ease the next steps.
Loading the base into Python
Once the bibtex file processed, we could pass it to Python. To load the bibtex database in Python, we use the great package bibtexparser. Using this package, importing the database is straightforward:
import bibtexparser
parser = bibtexparser.bparser.BibTexParser(common_strings=True)
with open(bibname) as bibtex_file:
bib_database = bibtexparser.load(bibtex_file, parser)
We could get the total number of articles:
>>> print("Number of articles: ", len(bib_database.entries))
Number of articles: 9197
Let's do some basic data analysis on our database. For instance, when were the articles in our database published? We write a small Python function to count the number of articles per year:
def hist_years(db):
counts = dict()
for entrie in db.entries:
year = entrie["year"]
counts[year] = 1 + counts.get(year, 0)
years = list(counts.keys())
years.sort()
for year in years:
print("%s: %s" % (year, counts[year]))
return counts
We get
>>> hist_years(bib_database)
0008: 1
1978: 1
1979: 1
1981: 1
1983: 3
1984: 3
1986: 1
1987: 2
1988: 3
1989: 3
1990: 8
1991: 4
1992: 7
1993: 14
1994: 31
1995: 17
1996: 31
1997: 32
1998: 37
1999: 39
2000: 57
2001: 48
2002: 60
2003: 76
2004: 101
2005: 116
2006: 260
2007: 286
2008: 300
2009: 371
2010: 444
2011: 508
2012: 552
2013: 685
2014: 715
2015: 669
2016: 649
2017: 850
2018: 830
2019: 1025
2020: 356
Here, we have just depicted the first bias in our analysis: a vast majority (79%) of the articles stored in our database have been written since 2010. This is consistent, as HAL was launched back in 2005 by the Centre pour la Communication Scientifique Directe (CCSD).
Building the graph of co-authors
It remains now to build the graph of co-authors. To do so, we scan the co-authors in each article, and add to the graph the new corresponding edges.
Parsing the authors in each article
For each article in the database, the authors are specified inside
a string, with the authors' names separated by a and
substring:
>>> authors = db.entries[0]["author"]
'Bonnans, J. Frederic and Zidani, Hasnaa'
To get each name individually, we
define a separator AUTHOR_SEPARATOR=and
and we split the string
in as many substrings as authors. That gives the two expected names
for our first entry:
>>> import re
>>> list_authors = re.split(AUTHOR_SEPARATOR, authors)
['Bonnans, J. Frederic', 'Zidani, Hasnaa']
By iterating over the database, we could load all names individually and start building the nodes of our network. But to avoid duplicate in names, we should take into account one last detail.
Affecting a key to each author
Indeed, bibtex' conventions lack consistency to specify the authors of a paper.
Imagine your co-author is named Jean-Pierre Dupont
. You could enter
it in bibtex as Dupont, Jean-Pierre
, but Dupont, JP
or Dupont, J.P.
are also perfectly valid entries. To avoid duplicate in our dataset, we choose
to affect a unique key to each author. We will use a dedicated library
to parse the names, named nameparser
. Then, we could parse each name
individually with the commands:
>>> from nameparser import HumanName
>>> name = HumanName("Dupont, Jean-Pierre")
>>> print(name)
Jean-Pierre Dupont
>>> name.first
'Jean-Pierre'
>>> name.last
'Dupont'
or equivalently
>>> name = HumanName("Dupont, J.P.")
>>> print(name)
J.P. Dupont
Parsing a name is almost straightforward with nameparser
. Hence, we
could affect a single key to each author, following the procedure:
- We parse a name (e.g. "Dupont, J.P.") with "nameparser"
- We convert the name to lowercase, to avoid issue with capitalization
- We return as key the last name concatenated with the first letter of the first name (so "J.P. Dupont", "J Dupont" and "Jean-Pierre Dupont" would return the same key "dupont j")
This procedure has a single default: two authors with the same last name but
with two first names sharing the same first letter would share the same key
(e.g. Jeanne Dupont
and Jean-Pierre Dupont
).
That was a choice we made: parse correctly composed names (as
Jean-Pierre Dupont
) or parse effectively homonyms. Fortunately, we
have only a few cases of people sharing the same last name in our dataset.
The implementation is given via the function key_name
:
def key_name(name):
# parse name
parsed = HumanName(name)
if len(parsed.first) > 0:
first_name = parsed.first[0]
# Key is lowercased
key = f"{parsed.last.lower()} {first_name.lower()}"
return key
else:
return name.lower()
We add a if
statement to handle the special case occurring when
authors do not have any first name.
That eventually leads to the new function load_authors
, scanning
the database to look for unique authors.
def load_authors(db):
authors = dict()
for entrie in db.entries:
names = entrie["author"]
for author in parse_authors(names):
name = parse_name(author)
key = key_name(name)
val = authors.get(key)
if isinstance(val, list) and name not in val:
val.append(name)
else:
authors[key] = [name]
return authors
With this function load_authors
, we finally have all the ingredients
we need to build the coauthorship network with networkx
.
Building the graph with networkx
It remains now to build the graph with networkx, a graph library written in pure Python. We start by importing the library:
import networkx as nx
By using the function load_authors
, we could build a new function
adding for each author a new node in the graph gx
:
def _add_nodes(gx, database):
id_node = 0
authors = load_authors(database)
correspondance = dict()
for auth in authors:
id_node += 1
gx.add_node(id_node)
correspondance[auth] = id_node
return correspondance
The dictionary correspondance
stores the correspondence between
each author's key and its id
in the graph. That would be necessary to
build a metadata file associated to the graph.
We could now process all articles in the database, and load a new edge each time a new co-authorship is detected:
def _add_edges(gx, database, correspondance):
for entrie in database.entries:
names = entrie["author"]
authors = []
# Parse names
for author in parse_authors(names):
name = parse_name(author)
authors.append(name)
# Add all corresponding edges
for name in authors:
k1 = key_name(name)
for coname in authors:
k2 = key_name(coname)
if k1 != k2:
o = correspondance[k1]
d = correspondance[k2]
gx.add_edge(o, d)
It remains to combine the two functions
_add_nodes
and _add_edges
to build a new graph from scratch. That writes out:
def build_graph(database):
gx = nx.Graph()
correspondance = _add_nodes(gx, database)
_add_edges(gx, database, correspondance)
return gx, correspondance
We build a new graph with
>>> g = build_graph(bib_database)
>>> g.number_of_nodes()
7487
>>> g.number_of_edges()
19046
So, at the end we get a graph with as many nodes as existing keys, and 19046 co-authorship relations :)
It remains to dump the graph in graphml
format for future uses:
>>> nx.write_graphml(g, "coauthors.graphml")
That would allow to load the graph in any graph library we want, as graphml
is a standard format for graph.
Recap
So, we finally manage to build a coauthorship network from a bibtex file. The network we output correspond to the coauthorship network of the optimization community in France. Let's emphasize the biases our process introduces:
- Almost 80% of the articles we processed were written after 2010. So we are missing the ancient history of the community and focus mostly on the most recent activity. Most of the seminal papers in optimization are not taken into account (think about the works of Jean-Jacques Moreau or Claude Lemaréchal).
- No matter how frequent two authors write together, we affect a unit weight
to each connection. Indeed, each time we add a coauthorship relation
associated to an already existing edge, the function
add_edge
overwrites the previous edge. - We rely blindly on HAL's API, notably on its classification in subfields.
However, whereas some authors write exclusively their articles in the
math-oc
subfields, some others disseminate their articles in different subfields (operation research, automatics, signal processing). We miss authors who write their articles in different communities, who are more and more common these days.
Note also that despite all my care, there might remain some mistakes when extracting metadata from the bibtex file. Notably, if you have a better idea to improve the key we affect to each name, feel free to update the code on Github.
In a next blog post, we will analyze more closely the structure of the graph of co-authors by using LightGraphs, a Julia library dedicated for network analysis.