py4sci

Table Of Contents

Previous topic

Geotagging

Next topic

Gazetteers

This Page

Georesolution

The georesolution step takes the tagged text file as input and processes the location entities to give them spatial co-ordinates. The chosen gazetteer is queried to produce a list of candidate locations for each toponym and these are ranked, with the highest ranking one chosen to be shown as a green marker on the map display, or as the only marker if the -top option is used.

The tagged text file produced by the geotagging step contains further markup - for other entity categories besides location (person, organisation, time expressions) and for temporal events, which are expressed as binary relations between pairs of entities. Although obviously the geoparser’s main business is with spatial entities, the temporal relations are processed at the end of the georesolution step, to produce a timeline display of events detected in the text.

The input file for this step is in a temporary file, labelled “tmp-temprel” in the flowcharts of the Overview chapter; see Georesolution flowchart. The actual file will be in the /tmp directory, with a name that includes the username of the process in which the script was run and a unique string generated from the name of the script that’s running and its process number, suffixed in this case with “temprel” to identify the content, eg “$USER-run-5648-temprel”. These temporary files are removed when the pipeline exits unless the $LXDEBUG environment variable is set, in which case they are kept for examination.

The final output file - written to $outdir.out.xml if -o outdir is specified and to stdout otherwise - is described at output file in the Practical Examples chapter, and there is an example file here (html documentation only). The “tmp-temprel” file differs only in respect of the location entities. In the unprocessed temprel file these look like this:

<ent type="location" id="rb6">
 <parts>
  <part sw="w148" ew="w148">Toronto</part>
 </parts>
</ent>

The georesolution step adds extra attributes to this element, from the Geonames gazetteer in this example:

<ent id="rb6" type="location" lat="43.7001138" long="-79.4163042"
     in-country="CA" gazref="geonames:6167865" feat-type="ppl"
     pop-size="4612191">
 <parts>
  <part ew="w148" sw="w148">Toronto</part>
 </parts>
</ent>

This is the top-ranked candidate, http://www.geonames.org/6167865/toronto.html. The other candidates are listed in $outdir/gaz.xml - see example file here (html documentation only). In this example there were 20 candidates for Toronto, which is the maximum number the geoparser considers. The first five are shown below:

<placenames>
 <placename id="rb6" name="Toronto">
  <place rank="1" score="1.762934636" scaled_type="0.8" scaled_pop=
    "0.9327814568" scaled_contained_by="0" scaled_contains="0" scaled_near="0"
    in-cc="CA" long="-79.4163" lat="43.70011" type="ppla" gazref=
    "geonames:6167865" name="Toronto" pop="4612191" clusteriness="870.3494166"
    scaled_clusteriness="0.03015317872" clusteriness_rank="9" locality="0"
    distance-to-known="99999" scaled_known="0"/>
  <place rank="2" score="1.363160631" scaled_type="0.4" scaled_pop=
    "0.9327814568" scaled_contained_by="0" scaled_contains="0" scaled_near="0"
    in-cc="CA" long="-79.66632" lat="43.60012" type="rgn" gazref=
    "geonames:6167864" name="Toronto" pop="4612191" clusteriness="869.4440736"
    scaled_clusteriness="0.03037917422" clusteriness_rank="8" locality="0"
    distance-to-known="99999" scaled_known="0"/>
  <place rank="3" score="1.162435057" scaled_type="0.2" scaled_pop=
    "0.9327814568" scaled_contained_by="0" scaled_contains="0" scaled_near="0"
    in-cc="CA" long="-79.61286" lat="43.68066" type="fac" gazref=
    "geonames:6296338" name="Toronto Pearson International Airport"
    pop="4612191" clusteriness="872.3540873" scaled_clusteriness=
    "0.02965359988" clusteriness_rank="10" locality="0" distance-to-known=
    "99999" scaled_known="0"/>
  <place rank="4" score="0.6922152501" scaled_type="0.6" scaled_pop="0"
    scaled_contained_by="0" scaled_contains="0" scaled_near="0" in-cc="US"
    long="-92.52546" lat="38.00365" type="ppl" gazref="geonames:4411872"
    name="Toronto" clusteriness="653.9875787" scaled_clusteriness=
    "0.09221525012" clusteriness_rank="1" locality="0" distance-to-known=
    "99999" scaled_known="0"/>
  <place rank="5" score="0.6883702413" scaled_type="0.6" scaled_pop="0"
    scaled_contained_by="0" scaled_contains="0" scaled_near="0" in-cc="US"
    long="-89.62982" lat="39.71394" type="ppl" gazref="geonames:4251360"
    name="Toronto" clusteriness="665.6708161" scaled_clusteriness=
    "0.08837024133" clusteriness_rank="2" locality="0" distance-to-known=
    "99999" scaled_known="0"/>
 ...
 </placename>
 ...
</placenames>

There is one <placename> element for each distinct placename found in the input document - note, not for each individual mention. If a place is mentioned multiple times in a document the geoparser assumes the same place is being talked about each time. Clearly there are examples where this would be an erroneous assumption, eg in the text snippet:

“Are we talking about London, England or London, Ontario?”

There is in fact a special rule to catch containment expressed in this co-ordinated way, but nevertheless the current version of the geoparser will only pick a single location for London (the first one, in England).

The rest of the output files produced if -o is specified are for visualisation in a browser.

The rest of this chapter looks at each step of the georesolution process in a little more detail: firstly the collection of candidate places from the gazetteer, then the ranking process and finally the production of display files.

Gazetteer Lookup

The run script calls another, named geoground, which carries out two tasks by calling further scripts. The first is gazetteer lookup, done by the geogaz script which calls a version of gazlookup tailored for the gazetteer and including the gazetteer name. So for example, if -g geonames were specified to the run script then gazlookup-geonames would be used at this point, whereas if Pleiades+ were required then gazlookup-plplus would be invoked.

If you look in the scripts directory you will find a collection of these gazlookup scripts, most being completely separate routines, needed because the connection methods and queries to be used differ greatly between different gazetteers. The “Unlock” option is an exception as it has three variants - “Unlock”, “OS” and “Natural Earth” (see -t and -g parameters) - but these can be dealt with by parameterisation within a single script, gazlookup-unlock. There are soft links to this script to cover the other two variants because, in order to make it straightforward to add new gazetteer options, the geogaz script looks for a script named gazlookup-$gaz, where “$gaz” is the -g $gaz command line parameter.

This means that to add a new gazetteer to the pipeline, all you need do is create a script named gazlookup-newgaz that handles the connection and querying appropriately, and returns a set of candidates formatted as required for the next stage; and then alter the run script to accept “$newgaz” as a valid -g option. Of course, if the domain covered by the new gazetteer is completely new, then alterations to the geotagging stage would also be needed - as for example was the case when the Pleiades gazetteer of ancient places was added to cater for classical texts.

The input to the gazlookup-$gaz step is a list of the locations found in the input, extracted by an XSL stylesheet named extractlocs.xsl. The list is formatted as shown in this example:

<?xml version="1.0" encoding="UTF-8"?>
<placenames>
  <placename id="rb6" name="Toronto"/>
  <placename id="rb11" name="Germany"/>
  <placename id="rb14" name="Washington"/>
  <placename id="rb22" name="Montreal"/>
  <placename id="rb28" name="Wimbledon"/>
  <placename id="rb32" name="France"/>
</placenames>

The output of the gazetteer lookup is a collection of up to 20 candidate <place> nodes for each <placename>. The final step of the geogaz script is to sort and deduplicate - as explained above, the assumption is made that multiple references to the same toponym string within a single document are referring to the same place.

The output of this stage is in a temporary file suffixed “gazunres.xml”, following the naming conventions described above. An example is here (html documentation only). It contains feature information extracted from the gazetteer for each candidate location, to be used by the ranking algorithm. The first few lines for our example are as follows:

<placenames>
  <placename name="Toronto" id="rb6">
    <place name="Toronto" gazref="geonames:149454" type="ppl"
      lat="-4.9000000" long="38.1000000" in-cc="TZ" pop="0"/>
    <place name="Toronto" gazref="geonames:2146222" type="ppl"
      lat="-33.0000000" long="151.6000000" in-cc="AU" pop="0"/>
    <place name="Toronto" gazref="geonames:3535110" type="ppl"
      lat="22.7833300" long="-82.5000000" in-cc="CU" pop="0"/>
    <place name="Toronto" gazref="geonames:3666869" type="ppl"
      lat="8.4039600" long="-75.2790700" in-cc="CO" pop="0"/>
    ...

This example makes clear the need for ranking over a reasonable number of candidates, at least for a gazetteer like Geonames with so many candidates for most placenames. For Toronto, the first four places returned were in Tanzania, Austria, Cuba and Columbia. We are up to numbers 13 and 14 before Canadian places appear in the list. For many places Geonames will return an extremely long list; the geoparser truncates the results at 20, which will almost always include the right one and makes the ranking process manageable in terms of processing time.

Ranking

The ranking of the <place> candidates is done by the georesolve script. If the gazetteer supplies feature information the ranking makes use of it, for example preferring populated places (Geonames code “PPL”) over natural features, and preferring larger to smaller places (based on population size).

Apart from the attributes of the candidate places, the ranking algorithm considers their locations compared pair-wise with each of the other places in the document. It will prefer places that cluster with other locations in the same document. For example, if most of the places mentioned in a text seem to be in Canada, a mention of “London” will probably be placed in Ontario rather than England.

If you know the geographical area that your input document deals with, you can specify either a locality circle or box using the -l or -lb command line options. These are explained in in the Quick Start chapter, Limiting geographical area: -l -lb. This is another factor that will be considered by the ranker, making it prefer locations in the area specified but still allowing the selection of places elsewhere that may be mentioned in the text. The “score” parameter can be used for weighting the degree of preference; if using this option it is probably best to experiement with different weights.

The output of the georesolve ranking step is the $outdir/gaz.xml that was described above. It is a ranked list of <place> candidates for each <placename>. The candidates have the features from the gazetteer and the extra attributes added by the ranking algorithm, such as “clusteriness” referring to how well the places mention form a spatial group. The raw scores are scaled and combined to produce an overall “score” attribute, which in turn determines the “rank” for each candidate <place>. See the sample output here (html documentation only).

It is worth noting here that for various reasons including the clustering factor, the geoparser works better with short texts than very long ones. It was originally designed to handle large numbers of short text documents (roughly one page at a time) processed in a loop. If an attempt is made to process an entire book in one go, the ranking algorithm may be overloaded - pairwise comparisons of locations throughout the document may break it - and in any case the assumption about locality will probably be invalid. We advise that long texts are split into small parts, preferably into coherent chunks of narrative.

Formatting Output

If the -o outdir option is not specified then the output of the pipeline is written to standard out (and can of course be redirected to a file), and consists of a single xml <document> as described at output file in the Practical Examples chapter, with an example file here (html documentation only). The output is a tagged version of the input file, in standoff xml format, with the <document> node having <text> and <standoff> children (plus a metadata node).

The placenames are tagged entities within the text, appearing as <ent> nodes in the standoff section with pointers back to their position in the tokenised text. Only the top candidate for each place is included in this output, as a tagged entity, such as:

<ent id="rb6" type="location" lat="43.70011" long="-79.4163"
     gazref="geonames:6167865" in-country="CA" feat-type="ppla"
     pop-size="4612191">
  <parts>
    <part ew="w150" sw="w150">Toronto</part>
  </parts>
</ent>

The ranking detail is removed and only the most important gazetteer features are retained: the latitude and longitude co-ordinates, and (for Geonames which supplies them) the country and feature type codes and population.

If the -o outdir option is specified then the georesolution component has several extra steps, which are simply reformatting of all the output generated so far, using XSL stylesheets to produce a collection of files for visualising the output. These steps are illustrated on the Georesolution flowchart.

The “plainvis.xsl” stylesheet is used to format the input text as an html page with the toponyms highlighted; DEEP has a special version which adds links back to the source gazetteer. The gazmap script pulls this html page together with the xml list of candidate placename locations (in the $outdir/gaz.xml file described earlier) and adds a map display created by plotting the locations using Google Maps. The three components are combined in a single file named $outdir.display.html. Various examples are shown in the Practical Examples chapter, including Geoparser display file for news text input, which has the maps panel at the top (green markers for top candidates, red for others), the tagged text on the left and the $outdir/gaz.xml list on the right.

If the -top option is specified then an additional set of files is created, with only the top candidate locations (green markers) retained. Herodotus display file shows an example.

Finally, the timeline script takes the tagged file and produces a display highlighting all the entities found: names, organisations and time expressions as well as locations. It also extracts the events detected and, where these can be given a specific date, uses javascript to create a timeline visualisation using a Simile widget. Timeline file shows an example of the $outdir.timeline.html file. The events found are listed in $outdir.events.xml, which is in the format required by the Timeline widget, as illustrated below:

<?xml version="1.0" encoding="UTF-8"?>
<data date-time-format="iso8601">
  <event start="2010-08-15T00:00:00Z" title="will face each other for a place in Sunday">
   Nadal and Murray set up semi showdown (CNN) -- Rafael Nadal and Andy
   Murray are both through to the semifinals of the Rogers Cup in Toronto,
   where they will face each other for a place in Sunday's final.
 </event>
 ...
</data>

The complete file for this example is here (html documentation only).

In summary, with the -o out option, the following files are created:

File Description
$out.out.xml Main output: tagged and geogrounded text
$out.gaz.xml Locations list
$out.gazlist.html Locations list in html format
$out.gazmap.html Locations plotted on Google maps
$out.geotagged.html Geotagged text as html file
$out.display.html 3-panel display: map + text + locations list
$out.gazlist-top.html Top-ranked candidate list in html format
$out.gazmap-top.html Top-ranked locations plotted on Google maps
$out.display-top.html 3-panel display: map + text + top-locations list
$out.nertagged.xml Output from NER stage
$out.events.xml Events extracted in Timeline format
$out.timeline.html Display page with all NEs and timeline

The three “*-top*” files are only produced if the -top option is used.