Crawler for Contracts and Tenders

This document explains how Base provides its data and how the crawler works.

Important

Please, take precautions on using the crawler as it can generate Denial of Service (DoS) to Base database. We provide remote access to our database to avoid that.

Important

Crawling Base from scratch takes more than 2 days as of Jan. 2014.

Base database

Base uses the following urls to expose its data

  1. Entity: http://www.base.gov.pt/base2/rest/entidades/[base_id]
  2. Contract: http://www.base.gov.pt/base2/rest/contratos/[base_id]
  3. Tender: http://www.base.gov.pt/base2/rest/anuncios/[base_id]
  4. List of Country: http://www.base.gov.pt/base2/rest/lista/paises
  5. List of District: http://www.base.gov.pt/base2/rest/lista/distritos?pais=[country_base_id]; (portugal_base_id=187)
  6. List of Council: http://www.base.gov.pt/base2/rest/lista/concelhos?distrito=[district_base_id];
  7. List of ContractType: http://www.base.gov.pt/base2/rest/lista/tipocontratos
  8. List of ProcedureType: http://www.base.gov.pt/base2/rest/lista/tipoprocedimentos

Each url returns json with information about the particular object. For this reason, we have an abstract crawler for retrieving this information and map it to this API.

What the crawler does

The crawler accesses Base urls using the same procedure for entities, contracts and tenders. It does the following:

  1. retrieves the list c1_ids=[i*10k, (i+1)*10k] of ids from links 4., 5. or 6.;
  2. retrieves all ids in range [c_ids[0], c_ids[-1]] from our db, c2_ids
  3. Adds, using links 1.,2. or 3. all ids in c1_ids and not in c2_ids.
  4. Removes, using links 1.,2. or 3. all ids in c2_ids and not in c1_ids.
  5. Go to 1 with i += 1 until it covers all contracts.

The initial value of i is 0 when the database is populated from scratch, and is such that only one cycle 1-5 is performed when searching for new items.

API references

This section introduces the different crawlers we use to crawl Base.

class contracts.crawler.ContractsStaticDataCrawler

A subclass JSONCrawler for static data. This crawler only needs to be run once and is used to populate the database the first time.

retrieve_and_save_all()

Retrieves and saves all static data of contracts.

Given the size of Base database, and since it is constantly being updated, contracts, entities and tenders, use the following approach:

class contracts.crawler.DynamicCrawler

An abstract subclass of JSONCrawler that implements the crawling procedure described in the previous section.

object_name = None

A string with the name of the object used to name the .json files; to be overwritten.

object_url = None

The url used to retrieve data from BASE; to be overwritten.

object_model = None

The model to be constructed from the retrieved data; to be overwritten.

static clean_data(data)

Cleans data, returning a cleaned_data dictionary with keys being fields of the object_model and values being extracted from data.

To be overwritten by subclasses.

save_instance(cleaned_data)

Saves or updates an instance of type object_model using the dictionary cleaned_data.

This method can be overwritten for changing how the instance is saved.

Returns a tuple (instance, created) where created is True if the instance was created (and not just updated).

update_instance(base_id)

Uses get_json(), clean_data() and save_instance() to create or update an instance identified by base_id.

Returns the output of save_instance().

get_instances_count()

Returns the total number of existing instances in BASE db.

get_base_ids(row1, row2)

Returns a list of instances from BASE of length row2 - row1.

update_batch(row1, row2)

Updates a batch of rows, step 2.-4. of the previous section.

update(start=0, end=None, items_per_batch=1000)

The method retrieves count of all items in BASE (1 hit), and synchronizes items from start until min(end, count) in batches of items_per_batch.

If end=None (default), it retrieves until the last item.

if start < 0, the start is counted from the end.

Use e.g. start=-2000 for a quick retrieve of new items;

Use start=0 (default) to synchronize all items in database (it takes time!)

class contracts.crawler.EntitiesCrawler

A subclass of DynamicCrawler to populate Entity table.

Overwrites clean_data() to clean data to Entity.

Uses:

  • object_directory: '../../data/entities'
  • object_name: 'entity';
  • object_url: 'http://www.base.gov.pt/base2/rest/entidades/%d'
  • object_model: Entity.
class contracts.crawler.ContractsCrawler

A subclass of DynamicCrawler to populate Contract table.

Overwrites clean_data() to clean data to Contract and save_instance() to also save ManytoMany relationships of the Contract.

Uses:

  • object_directory: '../../data/contracts'
  • object_name: 'contract';
  • object_url: 'http://www.base.gov.pt/base2/rest/contratos/%d'
  • object_model: Contract.
class contracts.crawler.TenderCrawler

A subclass of DynamicCrawler to populate Tender table.

Overwrites clean_data() to clean data to Tender and save_instance() to also save ManytoMany relationships of the Tender.

Uses:

  • object_directory: '../../data/tenders'
  • object_name: 'tender';
  • object_url: 'http://www.base.gov.pt/base2/rest/anuncios/%d'
  • object_model: Tender.

Crawler for Categories

Europe Union has a categorisation system for public contracts, CPVS, that translates a string of 8 digits into a category to be used in public contracts.

More than categories, this system is a tree with broader categories like “agriculture”, and more specific ones like “potatos”.

They provide the fixture as an XML file, which we import:

contracts.categories_crawler.build_categories()

Constructs the category tree of categories.

Gets the most general categories and saves then, repeating this recursively to more specific categories until it reaches the leaves of the tree.

The categories are retrieved from the internet.