1 Introduction
2 Related work
2.1 Extract-Transform-Load frameworks
2.2 RDF data management
2.2.1 RDB-to-RDF technologies
2.2.2 RDF storage
3 Design and implementation
3.1 RDFAdaptor framework
Figure 1. RDFAdaptor framework. |
3.2 RDFAdaptor implementation
Figure 2. Front-end interface screenshot. |
Figure 3. Workflow of RDF data generation with RDFZier. |
4 RDFAdaptor application
4.1 RDF data generation
Table 1 Parameters defined in RDFizer. |
Parameter | Description | |
---|---|---|
Namespace | Prefix | collections of names identified by URI references |
Namespace | different prefixes depending on the required namespaces | |
Mapping Setting | Subject URI | HTTPURI template for the Subject/Resource, a placeholder {sid} would be used and replaced by UniqueKey |
Class Types | the classes to which the resource belongs, supporting multi-class types(split by semicolon), such as skos:Concepts; foaf:Person | |
UniqueKey | the unique and stable primary key of resource, part of the Subject URI | |
Fields Mapping Parameters | a list of field map from selected data source to target RDF schema, including the input Stream Field, Predicates, Object URIs, Multi-Values Sepator, Data Type, Lang Tag | |
Dataset Metadata | Meta Subject URI | URI pattern of generated dataset |
Meta Class Types | the classes to which the resource belongs | |
Parameters | a list of descriptions of generated dataset, including PropertyType, Predicates, Object Values, DataType, Lang Tag | |
Output Setting | File system setting | option for file system storage, including Filename and RDF format |
RDF store setting | option for RDF store, including triple store name, server URL, Repository ID, Username (if any), Password, Graph URI |
4.2 RDF translation and loading
Figure 4. Configuration template of RDFTranslatorAndLoader. |
Table 2 Parameters defined in RDFTranslatorAndLoader. |
Parameter | Description | |
---|---|---|
Input | Source | RDF tiples to be converted or loaded |
Source Type | data source, such as local file system, Remote URL or string stream | |
Source RDF Format | format of the input RDF data, fully supporting the common RDF formats | |
Large Input Triples | a selector for input data scale large or not, if the input is large, then the output step can not count, merge or split the triples | |
Advance | BaseIRI | resolve against a Base IRI if RDF data contains relative IRIs |
BNode | a selector for preserving BNode IDs | |
Verify URI syntax | a selector for URI syntax/relative URIs/language tags/datatypes check which returns fail log when corresponding errors occur | |
Verify relative URIs | ||
Verify language tags | ||
Verify datatypes | ||
Language tags | a selector for language tags / datatype, including fail parsing if languages / datatypes are not recognised and normalizing recognised language tags / datatypes values | |
Datatype | ||
Output | Target RDF Format | RDF format of the converted output |
Commit or Split Size | number of RDF triples for the output to each RDF files or submit to stores every batch, the default value is 0, which means all the input data would be processed at one time | |
Local File Setting | options of file system storage, including three selectors for “Save to File System”, “Keep Source FileName” and “Merge to Single File (take precedence over “Commit or Split Size”)”, File name and location | |
TripleStore Setting | options of RDF store, including a selector for “Save to Store”, Triple Store, Server URL, Database/RepositoryID/NameSpace (identifier of database for different triple store), UserName, Password, and Graph URI. | |
Stream setting | option of String Stream for further data transferring, including a selector for “Save to Stream”, and Result Field |
4.3 RDF data migration
Table 3 Parameters defined in SparqlIn. |
Parameter | Description | |
---|---|---|
SPARQL Setting | Accept URL from field | checkbox, if checked means the Url of the SPARQL Endpoint would be coming from Kettle's previous steps and the value could get from the “URL field name” |
URL field name | only used by giving a list of drop-down options of input fields when the option “Accept URL from field” is selected | |
SPARQL Endpoint URL | endpoint Url queried when “Query Endpoint Url From Field” is disabled | |
Query Type | query type which provides two options: Graph query or Tuple query | |
SPARQL Query | SPARQL query forms: SELECT or CONSTRUCT | |
Limit | limitation on data size to be processed if necessary | |
Offset | the starting position of data processing | |
Output Setting | Result Field Name | field specified for file saving |
RDF Format | target local data format, either JSON, XML, CSV or TSV for SELECT query, RDF format only for CONSTRUCT query | |
Max Rows | definition of the maximum size of the output file, empty of 0 means get all the triples | |
Http Auth | HTTP UserID | user ID of SPARQL endpoint if any |
HTTP Password | password of SPARQL endpoint if UserID exists |
Figure 5. Configuration template of SPARQLIn and SPARQLUpdate. |
4.4 RDF graph update
Table 4 Parameters defined in SparqlUpdate. |
Parameter | Description | |
---|---|---|
SPARQL Setting | Query Endpoint Url From Field? | checkbox, if checked means the Url of the SPARQL Query Endpoint would be coming from Kettle's previous steps and the value could get from the “Query Endpoint Url Field” |
Query Endpoint Url Field | only used by giving a list of drop-down options of input fields when the option “Query Endpoint Url From Field” is selected | |
Query Endpoint Url | The value of the Query Endpoint Url would be used when “Query Endpoint Url From Field” is unchecked | |
Update Endpoint Url From Field? | checkbox, if checked means the Url of the SPARQL Update Endpoint would be coming from Kettle's previous steps and the value could get from the “Update Endpoint Url Field | |
Update Endpoint Url Field | only used by giving a list of drop-down options of input fields when the option “Update Endpoint Url From Field” is selected | |
Update Endpoint Url | The value of the Update Endpoint Url would be used when “Update Endpoint Url From Field” is unchecked | |
Query From Field? | checkbox, if checked means the SPARQL Update Query would be coming from Kettle's previous steps and the value could get the “Query Field Name” | |
Query Field Name | only used when the option “Query From Field” is selected | |
Base URI | resolve against a Base IRI if RDF data contains relative IRIs | |
SPARQL Update Query | JavaScript programming for graph update which is only used when the option “Query From Field” is disable | |
Output Setting | Result Field Name | field specified for file saving |
Http Auth | HTTP UserID | user ID of SPARQL endpoint if any |
HTTP Password | password of SPARQL endpoint if UserID exists |
5 Cases and evaluation
Table 5 RDF data generation/translation and loading. |
Data Source | Data Format | Number of Records | Number of mapped fields | Number of RDF generated | Total Time-consuming |
---|---|---|---|---|---|
MongDB | json | 1,948,268 | 17 | 37,038,563 | 32min18s |
SqlServer | RDB | 336,831 | 5 | 1,159,687 | 38.6s |
798,389 | 9 | 7,521,876 | 5min4s |
Figure 6. Dump All AGROVOC RDF Triples from SPARQL Endpoint to Local Files. |