Package org.openrefine.runners.local
Class LocalRunner
- java.lang.Object
-
- org.openrefine.runners.local.LocalRunner
-
- All Implemented Interfaces:
Runner
public class LocalRunner extends Object implements Runner
The default implementation of theRunnerinterface. It is optimized for local (single machine) use, with Grids and ChangeData being read off disk lazily. Those objects can be partitioned, allowing for concurrent processing via threads.
-
-
Field Summary
Fields Modifier and Type Field Description protected intdefaultParallelismprotected static intdefaultReconciledCellCostprotected static intdefaultRowCostprotected static intdefaultUnreconciledCellCostprotected static StringGRID_PATHprotected longmaxSplitRowCountprotected longmaxSplitSizeprotected static StringMETADATA_PATHprotected longminSplitRowCountprotected longminSplitSizeprotected PLLContextpllContextprotected intreconciledCellCostprotected introwCostprotected intunreconciledCellCost-
Fields inherited from interface org.openrefine.model.Runner
COMPLETION_MARKER_FILE_NAME, GRID_ENCODING
-
-
Constructor Summary
Constructors Constructor Description LocalRunner()LocalRunner(RunnerConfiguration configuration)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description <T> ChangeData<T>changeDataFromIterable(CloseableIterable<IndexedData<T>> iterable, long itemCount)Creates aChangeDatafrom an iterable.protected <T> ChangeData<T>changeDataFromIterable(CloseableIterable<IndexedData<T>> iterable, long itemCount, boolean isComplete)<T> ChangeData<T>changeDataFromList(List<IndexedData<T>> changeData)Creates aChangeDatafrom an in-memory list of indexed data.<T> ChangeData<T>emptyChangeData()Creates an empty change data object of a given type, marked as incomplete.protected static <T> CloseableIterator<Tuple2<Long,IndexedData<T>>>fillWithIncompleteIndexedData(CloseableIterator<Tuple2<Long,IndexedData<T>>> originalIterator, long initialIndex, long upperBound)Utility function used to pad a stream of elements read from an incomplete change data with "virtual", pending IndexedData elements to represent those which may be computed later.PLLContextgetPLLContext()intgetReconciledCellCost()Returns the predicted memory cost of a reconciled cell, when caching gridsintgetRowCost()Returns the predicted memory cost of a row (on top of the cost of its cells), when caching gridsintgetUnreconciledCellCost()Returns the predicted memory cost of an unreconciled cell, when caching gridsGridgridFromIterable(ColumnModel columnModel, CloseableIterable<Row> rows, Map<String,OverlayModel> overlayModels, long rowCount, long recordCount)Creates aGridfrom an iterable collection of rows.GridgridFromList(ColumnModel columnModel, List<Row> rows, Map<String,OverlayModel> overlayModels)Creates aGridfrom an in-memory list of rows, which will be numbered from 0 to length-1.protected static List<Long>incompleteUpperBounds(List<Optional<Long>> firstKeys)Given a list of first keys found in the last (n-1) partitions, where n is the total number of partitions of a ChangeData object, compute a list of size n which contains for each partition up to which upper bound it should be filled up with pending change data objects.<T> ChangeData<T>loadChangeData(File path, ChangeDataSerializer<T> serializer)Loads aChangeDataserialized at a given location.GridloadGrid(File path)Loads aGridserialized at a given location.GridloadTextFile(String path, MultiFileReadingProgress progress, Charset encoding)Loads a text file as aGridwith a single column named "Column" and whose contents are the lines in the file, parsed as strings.GridloadTextFile(String path, MultiFileReadingProgress progress, Charset encoding, long limit)Loads a text file as aGridwith a single column named "Column" and whose contents are the lines in the file, parsed as strings.protected static Tuple2<Long,Row>parseIndexedRow(String source)booleansupportsProgressReporting()Indicates whether this implementation supports progress reporting.
-
-
-
Field Detail
-
METADATA_PATH
protected static final String METADATA_PATH
- See Also:
- Constant Field Values
-
GRID_PATH
protected static final String GRID_PATH
- See Also:
- Constant Field Values
-
defaultReconciledCellCost
protected static final int defaultReconciledCellCost
- See Also:
- Constant Field Values
-
defaultUnreconciledCellCost
protected static final int defaultUnreconciledCellCost
- See Also:
- Constant Field Values
-
defaultRowCost
protected static final int defaultRowCost
- See Also:
- Constant Field Values
-
pllContext
protected final PLLContext pllContext
-
defaultParallelism
protected int defaultParallelism
-
minSplitSize
protected long minSplitSize
-
maxSplitSize
protected long maxSplitSize
-
minSplitRowCount
protected long minSplitRowCount
-
maxSplitRowCount
protected long maxSplitRowCount
-
reconciledCellCost
protected int reconciledCellCost
-
unreconciledCellCost
protected int unreconciledCellCost
-
rowCost
protected int rowCost
-
-
Constructor Detail
-
LocalRunner
public LocalRunner(RunnerConfiguration configuration)
-
LocalRunner
public LocalRunner()
-
-
Method Detail
-
getPLLContext
public PLLContext getPLLContext()
-
loadGrid
public Grid loadGrid(File path) throws IOException
Description copied from interface:RunnerLoads aGridserialized at a given location.- Specified by:
loadGridin interfaceRunner- Parameters:
path- the directory where the Grid is stored- Returns:
- the grid
- Throws:
IOException- when loading the grid failed, or when the grid's serialization was incomplete (lacking a _SUCCESS marker)
-
gridFromList
public Grid gridFromList(ColumnModel columnModel, List<Row> rows, Map<String,OverlayModel> overlayModels)
Description copied from interface:RunnerCreates aGridfrom an in-memory list of rows, which will be numbered from 0 to length-1.- Specified by:
gridFromListin interfaceRunner
-
gridFromIterable
public Grid gridFromIterable(ColumnModel columnModel, CloseableIterable<Row> rows, Map<String,OverlayModel> overlayModels, long rowCount, long recordCount)
Description copied from interface:RunnerCreates aGridfrom an iterable collection of rows. By default, this just gathers the iterable in a list and delegates toRunner.gridFromList(ColumnModel, List, Map), but implementations may implement a different approach which delays the loading of the collection in memory.- Specified by:
gridFromIterablein interfaceRunnerrowCount- if the number of rows is known, supply it in this parameter as it might improve efficiency. Otherwise, set to -1.recordCount- if the number of records is known, supply it in this parameter as it might improve efficiency. Otherwise, set to -1.
-
loadChangeData
public <T> ChangeData<T> loadChangeData(File path, ChangeDataSerializer<T> serializer) throws IOException
Description copied from interface:RunnerLoads aChangeDataserialized at a given location.- Specified by:
loadChangeDatain interfaceRunner- Parameters:
path- the directory where the ChangeData is stored- Throws:
IOException- when loading the grid failed
-
incompleteUpperBounds
protected static List<Long> incompleteUpperBounds(List<Optional<Long>> firstKeys)
Given a list of first keys found in the last (n-1) partitions, where n is the total number of partitions of a ChangeData object, compute a list of size n which contains for each partition up to which upper bound it should be filled up with pending change data objects.
-
fillWithIncompleteIndexedData
protected static <T> CloseableIterator<Tuple2<Long,IndexedData<T>>> fillWithIncompleteIndexedData(CloseableIterator<Tuple2<Long,IndexedData<T>>> originalIterator, long initialIndex, long upperBound)
Utility function used to pad a stream of elements read from an incomplete change data with "virtual", pending IndexedData elements to represent those which may be computed later.- Parameters:
upperBound- a strict upper bound on the indices to enumerate, or Long.MAX_VALUE if unbounded
-
loadTextFile
public Grid loadTextFile(String path, MultiFileReadingProgress progress, Charset encoding) throws IOException
Description copied from interface:RunnerLoads a text file as aGridwith a single column named "Column" and whose contents are the lines in the file, parsed as strings.- Specified by:
loadTextFilein interfaceRunnerencoding- TODO- Throws:
IOException
-
loadTextFile
public Grid loadTextFile(String path, MultiFileReadingProgress progress, Charset encoding, long limit) throws IOException
Description copied from interface:RunnerLoads a text file as aGridwith a single column named "Column" and whose contents are the lines in the file, parsed as strings.- Specified by:
loadTextFilein interfaceRunner- Parameters:
path- the path to the text file to loadencoding- TODOlimit- the maximum number of lines to read- Throws:
IOException
-
changeDataFromList
public <T> ChangeData<T> changeDataFromList(List<IndexedData<T>> changeData)
Description copied from interface:RunnerCreates aChangeDatafrom an in-memory list of indexed data. The list is required to be sorted.- Specified by:
changeDataFromListin interfaceRunner
-
changeDataFromIterable
public <T> ChangeData<T> changeDataFromIterable(CloseableIterable<IndexedData<T>> iterable, long itemCount)
Description copied from interface:RunnerCreates aChangeDatafrom an iterable. By default, this just gathers the iterable in a list and delegates toRunner.changeDataFromList(List), but implementations may implement a different approach which delays the loading of the collection in memory.- Specified by:
changeDataFromIterablein interfaceRunneritemCount- if the number of items is known, supply it here, otherwise set this parameter to -1.
-
changeDataFromIterable
protected <T> ChangeData<T> changeDataFromIterable(CloseableIterable<IndexedData<T>> iterable, long itemCount, boolean isComplete)
-
emptyChangeData
public <T> ChangeData<T> emptyChangeData()
Description copied from interface:RunnerCreates an empty change data object of a given type, marked as incomplete.- Specified by:
emptyChangeDatain interfaceRunner
-
supportsProgressReporting
public boolean supportsProgressReporting()
Description copied from interface:RunnerIndicates whether this implementation supports progress reporting. If not, progress objects will be left untouched when passed to methods in this interface.- Specified by:
supportsProgressReportingin interfaceRunner
-
getReconciledCellCost
public int getReconciledCellCost()
Returns the predicted memory cost of a reconciled cell, when caching grids
-
getUnreconciledCellCost
public int getUnreconciledCellCost()
Returns the predicted memory cost of an unreconciled cell, when caching grids
-
getRowCost
public int getRowCost()
Returns the predicted memory cost of a row (on top of the cost of its cells), when caching grids
-
-