Package org.openrefine.runners.local
Class LocalRunner
- java.lang.Object
-
- org.openrefine.runners.local.LocalRunner
-
- All Implemented Interfaces:
Runner
public class LocalRunner extends Object implements Runner
The default implementation of theRunner
interface. It is optimized for local (single machine) use, with Grids and ChangeData being read off disk lazily. Those objects can be partitioned, allowing for concurrent processing via threads.
-
-
Field Summary
Fields Modifier and Type Field Description protected int
defaultParallelism
protected static int
defaultReconciledCellCost
protected static int
defaultRowCost
protected static int
defaultUnreconciledCellCost
protected static String
GRID_PATH
protected long
maxSplitRowCount
protected long
maxSplitSize
protected static String
METADATA_PATH
protected long
minSplitRowCount
protected long
minSplitSize
protected PLLContext
pllContext
protected int
reconciledCellCost
protected int
rowCost
protected int
unreconciledCellCost
-
Fields inherited from interface org.openrefine.model.Runner
COMPLETION_MARKER_FILE_NAME, GRID_ENCODING
-
-
Constructor Summary
Constructors Constructor Description LocalRunner()
LocalRunner(RunnerConfiguration configuration)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description <T> ChangeData<T>
changeDataFromIterable(CloseableIterable<IndexedData<T>> iterable, long itemCount)
Creates aChangeData
from an iterable.protected <T> ChangeData<T>
changeDataFromIterable(CloseableIterable<IndexedData<T>> iterable, long itemCount, boolean isComplete)
<T> ChangeData<T>
changeDataFromList(List<IndexedData<T>> changeData)
Creates aChangeData
from an in-memory list of indexed data.<T> ChangeData<T>
emptyChangeData()
Creates an empty change data object of a given type, marked as incomplete.protected static <T> CloseableIterator<Tuple2<Long,IndexedData<T>>>
fillWithIncompleteIndexedData(CloseableIterator<Tuple2<Long,IndexedData<T>>> originalIterator, long initialIndex, long upperBound)
Utility function used to pad a stream of elements read from an incomplete change data with "virtual", pending IndexedData elements to represent those which may be computed later.PLLContext
getPLLContext()
int
getReconciledCellCost()
Returns the predicted memory cost of a reconciled cell, when caching gridsint
getRowCost()
Returns the predicted memory cost of a row (on top of the cost of its cells), when caching gridsint
getUnreconciledCellCost()
Returns the predicted memory cost of an unreconciled cell, when caching gridsGrid
gridFromIterable(ColumnModel columnModel, CloseableIterable<Row> rows, Map<String,OverlayModel> overlayModels, long rowCount, long recordCount)
Creates aGrid
from an iterable collection of rows.Grid
gridFromList(ColumnModel columnModel, List<Row> rows, Map<String,OverlayModel> overlayModels)
Creates aGrid
from an in-memory list of rows, which will be numbered from 0 to length-1.protected static List<Long>
incompleteUpperBounds(List<Optional<Long>> firstKeys)
Given a list of first keys found in the last (n-1) partitions, where n is the total number of partitions of a ChangeData object, compute a list of size n which contains for each partition up to which upper bound it should be filled up with pending change data objects.<T> ChangeData<T>
loadChangeData(File path, ChangeDataSerializer<T> serializer)
Loads aChangeData
serialized at a given location.Grid
loadGrid(File path)
Loads aGrid
serialized at a given location.Grid
loadTextFile(String path, MultiFileReadingProgress progress, Charset encoding)
Loads a text file as aGrid
with a single column named "Column" and whose contents are the lines in the file, parsed as strings.Grid
loadTextFile(String path, MultiFileReadingProgress progress, Charset encoding, long limit)
Loads a text file as aGrid
with a single column named "Column" and whose contents are the lines in the file, parsed as strings.protected static Tuple2<Long,Row>
parseIndexedRow(String source)
boolean
supportsProgressReporting()
Indicates whether this implementation supports progress reporting.
-
-
-
Field Detail
-
METADATA_PATH
protected static final String METADATA_PATH
- See Also:
- Constant Field Values
-
GRID_PATH
protected static final String GRID_PATH
- See Also:
- Constant Field Values
-
defaultReconciledCellCost
protected static final int defaultReconciledCellCost
- See Also:
- Constant Field Values
-
defaultUnreconciledCellCost
protected static final int defaultUnreconciledCellCost
- See Also:
- Constant Field Values
-
defaultRowCost
protected static final int defaultRowCost
- See Also:
- Constant Field Values
-
pllContext
protected final PLLContext pllContext
-
defaultParallelism
protected int defaultParallelism
-
minSplitSize
protected long minSplitSize
-
maxSplitSize
protected long maxSplitSize
-
minSplitRowCount
protected long minSplitRowCount
-
maxSplitRowCount
protected long maxSplitRowCount
-
reconciledCellCost
protected int reconciledCellCost
-
unreconciledCellCost
protected int unreconciledCellCost
-
rowCost
protected int rowCost
-
-
Constructor Detail
-
LocalRunner
public LocalRunner(RunnerConfiguration configuration)
-
LocalRunner
public LocalRunner()
-
-
Method Detail
-
getPLLContext
public PLLContext getPLLContext()
-
loadGrid
public Grid loadGrid(File path) throws IOException
Description copied from interface:Runner
Loads aGrid
serialized at a given location.- Specified by:
loadGrid
in interfaceRunner
- Parameters:
path
- the directory where the Grid is stored- Returns:
- the grid
- Throws:
IOException
- when loading the grid failed, or when the grid's serialization was incomplete (lacking a _SUCCESS marker)
-
gridFromList
public Grid gridFromList(ColumnModel columnModel, List<Row> rows, Map<String,OverlayModel> overlayModels)
Description copied from interface:Runner
Creates aGrid
from an in-memory list of rows, which will be numbered from 0 to length-1.- Specified by:
gridFromList
in interfaceRunner
-
gridFromIterable
public Grid gridFromIterable(ColumnModel columnModel, CloseableIterable<Row> rows, Map<String,OverlayModel> overlayModels, long rowCount, long recordCount)
Description copied from interface:Runner
Creates aGrid
from an iterable collection of rows. By default, this just gathers the iterable in a list and delegates toRunner.gridFromList(ColumnModel, List, Map)
, but implementations may implement a different approach which delays the loading of the collection in memory.- Specified by:
gridFromIterable
in interfaceRunner
rowCount
- if the number of rows is known, supply it in this parameter as it might improve efficiency. Otherwise, set to -1.recordCount
- if the number of records is known, supply it in this parameter as it might improve efficiency. Otherwise, set to -1.
-
loadChangeData
public <T> ChangeData<T> loadChangeData(File path, ChangeDataSerializer<T> serializer) throws IOException
Description copied from interface:Runner
Loads aChangeData
serialized at a given location.- Specified by:
loadChangeData
in interfaceRunner
- Parameters:
path
- the directory where the ChangeData is stored- Throws:
IOException
- when loading the grid failed
-
incompleteUpperBounds
protected static List<Long> incompleteUpperBounds(List<Optional<Long>> firstKeys)
Given a list of first keys found in the last (n-1) partitions, where n is the total number of partitions of a ChangeData object, compute a list of size n which contains for each partition up to which upper bound it should be filled up with pending change data objects.
-
fillWithIncompleteIndexedData
protected static <T> CloseableIterator<Tuple2<Long,IndexedData<T>>> fillWithIncompleteIndexedData(CloseableIterator<Tuple2<Long,IndexedData<T>>> originalIterator, long initialIndex, long upperBound)
Utility function used to pad a stream of elements read from an incomplete change data with "virtual", pending IndexedData elements to represent those which may be computed later.- Parameters:
upperBound
- a strict upper bound on the indices to enumerate, or Long.MAX_VALUE if unbounded
-
loadTextFile
public Grid loadTextFile(String path, MultiFileReadingProgress progress, Charset encoding) throws IOException
Description copied from interface:Runner
Loads a text file as aGrid
with a single column named "Column" and whose contents are the lines in the file, parsed as strings.- Specified by:
loadTextFile
in interfaceRunner
encoding
- TODO- Throws:
IOException
-
loadTextFile
public Grid loadTextFile(String path, MultiFileReadingProgress progress, Charset encoding, long limit) throws IOException
Description copied from interface:Runner
Loads a text file as aGrid
with a single column named "Column" and whose contents are the lines in the file, parsed as strings.- Specified by:
loadTextFile
in interfaceRunner
- Parameters:
path
- the path to the text file to loadencoding
- TODOlimit
- the maximum number of lines to read- Throws:
IOException
-
changeDataFromList
public <T> ChangeData<T> changeDataFromList(List<IndexedData<T>> changeData)
Description copied from interface:Runner
Creates aChangeData
from an in-memory list of indexed data. The list is required to be sorted.- Specified by:
changeDataFromList
in interfaceRunner
-
changeDataFromIterable
public <T> ChangeData<T> changeDataFromIterable(CloseableIterable<IndexedData<T>> iterable, long itemCount)
Description copied from interface:Runner
Creates aChangeData
from an iterable. By default, this just gathers the iterable in a list and delegates toRunner.changeDataFromList(List)
, but implementations may implement a different approach which delays the loading of the collection in memory.- Specified by:
changeDataFromIterable
in interfaceRunner
itemCount
- if the number of items is known, supply it here, otherwise set this parameter to -1.
-
changeDataFromIterable
protected <T> ChangeData<T> changeDataFromIterable(CloseableIterable<IndexedData<T>> iterable, long itemCount, boolean isComplete)
-
emptyChangeData
public <T> ChangeData<T> emptyChangeData()
Description copied from interface:Runner
Creates an empty change data object of a given type, marked as incomplete.- Specified by:
emptyChangeData
in interfaceRunner
-
supportsProgressReporting
public boolean supportsProgressReporting()
Description copied from interface:Runner
Indicates whether this implementation supports progress reporting. If not, progress objects will be left untouched when passed to methods in this interface.- Specified by:
supportsProgressReporting
in interfaceRunner
-
getReconciledCellCost
public int getReconciledCellCost()
Returns the predicted memory cost of a reconciled cell, when caching grids
-
getUnreconciledCellCost
public int getUnreconciledCellCost()
Returns the predicted memory cost of an unreconciled cell, when caching grids
-
getRowCost
public int getRowCost()
Returns the predicted memory cost of a row (on top of the cost of its cells), when caching grids
-
-