Class LocalGrid

  • All Implemented Interfaces:
    Grid

    public class LocalGrid
    extends Object
    implements Grid
    A PLL-based implementation of a Grid.
    Author:
    Antonin Delpeuch
    • Constructor Detail

      • LocalGrid

        public LocalGrid​(PairPLL<Long,​Record> records,
                         LocalRunner runner,
                         ColumnModel columnModel,
                         Map<String,​OverlayModel> overlayModels,
                         long rowCount)
        Constructs a grid from a grid of records.
        Parameters:
        records - the PLL of records for the current grid, which is assumed to have consistent indexing
        runner -
        columnModel -
        overlayModels -
    • Method Detail

      • getRunner

        public LocalRunner getRunner()
        Specified by:
        getRunner in interface Grid
        Returns:
        the runner which created this grid
      • getColumnModel

        public ColumnModel getColumnModel()
        Specified by:
        getColumnModel in interface Grid
        Returns:
        the column metadata at this stage of the workflow
      • withColumnModel

        public Grid withColumnModel​(ColumnModel newColumnModel)
        Specified by:
        withColumnModel in interface Grid
        Parameters:
        newColumnModel - the column model to apply to the grid
        Returns:
        a copy of this grid with a modified column model.
      • getRow

        public Row getRow​(long id)
        Description copied from interface: Grid
        Returns a row by index. Repeatedly calling this method to obtain multiple rows might be inefficient compared to fetching them by batch, depending on the implementation.
        Specified by:
        getRow in interface Grid
        Parameters:
        id - the row index. This refers to the current position of the row in the grid, which corresponds to IndexedRow.getIndex().
        Returns:
        the row at the given index
      • getRowsAfter

        public List<IndexedRow> getRowsAfter​(long start,
                                             int limit)
        Description copied from interface: Grid
        Returns a list of rows, starting from a given index and defined by a maximum size.
        Specified by:
        getRowsAfter in interface Grid
        Parameters:
        start - the first row id to fetch (inclusive)
        limit - the maximum number of rows to fetch
        Returns:
        the list of rows with their ids (if any)
      • getRowsAfter

        public List<IndexedRow> getRowsAfter​(RowFilter filter,
                                             long start,
                                             int limit)
        Description copied from interface: Grid
        Among the subset of filtered rows, return a list of rows, starting from a given index and defined by a maximum size.
        Specified by:
        getRowsAfter in interface Grid
        Parameters:
        filter - the subset of rows to paginate through. This object and its dependencies are required to be serializable.
        start - the first row id to fetch (inclusive)
        limit - the maximum number of rows to fetch
        Returns:
        the list of rows with their ids (if any)
        See Also:
        Grid.getRowsBefore(long, int)
      • getRowsBefore

        public List<IndexedRow> getRowsBefore​(long end,
                                              int limit)
        Description copied from interface: Grid
        Returns a list of consecutive rows, just before the given row index (not included) and up to a maximum size.
        Specified by:
        getRowsBefore in interface Grid
        Parameters:
        end - the last row id to fetch (exclusive)
        limit - the maximum number of rows to fetch
        Returns:
        the list of rows with their ids (if any)
        See Also:
        Grid.getRowsAfter(long, int)
      • getRowsBefore

        public List<IndexedRow> getRowsBefore​(RowFilter filter,
                                              long end,
                                              int limit)
        Description copied from interface: Grid
        Among the subset of filtered rows, return a list of rows, just before the row with a given index (excluded) and defined by a maximum size.
        Specified by:
        getRowsBefore in interface Grid
        Parameters:
        filter - the subset of rows to paginate through. This object and its dependencies are required to be serializable.
        end - the last row id to fetch (exclusive)
        limit - the maximum number of rows to fetch
        Returns:
        the list of rows with their ids (if any)
      • getRows

        public List<IndexedRow> getRows​(List<Long> rowIndices)
        Description copied from interface: Grid
        Returns a list of rows corresponding to the row indices supplied. By default, this calls Grid.getRow(long) on all values, but implementations can override this to more efficient strategies if available.
        Specified by:
        getRows in interface Grid
        Parameters:
        rowIndices - the indices of the rows to lookup
        Returns:
        the list contains null values for the row indices which could not be found.
      • iterateRows

        public CloseableIterator<IndexedRow> iterateRows​(RowFilter filter)
        Description copied from interface: Grid
        Iterate over rows matched by a filter, in the order determined by a sorting configuration. This might not require loading all rows in memory at once, but might be less efficient than Grid.collectRows() if all rows are to be stored in memory downstream.
        Specified by:
        iterateRows in interface Grid
      • countMatchingRows

        public long countMatchingRows​(RowFilter filter)
        Description copied from interface: Grid
        Count the number of rows which match a given filter.
        Specified by:
        countMatchingRows in interface Grid
        Parameters:
        filter - the row filter
        Returns:
        the number of rows for which this filter returns true
      • countMatchingRowsApprox

        public Grid.ApproxCount countMatchingRowsApprox​(RowFilter filter,
                                                        long limit)
        Description copied from interface: Grid
        Return the number of rows matching the given row filter, but by processing about at most a fixed number of row.
        Specified by:
        countMatchingRowsApprox in interface Grid
        Parameters:
        filter - counts the number of records on which it returns true
        limit - maximum number of records to process
      • collectRows

        public List<IndexedRow> collectRows()
        Description copied from interface: Grid
        Returns all rows in a list. This is inefficient for large datasets as it forces the entire grid to be loaded in memory.
        Specified by:
        collectRows in interface Grid
      • getRecord

        public Record getRecord​(long id)
        Description copied from interface: Grid
        Returns a record obtained by its id. Repeatedly calling this method to obtain multiple records might be inefficient depending on the implementation.
        Specified by:
        getRecord in interface Grid
        Parameters:
        id - the row id of the first row in the record. This refers to the current position of the record in the grid, which corresponds to Record.getStartRowId().
        Returns:
        the corresponding record
      • getRecordsAfter

        public List<Record> getRecordsAfter​(long start,
                                            int limit)
        Description copied from interface: Grid
        Returns a list of records, starting from a given index and defined by a maximum size.
        Specified by:
        getRecordsAfter in interface Grid
        Parameters:
        start - the first record id to fetch (inclusive)
        limit - the maximum number of records to fetch
        Returns:
        the list of records (if any)
        See Also:
        Grid.getRecordsBefore(long, int)
      • getRecordsAfter

        public List<Record> getRecordsAfter​(RecordFilter filter,
                                            long start,
                                            int limit)
        Description copied from interface: Grid
        Among the filtered subset of records, returns a list of records, starting from a given index and defined by a maximum size.
        Specified by:
        getRecordsAfter in interface Grid
        Parameters:
        filter - the filter which defines the subset of records to paginate through This object and its dependencies are required to be serializable.
        start - the first record id to fetch (inclusive)
        limit - the maximum number of records to fetch
        Returns:
        the list of records (if any)
      • getRecordsBefore

        public List<Record> getRecordsBefore​(long end,
                                             int limit)
        Description copied from interface: Grid
        Returns a list of consecutive records, ending at a given index (exclusive) and defined by a maximum size.
        Specified by:
        getRecordsBefore in interface Grid
        Parameters:
        end - the last record id to fetch (exclusive)
        limit - the maximum number of records to fetch
        Returns:
        the list of records (if any)
        See Also:
        Grid.getRecordsAfter(long, int)
      • getRecordsBefore

        public List<Record> getRecordsBefore​(RecordFilter filter,
                                             long end,
                                             int limit)
        Description copied from interface: Grid
        Among the filtered subset of records, returns a list of records, ending at a given index (exclusive) and defined by a maximum size.
        Specified by:
        getRecordsBefore in interface Grid
        Parameters:
        filter - the filter which defines the subset of records to paginate through This object and its dependencies are required to be serializable.
        end - the last record id to fetch (exclusive)
        limit - the maximum number of records to fetch
        Returns:
        the list of records (if any)
      • iterateRecords

        public CloseableIterator<Record> iterateRecords​(RecordFilter filter)
        Description copied from interface: Grid
        Iterate over records matched by a filter. This might not require loading all records in memory at once, but might be less efficient than Grid.collectRecords() if all records are to be stored in memory downstream.
        Specified by:
        iterateRecords in interface Grid
      • countMatchingRecords

        public long countMatchingRecords​(RecordFilter filter)
        Description copied from interface: Grid
        Return the number of records which are filtered by this filter.
        Specified by:
        countMatchingRecords in interface Grid
        Parameters:
        filter - the filter to evaluate
        Returns:
        the number of records for which this filter evaluates to true
      • countMatchingRecordsApprox

        public Grid.ApproxCount countMatchingRecordsApprox​(RecordFilter filter,
                                                           long limit)
        Description copied from interface: Grid
        Return the number of records matching the given record filter, but by processing about at most a fixed number of records.
        Specified by:
        countMatchingRecordsApprox in interface Grid
        Parameters:
        filter - counts the number of records on which it returns true
        limit - maximum number of records to process
      • collectRecords

        public List<Record> collectRecords()
        Description copied from interface: Grid
        Returns all records in a list. This is inefficient for large datasets as it forces all records to be loaded in memory.
        Specified by:
        collectRecords in interface Grid
      • rowCount

        public long rowCount()
        Specified by:
        rowCount in interface Grid
        Returns:
        the number of rows in the table
      • recordCount

        public long recordCount()
        Specified by:
        recordCount in interface Grid
        Returns:
        the number of records in the table
      • saveToFileAsync

        public ProgressingFuture<Void> saveToFileAsync​(File file)
        Description copied from interface: Grid
        Saves the grid to a specified directory, in an asynchronous fashion.
        Specified by:
        saveToFileAsync in interface Grid
        Parameters:
        file - the directory where to save the grid
        Returns:
        a future which completes once the save is complete
      • saveToFile

        public void saveToFile​(File file)
                        throws IOException
        Description copied from interface: Grid
        Saves the grid to a specified directory, following OpenRefine's format for grid storage.
        Specified by:
        saveToFile in interface Grid
        Parameters:
        file - the directory where to save the grid
        Throws:
        IOException
      • serializeIndexedRow

        protected static String serializeIndexedRow​(IndexedRow indexedRow)
      • aggregateRows

        public <T extends Serializable> T aggregateRows​(RowAggregator<T> aggregator,
                                                        T initialState)
        Description copied from interface: Grid
        Computes the result of a row aggregator on the grid.
        Specified by:
        aggregateRows in interface Grid
      • aggregateRecords

        public <T extends Serializable> T aggregateRecords​(RecordAggregator<T> aggregator,
                                                           T initialState)
        Description copied from interface: Grid
        Computes the result of a row aggregator on the grid.
        Specified by:
        aggregateRecords in interface Grid
      • aggregateRowsApprox

        public <T extends SerializableGrid.PartialAggregation<T> aggregateRowsApprox​(RowAggregator<T> aggregator,
                                                                                       T initialState,
                                                                                       long maxRows)
        Description copied from interface: Grid
        Computes the result of a row aggregator on the grid, reading about at most a fixed number of rows. The rows read should be deterministic for a given implementation.
        Specified by:
        aggregateRowsApprox in interface Grid
      • aggregateRecordsApprox

        public <T extends SerializableGrid.PartialAggregation<T> aggregateRecordsApprox​(RecordAggregator<T> aggregator,
                                                                                          T initialState,
                                                                                          long maxRecords)
        Description copied from interface: Grid
        Computes the result of a row aggregator on the grid, reading about at most a fixed number of records. The records read should be deterministic for a given implementation.
        Specified by:
        aggregateRecordsApprox in interface Grid
      • withOverlayModels

        public Grid withOverlayModels​(Map<String,​OverlayModel> newOverlayModels)
        Description copied from interface: Grid
        Returns a new grid where the overlay models have changed.
        Specified by:
        withOverlayModels in interface Grid
        Parameters:
        newOverlayModels - the new overlay models to apply to the grid
        Returns:
        the changed grid
      • mapRows

        public Grid mapRows​(RowMapper mapper,
                            ColumnModel newColumnModel)
        Description copied from interface: Grid
        Returns a new grid, where the rows have been mapped by the mapper.
        Specified by:
        mapRows in interface Grid
        Parameters:
        mapper - the function used to transform rows This object and its dependencies are required to be serializable.
        newColumnModel - the column model of the resulting grid
        Returns:
        the resulting grid
      • flatMapRows

        public Grid flatMapRows​(RowFlatMapper mapper,
                                ColumnModel newColumnModel)
        Description copied from interface: Grid
        Returns a new grid, where the rows have been mapped by the flat mapper.
        Specified by:
        flatMapRows in interface Grid
        Parameters:
        mapper - the function used to transform rows This object and its dependencies are required to be serializable.
        newColumnModel - the column model of the resulting grid
        Returns:
        the resulting grid
      • mapRows

        public <S extends SerializableGrid mapRows​(RowScanMapper<S> mapper,
                                                     ColumnModel newColumnModel)
        Description copied from interface: Grid
        Returns a new grid where the rows have been mapped by the stateful mapper. This can be significantly less efficient than a stateless mapper, so only use this if you really need to rely on state.
        Specified by:
        mapRows in interface Grid
        Type Parameters:
        S - the type of state kept by the mapper
        Parameters:
        mapper - the mapper to apply to the grid
        newColumnModel - the column model to apply to the new grid
      • mapRecords

        public Grid mapRecords​(RecordMapper mapper,
                               ColumnModel newColumnModel)
        Description copied from interface: Grid
        Returns a new grid, where the records have been mapped by the mapper
        Specified by:
        mapRecords in interface Grid
        Parameters:
        mapper - the function used to transform records This object and its dependencies are required to be serializable.
        newColumnModel - the column model of the resulting grid
        Returns:
        the resulting grid
      • reorderRows

        public Grid reorderRows​(SortingConfig sortingConfig,
                                boolean permanent)
        Description copied from interface: Grid
        Returns a new grid where rows have been reordered according to the configuration supplied.
        Specified by:
        reorderRows in interface Grid
        Parameters:
        sortingConfig - the criteria to sort rows
        permanent - if true, forget the original row ids. If false, store them in the corresponding IndexedRow.getOriginalIndex().
        Returns:
        the resulting grid
      • reorderRecords

        public Grid reorderRecords​(SortingConfig sortingConfig,
                                   boolean permanent)
        Description copied from interface: Grid
        Returns a new grid where records have been reordered according to the configuration supplied.
        Specified by:
        reorderRecords in interface Grid
        Parameters:
        sortingConfig - the criteria to sort records
        permanent - if true, forget the original record ids. If false, store them in the corresponding Record.getOriginalStartRowId().
        Returns:
        the resulting grid
      • removeRows

        public Grid removeRows​(RowFilter filter)
        Description copied from interface: Grid
        Removes all rows selected by a filter
        Specified by:
        removeRows in interface Grid
        Parameters:
        filter - which returns true when we should delete the row
        Returns:
        the grid where the matching rows have been removed
      • removeRecords

        public Grid removeRecords​(RecordFilter filter)
        Description copied from interface: Grid
        Removes all records selected by a filter
        Specified by:
        removeRecords in interface Grid
        Parameters:
        filter - which returns true when we should delete the record
        Returns:
        the grid where the matching record have been removed
      • mapRows

        public <T> ChangeData<T> mapRows​(RowFilter filter,
                                         RowChangeDataProducer<T> rowMapper,
                                         Optional<ChangeData<T>> incompleteChangeData)
        Description copied from interface: Grid
        Extract change data by applying a function to each filtered row. The calls to the change data producer are batched if requested by the producer.

        Specified by:
        mapRows in interface Grid
        Type Parameters:
        T - the type of change data that is serialized to disk for each row
        Parameters:
        filter - a filter to select which rows to map
        rowMapper - produces the change data for each row
        incompleteChangeData - a previously, incompletely fetched version of the same change data, from which the computation should be resumed, to avoid recomputing the items already in the incomplete change data
      • mapRecords

        public <T> ChangeData<T> mapRecords​(RecordFilter filter,
                                            RecordChangeDataProducer<T> recordMapper,
                                            Optional<ChangeData<T>> incompleteChangeData)
        Description copied from interface: Grid
        Extract change data by applying a function to each filtered record. The calls to the change data producer are batched if requested by the producer.
        Specified by:
        mapRecords in interface Grid
        Type Parameters:
        T - the type of change data that is serialized to disk for each row
        Parameters:
        filter - a filter to select which rows to map
        recordMapper - produces the change data for each record
        incompleteChangeData - a previously, incompletely fetched version of the same change data, from which the computation should be resumed, to avoid recomputing the items already in the incomplete change data
      • limitRows

        public Grid limitRows​(long rowLimit)
        Only keep the first rows. Overridden for efficiency when a partitioner is known.
        Specified by:
        limitRows in interface Grid
        Parameters:
        rowLimit - the number of rows to keep
        Returns:
        the limited grid
      • dropRows

        public Grid dropRows​(long rowsToDrop)
        Drop the first rows. Overridden for efficiency when partition sizes are known.
        Specified by:
        dropRows in interface Grid
        Parameters:
        rowsToDrop - the number of rows to drop
        Returns:
        the grid consisting of the last rows
      • join

        public <T> Grid join​(ChangeData<T> changeData,
                             RowChangeDataJoiner<T> rowJoiner,
                             ColumnModel newColumnModel)
        Description copied from interface: Grid
        Joins pre-computed change data with the current grid data, row by row.
        Specified by:
        join in interface Grid
        Type Parameters:
        T - the type of change data that was serialized to disk for each row
        Parameters:
        changeData - the serialized change data
        rowJoiner - produces the new row by joining the old row with change data
        newColumnModel - the column model to apply to the new grid
      • join

        public <T> Grid join​(ChangeData<T> changeData,
                             RowChangeDataFlatJoiner<T> rowJoiner,
                             ColumnModel newColumnModel)
        Description copied from interface: Grid
        Joins pre-computed change data with the current grid data, with a joiner function that can return multiple rows for a given original row.
        Specified by:
        join in interface Grid
        Type Parameters:
        T - the type of change data that was serialized to disk for each row
        Parameters:
        changeData - the serialized change data
        rowJoiner - produces the new row by joining the old row with change data
        newColumnModel - the column model to apply to the new grid
      • join

        public <T> Grid join​(ChangeData<T> changeData,
                             RecordChangeDataJoiner<T> recordJoiner,
                             ColumnModel newColumnModel)
        Description copied from interface: Grid
        Joins pre-computed change data with the current grid data, record by record.
        Specified by:
        join in interface Grid
        Type Parameters:
        T - the type of change data that was serialized to disk for each record
        Parameters:
        changeData - the serialized change data
        recordJoiner - produces the new list of rows by joining the old record with change data
        newColumnModel - the column model to apply to the new grid
      • concatenate

        public Grid concatenate​(Grid other)
        Description copied from interface: Grid
        Creates a new grid containing all rows in this grid, followed by all rows in the other grid supplied. The overlay models of this grid have priority over the others.

        The two grids are required to have the same number of columns.

        Specified by:
        concatenate in interface Grid
        Parameters:
        other - the grid to concatenate to this one
        Returns:
        a new grid, union of the two
      • concatenate

        public Grid concatenate​(List<Grid> otherGrids)
        Description copied from interface: Grid
        Concatenates this with other grids, in the given order. This is a variant of Grid.concatenate(Grid) which implementations can override to make more efficient than making repeated calls to Grid.concatenate(Grid) (which is the default implementation).
        Specified by:
        concatenate in interface Grid
        Parameters:
        otherGrids - the list of other grids to concatenate with this one.
        Returns:
        a new grid, union of all those grids
      • concatenateGridList

        protected static Grid concatenateGridList​(List<Grid> grids)
      • isCached

        public boolean isCached()
        Description copied from interface: Grid
        Is this grid cached in memory? If not, its contents are stored on disk.
        Specified by:
        isCached in interface Grid
      • uncache

        public void uncache()
        Description copied from interface: Grid
        Free up any memory used to cache this grid in memory.
        Specified by:
        uncache in interface Grid
      • cacheAsync

        public ProgressingFuture<Boolean> cacheAsync()
        Description copied from interface: Grid
        Attempt to cache this grid in memory, in an async way.
        Specified by:
        cacheAsync in interface Grid
        Returns:
        a future to keep track of the status of the caching process. The future returns whether the caching succeeded.
      • cache

        public boolean cache()
        Description copied from interface: Grid
        Attempt to cache this grid in memory. If the grid is too big, this can fail.
        Specified by:
        cache in interface Grid
        Returns:
        whether the grid was actually cached in memory.
      • smallEnoughToCacheInMemory

        protected boolean smallEnoughToCacheInMemory()
        Heuristic which predicts how much RAM a grid will occupy in memory after caching.
        Returns:
        whether there is enough free memory to cache this one.
      • sample

        protected <T,​U extends SerializableGrid.PartialAggregation<U> sample​(PLL<T> source,
                                                                                     long sampleSize,
                                                                                     U initialValue,
                                                                                     BiFunction<U,​T,​U> fold,
                                                                                     BiFunction<U,​U,​U> combine)
        Compute how many partitions and how many elements in those partitions we should process, if we want to get a sample of the given size. This will generally result in using all partitions and dividing the sample size by that number of partitions, but if there are many partitions we might not want to use them all (for instance, if there are more partitions than the desired sample size).
        Parameters:
        sampleSize - the number of elements we want to process in total
        Returns:
        a tuple, where the first component is the number of partitions and the second one is the number of elements in each of those partitions.
      • getRowsQueryTree

        public QueryTree getRowsQueryTree()
      • getRecordsQueryTree

        public QueryTree getRecordsQueryTree()