zune-jpeg-0.5.11/.cargo_vcs_info.json0000644000000001560000000000100130260ustar { "git": { "sha1": "3cdedbe15b78a0aeb1da003d533eeeb767b6d381" }, "path_in_vcs": "crates/zune-jpeg" }zune-jpeg-0.5.11/.gitignore000064400000000000000000000000071046102023000136010ustar 00000000000000/targetzune-jpeg-0.5.11/Benches.md000064400000000000000000000051271046102023000135120ustar 00000000000000# Benchmarks of popular jpeg libraries Here I compare how long it takes popular JPEG decoders to decode the below 7680*4320 image of (now defunct ?) [Cutefish OS](https://en.cutefishos.com/) default wallpaper. ![img](benches/images/speed_bench.jpg) ## About benchmarks Benchmarks are weird, especially IO & multi-threaded programs. This library uses both of the above hence performance may vary. For best results shut down your machine, go take coffee, think about life and how it came to be and why people should save the environment. Then power up your machine, if it's a laptop connect it to a power supply and if there is a setting for performance mode, tweak it. Then run. ## Benchmarks vs real world usage Real world usage may vary. Notice that I'm using a large image but probably most decoding will be small to medium images. To make the library thread safe, we do about 1.5-1.7x more allocations than libjpeg-turbo. Although, do note that the allocations do not occur at ago, we allocate when needed and deallocate when not needed. Do note if memory bandwidth is a limitation. This is not for you. ## Reproducibility The benchmarks are carried out on my local machine with an AMD Ryzen 5 4500u The benchmarks are reproducible. To reproduce them 1. Clone this repository 2. Install rust(if you don't have it yet) 3. `cd` into the directory. 4. Run `cargo bench` ## Performance features of the three libraries | feature | image-rs/jpeg-decoder | libjpeg-turbo | zune-jpeg | |------------------------------|-----------------------|---------------|-----------| | multithreaded | ✅ | ❌ | ❌ | | platform specific intrinsics | ✅ | ✅ | ✅ | - Image-rs/jpeg-decoder uses [rayon] under the hood but it's under a feature flag. - libjpeg-turbo uses hand-written asm for platform specific intrinsics, ported to the most common architectures out there but falls back to scalar code if it can't run in a platform. # Finally benchmarks [here] ## Notes Benchmarks are ran at least once a week to catch regressions early and are uploaded to Github pages. Machine specs can be found on the other [landing page] Benchmarks may not reflect real world usage(threads, other I/O machine bottlenecks) [landing page]:https://etemesi254.github.io/posts/Zune-Benchmarks/ [here]:https://etemesi254.github.io/assets/criterion/report/index.html [libjpeg-turbo]:https://github.com/libjpeg-turbo/libjpeg-turbo [jpeg-decoder]:https://github.com/image-rs/jpeg-decoder [rayon]:https://github.com/rayon-rs/rayonzune-jpeg-0.5.11/Cargo.lock0000644000000011230000000000100107740ustar # This file is automatically @generated by Cargo. # It is not intended for manual editing. version = 4 [[package]] name = "log" version = "0.4.29" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5e5032e24019045c762d3c0f28f5b6b8bbf38563a65908389bf7978758920897" [[package]] name = "zune-core" version = "0.5.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "cb8a0807f7c01457d0379ba880ba6322660448ddebc890ce29bb64da71fb40f9" dependencies = [ "log", ] [[package]] name = "zune-jpeg" version = "0.5.11" dependencies = [ "zune-core", ] zune-jpeg-0.5.11/Cargo.toml0000644000000026740000000000100110330ustar # THIS FILE IS AUTOMATICALLY GENERATED BY CARGO # # When uploading crates to the registry Cargo will automatically # "normalize" Cargo.toml files for maximal compatibility # with all versions of Cargo and also rewrite `path` dependencies # to registry (e.g., crates.io) dependencies. # # If you are reading this file be aware that the original Cargo.toml # will likely look very different (and much more reasonable). # See Cargo.toml.orig for the original contents. [package] edition = "2021" rust-version = "1.75.0" name = "zune-jpeg" version = "0.5.11" authors = ["caleb "] build = false exclude = [ "/benches/images/*", "/tests/*", "/.idea/*", "/.gradle/*", "/test-images/*", "fuzz/*", ] autolib = false autobins = false autoexamples = false autotests = false autobenches = false description = "A fast, correct and safe jpeg decoder" readme = "README.md" keywords = [ "jpeg", "jpeg-decoder", "decoder", ] categories = ["multimedia::images"] license = "MIT OR Apache-2.0 OR Zlib" repository = "https://github.com/etemesi254/zune-image/tree/dev/crates/zune-jpeg" [features] default = [ "x86", "neon", "std", ] log = ["zune-core/log"] neon = [] portable_simd = [] std = ["zune-core/std"] x86 = [] [lib] name = "zune_jpeg" path = "src/lib.rs" [dependencies.zune-core] version = "0.5.1" [dev-dependencies] [lints.rust.unexpected_cfgs] level = "warn" priority = 0 check-cfg = ["cfg(fuzzing)"] zune-jpeg-0.5.11/Cargo.toml.orig000064400000000000000000000020571046102023000145070ustar 00000000000000[package] name = "zune-jpeg" version = "0.5.11" rust-version = "1.75.0" authors = ["caleb "] edition = "2021" repository = "https://github.com/etemesi254/zune-image/tree/dev/crates/zune-jpeg" license = "MIT OR Apache-2.0 OR Zlib" keywords = ["jpeg", "jpeg-decoder", "decoder"] categories = ["multimedia::images"] exclude = ["/benches/images/*", "/tests/*", "/.idea/*", "/.gradle/*", "/test-images/*", "fuzz/*"] description = "A fast, correct and safe jpeg decoder" [lints.rust] # Disable feature checker for fuzzing since it's used and cargo doesn't # seem to recognise fuzzing unexpected_cfgs = { level = "warn", check-cfg = ['cfg(fuzzing)'] } # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html [features] x86 = [] neon = [] std = ["zune-core/std"] # NOTE: portable_simd requires Rust 1.87+ portable_simd = [] log = ["zune-core/log"] default = ["x86", "neon", "std"] [dependencies] zune-core = { path = "../zune-core", version = "0.5.1" } [dev-dependencies] zune-ppm = { path = "../zune-ppm" }zune-jpeg-0.5.11/Changelog.md000064400000000000000000000052111046102023000140240ustar 00000000000000## Version 0.5.7 - Move scalar idct to wrapping maths. - Simd upsampling (mhils) - Faster zero idct check (mhils) ## Version 0.5.6 - Better support for truncated images (by https://github.com/mhils) - fix 4:1:0 chroma subsampling (by https://github.com/mhils) - Fix some crashes - Fix some bug on last pixel sampling ## Version 0.5.5 - Support direct conversion of Luma to RGBA ## Version 0.5.4 - Fix overriding color space when decoding Luma colorspace ## Version 0.5.3 - Fix some decoding of some images with markers in progressive segments, see https://github.com/etemesi254/zune-image/issues/295 ## Version 0.5.1 - Fix decoding of particular images with a non-standard subsample, ( see https://github.com/etemesi254/zune-image/issues/291) - Add better RGB color detection of images to match libjpeg and stb_image formats ----- ## Version 0.3.17 - Fix no-std compilation ## Version 0.3.16 - Add support for decoding to BGR and BGRA ## Version 0.3.14 - Add ability to parse exif and ICC chunk. - Fix images with one component that were down-sampled. ### Version 0.3.13 - Allow decoding into pre-allocated buffer - Clarify documentation ### Version 0.3.11 - Add guards for SSE and AVX code paths(allows compiling for platforms that do not support it) ### Version 0.3.0 - Overhaul to the whole decoder. - Single threaded version - Lightweight. ### Version 0.2.0 - New `ZuneJpegOptions` struct, this is the now recommended way to set up decoding options for decoding - Deprecated previous options setting functions. - More code cleanups - Fixed new bugs discovered by fuzzing - Removed dependency on `num_cpu` ### Version 0.1.5 - Allow user to set memory limits in during decoding explicitly via `set_limits` - Fixed some bugs discovered by fuzzing - Correctly handle small images less than 16 pixels - Gracefully handle incorrectly sampled images. ### Version 0.1.4 - Remove all `unsafe` instances except platform dependent intrinsics. - Numerous bug fixes identified by fuzzing. - Expose `ImageInfo` to the crate root. ### Version 0.1.3 - Fix numerous panics found by fuzzing(thanks to @[Shnatsel] for the corpus) - Add new method `set_num_threads` that allows one to explicitly set the number of threads to use to decode the image. ### Version 0.1.2 - Add more sub checks, contributed by @[5225225] - Privatize some modules. ### Version 0.1.1 - Fix rgba/rgbx decoding when avx optimized functions were used - Initial support for fuzzing - Remove `align_alloc` method which was unsound (Thanks to @[HeroicKatora] for pointing that out) [Shnatsel]:https://github.com/Shnatsel [HeroicKatora]:https://github.com/HeroicKatora [5225225]:https://github.com/5225225zune-jpeg-0.5.11/LICENSE-APACHE000064400000000000000000000261351046102023000135470ustar 00000000000000 Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. zune-jpeg-0.5.11/LICENSE-MIT000064400000000000000000000020611046102023000132470ustar 00000000000000MIT License Copyright (c) zune-image developers Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. zune-jpeg-0.5.11/LICENSE-ZLIB000064400000000000000000000015331046102023000133610ustar 00000000000000zlib License (C) zune-image developers This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution. zune-jpeg-0.5.11/README.md000064400000000000000000000073231046102023000131000ustar 00000000000000# Zune-JPEG A fast, correct and safe jpeg decoder in pure Rust. ## Usage The library provides a simple-to-use API for jpeg decoding and an ability to add options to influence decoding. ### Example ```Rust // Import the library use zune_jpeg::JpegDecoder; use std::fs::read; fn main()->Result<(),DecoderErrors> { // load some jpeg data let data = read("cat.jpg").unwrap(); // create a decoder let mut decoder = JpegDecoder::new(&data); // decode the file let pixels = decoder.decode()?; } ``` The decoder supports more manipulations via `DecoderOptions`, see additional documentation in the library. ## Goals The implementation aims to have the following goals achieved, in order of importance 1. Safety - Do not segfault on errors or invalid input. Panics are okay, but should be fixed when reported. `unsafe` is only used for SIMD intrinsics, and can be turned off entirely both at compile time and at runtime. 2. Speed - Get the data as quickly as possible, which means 1. Platform intrinsics code where justifiable 2. Carefully written platform independent code that allows the compiler to vectorize it. 3. Regression tests. 4. Watch the memory usage of the program 3. Usability - Provide utility functions like different color conversions functions. ## Non-Goals - Bit identical results with libjpeg/libjpeg-turbo will never be an aim of this library. Jpeg is a lossy format with very few parts specified by the standard (i.e it doesn't give a reference upsampling and color conversion algorithm) ## Features - [x] A Pretty fast 8*8 integer IDCT. - [x] Fast Huffman Decoding - [x] Fast color convert functions. - [x] Support for extended colorspaces like GrayScale and RGBA - [X] Single-threaded decoding. - [X] Support for four component JPEGs, and esoteric color schemes like CYMK - [X] Support for `no_std` - [X] BGR/BGRA decoding support. ## Crate Features | feature | on | Capabilities | |---------|-----|---------------------------------------------------------------------------------------------| | `x86` | yes | Enables `x86` specific instructions, specifically `avx` and `sse` for accelerated decoding. | | `std` | yes | Enable linking to the `std` crate | Note that the `x86` features are automatically disabled on platforms that aren't x86 during compile time hence there is no need to disable them explicitly if you are targeting such a platform. ## Using in a `no_std` environment The crate can be used in a `no_std` environment with the `alloc` feature. But one is required to link to a working allocator for whatever environment the decoder will be running on ## Debug vs release The decoder heavily relies on platform specific intrinsics, namely AVX2 and SSE to gain speed-ups in decoding, but they [perform poorly](https://godbolt.org/z/vPq57z13b) in debug builds. To get reasonable performance even when compiling your program in debug mode, add this to your `Cargo.toml`: ```toml # `zune-jpeg` package will be always built with optimizations [profile.dev.package.zune-jpeg] opt-level = 3 ``` ## Benchmarks The library tries to be at fast as [libjpeg-turbo] while being as safe as possible. Platform specific intrinsics help get speed up intensive operations ensuring we can almost match [libjpeg-turbo] speeds but speeds are always +- 10 ms of this library. For more up-to-date benchmarks, see the online repo with benchmarks [here](https://etemesi254.github.io/assets/criterion/report/index.html) [libjpeg-turbo]:https://github.com/libjpeg-turbo/libjpeg-turbo/ [image-rs/jpeg-decoder]:https://github.com/image-rs/jpeg-decoder/tree/master/src zune-jpeg-0.5.11/src/bitstream.rs000064400000000000000000000720571046102023000147560ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ #![allow( clippy::if_not_else, clippy::similar_names, clippy::inline_always, clippy::doc_markdown, clippy::cast_sign_loss, clippy::cast_possible_truncation )] //! This file exposes a single struct that can decode a huffman encoded //! Bitstream in a JPEG file //! //! This code is optimized for speed. //! It's meant to be super duper super fast, because everyone else depends on this being fast. //! It's (annoyingly) serial hence we cant use parallel bitstreams(it's variable length coding.) //! //! Furthermore, on the case of refills, we have to do bytewise processing because the standard decided //! that we want to support markers in the middle of streams(seriously few people use RST markers). //! //! So we pull in all optimization steps: //! - use `inline[always]`? ✅ , //! - pre-execute most common cases ✅, //! - add random comments ✅ //! - fast paths ✅. //! //! Speed-wise: It is probably the fastest JPEG BitStream decoder to ever sail the seven seas because of //! a couple of optimization tricks. //! 1. Fast refills from libjpeg-turbo //! 2. As few as possible branches in decoder fast paths. //! 3. Accelerated AC table decoding borrowed from stb_image.h written by Fabian Gissen (@ rygorous), //! improved by me to handle more cases. //! 4. Safe and extensible routines(e.g. cool ways to eliminate bounds check) //! 5. No unsafe here //! //! Readability comes as a second priority(I tried with variable names this time, and we are wayy better than libjpeg). //! //! Anyway if you are reading this it means your cool and I hope you get whatever part of the code you are looking for //! (or learn something cool) //! //! Knock yourself out. use alloc::format; use alloc::string::ToString; use core::cmp::min; use zune_core::bytestream::{ZByteReaderTrait, ZReader}; use crate::errors::DecodeErrors; use crate::huffman::{HuffmanTable, HUFF_LOOKAHEAD}; use crate::marker::Marker; use crate::mcu::DCT_BLOCK; use crate::misc::UN_ZIGZAG; macro_rules! decode_huff { ($stream:tt,$symbol:tt,$table:tt) => { let mut code_length = $symbol >> HUFF_LOOKAHEAD; ($symbol) &= (1 << HUFF_LOOKAHEAD) - 1; if code_length > i32::from(HUFF_LOOKAHEAD) { // if the symbol cannot be resolved in the first HUFF_LOOKAHEAD bits, // we know it lies somewhere between HUFF_LOOKAHEAD and 16 bits since jpeg imposes 16 bit // limit, we can therefore look 16 bits ahead and try to resolve the symbol // starting from 1+HUFF_LOOKAHEAD bits. $symbol = ($stream).peek_bits::<16>() as i32; // (Credits to Sean T. Barrett stb library for this optimization) // maxcode is pre-shifted 16 bytes long so that it has (16-code_length) // zeroes at the end hence we do not need to shift in the inner loop. while code_length < 17{ if $symbol < $table.maxcode[code_length as usize] { break; } code_length += 1; } if code_length == 17{ // symbol could not be decoded. // // We may think, lets fake zeroes, noo // panic, because Huffman codes are sensitive, probably everything // after this will be corrupt, so no need to continue. // panic!("Bad Huffman code length"); return Err(DecodeErrors::Format(format!("Bad Huffman Code 0x{:X}, corrupt JPEG",$symbol))) } $symbol >>= (16-code_length); ($symbol) = i32::from( ($table).values [(($symbol + ($table).offset[code_length as usize]) & 0xFF) as usize], ); } if code_length> i32::from(($stream).bits_left){ return Err(DecodeErrors::Format(format!("Code length {code_length} more than bits left {}",($stream).bits_left))) } // drop bits read ($stream).drop_bits(code_length as u8); }; } /// A `BitStream` struct, a bit by bit reader with super powers /// #[rustfmt::skip] pub(crate) struct BitStream { /// A MSB type buffer that is used for some certain operations pub buffer: u64, /// A TOP aligned MSB type buffer that is used to accelerate some operations like /// peek_bits and get_bits. /// /// By top aligned, I mean the top bit (63) represents the top bit in the buffer. aligned_buffer: u64, /// Tell us the bits left the two buffer pub(crate) bits_left: u8, /// Did we find a marker(RST/EOF) during decoding? pub marker: Option, /// An i16 with the bit corresponding to successive_low set to 1, others 0. pub successive_low_mask: i16, spec_start: u8, spec_end: u8, pub eob_run: i32, pub overread_by: usize, /// True if we have seen end of image marker. /// Don't read anything after that. pub seen_eoi: bool, } impl BitStream { /// Create a new BitStream #[rustfmt::skip] pub(crate) const fn new() -> BitStream { BitStream { buffer: 0, aligned_buffer: 0, bits_left: 0, marker: None, successive_low_mask: 1, spec_start: 0, spec_end: 0, eob_run: 0, overread_by: 0, seen_eoi: false, } } /// Create a new Bitstream for progressive decoding #[allow(clippy::redundant_field_names)] #[rustfmt::skip] pub(crate) fn new_progressive(al: u8, spec_start: u8, spec_end: u8) -> BitStream { BitStream { buffer: 0, aligned_buffer: 0, bits_left: 0, marker: None, successive_low_mask: 1i16 << al, spec_start: spec_start, spec_end: spec_end, eob_run: 0, overread_by: 0, seen_eoi: false, } } /// Refill the bit buffer by (a maximum of) 32 bits /// /// # Arguments /// - `reader`:`&mut BufReader`: A mutable reference to an underlying /// File/Memory buffer containing a valid JPEG stream /// /// This function will only refill if `self.count` is less than 32 #[inline(always)] // to many call sites? ( perf improvement by 4%) pub fn refill(&mut self, reader: &mut ZReader) -> Result where T: ZByteReaderTrait { /// Macro version of a single byte refill. /// Arguments /// buffer-> our io buffer, because rust macros cannot get values from /// the surrounding environment bits_left-> number of bits left /// to full refill macro_rules! refill { ($buffer:expr,$byte:expr,$bits_left:expr) => { // read a byte from the stream $byte = u64::from(reader.read_u8()); self.overread_by += usize::from(reader.eof()?); // append to the buffer // JPEG is a MSB type buffer so that means we append this // to the lower end (0..8) of the buffer and push the rest bits above.. $buffer = ($buffer << 8) | $byte; // Increment bits left $bits_left += 8; // Check for special case of OxFF, to see if it's a stream or a marker if $byte == 0xff { // read next byte let mut next_byte = u64::from(reader.read_u8()); // Byte snuffing, if we encounter byte snuff, we skip the byte if next_byte != 0x00 { // skip that byte we read while next_byte == 0xFF { next_byte = u64::from(reader.read_u8()); } if next_byte != 0x00 { // Undo the byte append and return $buffer >>= 8; $bits_left -= 8; if $bits_left != 0 { self.aligned_buffer = $buffer << (64 - $bits_left); } let marker = Marker::from_u8(next_byte as u8); self.marker = marker; if let Some(Marker::UNKNOWN(_)) = marker{ return Err(DecodeErrors::Format("Unknown marker in bit stream".to_string())); } if next_byte == 0xD9 { // special handling for eoi, fill some bytes,even if its zero, // removes some panics self.buffer <<= 8; self.bits_left += 8; self.aligned_buffer = self.buffer << (64 - self.bits_left); } return Ok(false); } } } }; } // 32 bits is enough for a decode(16 bits) and receive_extend(max 16 bits) if self.bits_left < 32 { if self.marker.is_some() || self.overread_by > 0 || self.seen_eoi { // found a marker, or we are in EOI // also we are in over-reading mode, where we fill it with zeroes // fill with zeroes self.buffer <<= 32; self.bits_left += 32; self.aligned_buffer = self.buffer << (64 - self.bits_left); return Ok(true); } // we optimize for the case where we don't have 255 in the stream and have 4 bytes left // as it is the common case // // so we always read 4 bytes, if read_fixed_bytes errors out, the cursor is // guaranteed not to advance in case of failure (is this true), so // we revert the read later on (if we have 255), if this fails, we use the normal // byte at a time read if let Ok(bytes) = reader.read_fixed_bytes_or_error::<4>() { // we have 4 bytes to spare, read the 4 bytes into a temporary buffer // create buffer let msb_buf = u32::from_be_bytes(bytes); // check if we have 0xff if !has_byte(msb_buf, 255) { self.bits_left += 32; self.buffer <<= 32; self.buffer |= u64::from(msb_buf); self.aligned_buffer = self.buffer << (64 - self.bits_left); return Ok(true); } reader.rewind(4)?; } // This serves two reasons, // 1: Make clippy shut up // 2: Favour register reuse let mut byte; // 4 refills, if all succeed the stream should contain enough bits to decode a // value refill!(self.buffer, byte, self.bits_left); refill!(self.buffer, byte, self.bits_left); refill!(self.buffer, byte, self.bits_left); refill!(self.buffer, byte, self.bits_left); // Construct an MSB buffer whose top bits are the bitstream we are currently holding. self.aligned_buffer = self.buffer << (64 - self.bits_left); } return Ok(true); } /// Decode the DC coefficient in a MCU block. /// /// The decoded coefficient is written to `dc_prediction` /// #[allow( clippy::cast_possible_truncation, clippy::cast_sign_loss, clippy::unwrap_used )] #[inline(always)] fn decode_dc( &mut self, reader: &mut ZReader, dc_table: &HuffmanTable, dc_prediction: &mut i32 ) -> Result where T: ZByteReaderTrait { let (mut symbol, r); if self.bits_left < 32 { self.refill(reader)?; }; // look a head HUFF_LOOKAHEAD bits into the bitstream symbol = self.peek_bits::(); symbol = dc_table.lookup[symbol as usize]; decode_huff!(self, symbol, dc_table); if symbol != 0 { r = self.get_bits(symbol as u8); symbol = huff_extend(r, symbol); } // Update DC prediction *dc_prediction = dc_prediction.wrapping_add(symbol); return Ok(true); } /// Like `decode_dc` but we do not need the result of the component, we only want to remove it /// from the bitstream of the MCU. fn discard_dc( &mut self, reader: &mut ZReader, dc_table: &HuffmanTable ) -> Result where T: ZByteReaderTrait { let mut symbol; if self.bits_left < 32 { self.refill(reader)?; }; // look a head HUFF_LOOKAHEAD bits into the bitstream symbol = self.peek_bits::(); symbol = dc_table.lookup[symbol as usize]; decode_huff!(self, symbol, dc_table); if symbol != 0 { let _ = self.get_bits(symbol as u8); } return Ok(true); } /// Decode a Minimum Code Unit(MCU) as quickly as possible /// /// # Arguments /// - reader: The bitstream from where we read more bits. /// - dc_table: The Huffman table used to decode the DC coefficient /// - ac_table: The Huffman table used to decode AC values /// - block: A memory region where we will write out the decoded values /// - DC prediction: Last DC value for this component /// #[allow( clippy::many_single_char_names, clippy::cast_possible_truncation, clippy::cast_sign_loss )] #[inline(never)] pub fn decode_mcu_block( &mut self, reader: &mut ZReader, dc_table: &HuffmanTable, ac_table: &HuffmanTable, qt_table: &[i32; DCT_BLOCK], block: &mut [i32; 64], dc_prediction: &mut i32 ) -> Result where T: ZByteReaderTrait { // Get fast AC table as a reference before we enter the hot path let ac_lookup = ac_table.ac_lookup.as_ref().unwrap(); let (mut symbol, mut r, mut fast_ac); // Decode AC coefficients let mut pos: usize = 1; if self.bits_left < 1 && self.marker.is_some() { return Err(DecodeErrors::Format( "No more bytes left in stream before marker".to_string() )); } // decode DC, dc prediction will contain the value self.decode_dc(reader, dc_table, dc_prediction)?; // set dc to be the dc prediction. block[0] = *dc_prediction * qt_table[0]; while pos < 64 { self.refill(reader)?; symbol = self.peek_bits::(); fast_ac = ac_lookup[symbol as usize]; symbol = ac_table.lookup[symbol as usize]; if fast_ac != 0 { // FAST AC path pos += ((fast_ac >> 4) & 15) as usize; // run let t_pos = UN_ZIGZAG[min(pos, 63)] & 63; block[t_pos] = i32::from(fast_ac >> 8) * (qt_table[t_pos]); // Value self.drop_bits((fast_ac & 15) as u8); pos += 1; } else { decode_huff!(self, symbol, ac_table); r = symbol >> 4; symbol &= 15; if symbol != 0 { pos += r as usize; r = self.get_bits(symbol as u8); symbol = huff_extend(r, symbol); let t_pos = UN_ZIGZAG[pos & 63] & 63; block[t_pos] = symbol * qt_table[t_pos]; pos += 1; } else if r != 15 { return Ok(pos as u16); } else { pos += 16; } } } return Ok(64); } /// Advance the bitstream over a block but ignore the data contained. /// /// This updates DC prediction but we never dequantize and we never do any Zig-Zag translation /// either. Still returns the index of the last component read. pub fn discard_mcu_block( &mut self, reader: &mut ZReader, dc_table: &HuffmanTable, ac_table: &HuffmanTable ) -> Result where T: ZByteReaderTrait { // Get fast AC table as a reference before we enter the hot path let ac_lookup = ac_table.ac_lookup.as_ref().unwrap(); let (mut symbol, mut r, mut fast_ac); // Decode AC coefficients let mut pos: usize = 1; // decode DC, dc prediction will contain the value self.discard_dc(reader, dc_table)?; while pos < 64 { self.refill(reader)?; symbol = self.peek_bits::(); fast_ac = ac_lookup[symbol as usize]; symbol = ac_table.lookup[symbol as usize]; if fast_ac != 0 { // FAST AC path pos += ((fast_ac >> 4) & 15) as usize; // run self.drop_bits((fast_ac & 15) as u8); pos += 1; } else { decode_huff!(self, symbol, ac_table); r = symbol >> 4; symbol &= 15; if symbol != 0 { pos += r as usize; // Advance over bits but ignore. let _ = self.get_bits(symbol as u8); pos += 1; } else if r != 15 { return Ok(pos as u16); } else { pos += 16; } } } return Ok(64); } /// Peek `look_ahead` bits ahead without discarding them from the buffer #[inline(always)] #[allow(clippy::cast_possible_truncation)] const fn peek_bits(&self) -> i32 { (self.aligned_buffer >> (64 - LOOKAHEAD)) as i32 } /// Discard the next `N` bits without checking #[inline] fn drop_bits(&mut self, n: u8) { // PS: Its a good check, but triggers fuzzer and a lot of false positives //debug_assert!(self.bits_left >= n); //self.bits_left -= n; self.bits_left = self.bits_left.saturating_sub(n); self.aligned_buffer <<= n; } /// Read `n_bits` from the buffer and discard them #[inline(always)] #[allow(clippy::cast_possible_truncation)] fn get_bits(&mut self, n_bits: u8) -> i32 { let mask = (1_u64 << n_bits) - 1; self.aligned_buffer = self.aligned_buffer.rotate_left(u32::from(n_bits)); let bits = (self.aligned_buffer & mask) as i32; self.bits_left = self.bits_left.wrapping_sub(n_bits); bits } /// Decode a DC block #[allow(clippy::cast_possible_truncation)] #[inline] pub(crate) fn decode_prog_dc_first( &mut self, reader: &mut ZReader, dc_table: &HuffmanTable, block: &mut i16, dc_prediction: &mut i32 ) -> Result<(), DecodeErrors> where T: ZByteReaderTrait { self.decode_dc(reader, dc_table, dc_prediction)?; *block = (*dc_prediction as i16).wrapping_mul(self.successive_low_mask); return Ok(()); } #[inline] pub(crate) fn decode_prog_dc_refine( &mut self, reader: &mut ZReader, block: &mut i16 ) -> Result<(), DecodeErrors> where T: ZByteReaderTrait { // refinement scan if self.bits_left < 1 { self.refill(reader)?; // if we find a marker, it may happens we don't refill. // So let's confirm again that refill worked if self.bits_left < 1 { return Err(DecodeErrors::Format( "Marker found where not expected in refine bit".to_string() )); } } if self.get_bit() == 1 { *block = block.wrapping_add(self.successive_low_mask); } Ok(()) } /// Get a single bit from the bitstream fn get_bit(&mut self) -> u8 { let k = (self.aligned_buffer >> 63) as u8; // discard a bit self.drop_bits(1); return k; } pub(crate) fn decode_mcu_ac_first( &mut self, reader: &mut ZReader, ac_table: &HuffmanTable, block: &mut [i16; 64] ) -> Result where T: ZByteReaderTrait { let fast_ac = ac_table.ac_lookup.as_ref().unwrap(); let bit = self.successive_low_mask; let mut k = self.spec_start as usize; let (mut symbol, mut r, mut fac); // EOB runs are handled in mcu_prog.rs 'block: loop { self.refill(reader)?; // Check for marker in the stream symbol = self.peek_bits::(); fac = fast_ac[symbol as usize]; symbol = ac_table.lookup[symbol as usize]; if fac != 0 { // fast ac path k += ((fac >> 4) & 15) as usize; // run block[UN_ZIGZAG[min(k, 63)] & 63] = (fac >> 8).wrapping_mul(bit); // value self.drop_bits((fac & 15) as u8); k += 1; } else { decode_huff!(self, symbol, ac_table); r = symbol >> 4; symbol &= 15; if symbol != 0 { k += r as usize; r = self.get_bits(symbol as u8); symbol = huff_extend(r, symbol); block[UN_ZIGZAG[k & 63] & 63] = (symbol as i16).wrapping_mul(bit); k += 1; } else { if r != 15 { self.eob_run = 1 << r; self.eob_run += self.get_bits(r as u8); self.eob_run -= 1; break; } k += 16; } } if k > self.spec_end as usize { break 'block; } } return Ok(true); } #[allow(clippy::too_many_lines, clippy::op_ref)] pub(crate) fn decode_mcu_ac_refine( &mut self, reader: &mut ZReader, table: &HuffmanTable, block: &mut [i16; 64] ) -> Result where T: ZByteReaderTrait { let bit = self.successive_low_mask; let mut k = self.spec_start; let (mut symbol, mut r); if self.eob_run == 0 { 'no_eob: loop { // Decode a coefficient from the bit stream self.refill(reader)?; symbol = self.peek_bits::(); symbol = table.lookup[symbol as usize]; decode_huff!(self, symbol, table); r = symbol >> 4; symbol &= 15; if symbol == 0 { if r != 15 { // EOB run is 2^r + bits self.eob_run = 1 << r; self.eob_run += self.get_bits(r as u8); // EOB runs are handled by the eob logic break 'no_eob; } } else { if symbol != 1 { return Err(DecodeErrors::HuffmanDecode( "Bad Huffman code, corrupt JPEG?".to_string() )); } // get sign bit // We assume we have enough bits, which should be correct for sane images // since we refill by 32 above if self.get_bit() == 1 { symbol = i32::from(bit); } else { symbol = i32::from(-bit); } } // Advance over already nonzero coefficients appending // correction bits to the non-zeroes. // A correction bit is 1 if the absolute value of the coefficient must be increased if k <= self.spec_end { 'advance_nonzero: loop { let coefficient = &mut block[UN_ZIGZAG[k as usize & 63] & 63]; if *coefficient != 0 { if self.bits_left < 1 { self.refill(reader)?; if self.bits_left < 1 && self.marker.is_some() { return Err(DecodeErrors::Format( "Marker found where not expected in refine bit".to_string() )); } } if self.get_bit() == 1 && (*coefficient & bit) == 0 { if *coefficient >= 0 { *coefficient += bit; } else { *coefficient -= bit; } } } else { r -= 1; if r < 0 { // reached target zero coefficient. break 'advance_nonzero; } }; if k == self.spec_end { break 'advance_nonzero; } k += 1; } } if symbol != 0 { let pos = UN_ZIGZAG[k as usize & 63]; // output new non-zero coefficient. block[pos & 63] = symbol as i16; } k += 1; if k > self.spec_end { break 'no_eob; } } } if self.eob_run > 0 { // only run if block does not consists of purely zeroes if &block[1..] != &[0; 63] { self.refill(reader)?; while k <= self.spec_end { let coefficient = &mut block[UN_ZIGZAG[k as usize & 63] & 63]; if *coefficient != 0 && self.get_bit() == 1 { // check if we already modified it, if so do nothing, otherwise // append the correction bit. if (*coefficient & bit) == 0 { if *coefficient >= 0 { *coefficient = coefficient.wrapping_add(bit); } else { *coefficient = coefficient.wrapping_sub(bit); } } } if self.bits_left < 1 { // refill at the last possible moment self.refill(reader)?; } k += 1; } } // count a block completed in EOB run self.eob_run -= 1; } return Ok(true); } pub fn update_progressive_params(&mut self, _ah: u8, al: u8, spec_start: u8, spec_end: u8) { self.successive_low_mask = 1i16 << al; self.spec_start = spec_start; self.spec_end = spec_end; } /// Reset the stream if we have a restart marker /// /// Restart markers indicate drop those bits in the stream and zero out /// everything #[cold] pub fn reset(&mut self) { self.bits_left = 0; self.marker = None; self.buffer = 0; self.aligned_buffer = 0; self.eob_run = 0; } } /// Do the equivalent of JPEG HUFF_EXTEND #[inline(always)] fn huff_extend(x: i32, s: i32) -> i32 { // if x> 31) & (((-1) << (s)) + 1)) } const fn has_zero(v: u32) -> bool { // Retrieved from Stanford bithacks // @ https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord return !((((v & 0x7F7F_7F7F) + 0x7F7F_7F7F) | v) | 0x7F7F_7F7F) != 0; } const fn has_byte(b: u32, val: u8) -> bool { // Retrieved from Stanford bithacks // @ https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord has_zero(b ^ ((!0_u32 / 255) * (val as u32))) } // mod tests { // use zune_core::bytestream::ZCursor; // use zune_core::colorspace::ColorSpace; // use zune_core::options::DecoderOptions; // // use crate::JpegDecoder; // // #[test] // fn test_image() { // let img = "/Users/etemesi/Downloads/test_IDX_45_RAND_168601280367171438891916_minimized_837.jpg"; // let data = std::fs::read(img).unwrap(); // let options = DecoderOptions::new_cmd().jpeg_set_out_colorspace(ColorSpace::RGB); // let mut decoder = JpegDecoder::new_with_options(ZCursor::new(&data[..]), options); // // decoder.decode().unwrap(); // println!("{:?}", decoder.options.jpeg_get_out_colorspace()) // } // } zune-jpeg-0.5.11/src/color_convert/avx.rs000064400000000000000000000240441046102023000164310ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ //! AVX color conversion routines //! //! Okay these codes are cool //! //! Herein lies super optimized codes to do color conversions. //! //! //! 1. The YCbCr to RGB use integer approximations and not the floating point equivalent. //! That means we may be +- 2 of pixels generated by libjpeg-turbo jpeg decoding //! (also libjpeg uses routines like `Y = 0.29900 * R + 0.33700 * G + 0.11400 * B + 0.25000 * G`) //! //! Firstly, we use integers (fun fact:there is no part of this code base where were dealing with //! floating points.., fun fact: the first fun fact wasn't even fun.) //! //! Secondly ,we have cool clamping code, especially for rgba , where we don't need clamping and we //! spend our time cursing that Intel decided permute instructions to work like 2 128 bit vectors(the compiler opitmizes //! it out to something cool). //! //! There isn't a lot here (not as fun as bitstream ) but I hope you find what you're looking for. //! //! O and ~~subscribe to my youtube channel~~ #![cfg(any(target_arch = "x86", target_arch = "x86_64"))] #![cfg(feature = "x86")] #![allow( clippy::wildcard_imports, clippy::cast_possible_truncation, clippy::too_many_arguments, clippy::inline_always, clippy::doc_markdown, dead_code )] #[cfg(target_arch = "x86")] use core::arch::x86::*; #[cfg(target_arch = "x86_64")] use core::arch::x86_64::*; use crate::color_convert::scalar::{CB_CF, CR_CF, C_G_CB_COEF_2, C_G_CR_COEF_1, YUV_RND, Y_CF}; pub union YmmRegister { // both are 32 when using std::mem::size_of mm256: __m256i, // for avx color conversion array: [i16; 16] } const R_AVX_COEF: i32 = i32::from_ne_bytes([CR_CF.to_ne_bytes()[0], CR_CF.to_ne_bytes()[1], 0, 0]); const B_AVX_COEF: i32 = i32::from_ne_bytes([0, 0, CB_CF.to_ne_bytes()[0], CB_CF.to_ne_bytes()[1]]); const G_COEF_AVX_COEF: i32 = i32::from_ne_bytes([ C_G_CR_COEF_1.to_ne_bytes()[0], C_G_CR_COEF_1.to_ne_bytes()[1], C_G_CB_COEF_2.to_ne_bytes()[0], C_G_CB_COEF_2.to_ne_bytes()[1] ]); //-------------------------------------------------------------------------------------------------- // AVX conversion routines //-------------------------------------------------------------------------------------------------- /// /// Convert YCBCR to RGB using AVX instructions /// /// # Note ///**IT IS THE RESPONSIBILITY OF THE CALLER TO CALL THIS IN CPUS SUPPORTING /// AVX2 OTHERWISE THIS IS UB** /// /// *Peace* /// /// This library itself will ensure that it's never called in CPU's not /// supporting AVX2 /// /// # Arguments /// - `y`,`cb`,`cr`: A reference of 8 i32's /// - `out`: The output array where we store our converted items /// - `offset`: The position from 0 where we write these RGB values #[inline(always)] pub fn ycbcr_to_rgb_avx2( y: &[i16; 16], cb: &[i16; 16], cr: &[i16; 16], out: &mut [u8], offset: &mut usize ) { // call this in another function to tell RUST to vectorize this // storing unsafe { ycbcr_to_rgb_avx2_1(y, cb, cr, out, offset); } } #[inline] #[target_feature(enable = "avx2")] unsafe fn ycbcr_to_rgb_avx2_1( y: &[i16; 16], cb: &[i16; 16], cr: &[i16; 16], out: &mut [u8], offset: &mut usize ) { let (mut r, mut g, mut b) = ycbcr_to_rgb_baseline_no_clamp(y, cb, cr); r = _mm256_packus_epi16(r, _mm256_setzero_si256()); g = _mm256_packus_epi16(g, _mm256_setzero_si256()); b = _mm256_packus_epi16(b, _mm256_setzero_si256()); r = _mm256_permute4x64_epi64::<{ shuffle(3, 1, 2, 0) }>(r); g = _mm256_permute4x64_epi64::<{ shuffle(3, 1, 2, 0) }>(g); b = _mm256_permute4x64_epi64::<{ shuffle(3, 1, 2, 0) }>(b); let sh_r = _mm256_setr_epi8( 0, 11, 6, 1, 12, 7, 2, 13, 8, 3, 14, 9, 4, 15, 10, 5, 0, 11, 6, 1, 12, 7, 2, 13, 8, 3, 14, 9, 4, 15, 10, 5 ); let sh_g = _mm256_setr_epi8( 5, 0, 11, 6, 1, 12, 7, 2, 13, 8, 3, 14, 9, 4, 15, 10, 5, 0, 11, 6, 1, 12, 7, 2, 13, 8, 3, 14, 9, 4, 15, 10 ); let sh_b = _mm256_setr_epi8( 10, 5, 0, 11, 6, 1, 12, 7, 2, 13, 8, 3, 14, 9, 4, 15, 10, 5, 0, 11, 6, 1, 12, 7, 2, 13, 8, 3, 14, 9, 4, 15 ); let r0 = _mm256_shuffle_epi8(r, sh_r); let g0 = _mm256_shuffle_epi8(g, sh_g); let b0 = _mm256_shuffle_epi8(b, sh_b); let m0 = _mm256_setr_epi8( 0, -1, 0, 0, -1, 0, 0, -1, 0, 0, -1, 0, 0, -1, 0, 0, 0, -1, 0, 0, -1, 0, 0, -1, 0, 0, -1, 0, 0, -1, 0, 0 ); let m1 = _mm256_setr_epi8( 0, 0, -1, 0, 0, -1, 0, 0, -1, 0, 0, -1, 0, 0, -1, 0, 0, 0, -1, 0, 0, -1, 0, 0, -1, 0, 0, -1, 0, 0, -1, 0 ); let p0 = _mm256_blendv_epi8(_mm256_blendv_epi8(r0, g0, m0), b0, m1); let p1 = _mm256_blendv_epi8(_mm256_blendv_epi8(g0, b0, m0), r0, m1); let p2 = _mm256_blendv_epi8(_mm256_blendv_epi8(b0, r0, m0), g0, m1); let rgb0 = _mm256_permute2x128_si256::<32>(p0, p1); let rgb1 = _mm256_permute2x128_si256::<48>(p2, p0); _mm256_storeu_si256(out.as_mut_ptr().cast(), rgb0); _mm_storeu_si128(out[32..].as_mut_ptr().cast(), _mm256_castsi256_si128(rgb1)); *offset += 48; } // Enabled avx2 automatically enables avx. #[inline] #[target_feature(enable = "avx2")] /// A baseline implementation of YCbCr to RGB conversion which does not carry /// out clamping /// /// This is used by the `ycbcr_to_rgba_avx` and `ycbcr_to_rgbx` conversion /// routines unsafe fn ycbcr_to_rgb_baseline_no_clamp( y: &[i16; 16], cb: &[i16; 16], cr: &[i16; 16] ) -> (__m256i, __m256i, __m256i) { // Load values into a register // let y_c = _mm256_loadu_si256(y.as_ptr().cast()); let cb_c = _mm256_loadu_si256(cb.as_ptr().cast()); let cr_c = _mm256_loadu_si256(cr.as_ptr().cast()); // Here we want to use _mm256_madd_epi16 to perform 2 multiplications // and one addition per instruction. // At first, we have to pack i16 U and V that stores u8 into one u8 [U,V] // then zero extend, and keep in mind that lanes is already been permuted. let y_coeff = _mm256_set1_epi32(i32::from(Y_CF)); let cr_coeff = _mm256_set1_epi32(R_AVX_COEF); let cb_coeff = _mm256_set1_epi32(B_AVX_COEF); let cg_coeff = _mm256_set1_epi32(G_COEF_AVX_COEF); let v_rnd = _mm256_set1_epi32(i32::from(YUV_RND)); let uv_bias = _mm256_set1_epi16(128); // UV in memory because x86/x86_64 is always little endian let v_0 = _mm256_slli_epi16::<8>(cb_c); let u_v_8 = _mm256_or_si256(v_0, cr_c); let mut u_v_lo = _mm256_unpacklo_epi8(u_v_8, _mm256_setzero_si256()); let mut u_v_hi = _mm256_unpackhi_epi8(u_v_8, _mm256_setzero_si256()); let mut y_lo = _mm256_unpacklo_epi16(y_c, _mm256_setzero_si256()); let mut y_hi = _mm256_unpackhi_epi16(y_c, _mm256_setzero_si256()); u_v_lo = _mm256_sub_epi16(u_v_lo, uv_bias); u_v_hi = _mm256_sub_epi16(u_v_hi, uv_bias); y_lo = _mm256_madd_epi16(y_lo, y_coeff); y_hi = _mm256_madd_epi16(y_hi, y_coeff); let mut r_lo = _mm256_madd_epi16(u_v_lo, cr_coeff); let mut r_hi = _mm256_madd_epi16(u_v_hi, cr_coeff); let mut g_lo = _mm256_madd_epi16(u_v_lo, cg_coeff); let mut g_hi = _mm256_madd_epi16(u_v_hi, cg_coeff); // This ordering is preferred to reduce register file pressure. y_lo = _mm256_add_epi32(y_lo, v_rnd); y_hi = _mm256_add_epi32(y_hi, v_rnd); let mut b_lo = _mm256_madd_epi16(u_v_lo, cb_coeff); let mut b_hi = _mm256_madd_epi16(u_v_hi, cb_coeff); r_lo = _mm256_add_epi32(r_lo, y_lo); r_hi = _mm256_add_epi32(r_hi, y_hi); g_lo = _mm256_add_epi32(g_lo, y_lo); g_hi = _mm256_add_epi32(g_hi, y_hi); b_lo = _mm256_add_epi32(b_lo, y_lo); b_hi = _mm256_add_epi32(b_hi, y_hi); r_lo = _mm256_srai_epi32::<14>(r_lo); r_hi = _mm256_srai_epi32::<14>(r_hi); g_lo = _mm256_srai_epi32::<14>(g_lo); g_hi = _mm256_srai_epi32::<14>(g_hi); b_lo = _mm256_srai_epi32::<14>(b_lo); b_hi = _mm256_srai_epi32::<14>(b_hi); let r = _mm256_packus_epi32(r_lo, r_hi); let g = _mm256_packus_epi32(g_lo, g_hi); let b = _mm256_packus_epi32(b_lo, b_hi); return (r, g, b); } #[inline(always)] pub fn ycbcr_to_rgba_avx2( y: &[i16; 16], cb: &[i16; 16], cr: &[i16; 16], out: &mut [u8], offset: &mut usize ) { unsafe { ycbcr_to_rgba_unsafe(y, cb, cr, out, offset); } } #[inline] #[target_feature(enable = "avx2")] #[rustfmt::skip] unsafe fn ycbcr_to_rgba_unsafe( y: &[i16; 16], cb: &[i16; 16], cr: &[i16; 16], out: &mut [u8], offset: &mut usize, ) { // check if we have enough space to write. let tmp:& mut [u8; 64] = out.get_mut(*offset..*offset + 64).expect("Slice to small cannot write").try_into().unwrap(); let (r, g, b) = ycbcr_to_rgb_baseline_no_clamp(y, cb, cr); // set alpha channel to 255 for opaque // And no these comments were not from me pressing the keyboard // Pack the integers into u8's using unsigned saturation. let c = _mm256_packus_epi16(r, g); //aaaaa_bbbbb_aaaaa_bbbbbb let d = _mm256_packus_epi16(b, _mm256_set1_epi16(255)); // cccccc_dddddd_ccccccc_ddddd // transpose_u16 and interleave channels let e = _mm256_unpacklo_epi8(c, d); //ab_ab_ab_ab_ab_ab_ab_ab let f = _mm256_unpackhi_epi8(c, d); //cd_cd_cd_cd_cd_cd_cd_cd // final transpose_u16 let g = _mm256_unpacklo_epi8(e, f); //abcd_abcd_abcd_abcd_abcd let h = _mm256_unpackhi_epi8(e, f); // undo packus shuffling... let i = _mm256_permute2x128_si256::<{ shuffle(3, 2, 1, 0) }>(g, h); let j = _mm256_permute2x128_si256::<{ shuffle(1, 2, 3, 0) }>(g, h); let k = _mm256_permute2x128_si256::<{ shuffle(3, 2, 0, 1) }>(g, h); let l = _mm256_permute2x128_si256::<{ shuffle(0, 3, 2, 1) }>(g, h); let m = _mm256_blend_epi32::<0b1111_0000>(i, j); let n = _mm256_blend_epi32::<0b1111_0000>(k, l); // Store // Use streaming instructions to prevent polluting the cache? _mm256_storeu_si256(tmp.as_mut_ptr().cast(), m); _mm256_storeu_si256(tmp[32..].as_mut_ptr().cast(), n); *offset += 64; } #[inline] const fn shuffle(z: i32, y: i32, x: i32, w: i32) -> i32 { (z << 6) | (y << 4) | (x << 2) | w } zune-jpeg-0.5.11/src/color_convert/neon64.rs000064400000000000000000000113001046102023000167330ustar 00000000000000/* * Copyright (c) 2025. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ //! Aarch64 color conversion routines //! NEON is mandatory on aarch64. #![cfg(all(feature = "neon", target_arch = "aarch64"))] use core::arch::aarch64::*; use crate::color_convert::scalar::{CB_CF, CR_CF, C_G_CB_COEF_2, C_G_CR_COEF_1, YUV_RND, Y_CF}; const C_1: u64 = u64::from_ne_bytes([ Y_CF.to_ne_bytes()[0], Y_CF.to_ne_bytes()[1], CR_CF.to_ne_bytes()[0], CR_CF.to_ne_bytes()[1], CB_CF.to_ne_bytes()[0], CB_CF.to_ne_bytes()[1], C_G_CR_COEF_1.to_ne_bytes()[0], C_G_CR_COEF_1.to_ne_bytes()[1] ]); const C_2: u64 = u64::from_ne_bytes([ C_G_CB_COEF_2.to_ne_bytes()[0], C_G_CB_COEF_2.to_ne_bytes()[1], 0, 0, 0, 0, 0, 0 ]); #[inline(always)] unsafe fn ycbcr_to_rgb_baseline_no_clamp( y: &[i16; 16], cb: &[i16; 16], cr: &[i16; 16] ) -> (uint8x16_t, uint8x16_t, uint8x16_t) { // NEON has 32 registers, so it is good idea to utilize a lot of variables at once let cb_cr_bias = vdupq_n_s16(128); // 0 - Y coeff, 1 - Cr, 2 - Cb, 3 - G1, 4 - G2 let coefficients = vcombine_s16(vcreate_s16(C_1), vcreate_s16(C_2)); let y0 = vld1q_s16(y.as_ptr().cast()); let y1 = vld1q_s16(y[8..].as_ptr().cast()); let mut cb0 = vld1q_s16(cb.as_ptr().cast()); let mut cb1 = vld1q_s16(cb[8..].as_ptr().cast()); let mut cr0 = vld1q_s16(cr.as_ptr().cast()); let mut cr1 = vld1q_s16(cr[8..].as_ptr().cast()); cb0 = vsubq_s16(cb0, cb_cr_bias); cb1 = vsubq_s16(cb1, cb_cr_bias); cr0 = vsubq_s16(cr0, cb_cr_bias); cr1 = vsubq_s16(cr1, cb_cr_bias); let bias = vdupq_n_s32(i32::from(YUV_RND)); let acc0 = vmlal_laneq_s16::<0>(bias, vget_low_s16(y0), coefficients); let acc1 = vmlal_high_laneq_s16::<0>(bias, y0, coefficients); let acc2 = vmlal_laneq_s16::<0>(bias, vget_low_s16(y1), coefficients); let acc3 = vmlal_high_laneq_s16::<0>(bias, y1, coefficients); let r0 = vmlal_laneq_s16::<1>(acc0, vget_low_s16(cr0), coefficients); let r1 = vmlal_high_laneq_s16::<1>(acc1, cr0, coefficients); let r2 = vmlal_laneq_s16::<1>(acc2, vget_low_s16(cr1), coefficients); let r3 = vmlal_high_laneq_s16::<1>(acc3, cr1, coefficients); let b0 = vmlal_laneq_s16::<2>(acc0, vget_low_s16(cb0), coefficients); let b1 = vmlal_high_laneq_s16::<2>(acc1, cb0, coefficients); let b2 = vmlal_laneq_s16::<2>(acc2, vget_low_s16(cb1), coefficients); let b3 = vmlal_high_laneq_s16::<2>(acc3, cb1, coefficients); // Saturating shift right with signed -> unsigned saturation let qr0 = vqshrun_n_s32::<14>(r0); let qr1 = vqshrun_n_s32::<14>(r1); let qr2 = vqshrun_n_s32::<14>(r2); let qr3 = vqshrun_n_s32::<14>(r3); let mut g0 = vmlal_laneq_s16::<4>(acc0, vget_low_s16(cb0), coefficients); let mut g1 = vmlal_high_laneq_s16::<4>(acc1, cb0, coefficients); let mut g2 = vmlal_laneq_s16::<4>(acc2, vget_low_s16(cb1), coefficients); let mut g3 = vmlal_high_laneq_s16::<4>(acc3, cb1, coefficients); let qb0 = vqshrun_n_s32::<14>(b0); let qb1 = vqshrun_n_s32::<14>(b1); let qb2 = vqshrun_n_s32::<14>(b2); let qb3 = vqshrun_n_s32::<14>(b3); let r0 = vqmovn_u16(vcombine_u16(qr0, qr1)); let r1 = vqmovn_u16(vcombine_u16(qr2, qr3)); let b0 = vqmovn_u16(vcombine_u16(qb0, qb1)); let b1 = vqmovn_u16(vcombine_u16(qb2, qb3)); g0 = vmlal_laneq_s16::<3>(g0, vget_low_s16(cr0), coefficients); g1 = vmlal_high_laneq_s16::<3>(g1, cr0, coefficients); g2 = vmlal_laneq_s16::<3>(g2, vget_low_s16(cr1), coefficients); g3 = vmlal_high_laneq_s16::<3>(g3, cr1, coefficients); let qg0 = vqshrun_n_s32::<14>(g0); let qg1 = vqshrun_n_s32::<14>(g1); let qg2 = vqshrun_n_s32::<14>(g2); let qg3 = vqshrun_n_s32::<14>(g3); let g0 = vqmovn_u16(vcombine_u16(qg0, qg1)); let g1 = vqmovn_u16(vcombine_u16(qg2, qg3)); ( vcombine_u8(r0, r1), vcombine_u8(g0, g1), vcombine_u8(b0, b1) ) } #[inline(always)] pub fn ycbcr_to_rgb_neon( y: &[i16; 16], cb: &[i16; 16], cr: &[i16; 16], out: &mut [u8], offset: &mut usize ) { // call this in another function to tell RUST to vectorize this // storing unsafe { let (r, g, b) = ycbcr_to_rgb_baseline_no_clamp(y, cb, cr); vst3q_u8(out.as_mut_ptr(), uint8x16x3_t(r, g, b)); *offset += 48; } } #[inline(always)] pub fn ycbcr_to_rgba_neon( y: &[i16; 16], cb: &[i16; 16], cr: &[i16; 16], out: &mut [u8], offset: &mut usize ) { unsafe { let (r, g, b) = ycbcr_to_rgb_baseline_no_clamp(y, cb, cr); vst4q_u8(out.as_mut_ptr(), uint8x16x4_t(r, g, b, vdupq_n_u8(255))); *offset += 64; } } zune-jpeg-0.5.11/src/color_convert/scalar.rs000064400000000000000000000100411046102023000170700ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ use core::convert::TryInto; // Bt.601 Full Range inverse coefficients computed with 14 bits of precision with MPFR. // This is important to keep them in i16. // In most cases LLVM will detect what we're doing i16 widening to i32 math and will use // appropriate optimizations. pub(crate) const Y_CF: i16 = 16384; pub(crate) const CR_CF: i16 = 22970; pub(crate) const CB_CF: i16 = 29032; pub(crate) const C_G_CR_COEF_1: i16 = -11700; pub(crate) const C_G_CB_COEF_2: i16 = -5638; pub(crate) const YUV_PREC: i16 = 14; // Rounding const for YUV -> RGB conversion: floating equivalent 0.499(9). pub(crate) const YUV_RND: i16 = (1 << (YUV_PREC - 1)) - 1; /// Limit values to 0 and 255 #[inline] #[allow(clippy::cast_possible_truncation, clippy::cast_sign_loss, dead_code)] fn clamp(a: i32) -> u8 { a.clamp(0, 255) as u8 } /// YCbCr to RGBA color conversion /// Convert YCbCr to RGB/BGR /// /// Converts to RGB if const BGRA is false /// /// Converts to BGR if const BGRA is true pub fn ycbcr_to_rgba_inner_16_scalar( y: &[i16; 16], cb: &[i16; 16], cr: &[i16; 16], output: &mut [u8], pos: &mut usize ) { let (_, output_position) = output.split_at_mut(*pos); // Convert into a slice with 64 elements for Rust to see we won't go out of bounds. let opt: &mut [u8; 64] = output_position .get_mut(0..64) .expect("Slice to small cannot write") .try_into() .unwrap(); for ((&y, (cb, cr)), out) in y .iter() .zip(cb.iter().zip(cr.iter())) .zip(opt.chunks_exact_mut(4)) { let cr = cr - 128; let cb = cb - 128; let y0 = i32::from(y) * i32::from(Y_CF) + i32::from(YUV_RND); let r = (y0 + i32::from(cr) * i32::from(CR_CF)) >> YUV_PREC; let g = (y0 + i32::from(cr) * i32::from(C_G_CR_COEF_1) + i32::from(cb) * i32::from(C_G_CB_COEF_2)) >> YUV_PREC; let b = (y0 + i32::from(cb) * i32::from(CB_CF)) >> YUV_PREC; if BGRA { out[0] = clamp(b); out[1] = clamp(g); out[2] = clamp(r); out[3] = 255; } else { out[0] = clamp(r); out[1] = clamp(g); out[2] = clamp(b); out[3] = 255; } } *pos += 64; } /// Convert YCbCr to RGB/BGR /// /// Converts to RGB if const BGRA is false /// /// Converts to BGR if const BGRA is true pub fn ycbcr_to_rgb_inner_16_scalar( y: &[i16; 16], cb: &[i16; 16], cr: &[i16; 16], output: &mut [u8], pos: &mut usize ) { let (_, output_position) = output.split_at_mut(*pos); // Convert into a slice with 48 elements let opt: &mut [u8; 48] = output_position .get_mut(0..48) .expect("Slice to small cannot write") .try_into() .unwrap(); for ((&y, (cb, cr)), out) in y .iter() .zip(cb.iter().zip(cr.iter())) .zip(opt.chunks_exact_mut(3)) { let cr = cr - 128; let cb = cb - 128; let y0 = i32::from(y) * i32::from(Y_CF) + i32::from(YUV_RND); let r = (y0 + i32::from(cr) * i32::from(CR_CF)) >> YUV_PREC; let g = (y0 + i32::from(cr) * i32::from(C_G_CR_COEF_1) + i32::from(cb) * i32::from(C_G_CB_COEF_2)) >> YUV_PREC; let b = (y0 + i32::from(cb) * i32::from(CB_CF)) >> YUV_PREC; if BGRA { out[0] = clamp(b); out[1] = clamp(g); out[2] = clamp(r); } else { out[0] = clamp(r); out[1] = clamp(g); out[2] = clamp(b); } } // Increment pos *pos += 48; } pub fn ycbcr_to_grayscale(y: &[i16], width: usize, padded_width: usize, output: &mut [u8]) { for (y_in, out) in y .chunks_exact(padded_width) .zip(output.chunks_exact_mut(width)) { for (y, out) in y_in.iter().zip(out.iter_mut()) { *out = *y as u8; } } } zune-jpeg-0.5.11/src/color_convert.rs000064400000000000000000000066541046102023000156420ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ #![allow( clippy::many_single_char_names, clippy::similar_names, clippy::cast_possible_truncation, clippy::cast_sign_loss, clippy::cast_possible_wrap, clippy::too_many_arguments, clippy::doc_markdown )] //! Color space conversion routines //! //! This files exposes functions to convert one colorspace to another in a jpeg //! image //! //! Currently supported conversions are //! //! - `YCbCr` to `RGB,RGBA,GRAYSCALE,RGBX`. //! //! //! Hey there, if your reading this it means you probably need something, so let me help you. //! //! There are 3 supported cpu extensions here. //! 1. Scalar //! 2. SSE //! 3. AVX //! //! There are two types of the color convert functions //! //! 1. Acts on 16 pixels. //! 2. Acts on 8 pixels. //! //! The reason for this is because when implementing the AVX part it occurred to me that we can actually //! do better and process 2 MCU's if we change IDCT return type to be `i16's`, since a lot of //! CPU's these days support AVX extensions, it becomes nice if we optimize for that path , //! therefore AVX routines can process 16 pixels directly and SSE and Scalar just compensate. //! //! By compensating, I mean I wrote the 16 pixels version operating on the 8 pixel version twice. //! //! Therefore if your looking to optimize some routines, probably start there. pub use scalar::ycbcr_to_grayscale; use zune_core::colorspace::ColorSpace; use zune_core::options::DecoderOptions; #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] #[cfg(feature = "x86")] pub use crate::color_convert::avx::{ycbcr_to_rgb_avx2, ycbcr_to_rgba_avx2}; use crate::decoder::ColorConvert16Ptr; mod avx; mod neon64; mod scalar; #[allow(unused_variables)] pub fn choose_ycbcr_to_rgb_convert_func( type_need: ColorSpace, options: &DecoderOptions ) -> Option { #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] #[cfg(feature = "x86")] { use zune_core::log::debug; if options.use_avx2() { debug!("Using AVX optimised color conversion functions"); // I believe avx2 means sse4 is also available // match colorspace match type_need { ColorSpace::RGB => return Some(ycbcr_to_rgb_avx2), ColorSpace::RGBA => return Some(ycbcr_to_rgba_avx2), _ => () // fall through to scalar, which has more types }; } } #[cfg(all(feature = "neon", target_arch = "aarch64"))] { if options.use_neon() { use crate::color_convert::neon64::{ycbcr_to_rgb_neon, ycbcr_to_rgba_neon}; match type_need { ColorSpace::RGB => return Some(ycbcr_to_rgb_neon), ColorSpace::RGBA => return Some(ycbcr_to_rgba_neon), _ => () // fall through to scalar, which has more types }; } } // when there is no x86 or we haven't returned by here, resort to scalar return match type_need { ColorSpace::RGB => Some(scalar::ycbcr_to_rgb_inner_16_scalar::), ColorSpace::RGBA => Some(scalar::ycbcr_to_rgba_inner_16_scalar::), ColorSpace::BGRA => Some(scalar::ycbcr_to_rgba_inner_16_scalar::), ColorSpace::BGR => Some(scalar::ycbcr_to_rgb_inner_16_scalar::), _ => None }; } zune-jpeg-0.5.11/src/components.rs000064400000000000000000000166231046102023000151460ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ //! This module exports a single struct to store information about //! JPEG image components //! //! The data is extracted from a SOF header. use alloc::vec::Vec; use alloc::{format, vec}; use zune_core::log::trace; use crate::alloc::string::ToString; use crate::decoder::MAX_COMPONENTS; use crate::errors::DecodeErrors; use crate::upsampler::upsample_no_op; const MAX_SAMP_FACTOR: usize = 4; /// Represents an up-sampler function, this function will be called to upsample /// a down-sampled image pub type UpSampler = fn( input: &[i16], in_near: &[i16], in_far: &[i16], scratch_space: &mut [i16], output: &mut [i16] ); /// Component Data from start of frame #[derive(Clone)] pub(crate) struct Components { /// The type of component that has the metadata below, can be Y,Cb or Cr pub component_id: ComponentID, /// Sub-sampling ratio of this component in the x-plane pub vertical_sample: usize, /// Sub-sampling ratio of this component in the y-plane pub horizontal_sample: usize, /// DC huffman table position pub dc_huff_table: usize, /// AC huffman table position for this element. pub ac_huff_table: usize, /// Quantization table number pub quantization_table_number: u8, /// Specifies quantization table to use with this component pub quantization_table: [i32; 64], /// dc prediction for the component pub dc_pred: i32, /// An up-sampling function, can be basic or SSE, depending /// on the platform pub up_sampler: UpSampler, /// How pixels do we need to go to get to the next line? pub width_stride: usize, /// Component ID for progressive pub id: u8, /// Whether we need to decode this image component. pub needed: bool, /// Upsample scanline pub raw_coeff: Vec, /// Upsample destination, stores a scanline worth of sub sampled data pub upsample_dest: Vec, /// previous row, used to handle MCU boundaries pub row_up: Vec, /// current row, used to handle MCU boundaries again pub row: Vec, pub first_row_upsample_dest: Vec, pub idct_pos: usize, pub x: usize, pub w2: usize, pub y: usize, pub sample_ratio: SampleRatios, // a very annoying bug pub fix_an_annoying_bug: usize } impl Components { /// Create a new instance from three bytes from the start of frame #[inline] pub fn from(a: [u8; 3], pos: u8) -> Result { // it's a unique identifier. // doesn't have to be ascending // see tests/inputs/huge_sof_number // // For such cases, use the position of the component // to determine width let id = match pos { 0 => ComponentID::Y, 1 => ComponentID::Cb, 2 => ComponentID::Cr, 3 => ComponentID::Q, _ => { return Err(DecodeErrors::Format(format!( "Unknown component id found,{pos}, expected value between 1 and 4" ))) } }; let horizontal_sample = (a[1] >> 4) as usize; let vertical_sample = (a[1] & 0x0f) as usize; // Match libjpeg turbo on checking for sampling factors // Reject anything above 4 if horizontal_sample > MAX_SAMP_FACTOR { return Err(DecodeErrors::Format(format!( "Bogus Horizontal Sampling Factor {horizontal_sample}" ))); } if vertical_sample > MAX_SAMP_FACTOR { return Err(DecodeErrors::Format(format!( "Bogus Vertical Sampling Factor {vertical_sample}" ))); } let quantization_table_number = a[2]; // confirm quantization number is between 0 and MAX_COMPONENTS if usize::from(quantization_table_number) >= MAX_COMPONENTS { return Err(DecodeErrors::Format(format!( "Too large quantization number :{quantization_table_number}, expected value between 0 and {MAX_COMPONENTS}" ))); } // check that upsampling ratios are powers of two // if these fail, it's probably a corrupt image. if !horizontal_sample.is_power_of_two() { return Err(DecodeErrors::Format(format!( "Horizontal sample is not a power of two({horizontal_sample}) cannot decode" ))); } // if !vertical_sample.is_power_of_two() { // return Err(DecodeErrors::Format(format!( // "Vertical sub-sample is not power of two({vertical_sample}) cannot decode" // ))); // } if vertical_sample == 0 { // Check for invalid vertical sample return Err(DecodeErrors::Format("Vertical sample is zero".to_string())); } trace!( "Component ID:{:?} \tHS:{} VS:{} QT:{}", id, horizontal_sample, vertical_sample, quantization_table_number ); Ok(Components { component_id: id, vertical_sample, horizontal_sample, quantization_table_number, first_row_upsample_dest: vec![], // These two will be set with sof marker dc_huff_table: 0, ac_huff_table: 0, quantization_table: [0; 64], dc_pred: 0, up_sampler: upsample_no_op, // set later width_stride: horizontal_sample, id: a[0], needed: true, raw_coeff: vec![], upsample_dest: vec![], row_up: vec![], row: vec![], idct_pos: 0, x: 0, y: 0, w2: 0, sample_ratio: SampleRatios::None, fix_an_annoying_bug: 1 }) } /// Setup space for upsampling /// /// During upsample, we need a reference of the last row so that upsampling can /// proceed correctly, /// so we store the last line of every scanline and use it for the next upsampling procedure /// to store this, but since we don't need it for 1v1 upsampling, /// we only call this for routines that need upsampling /// /// # Requirements /// - width stride of this element is set for the component. pub fn setup_upsample_scanline(&mut self) { self.row = vec![0; self.width_stride * self.vertical_sample]; self.row_up = vec![0; self.width_stride * self.vertical_sample]; self.first_row_upsample_dest = vec![128; self.vertical_sample * self.width_stride * self.sample_ratio.sample()]; self.upsample_dest = vec![0; self.width_stride * self.sample_ratio.sample() * self.fix_an_annoying_bug * 8]; } } /// Component ID's #[derive(Copy, Debug, Clone, PartialEq, Eq)] pub enum ComponentID { /// Luminance channel Y, /// Blue chrominance Cb, /// Red chrominance Cr, /// Q or fourth component Q } #[derive(Copy, Debug, Clone, PartialEq, Eq)] pub enum SampleRatios { HV, V, H, Generic(usize, usize), None } impl SampleRatios { pub fn sample(self) -> usize { match self { SampleRatios::HV => 4, SampleRatios::V | SampleRatios::H => 2, SampleRatios::Generic(a, b) => a * b, SampleRatios::None => 1 } } } zune-jpeg-0.5.11/src/decoder.rs000064400000000000000000001047531046102023000143700ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ //! Main image logic. #![allow(clippy::doc_markdown)] use alloc::string::ToString; use alloc::vec::Vec; use alloc::{format, vec}; use zune_core::bytestream::{ZByteReaderTrait, ZReader}; use zune_core::colorspace::ColorSpace; use zune_core::log::{error, trace, warn}; use zune_core::options::DecoderOptions; use crate::color_convert::choose_ycbcr_to_rgb_convert_func; use crate::components::{Components, SampleRatios}; use crate::errors::{DecodeErrors, UnsupportedSchemes}; use crate::headers::{ parse_app1, parse_app13, parse_app14, parse_app2, parse_dqt, parse_huffman, parse_sos, parse_start_of_frame }; use crate::huffman::HuffmanTable; use crate::idct::{choose_idct_func, choose_idct_1x1_func, choose_idct_4x4_func}; use crate::marker::Marker; use crate::misc::SOFMarkers; use crate::upsampler::{ choose_horizontal_samp_function, choose_hv_samp_function, choose_v_samp_function, generic_sampler, upsample_no_op }; /// Maximum components pub(crate) const MAX_COMPONENTS: usize = 4; /// Maximum image dimensions supported. pub(crate) const MAX_DIMENSIONS: usize = 1 << 27; /// Color conversion function that can convert YCbCr colorspace to RGB(A/X) for /// 16 values /// /// The following are guarantees to the following functions /// /// 1. The `&[i16]` slices passed contain 16 items /// /// 2. The slices passed are in the following order /// `y,cb,cr` /// /// 3. `&mut [u8]` is zero initialized /// /// 4. `&mut usize` points to the position in the array where new values should /// be used /// /// The pointer should /// 1. Carry out color conversion /// 2. Update `&mut usize` with the new position pub type ColorConvert16Ptr = fn(&[i16; 16], &[i16; 16], &[i16; 16], &mut [u8], &mut usize); /// IDCT function prototype /// /// This encapsulates a dequantize and IDCT function which will carry out the /// following functions /// /// Multiply each 64 element block of `&mut [i16]` with `&Aligned32<[i32;64]>` /// Carry out IDCT (type 3 dct) on ach block of 64 i16's pub type IDCTPtr = fn(&mut [i32; 64], &mut [i16], usize); /// An encapsulation of an ICC chunk pub(crate) struct ICCChunk { pub(crate) seq_no: u8, pub(crate) num_markers: u8, pub(crate) data: Vec } /// A JPEG Decoder Instance. #[allow(clippy::upper_case_acronyms, clippy::struct_excessive_bools)] pub struct JpegDecoder { /// Struct to hold image information from SOI pub(crate) info: ImageInfo, /// Quantization tables, will be set to none and the tables will /// be moved to `components` field pub(crate) qt_tables: [Option<[i32; 64]>; MAX_COMPONENTS], /// DC Huffman Tables with a maximum of 4 tables for each component pub(crate) dc_huffman_tables: [Option; MAX_COMPONENTS], /// AC Huffman Tables with a maximum of 4 tables for each component pub(crate) ac_huffman_tables: [Option; MAX_COMPONENTS], /// Image components, holds information like DC prediction and quantization /// tables of a component pub(crate) components: Vec, /// maximum horizontal component of all channels in the image pub(crate) h_max: usize, // maximum vertical component of all channels in the image pub(crate) v_max: usize, /// mcu's width (interleaved scans) pub(crate) mcu_width: usize, /// MCU height(interleaved scans pub(crate) mcu_height: usize, /// Number of MCU's in the x plane pub(crate) mcu_x: usize, /// Number of MCU's in the y plane pub(crate) mcu_y: usize, /// Is the image interleaved? pub(crate) is_interleaved: bool, pub(crate) sub_sample_ratio: SampleRatios, /// Image input colorspace, should be YCbCr for a sane image, might be /// grayscale too pub(crate) input_colorspace: ColorSpace, // Progressive image details /// Is the image progressive? pub(crate) is_progressive: bool, /// Start of spectral scan pub(crate) spec_start: u8, /// End of spectral scan pub(crate) spec_end: u8, /// Successive approximation bit position high pub(crate) succ_high: u8, /// Successive approximation bit position low pub(crate) succ_low: u8, /// Number of components. pub(crate) num_scans: u8, /// For a scan, check if any component has vertical/horizontal sampling. pub(crate) scan_subsampled: bool, // Function pointers, for pointy stuff. /// Dequantize and idct function // This is determined at runtime which function to run, statically it's // initialized to a platform independent one and during initialization // of this struct, we check if we can switch to a faster one which // depend on certain CPU extensions. pub(crate) idct_func: IDCTPtr, /// Specialized IDCT when we can guarantee only few coefficients are non-zero. /// /// **The callee must uphold a contract**. See [`choose_idct_4x4_func`]. pub(crate) idct_4x4_func: IDCTPtr, pub(crate) idct_1x1_func: IDCTPtr, // Color convert function which acts on 16 YCbCr values pub(crate) color_convert_16: ColorConvert16Ptr, pub(crate) z_order: [usize; MAX_COMPONENTS], /// restart markers pub(crate) restart_interval: usize, pub(crate) todo: usize, // decoder options pub(crate) options: DecoderOptions, // byte-stream pub(crate) stream: ZReader, // Indicate whether headers have been decoded pub(crate) headers_decoded: bool, pub(crate) seen_sof: bool, // exif data, lifted from app2 pub(crate) icc_data: Vec, pub(crate) is_mjpeg: bool, pub(crate) coeff: usize // Solves some weird bug :) } impl JpegDecoder where T: ZByteReaderTrait { #[allow(clippy::redundant_field_names)] fn default(options: DecoderOptions, buffer: T) -> Self { let color_convert = choose_ycbcr_to_rgb_convert_func(ColorSpace::RGB, &options).unwrap(); JpegDecoder { info: ImageInfo::default(), qt_tables: [None, None, None, None], dc_huffman_tables: [None, None, None, None], ac_huffman_tables: [None, None, None, None], components: vec![], // Interleaved information h_max: 1, v_max: 1, mcu_height: 0, mcu_width: 0, mcu_x: 0, mcu_y: 0, is_interleaved: false, sub_sample_ratio: SampleRatios::None, is_progressive: false, spec_start: 0, spec_end: 0, succ_high: 0, succ_low: 0, num_scans: 0, scan_subsampled: false, idct_func: choose_idct_func(&options), idct_4x4_func: choose_idct_4x4_func(&options), idct_1x1_func: choose_idct_1x1_func(&options), color_convert_16: color_convert, input_colorspace: ColorSpace::YCbCr, z_order: [0; MAX_COMPONENTS], restart_interval: 0, todo: 0x7fff_ffff, options: options, stream: ZReader::new(buffer), headers_decoded: false, seen_sof: false, icc_data: vec![], is_mjpeg: false, coeff: 1 } } /// Decode a buffer already in memory /// /// The buffer should be a valid jpeg file, perhaps created by the command /// `std:::fs::read()` or a JPEG file downloaded from the internet. /// /// # Errors /// See DecodeErrors for an explanation pub fn decode(&mut self) -> Result, DecodeErrors> { self.decode_headers()?; let size = self.output_buffer_size().unwrap(); let mut out = vec![0; size]; self.decode_into(&mut out)?; Ok(out) } /// Create a new Decoder instance /// /// # Arguments /// - `stream`: The raw bytes of a jpeg file. #[must_use] #[allow(clippy::new_without_default)] pub fn new(stream: T) -> JpegDecoder { JpegDecoder::default(DecoderOptions::default(), stream) } /// Returns the image information /// /// This **must** be called after a subsequent call to [`decode`] or [`decode_headers`] /// it will return `None` /// /// # Returns /// - `Some(info)`: Image information,width, height, number of components /// - None: Indicates image headers haven't been decoded /// /// [`decode`]: JpegDecoder::decode /// [`decode_headers`]: JpegDecoder::decode_headers #[must_use] pub fn info(&self) -> Option { // we check for fails to that call by comparing what we have to the default, if // it's default we assume that the caller failed to uphold the // guarantees. We can be sure that an image cannot be the default since // its a hard panic in-case width or height are set to zero. if !self.headers_decoded { return None; } return Some(self.info.clone()); } /// Return the number of bytes required to hold a decoded image frame /// decoded using the given input transformations /// /// # Returns /// - `Some(usize)`: Minimum size for a buffer needed to decode the image /// - `None`: Indicates the image was not decoded, or image dimensions would overflow a usize /// #[must_use] pub fn output_buffer_size(&self) -> Option { return if self.headers_decoded { Some( usize::from(self.width()) .checked_mul(usize::from(self.height()))? .checked_mul(self.options.jpeg_get_out_colorspace().num_components())? ) } else { None }; } /// Get an immutable reference to the decoder options /// for the decoder instance /// /// This can be used to modify options before actual decoding /// but after initial creation /// /// # Example /// ```no_run /// use zune_core::bytestream::ZCursor; /// use zune_jpeg::JpegDecoder; /// /// let mut decoder = JpegDecoder::new(ZCursor::new(&[])); /// // get current options /// let mut options = decoder.options(); /// // modify it /// let new_options = options.set_max_width(10); /// // set it back /// decoder.set_options(new_options); /// /// ``` #[must_use] pub const fn options(&self) -> &DecoderOptions { &self.options } /// Return the input colorspace of the image /// /// This indicates the colorspace that is present in /// the image, but this may be different to the colorspace that /// the output will be transformed to /// /// # Returns /// -`Some(Colorspace)`: Input colorspace /// - None : Indicates the headers weren't decoded #[must_use] pub fn input_colorspace(&self) -> Option { return if self.headers_decoded { Some(self.input_colorspace) } else { None }; } /// Set decoder options /// /// This can be used to set new options even after initialization /// but before decoding. /// /// This does not bear any significance after decoding an image /// /// # Arguments /// - `options`: New decoder options /// /// # Example /// Set maximum jpeg progressive passes to be 4 /// /// ```no_run /// use zune_core::bytestream::ZCursor; /// use zune_jpeg::JpegDecoder; /// let mut decoder =JpegDecoder::new(ZCursor::new(&[])); /// // this works also because DecoderOptions implements `Copy` /// let options = decoder.options().jpeg_set_max_scans(4); /// // set the new options /// decoder.set_options(options); /// // now decode /// decoder.decode().unwrap(); /// ``` pub fn set_options(&mut self, options: DecoderOptions) { self.options = options; } /// Decode Decoder headers /// /// This routine takes care of parsing supported headers from a Decoder /// image /// /// # Supported Headers /// - APP(0) /// - SOF(O) /// - DQT -> Quantization tables /// - DHT -> Huffman tables /// - SOS -> Start of Scan /// # Unsupported Headers /// - SOF(n) -> Decoder images which are not baseline/progressive /// - DAC -> Images using Arithmetic tables /// - JPG(n) fn decode_headers_internal(&mut self) -> Result<(), DecodeErrors> { if self.headers_decoded { trace!("Headers decoded!"); return Ok(()); } // match output colorspace here // we know this will only be called once per image // so makes sense // We only care for ycbcr to rgb/rgba here // in case one is using another colorspace. // May god help you let out_colorspace = self.options.jpeg_get_out_colorspace(); if matches!( out_colorspace, ColorSpace::BGR | ColorSpace::BGRA | ColorSpace::RGB | ColorSpace::RGBA ) { self.color_convert_16 = choose_ycbcr_to_rgb_convert_func( self.options.jpeg_get_out_colorspace(), &self.options ) .unwrap(); } // First two bytes should be jpeg soi marker let magic_bytes = self.stream.get_u16_be_err()?; let mut last_byte = 0; let mut bytes_before_marker = 0; if magic_bytes != 0xffd8 { return Err(DecodeErrors::IllegalMagicBytes(magic_bytes)); } loop { // read a byte let mut m = self.stream.read_u8_err()?; // AND OF COURSE some images will have fill bytes in their marker // bitstreams because why not. // // I am disappointed as a man. if (m == 0xFF || m == 0) && last_byte == 0xFF { // This handles the edge case where // images have markers with fill bytes(0xFF) // or byte stuffing (0) // I.e 0xFF 0xFF 0xDA // and // 0xFF 0 0xDA // It should ignore those fill bytes and take 0xDA // I don't know why such images exist // but they do. // so this is for you (with love) while m == 0xFF || m == 0x0 { last_byte = m; m = self.stream.read_u8_err()?; } } // Last byte should be 0xFF to confirm existence of a marker since markers look // like OxFF(some marker data) if last_byte == 0xFF { let marker = Marker::from_u8(m); if let Some(n) = marker { if bytes_before_marker > 3 { if self.options.strict_mode() /*No reason to use this*/ { return Err(DecodeErrors::FormatStatic( "[strict-mode]: Extra bytes between headers" )); } error!( "Extra bytes {} before marker 0xFF{:X}", bytes_before_marker - 3, m ); } bytes_before_marker = 0; self.parse_marker_inner(n)?; // break after reading the start of scan. // what follows is the image data if n == Marker::SOS { self.headers_decoded = true; trace!("Input colorspace {:?}", self.input_colorspace); // Check if image is RGB // The check is weird, we need to check if ID // represents R, G and B in ascii, // // I am not sure if this is even specified in any standard, // but jpegli https://github.com/google/jpegli does encode // its images that way, so this will check for that. and handle it appropriately // It is spefified here so that on a successful header decode,we can at least // try to attribute image colorspace correctly. // // It was first the issue in https://github.com/etemesi254/zune-image/issues/291 // that brought it to light // let mut is_rgb = self.components.len() == 3; let chars = ['R', 'G', 'B']; for (comp, single_char) in self.components.iter().zip(chars.iter()) { is_rgb &= comp.id == (*single_char) as u8 } // Image is RGB, change colorspace if is_rgb { self.input_colorspace = ColorSpace::RGB; } return Ok(()); } } else { bytes_before_marker = 0; warn!("Marker 0xFF{:X} not known", m); let length = self.stream.get_u16_be_err()?; if length < 2 { return Err(DecodeErrors::Format(format!( "Found a marker with invalid length : {length}" ))); } warn!("Skipping {} bytes", length - 2); self.stream.skip((length - 2) as usize)?; } } last_byte = m; bytes_before_marker += 1; } // Check if image is RGB } #[allow(clippy::too_many_lines)] pub(crate) fn parse_marker_inner(&mut self, m: Marker) -> Result<(), DecodeErrors> { match m { Marker::SOF(0..=2) => { let marker = { // choose marker if m == Marker::SOF(0) || m == Marker::SOF(1) { SOFMarkers::BaselineDct } else { self.is_progressive = true; SOFMarkers::ProgressiveDctHuffman } }; trace!("Image encoding scheme =`{:?}`", marker); // get components parse_start_of_frame(marker, self)?; } // Start of Frame Segments not supported Marker::SOF(v) => { let feature = UnsupportedSchemes::from_int(v); if let Some(feature) = feature { return Err(DecodeErrors::Unsupported(feature)); } return Err(DecodeErrors::Format("Unsupported image format".to_string())); } //APP(0) segment Marker::APP(0) => { let mut length = self.stream.get_u16_be_err()?; if length < 2 { return Err(DecodeErrors::Format(format!( "Found a marker with invalid length:{length}\n" ))); } // skip for now if length > 5 { let mut buffer = [0u8; 5]; self.stream.read_exact_bytes(&mut buffer)?; if &buffer == b"AVI1\0" { self.is_mjpeg = true; } length -= 5; } self.stream.skip(length.saturating_sub(2) as usize)?; //parse_app(buf, m, &mut self.info)?; } Marker::APP(1) => { parse_app1(self)?; } Marker::APP(2) => { parse_app2(self)?; } // Quantization tables Marker::DQT => { parse_dqt(self)?; } // Huffman tables Marker::DHT => { parse_huffman(self)?; } // Start of Scan Data Marker::SOS => { parse_sos(self)?; } Marker::EOI => return Err(DecodeErrors::FormatStatic("Premature End of image")), Marker::DAC | Marker::DNL => { return Err(DecodeErrors::Format(format!( "Parsing of the following header `{m:?}` is not supported,\ cannot continue" ))); } Marker::DRI => { if self.stream.get_u16_be_err()? != 4 { return Err(DecodeErrors::Format( "Bad DRI length, Corrupt JPEG".to_string() )); } self.restart_interval = usize::from(self.stream.get_u16_be_err()?); trace!("DRI marker present ({})", self.restart_interval); self.todo = self.restart_interval; } Marker::APP(14) => { parse_app14(self)?; } Marker::APP(13) => { parse_app13(self)?; } _ => { warn!( "Capabilities for processing marker \"{:?}\" not implemented", m ); let length = self.stream.get_u16_be_err()?; if length < 2 { return Err(DecodeErrors::Format(format!( "Found a marker with invalid length:{length}\n" ))); } warn!("Skipping {} bytes", length - 2); self.stream.skip((length - 2) as usize)?; } } Ok(()) } /// Get the embedded ICC profile if it exists /// and is correct /// /// One needs not to decode the whole image to extract this, /// calling [`decode_headers`] for an image with an ICC profile /// allows you to decode this /// /// # Returns /// - `Some(Vec)`: The raw ICC profile of the image /// - `None`: May indicate an error in the ICC profile , non-existence of /// an ICC profile, or that the headers weren't decoded. /// /// [`decode_headers`]:Self::decode_headers #[must_use] pub fn icc_profile(&self) -> Option> { let mut marker_present: [Option<&ICCChunk>; 256] = [None; 256]; if !self.headers_decoded { return None; } let num_markers = self.icc_data.len(); if num_markers == 0 || num_markers >= 255 { return None; } // check validity for chunk in &self.icc_data { if usize::from(chunk.num_markers) != num_markers { // all the lengths must match return None; } if chunk.seq_no == 0 { warn!("Zero sequence number in ICC, corrupt ICC chunk"); return None; } if marker_present[usize::from(chunk.seq_no)].is_some() { // duplicate seq_no warn!("Duplicate sequence number in ICC, corrupt chunk"); return None; } marker_present[usize::from(chunk.seq_no)] = Some(chunk); } let mut data = Vec::with_capacity(1000); // assemble the data now for chunk in marker_present.get(1..=num_markers).unwrap() { if let Some(ch) = chunk { data.extend_from_slice(&ch.data); } else { warn!("Missing icc sequence number, corrupt ICC chunk "); return None; } } Some(data) } /// Return the exif data for the file /// /// This returns the raw exif data starting at the /// TIFF header /// /// # Returns /// -`Some(data)`: The raw exif data, if present in the image /// - None: May indicate the following /// /// 1. The image doesn't have exif data /// 2. The image headers haven't been decoded #[must_use] pub fn exif(&self) -> Option<&Vec> { return self.info.exif_data.as_ref(); } /// Return the XMP data for the file /// /// This returns raw XMP data starting at the XML header /// One needs an XML/XMP decoder to extract valuable metadata /// /// /// # Returns /// - `Some(data)`: Raw xmp data /// - `None`: May indicate the following /// 1. The image does not have xmp data /// 2. The image headers have not been decoded /// /// # Example /// /// ```no_run /// use zune_core::bytestream::ZCursor; /// use zune_jpeg::JpegDecoder; /// let mut decoder = JpegDecoder::new(ZCursor::new(&[])); /// // decode headers to extract xmp metadata if present /// decoder.decode_headers().unwrap(); /// if let Some(data) = decoder.xmp(){ /// let stringified = String::from_utf8_lossy(data); /// println!("XMP") /// } else{ /// println!("No XMP Found") /// } /// /// ``` pub fn xmp(&self) -> Option<&Vec> { return self.info.xmp_data.as_ref(); } /// Return the IPTC data for the file /// /// This returns the raw IPTC data. /// /// # Returns /// -`Some(data)`: The raw IPTC data, if present in the image /// - None: May indicate the following /// /// 1. The image doesn't have IPTC data /// 2. The image headers haven't been decoded #[must_use] pub fn iptc(&self) -> Option<&Vec> { return self.info.iptc_data.as_ref(); } /// Get the output colorspace the image pixels will be decoded into /// /// /// # Note. /// This field can only be regarded after decoding headers, /// as markers such as Adobe APP14 may dictate different colorspaces /// than requested. /// /// Calling `decode_headers` is sufficient to know what colorspace the /// output is, if this is called after `decode` it indicates the colorspace /// the output is currently in /// /// Additionally not all input->output colorspace mappings are supported /// but all input colorspaces can map to RGB colorspace, so that's a safe bet /// if one is handling image formats /// ///# Returns /// - `Some(Colorspace)`: If headers have been decoded, the colorspace the ///output array will be in ///- `None #[must_use] pub fn output_colorspace(&self) -> Option { return if self.headers_decoded { Some(self.options.jpeg_get_out_colorspace()) } else { None }; } /// Decode into a pre-allocated buffer /// /// It is an error if the buffer size is smaller than /// [`output_buffer_size()`](Self::output_buffer_size) /// /// If the buffer is bigger than expected, we ignore the end padding bytes /// /// # Example /// /// - Read headers and then alloc a buffer big enough to hold the image /// /// ```no_run /// use zune_core::bytestream::ZCursor; /// use zune_jpeg::JpegDecoder; /// let mut decoder = JpegDecoder::new(ZCursor::new(&[])); /// // before we get output, we must decode the headers to get width /// // height, and input colorspace /// decoder.decode_headers().unwrap(); /// /// let mut out = vec![0;decoder.output_buffer_size().unwrap()]; /// // write into out /// decoder.decode_into(&mut out).unwrap(); /// ``` /// /// pub fn decode_into(&mut self, out: &mut [u8]) -> Result<(), DecodeErrors> { self.decode_headers_internal()?; let expected_size = self.output_buffer_size().unwrap(); if out.len() < expected_size { // too small of a size return Err(DecodeErrors::TooSmallOutput(expected_size, out.len())); } // ensure we don't touch anyone else's scratch space let out_len = core::cmp::min(out.len(), expected_size); let out = &mut out[0..out_len]; if self.is_progressive { self.decode_mcu_ycbcr_progressive(out) } else { self.decode_mcu_ycbcr_baseline(out) } } /// Read only headers from a jpeg image buffer /// /// This allows you to extract important information like /// image width and height without decoding the full image /// /// # Examples /// ```no_run /// use zune_core::bytestream::ZCursor; /// use zune_jpeg::{JpegDecoder}; /// /// let img_data = std::fs::read("a_valid.jpeg").unwrap(); /// let mut decoder = JpegDecoder::new(ZCursor::new(&img_data)); /// decoder.decode_headers().unwrap(); /// /// println!("Total decoder dimensions are : {:?} pixels",decoder.dimensions()); /// println!("Number of components in the image are {}", decoder.info().unwrap().components); /// ``` /// # Errors /// See DecodeErrors enum for list of possible errors during decoding pub fn decode_headers(&mut self) -> Result<(), DecodeErrors> { self.decode_headers_internal()?; Ok(()) } /// Create a new decoder with the specified options to be used for decoding /// an image /// /// # Arguments /// - `buf`: The input buffer from where we will pull in compressed jpeg bytes from /// - `options`: Options specific to this decoder instance #[must_use] pub fn new_with_options(buf: T, options: DecoderOptions) -> JpegDecoder { JpegDecoder::default(options, buf) } /// Set up-sampling routines in case an image is down sampled pub(crate) fn set_upsampling(&mut self) -> Result<(), DecodeErrors> { // no sampling, return early // check if horizontal max ==1 if self.h_max == self.v_max && self.h_max == 1 { return Ok(()); } match (self.h_max, self.v_max) { (1, 1) => { self.sub_sample_ratio = SampleRatios::None; } (1, 2) => { self.sub_sample_ratio = SampleRatios::V; } (2, 1) => { self.sub_sample_ratio = SampleRatios::H; } (2, 2) => { self.sub_sample_ratio = SampleRatios::HV; } (hs, vs) => { self.sub_sample_ratio = SampleRatios::Generic(hs, vs) // return Err(DecodeErrors::Format(format!( // "Unknown down-sampling method ({hs},{vs}), cannot continue") // )) } } for comp in &mut self.components { let hs = self.h_max / comp.horizontal_sample; let vs = self.v_max / comp.vertical_sample; let samp_factor = match (hs, vs) { (1, 1) => { comp.sample_ratio = SampleRatios::None; upsample_no_op } (2, 1) => { comp.sample_ratio = SampleRatios::H; choose_horizontal_samp_function(&self.options) } (1, 2) => { comp.sample_ratio = SampleRatios::V; choose_v_samp_function(&self.options) } (2, 2) => { comp.sample_ratio = SampleRatios::HV; choose_hv_samp_function(&self.options) } (hs, vs) => { comp.sample_ratio = SampleRatios::Generic(hs, vs); generic_sampler() } }; comp.setup_upsample_scanline(); comp.up_sampler = samp_factor; } return Ok(()); } #[must_use] /// Get the width of the image as a u16 /// /// The width lies between 1 and 65535 pub(crate) fn width(&self) -> u16 { self.info.width } /// Get the height of the image as a u16 /// /// The height lies between 1 and 65535 #[must_use] pub(crate) fn height(&self) -> u16 { self.info.height } /// Get image dimensions as a tuple of width and height /// or `None` if the image hasn't been decoded. /// /// # Returns /// - `Some(width,height)`: Image dimensions /// - None : The image headers haven't been decoded #[must_use] pub const fn dimensions(&self) -> Option<(usize, usize)> { return if self.headers_decoded { Some((self.info.width as usize, self.info.height as usize)) } else { None }; } } #[derive(Default, Clone, Eq, PartialEq, Debug)] pub struct GainMapInfo { pub data: Vec } /// A struct representing Image Information #[derive(Default, Clone, Eq, PartialEq)] #[allow(clippy::module_name_repetitions)] pub struct ImageInfo { /// Width of the image pub width: u16, /// Height of image pub height: u16, /// PixelDensity pub pixel_density: u8, /// Start of frame markers pub sof: SOFMarkers, /// Horizontal sample pub x_density: u16, /// Vertical sample pub y_density: u16, /// Number of components pub components: u8, /// Gain Map information, useful for /// UHDR images pub gain_map_info: Vec, /// Multi picture information, useful for /// UHDR images pub multi_picture_information: Option>, /// Exif Data pub exif_data: Option>, /// XMP Data pub xmp_data: Option>, /// IPTC Data pub iptc_data: Option> } impl ImageInfo { /// Set width of the image /// /// Found in the start of frame pub(crate) fn set_width(&mut self, width: u16) { self.width = width; } /// Set height of the image /// /// Found in the start of frame pub(crate) fn set_height(&mut self, height: u16) { self.height = height; } /// Set the image density /// /// Found in the start of frame pub(crate) fn set_density(&mut self, density: u8) { self.pixel_density = density; } /// Set image Start of frame marker /// /// found in the Start of frame header pub(crate) fn set_sof_marker(&mut self, marker: SOFMarkers) { self.sof = marker; } /// Set image x-density(dots per pixel) /// /// Found in the APP(0) marker #[allow(dead_code)] pub(crate) fn set_x(&mut self, sample: u16) { self.x_density = sample; } /// Set image y-density /// /// Found in the APP(0) marker #[allow(dead_code)] pub(crate) fn set_y(&mut self, sample: u16) { self.y_density = sample; } } zune-jpeg-0.5.11/src/errors.rs000064400000000000000000000135421046102023000142720ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ //! Contains most common errors that may be encountered in decoding a Decoder //! image use alloc::string::String; use core::fmt::{Debug, Display, Formatter}; use zune_core::bytestream::ZByteIoError; use crate::misc::{ START_OF_FRAME_EXT_AR, START_OF_FRAME_EXT_SEQ, START_OF_FRAME_LOS_SEQ, START_OF_FRAME_LOS_SEQ_AR, START_OF_FRAME_PROG_DCT_AR }; /// Common Decode errors #[allow(clippy::module_name_repetitions)] pub enum DecodeErrors { /// Any other thing we do not know Format(String), /// Any other thing we do not know but we /// don't need to allocate space on the heap FormatStatic(&'static str), /// Illegal Magic Bytes IllegalMagicBytes(u16), /// problems with the Huffman Tables in a Decoder file HuffmanDecode(String), /// Image has zero width ZeroError, /// Discrete Quantization Tables error DqtError(String), /// Start of scan errors SosError(String), /// Start of frame errors SofError(String), /// UnsupportedImages Unsupported(UnsupportedSchemes), /// MCU errors MCUError(String), /// Exhausted data ExhaustedData, /// Large image dimensions(Corrupted data)? LargeDimensions(usize), /// Too small output for size TooSmallOutput(usize, usize), IoErrors(ZByteIoError) } #[cfg(feature = "std")] impl std::error::Error for DecodeErrors {} impl From<&'static str> for DecodeErrors { fn from(data: &'static str) -> Self { return Self::FormatStatic(data); } } impl From for DecodeErrors { fn from(data: ZByteIoError) -> Self { return Self::IoErrors(data); } } impl Debug for DecodeErrors { fn fmt(&self, f: &mut Formatter<'_>) -> core::fmt::Result { match &self { Self::Format(ref a) => write!(f, "{a:?}"), Self::FormatStatic(a) => write!(f, "{:?}", &a), Self::HuffmanDecode(ref reason) => { write!(f, "Error decoding huffman values: {reason}") } Self::ZeroError => write!(f, "Image width or height is set to zero, cannot continue"), Self::DqtError(ref reason) => write!(f, "Error parsing DQT segment. Reason:{reason}"), Self::SosError(ref reason) => write!(f, "Error parsing SOS Segment. Reason:{reason}"), Self::SofError(ref reason) => write!(f, "Error parsing SOF segment. Reason:{reason}"), Self::IllegalMagicBytes(bytes) => { write!(f, "Error parsing image. Illegal start bytes:{bytes:X}") } Self::MCUError(ref reason) => write!(f, "Error in decoding MCU. Reason {reason}"), Self::Unsupported(ref image_type) => { write!(f, "{image_type:?}") } Self::ExhaustedData => write!(f, "Exhausted data in the image"), Self::LargeDimensions(ref dimensions) => write!( f, "Too large dimensions {dimensions},library supports up to {}", crate::decoder::MAX_DIMENSIONS ), Self::TooSmallOutput(expected, found) => write!(f, "Too small output, expected buffer with at least {expected} bytes but got one with {found} bytes"), Self::IoErrors(error)=>write!(f,"I/O errors {error:?}"), } } } impl Display for DecodeErrors { fn fmt(&self, f: &mut Formatter<'_>) -> core::fmt::Result { write!(f, "{self:?}") } } /// Contains Unsupported/Yet-to-be supported Decoder image encoding types. #[derive(Eq, PartialEq, Copy, Clone)] pub enum UnsupportedSchemes { /// SOF_1 Extended sequential DCT,Huffman coding ExtendedSequentialHuffman, /// Lossless (sequential), huffman coding, LosslessHuffman, /// Extended sequential DEC, arithmetic coding ExtendedSequentialDctArithmetic, /// Progressive DCT, arithmetic coding, ProgressiveDctArithmetic, /// Lossless ( sequential), arithmetic coding LosslessArithmetic } impl Debug for UnsupportedSchemes { fn fmt(&self, f: &mut Formatter<'_>) -> core::fmt::Result { match &self { Self::ExtendedSequentialHuffman => { write!(f, "The library cannot yet decode images encoded using Extended Sequential Huffman encoding scheme yet.") } Self::LosslessHuffman => { write!(f, "The library cannot yet decode images encoded with Lossless Huffman encoding scheme") } Self::ExtendedSequentialDctArithmetic => { write!(f,"The library cannot yet decode Images Encoded with Extended Sequential DCT Arithmetic scheme") } Self::ProgressiveDctArithmetic => { write!(f,"The library cannot yet decode images encoded with Progressive DCT Arithmetic scheme") } Self::LosslessArithmetic => { write!(f,"The library cannot yet decode images encoded with Lossless Arithmetic encoding scheme") } } } } impl UnsupportedSchemes { #[must_use] /// Create an unsupported scheme from an integer /// /// # Returns /// `Some(UnsupportedScheme)` if the int refers to a specific scheme, /// otherwise returns `None` pub fn from_int(int: u8) -> Option { let int = u16::from_be_bytes([0xff, int]); match int { START_OF_FRAME_PROG_DCT_AR => Some(Self::ProgressiveDctArithmetic), START_OF_FRAME_LOS_SEQ => Some(Self::LosslessHuffman), START_OF_FRAME_LOS_SEQ_AR => Some(Self::LosslessArithmetic), START_OF_FRAME_EXT_SEQ => Some(Self::ExtendedSequentialHuffman), START_OF_FRAME_EXT_AR => Some(Self::ExtendedSequentialDctArithmetic), _ => None } } } zune-jpeg-0.5.11/src/headers.rs000064400000000000000000000517061046102023000143750ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ //! Decode Decoder markers/segments //! //! This file deals with decoding header information in a jpeg file //! use alloc::format; use alloc::string::ToString; use alloc::vec::Vec; use zune_core::bytestream::ZByteReaderTrait; use zune_core::colorspace::ColorSpace; use zune_core::log::{debug, trace, warn}; use crate::components::Components; use crate::decoder::{GainMapInfo, ICCChunk, JpegDecoder, MAX_COMPONENTS}; use crate::errors::DecodeErrors; use crate::huffman::HuffmanTable; use crate::misc::{SOFMarkers, UN_ZIGZAG}; ///**B.2.4.2 Huffman table-specification syntax** #[allow(clippy::similar_names, clippy::cast_sign_loss)] pub(crate) fn parse_huffman( decoder: &mut JpegDecoder ) -> Result<(), DecodeErrors> where { // Read the length of the Huffman table let mut dht_length = i32::from(decoder.stream.get_u16_be_err()?.checked_sub(2).ok_or( DecodeErrors::FormatStatic("Invalid Huffman length in image") )?); while dht_length > 16 { // HT information let ht_info = decoder.stream.read_u8_err()?; // third bit indicates whether the huffman encoding is DC or AC type let dc_or_ac = (ht_info >> 4) & 0xF; // Indicate the position of this table, should be less than 4; let index = (ht_info & 0xF) as usize; // read the number of symbols let mut num_symbols: [u8; 17] = [0; 17]; if index >= MAX_COMPONENTS { return Err(DecodeErrors::HuffmanDecode(format!( "Invalid DHT index {index}, expected between 0 and 3" ))); } if dc_or_ac > 1 { return Err(DecodeErrors::HuffmanDecode(format!( "Invalid DHT position {dc_or_ac}, should be 0 or 1" ))); } decoder.stream.read_exact_bytes(&mut num_symbols[1..17])?; dht_length -= 1 + 16; let symbols_sum: i32 = num_symbols.iter().map(|f| i32::from(*f)).sum(); // The sum of the number of symbols cannot be greater than 256; if symbols_sum > 256 { return Err(DecodeErrors::FormatStatic( "Encountered Huffman table with excessive length in DHT" )); } if symbols_sum > dht_length { return Err(DecodeErrors::HuffmanDecode(format!( "Excessive Huffman table of length {symbols_sum} found when header length is {dht_length}" ))); } dht_length -= symbols_sum; // A table containing symbols in increasing code length let mut symbols = [0; 256]; decoder .stream .read_exact_bytes(&mut symbols[0..(symbols_sum as usize)])?; // store match dc_or_ac { 0 => { decoder.dc_huffman_tables[index] = Some(HuffmanTable::new( &num_symbols, symbols, true, decoder.is_progressive )?); } _ => { decoder.ac_huffman_tables[index] = Some(HuffmanTable::new( &num_symbols, symbols, false, decoder.is_progressive )?); } } } if dht_length > 0 { return Err(DecodeErrors::FormatStatic("Bogus Huffman table definition")); } Ok(()) } ///**B.2.4.1 Quantization table-specification syntax** #[allow(clippy::cast_possible_truncation, clippy::needless_range_loop)] pub(crate) fn parse_dqt(img: &mut JpegDecoder) -> Result<(), DecodeErrors> { // read length let mut qt_length = img.stream .get_u16_be_err()? .checked_sub(2) .ok_or(DecodeErrors::FormatStatic( "Invalid DQT length. Length should be greater than 2" ))?; // A single DQT header may have multiple QT's while qt_length > 0 { let qt_info = img.stream.read_u8_err()?; // 0 = 8 bit otherwise 16 bit dqt let precision = (qt_info >> 4) as usize; // last 4 bits give us position let table_position = (qt_info & 0x0f) as usize; let precision_value = 64 * (precision + 1); if (precision_value + 1) as u16 > qt_length { return Err(DecodeErrors::DqtError(format!("Invalid QT table bytes left :{}. Too small to construct a valid qt table which should be {} long", qt_length, precision_value + 1))); } let dct_table = match precision { 0 => { let mut qt_values = [0; 64]; img.stream.read_exact_bytes(&mut qt_values)?; qt_length -= (precision_value as u16) + 1 /*QT BIT*/; // carry out un zig-zag here un_zig_zag(&qt_values) } 1 => { // 16 bit quantization tables let mut qt_values = [0_u16; 64]; for i in 0..64 { qt_values[i] = img.stream.get_u16_be_err()?; } qt_length -= (precision_value as u16) + 1; un_zig_zag(&qt_values) } _ => { return Err(DecodeErrors::DqtError(format!( "Expected QT precision value of either 0 or 1, found {precision:?}" ))); } }; if table_position >= MAX_COMPONENTS { return Err(DecodeErrors::DqtError(format!( "Too large table position for QT :{table_position}, expected between 0 and 3" ))); } trace!("Assigning qt table {table_position} with precision {precision}"); img.qt_tables[table_position] = Some(dct_table); } return Ok(()); } /// Section:`B.2.2 Frame header syntax` pub(crate) fn parse_start_of_frame( sof: SOFMarkers, img: &mut JpegDecoder ) -> Result<(), DecodeErrors> { if img.seen_sof { return Err(DecodeErrors::SofError( "Two Start of Frame Markers".to_string() )); } // Get length of the frame header let length = img.stream.get_u16_be_err()?; // usually 8, but can be 12 and 16, we currently support only 8 // so sorry about that 12 bit images let dt_precision = img.stream.read_u8_err()?; if dt_precision != 8 { return Err(DecodeErrors::SofError(format!( "The library can only parse 8-bit images, the image has {dt_precision} bits of precision" ))); } img.info.set_density(dt_precision); // read and set the image height. let img_height = img.stream.get_u16_be_err()?; img.info.set_height(img_height); // read and set the image width let img_width = img.stream.get_u16_be_err()?; img.info.set_width(img_width); trace!("Image width :{}", img_width); trace!("Image height :{}", img_height); if usize::from(img_width) > img.options.max_width() { return Err(DecodeErrors::Format(format!("Image width {} greater than width limit {}. If use `set_limits` if you want to support huge images", img_width, img.options.max_width()))); } if usize::from(img_height) > img.options.max_height() { return Err(DecodeErrors::Format(format!("Image height {} greater than height limit {}. If use `set_limits` if you want to support huge images", img_height, img.options.max_height()))); } // Check image width or height is zero if img_width == 0 || img_height == 0 { return Err(DecodeErrors::ZeroError); } // Number of components for the image. let num_components = img.stream.read_u8_err()?; if num_components == 0 { return Err(DecodeErrors::SofError( "Number of components cannot be zero.".to_string() )); } let expected = 8 + 3 * u16::from(num_components); // length should be equal to num components if length != expected { return Err(DecodeErrors::SofError(format!( "Length of start of frame differs from expected {expected},value is {length}" ))); } trace!("Image components : {}", num_components); if num_components == 1 { // SOF sets the number of image components // and that to us translates to setting input and output // colorspaces to zero img.input_colorspace = ColorSpace::Luma; //img.options = img.options.jpeg_set_out_colorspace(ColorSpace::Luma); debug!("Overriding default colorspace set to Luma"); } if num_components == 4 && img.input_colorspace == ColorSpace::YCbCr { trace!("Input image has 4 components, defaulting to CMYK colorspace"); // https://entropymine.wordpress.com/2018/10/22/how-is-a-jpeg-images-color-type-determined/ img.input_colorspace = ColorSpace::CMYK; } // set number of components img.info.components = num_components; let mut components = Vec::with_capacity(num_components as usize); let mut temp = [0; 3]; for pos in 0..num_components { // read 3 bytes for each component img.stream.read_exact_bytes(&mut temp)?; // create a component. let component = Components::from(temp, pos)?; components.push(component); } img.seen_sof = true; img.info.set_sof_marker(sof); img.components = components; Ok(()) } /// Parse a start of scan data pub(crate) fn parse_sos( image: &mut JpegDecoder ) -> Result<(), DecodeErrors> { // Scan header length let ls = usize::from(image.stream.get_u16_be_err()?); // Number of image components in scan let ns = image.stream.read_u8_err()?; let mut seen: [_; 5] = [-1; { MAX_COMPONENTS + 1 }]; image.num_scans = ns; let smallest_size = 6 + 2 * usize::from(ns); if ls != smallest_size { return Err(DecodeErrors::SosError(format!( "Bad SOS length {ls},corrupt jpeg" ))); } // Check number of components. if !(1..5).contains(&ns) { return Err(DecodeErrors::SosError(format!( "Invalid number of components in start of scan {ns}, expected in range 1..5" ))); } if image.info.components == 0 { return Err(DecodeErrors::FormatStatic( "Error decoding SOF Marker, Number of components cannot be zero." )); } // consume spec parameters image.scan_subsampled = false; for i in 0..ns { let id = image.stream.read_u8_err()?; if seen.contains(&i32::from(id)) { return Err(DecodeErrors::SofError(format!( "Duplicate ID {id} seen twice in the same component" ))); } seen[usize::from(i)] = i32::from(id); // DC and AC huffman table position // top 4 bits contain dc huffman destination table // lower four bits contain ac huffman destination table let y = image.stream.read_u8_err()?; let mut j = 0; while j < image.info.components { if image.components[j as usize].id == id { break; } j += 1; } if j == image.info.components { return Err(DecodeErrors::SofError(format!( "Invalid component id {}, expected one one of {:?}", id, image.components.iter().map(|c| c.id).collect::>() ))); } let component = &mut image.components[usize::from(j)]; component.dc_huff_table = usize::from((y >> 4) & 0xF); component.ac_huff_table = usize::from(y & 0xF); image.z_order[i as usize] = j as usize; if component.vertical_sample != 1 || component.horizontal_sample != 1 { image.scan_subsampled = true; } trace!( "Assigned huffman tables {}/{} to component {j}, id={}", image.components[usize::from(j)].dc_huff_table, image.components[usize::from(j)].ac_huff_table, image.components[usize::from(j)].id, ); } // Collect the component spec parameters // This is only needed for progressive images but I'll read // them in order to ensure they are correct according to the spec // Extract progressive information // https://www.w3.org/Graphics/JPEG/itu-t81.pdf // Page 42 // Start of spectral / predictor selection. (between 0 and 63) image.spec_start = image.stream.read_u8_err()?; // End of spectral selection image.spec_end = image.stream.read_u8_err()?; let bit_approx = image.stream.read_u8_err()?; // successive approximation bit position high image.succ_high = bit_approx >> 4; if image.spec_end > 63 { return Err(DecodeErrors::SosError(format!( "Invalid Se parameter {}, range should be 0-63", image.spec_end ))); } if image.spec_start > 63 { return Err(DecodeErrors::SosError(format!( "Invalid Ss parameter {}, range should be 0-63", image.spec_start ))); } if image.succ_high > 13 { return Err(DecodeErrors::SosError(format!( "Invalid Ah parameter {}, range should be 0-13", image.succ_low ))); } // successive approximation bit position low image.succ_low = bit_approx & 0xF; if image.succ_low > 13 { return Err(DecodeErrors::SosError(format!( "Invalid Al parameter {}, range should be 0-13", image.succ_low ))); } // skip any bytes not read image.stream.skip(smallest_size.saturating_sub(ls))?; trace!( "Ss={}, Se={} Ah={} Al={}", image.spec_start, image.spec_end, image.succ_high, image.succ_low ); Ok(()) } /// Parse the APP13 (IPTC) segment. pub(crate) fn parse_app13( decoder: &mut JpegDecoder ) -> Result<(), DecodeErrors> { const IPTC_PREFIX: &[u8] = b"Photoshop 3.0\0"; // skip length. let mut length = usize::from(decoder.stream.get_u16_be()); if length < 2 { return Err(DecodeErrors::FormatStatic("Too small APP13 length")); } // length bytes. length -= 2; if length > IPTC_PREFIX.len() && decoder.stream.peek_at(0, IPTC_PREFIX.len())? == IPTC_PREFIX { // skip bytes we read above. decoder.stream.skip(IPTC_PREFIX.len())?; length -= IPTC_PREFIX.len(); let iptc_bytes = decoder.stream.peek_at(0, length)?.to_vec(); decoder.info.iptc_data = Some(iptc_bytes); } decoder.stream.skip(length)?; Ok(()) } /// Parse Adobe App14 segment pub(crate) fn parse_app14( decoder: &mut JpegDecoder ) -> Result<(), DecodeErrors> { // skip length let mut length = usize::from(decoder.stream.get_u16_be()); if length < 2 { return Err(DecodeErrors::FormatStatic("Too small APP14 length")); } if decoder.stream.peek_at(0, 5)? == b"Adobe" { if length < 14 { return Err(DecodeErrors::FormatStatic( "Too short of a length for App14 segment" )); } // move stream 6 bytes to remove adobe id decoder.stream.skip(6)?; // skip version, flags0 and flags1 decoder.stream.skip(5)?; // get color transform let transform = decoder.stream.read_u8(); // https://exiftool.org/TagNames/JPEG.html#Adobe match transform { 0 => decoder.input_colorspace = ColorSpace::CMYK, 1 => decoder.input_colorspace = ColorSpace::YCbCr, 2 => decoder.input_colorspace = ColorSpace::YCCK, _ => { return Err(DecodeErrors::Format(format!( "Unknown Adobe colorspace {transform}" ))) } } // length = 2 // adobe id = 6 // version = 5 // transform = 1 length = length.saturating_sub(14); } else { warn!("Not a valid Adobe APP14 Segment, skipping {} bytes", length); length = length.saturating_sub(2); } // skip any proceeding lengths. // we do not need them decoder.stream.skip(length)?; Ok(()) } /// Parse the APP1 segment /// /// This contains the exif tag pub(crate) fn parse_app1( decoder: &mut JpegDecoder ) -> Result<(), DecodeErrors> { const XMP_NAMESPACE_PREFIX: &[u8] = b"http://ns.adobe.com/xap/1.0/\0"; // contains exif data let mut length = usize::from(decoder.stream.get_u16_be()); if length < 2 { return Err(DecodeErrors::FormatStatic("Too small app1 length")); } // length bytes length -= 2; if length > 6 && decoder.stream.peek_at(0, 6)? == b"Exif\x00\x00" { trace!("Exif segment present"); // skip bytes we read above decoder.stream.skip(6)?; length -= 6; let exif_bytes = decoder.stream.peek_at(0, length)?.to_vec(); decoder.info.exif_data = Some(exif_bytes); } else if length > XMP_NAMESPACE_PREFIX.len() && decoder.stream.peek_at(0, XMP_NAMESPACE_PREFIX.len())? == XMP_NAMESPACE_PREFIX { trace!("XMP Data Present"); decoder.stream.skip(XMP_NAMESPACE_PREFIX.len())?; length -= XMP_NAMESPACE_PREFIX.len(); let xmp_data = decoder.stream.peek_at(0, length)?.to_vec(); decoder.info.xmp_data = Some(xmp_data); } else { warn!("Unknown format for APP1 tag, skipping"); } decoder.stream.skip(length)?; Ok(()) } pub(crate) fn parse_app2( decoder: &mut JpegDecoder ) -> Result<(), DecodeErrors> { static HDR_META: &[u8] = b"urn:iso:std:iso:ts:21496:-1\0"; static MPF_DATA: &[u8] = b"MPF\0"; let mut length = usize::from(decoder.stream.get_u16_be()); if length < 2 { return Err(DecodeErrors::FormatStatic("Too small app2 segment")); } // length bytes length -= 2; if length > 14 && decoder.stream.peek_at(0, 12)? == *b"ICC_PROFILE\0" { trace!("ICC Profile present"); // skip 12 bytes which indicate ICC profile length -= 12; decoder.stream.skip(12)?; let seq_no = decoder.stream.read_u8(); let num_markers = decoder.stream.read_u8(); // deduct the two bytes we read above length -= 2; let data = decoder.stream.peek_at(0, length)?.to_vec(); let icc_chunk = ICCChunk { seq_no, num_markers, data }; decoder.icc_data.push(icc_chunk); } else if length > HDR_META.len() && decoder.stream.peek_at(0, HDR_META.len())? == HDR_META { length = length.saturating_sub(HDR_META.len()); decoder.stream.skip(HDR_META.len())?; trace!("Gain Map metadata found"); match length { 4 => { // If gain map metadata length == 4 then here it variables // https://github.com/google/libultrahdr/blob/bf2aa439eea9ad5da483003fa44182f990f74091/lib/src/jpegr.cpp#L1076C1-L1077C35 // 2 bytes minimum_version: (00 00) // 2 bytes writer_version: (00 00) // Perhaps nothing to do with it ? let _ = decoder.stream.get_u16_be(); let _ = decoder.stream.get_u16_be(); length -= 4; decoder .info .gain_map_info .push(GainMapInfo { data: Vec::new() }); } n if n > 4 => { // If there is perhaps useful gain map info // we'll read this until end // https://github.com/google/libultrahdr/blob/bf2aa439eea9ad5da483003fa44182f990f74091/lib/src/jpegr.cpp#L1323 let data = decoder.stream.peek_at(0, length)?.to_vec(); length -= data.len(); decoder.stream.skip(data.len())?; decoder.info.gain_map_info.push(GainMapInfo { data }); } _ => {} } } else if length > MPF_DATA.len() && decoder.stream.peek_at(0, MPF_DATA.len())? == MPF_DATA { trace!("MPF Signature present"); length = length.saturating_sub(MPF_DATA.len()); decoder.stream.skip(MPF_DATA.len())?; // MPF signature taken from here // https://github.com/google/libultrahdr/blob/bf2aa439eea9ad5da483003fa44182f990f74091/lib/include/ultrahdr/multipictureformat.h#L50 // https://github.com/google/libultrahdr/blob/bf2aa439eea9ad5da483003fa44182f990f74091/lib/src/multipictureformat.cpp#L36 // More info https://www.cipa.jp/std/documents/e/DC-X007-KEY_E.pdf let data = decoder.stream.peek_at(0, length)?.to_vec(); length -= data.len(); decoder.stream.skip(data.len())?; decoder.info.multi_picture_information = Some(data); } decoder.stream.skip(length)?; Ok(()) } /// Small utility function to print Un-zig-zagged quantization tables fn un_zig_zag(a: &[T]) -> [i32; 64] where T: Default + Copy, i32: core::convert::From { let mut output = [i32::default(); 64]; for i in 0..64 { output[UN_ZIGZAG[i]] = i32::from(a[i]); } output } zune-jpeg-0.5.11/src/huffman.rs000064400000000000000000000215111046102023000143750ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ //! This file contains a single struct `HuffmanTable` that //! stores Huffman tables needed during `BitStream` decoding. #![allow(clippy::similar_names, clippy::module_name_repetitions)] use alloc::string::ToString; use crate::errors::DecodeErrors; /// Determines how many bits of lookahead we have for our bitstream decoder. pub const HUFF_LOOKAHEAD: u8 = 9; /// A struct which contains necessary tables for decoding a JPEG /// huffman encoded bitstream pub struct HuffmanTable { // element `[0]` of each array is unused /// largest code of length k pub(crate) maxcode: [i32; 18], /// offset for codes of length k /// Answers the question, where do code-lengths of length k end /// Element 0 is unused pub(crate) offset: [i32; 18], /// lookup table for fast decoding /// /// top bits above HUFF_LOOKAHEAD contain the code length. /// /// Lower (8) bits contain the symbol in order of increasing code length. pub(crate) lookup: [i32; 1 << HUFF_LOOKAHEAD], /// A table which can be used to decode small AC coefficients and /// do an equivalent of receive_extend pub(crate) ac_lookup: Option<[i16; 1 << HUFF_LOOKAHEAD]>, /// Directly represent contents of a JPEG DHT marker /// /// \# number of symbols with codes of length `k` bits // bits[0] is unused /// Symbols in order of increasing code length pub(crate) values: [u8; 256] } impl HuffmanTable { pub fn new( codes: &[u8; 17], values: [u8; 256], is_dc: bool, is_progressive: bool ) -> Result { let too_long_code = (i32::from(HUFF_LOOKAHEAD) + 1) << HUFF_LOOKAHEAD; let mut p = HuffmanTable { maxcode: [0; 18], offset: [0; 18], lookup: [too_long_code; 1 << HUFF_LOOKAHEAD], values, ac_lookup: None }; p.make_derived_table(is_dc, is_progressive, codes)?; Ok(p) } /// Create a new huffman tables with values that aren't fixed /// used by fill_mjpeg_tables pub fn new_unfilled( codes: &[u8; 17], values: &[u8], is_dc: bool, is_progressive: bool ) -> Result { let mut buf = [0; 256]; buf[..values.len()].copy_from_slice(values); HuffmanTable::new(codes, buf, is_dc, is_progressive) } /// Compute derived values for a Huffman table /// /// This routine performs some validation checks on the table #[allow( clippy::cast_possible_truncation, clippy::cast_possible_wrap, clippy::cast_sign_loss, clippy::too_many_lines, clippy::needless_range_loop )] fn make_derived_table( &mut self, is_dc: bool, _is_progressive: bool, bits: &[u8; 17] ) -> Result<(), DecodeErrors> { // build a list of code size let mut huff_size = [0; 257]; // Huffman code lengths let mut huff_code: [u32; 257] = [0; 257]; // figure C.1 make table of Huffman code length for each symbol let mut p = 0; for l in 1..=16 { let mut i = i32::from(bits[l]); // table overrun is checked before ,so we dont need to check while i != 0 { huff_size[p] = l as u8; p += 1; i -= 1; } } huff_size[p] = 0; let num_symbols = p; // Generate the codes themselves // We also validate that the counts represent a legal Huffman code tree let mut code = 0; let mut si = i32::from(huff_size[0]); p = 0; while huff_size[p] != 0 { while i32::from(huff_size[p]) == si { huff_code[p] = code; code += 1; p += 1; } // maximum code of length si, pre-shifted by 16-k bits self.maxcode[si as usize] = (code << (16 - si)) as i32; // code is now 1 more than the last code used for code-length si; but // it must still fit in si bits, since no code is allowed to be all ones. if (code as i32) >= (1 << si) { return Err(DecodeErrors::HuffmanDecode("Bad Huffman Table".to_string())); } code <<= 1; si += 1; } // Figure F.15 generate decoding tables for bit-sequential decoding p = 0; for l in 0..=16 { if bits[l] == 0 { // -1 if no codes of this length self.maxcode[l] = -1; } else { // offset[l]=codes[index of 1st symbol of code length l // minus minimum code of length l] self.offset[l] = (p as i32) - (huff_code[p]) as i32; p += usize::from(bits[l]); } } self.offset[17] = 0; // we ensure that decode terminates self.maxcode[17] = 0x000F_FFFF; /* * Compute lookahead tables to speed up decoding. * First we set all the table entries to 0(left justified), indicating "too long"; * (Note too long was set during initialization) * then we iterate through the Huffman codes that are short enough and * fill in all the entries that correspond to bit sequences starting * with that code. */ p = 0; for l in 1..=HUFF_LOOKAHEAD { for _ in 1..=i32::from(bits[usize::from(l)]) { // l -> Current code length, // p => Its index in self.code and self.values // Generate left justified code followed by all possible bit sequences let mut look_bits = (huff_code[p] as usize) << (HUFF_LOOKAHEAD - l); for _ in 0..1 << (HUFF_LOOKAHEAD - l) { self.lookup[look_bits] = (i32::from(l) << HUFF_LOOKAHEAD) | i32::from(self.values[p]); look_bits += 1; } p += 1; } } // build an ac table that does an equivalent of decode and receive_extend if !is_dc { let mut fast = [255; 1 << HUFF_LOOKAHEAD]; // Iterate over number of symbols for i in 0..num_symbols { // get code size for an item let s = huff_size[i]; if s <= HUFF_LOOKAHEAD { // if it's lower than what we need for our lookup table create the table let c = (huff_code[i] << (HUFF_LOOKAHEAD - s)) as usize; let m = (1 << (HUFF_LOOKAHEAD - s)) as usize; for j in 0..m { fast[c + j] = i as i16; } } } // build a table that decodes both magnitude and value of small ACs in // one go. let mut fast_ac = [0; 1 << HUFF_LOOKAHEAD]; for i in 0..(1 << HUFF_LOOKAHEAD) { let fast_v = fast[i]; if fast_v < 255 { // get symbol value from AC table let rs = self.values[fast_v as usize]; // shift by 4 to get run length let run = i16::from((rs >> 4) & 15); // get magnitude bits stored at the lower 3 bits let mag_bits = i16::from(rs & 15); // length of the bit we've read let len = i16::from(huff_size[fast_v as usize]); if mag_bits != 0 && (len + mag_bits) <= i16::from(HUFF_LOOKAHEAD) { // magnitude code followed by receive_extend code let mut k = (((i as i16) << len) & ((1 << HUFF_LOOKAHEAD) - 1)) >> (i16::from(HUFF_LOOKAHEAD) - mag_bits); let m = 1 << (mag_bits - 1); if k < m { k += (!0_i16 << mag_bits) + 1; }; // if result is small enough fit into fast ac table if (-128..=127).contains(&k) { fast_ac[i] = (k << 8) + (run << 4) + (len + mag_bits); } } } } self.ac_lookup = Some(fast_ac); } // Validate symbols as being reasonable // For AC tables, we make no check, but accept all byte values 0..255 // For DC tables, we require symbols to be in range 0..15 if is_dc { for i in 0..num_symbols { let sym = self.values[i]; if sym > 15 { return Err(DecodeErrors::HuffmanDecode("Bad Huffman Table".to_string())); } } } Ok(()) } } zune-jpeg-0.5.11/src/idct/avx2.rs000064400000000000000000000321001046102023000145500ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ #![cfg(any(target_arch = "x86", target_arch = "x86_64"))] //! AVX optimised IDCT. //! //! Okay not thaat optimised. //! //! //! # The implementation //! The implementation is neatly broken down into two operations. //! //! 1. Test for zeroes //! > There is a shortcut method for idct where when all AC values are zero, we can get the answer really quickly. //! by scaling the 1/8th of the DCT coefficient of the block to the whole block and level shifting. //! //! 2. If above fails, we proceed to carry out IDCT as a two pass one dimensional algorithm. //! IT does two whole scans where it carries out IDCT on all items //! After each successive scan, data is transposed in register(thank you x86 SIMD powers). and the second //! pass is carried out. //! //! The code is not super optimized, it produces bit identical results with scalar code hence it's //! `mm256_add_epi16` //! and it also has the advantage of making this implementation easy to maintain. #![cfg(feature = "x86")] #![allow(dead_code)] #[cfg(target_arch = "x86")] use core::arch::x86::*; #[cfg(target_arch = "x86_64")] use core::arch::x86_64::*; use crate::unsafe_utils::{transpose, YmmRegister}; const SCALE_BITS: i32 = 512 + 65536 + (128 << 17); // Pack i32 to i16's, // clamp them to be between 0-255 // Undo shuffling // Store back to array macro_rules! permute_store { ($x:tt,$y:tt,$index:tt,$out:tt,$stride:tt) => { let a = _mm256_packs_epi32($x, $y); // Clamp the values after packing, we can clamp more values at once let b = clamp_avx(a); // /Undo shuffling let c = _mm256_permute4x64_epi64(b, shuffle(3, 1, 2, 0)); // store first vector _mm_storeu_si128( ($out) .get_mut($index..$index + 8) .unwrap() .as_mut_ptr() .cast(), _mm256_extractf128_si256::<0>(c), ); $index += $stride; // second vector _mm_storeu_si128( ($out) .get_mut($index..$index + 8) .unwrap() .as_mut_ptr() .cast(), _mm256_extractf128_si256::<1>(c), ); $index += $stride; }; } #[target_feature(enable = "avx2")] #[allow( clippy::too_many_lines, clippy::cast_possible_truncation, clippy::similar_names, clippy::op_ref, unused_assignments, clippy::zero_prefixed_literal )] pub unsafe fn idct_avx2( in_vector: &mut [i32; 64], out_vector: &mut [i16], stride: usize, ) { let mut pos = 0; // load into registers // // We sign extend i16's to i32's and calculate them with extended precision and // later reduce them to i16's when we are done carrying out IDCT let rw0 = _mm256_loadu_si256(in_vector[00..].as_ptr().cast()); let rw1 = _mm256_loadu_si256(in_vector[08..].as_ptr().cast()); let rw2 = _mm256_loadu_si256(in_vector[16..].as_ptr().cast()); let rw3 = _mm256_loadu_si256(in_vector[24..].as_ptr().cast()); let rw4 = _mm256_loadu_si256(in_vector[32..].as_ptr().cast()); let rw5 = _mm256_loadu_si256(in_vector[40..].as_ptr().cast()); let rw6 = _mm256_loadu_si256(in_vector[48..].as_ptr().cast()); let rw7 = _mm256_loadu_si256(in_vector[56..].as_ptr().cast()); // Forward DCT and quantization may cause all the AC terms to be zero, for such // cases we can try to accelerate it // Basically the poop is that whenever the array has 63 zeroes, its idct is // (arr[0]>>3)or (arr[0]/8) propagated to all the elements. // We first test to see if the array contains zero elements and if it does, we go the // short way. // // This reduces IDCT overhead from about 39% to 18 %, almost half // Do another load for the first row, we don't want to check DC value, because // we only care about AC terms let rw8 = _mm256_loadu_si256(in_vector[1..].as_ptr().cast()); let mut bitmap = _mm256_or_si256(rw1, rw2); bitmap = _mm256_or_si256(bitmap, rw3); bitmap = _mm256_or_si256(bitmap, rw4); bitmap = _mm256_or_si256(bitmap, rw5); bitmap = _mm256_or_si256(bitmap, rw6); bitmap = _mm256_or_si256(bitmap, rw7); bitmap = _mm256_or_si256(bitmap, rw8); if _mm256_testz_si256(bitmap, bitmap) == 1 { // AC terms all zero, idct of the block is ( coeff[0] * qt[0] )/8 + 128 (bias) // (and clamped to 255) // Round by adding 0.5 * (1 << 3) and offset by adding (128 << 3) before scaling let coeff = ((in_vector[0] + 4 + 1024) >> 3).clamp(0, 255) as i16; let idct_value = _mm_set1_epi16(coeff); macro_rules! store { ($pos:tt,$value:tt) => { // store _mm_storeu_si128( out_vector .get_mut($pos..$pos + 8) .unwrap() .as_mut_ptr() .cast(), $value, ); $pos += stride; }; } store!(pos, idct_value); store!(pos, idct_value); store!(pos, idct_value); store!(pos, idct_value); store!(pos, idct_value); store!(pos, idct_value); store!(pos, idct_value); store!(pos, idct_value); return; } let mut row0 = YmmRegister { mm256: rw0 }; let mut row1 = YmmRegister { mm256: rw1 }; let mut row2 = YmmRegister { mm256: rw2 }; let mut row3 = YmmRegister { mm256: rw3 }; let mut row4 = YmmRegister { mm256: rw4 }; let mut row5 = YmmRegister { mm256: rw5 }; let mut row6 = YmmRegister { mm256: rw6 }; let mut row7 = YmmRegister { mm256: rw7 }; macro_rules! dct_pass { ($SCALE_BITS:tt,$scale:tt) => { // There are a lot of ways to do this // but to keep it simple(and beautiful), ill make a direct translation of the // scalar code to also make this code fully transparent(this version and the non // avx one should produce identical code.) // even part let p1 = (row2 + row6) * 2217; let mut t2 = p1 + row6 * -7567; let mut t3 = p1 + row2 * 3135; let mut t0 = YmmRegister { mm256: _mm256_slli_epi32((row0 + row4).mm256, 12), }; let mut t1 = YmmRegister { mm256: _mm256_slli_epi32((row0 - row4).mm256, 12), }; let x0 = t0 + t3 + $SCALE_BITS; let x3 = t0 - t3 + $SCALE_BITS; let x1 = t1 + t2 + $SCALE_BITS; let x2 = t1 - t2 + $SCALE_BITS; let p3 = row7 + row3; let p4 = row5 + row1; let p1 = row7 + row1; let p2 = row5 + row3; let p5 = (p3 + p4) * 4816; t0 = row7 * 1223; t1 = row5 * 8410; t2 = row3 * 12586; t3 = row1 * 6149; let p1 = p5 + p1 * -3685; let p2 = p5 + (p2 * -10497); let p3 = p3 * -8034; let p4 = p4 * -1597; t3 += p1 + p4; t2 += p2 + p3; t1 += p2 + p4; t0 += p1 + p3; row0.mm256 = _mm256_srai_epi32((x0 + t3).mm256, $scale); row1.mm256 = _mm256_srai_epi32((x1 + t2).mm256, $scale); row2.mm256 = _mm256_srai_epi32((x2 + t1).mm256, $scale); row3.mm256 = _mm256_srai_epi32((x3 + t0).mm256, $scale); row4.mm256 = _mm256_srai_epi32((x3 - t0).mm256, $scale); row5.mm256 = _mm256_srai_epi32((x2 - t1).mm256, $scale); row6.mm256 = _mm256_srai_epi32((x1 - t2).mm256, $scale); row7.mm256 = _mm256_srai_epi32((x0 - t3).mm256, $scale); }; } // Process rows dct_pass!(512, 10); transpose( &mut row0, &mut row1, &mut row2, &mut row3, &mut row4, &mut row5, &mut row6, &mut row7, ); // process columns dct_pass!(SCALE_BITS, 17); transpose( &mut row0, &mut row1, &mut row2, &mut row3, &mut row4, &mut row5, &mut row6, &mut row7, ); // Pack and write the values back to the array permute_store!((row0.mm256), (row1.mm256), pos, out_vector, stride); permute_store!((row2.mm256), (row3.mm256), pos, out_vector, stride); permute_store!((row4.mm256), (row5.mm256), pos, out_vector, stride); permute_store!((row6.mm256), (row7.mm256), pos, out_vector, stride); } #[target_feature(enable = "avx2")] #[allow( clippy::too_many_lines, clippy::cast_possible_truncation, clippy::similar_names, clippy::op_ref, unused_assignments, clippy::zero_prefixed_literal )] pub unsafe fn idct_avx2_4x4( in_vector: &mut [i32; 64], out_vector: &mut [i16], stride: usize, ) { let rw0 = _mm256_loadu_si256(in_vector[00..].as_ptr().cast()); let rw1 = _mm256_loadu_si256(in_vector[08..].as_ptr().cast()); let rw2 = _mm256_loadu_si256(in_vector[16..].as_ptr().cast()); let rw3 = _mm256_loadu_si256(in_vector[24..].as_ptr().cast()); let mut row0 = YmmRegister { mm256: rw0 }; let mut row1 = YmmRegister { mm256: rw1 }; let mut row2 = YmmRegister { mm256: rw2 }; let mut row3 = YmmRegister { mm256: rw3 }; let mut row4 = YmmRegister { mm256: rw0 }; let mut row5 = YmmRegister { mm256: rw0 }; let mut row6 = YmmRegister { mm256: rw0 }; let mut row7 = YmmRegister { mm256: rw0 }; { row0.mm256 = _mm256_slli_epi32(row0.mm256, 12); row0 += 512; let i2 = row2; let p1 = i2 * 2217; let p3 = i2 * 5352; let x0 = row0 + p3; let x1 = row0 + p1; let x2 = row0 - p1; let x3 = row0 - p3; // odd part let i4 = row3; let i3 = row1; let p5 = (i4 + i3) * 4816; let p1 = p5 + i3 * -3685; let p2 = p5 + i4 * -10497; let t3 = p5 + i3 * 867; let t2 = p5 + i4 * -5945; let t1 = p2 + i3 * -1597; let t0 = p1 + i4 * -8034; row0.mm256 = _mm256_srai_epi32((x0 + t3).mm256, 10); row1.mm256 = _mm256_srai_epi32((x1 + t2).mm256, 10); row2.mm256 = _mm256_srai_epi32((x2 + t1).mm256, 10); row3.mm256 = _mm256_srai_epi32((x3 + t0).mm256, 10); row4.mm256 = _mm256_srai_epi32((x3 - t0).mm256, 10); row5.mm256 = _mm256_srai_epi32((x2 - t1).mm256, 10); row6.mm256 = _mm256_srai_epi32((x1 - t2).mm256, 10); row7.mm256 = _mm256_srai_epi32((x0 - t3).mm256, 10); } transpose( &mut row0, &mut row1, &mut row2, &mut row3, &mut row4, &mut row5, &mut row6, &mut row7, ); { let i2 = row2; let i0 = row0; row0.mm256 = _mm256_slli_epi32(i0.mm256, 12); let t0 = row0 + SCALE_BITS; let t2 = i2 * 2217; let t3 = i2 * 5352; // constants scaled things up by 1<<12, plus we had 1<<2 from first // loop, plus horizontal and vertical each scale by sqrt(8) so together // we've got an extra 1<<3, so 1<<17 total we need to remove. // so we want to round that, which means adding 0.5 * 1<<17, // aka 65536. Also, we'll end up with -128 to 127 that we want // to encode as 0..255 by adding 128, so we'll add that before the shift // Rounding constant is already added into `t0` let x0 = t0 + t3; let x3 = t0 - t3; let x1 = t0 + t2; let x2 = t0 - t2; // odd part let i3 = row3; let i1 = row1; let p5 = (i3 + i1) * 4816; let p1 = p5 + i1 * -3685; let p2 = p5 + i3 * -10497; let t3 = p5 + i1 * 867; let t2 = p5 + i3 * -5945; let t1 = p2 + i1 * -1597; let t0 = p1 + i3 * -8034; row0.mm256 = _mm256_srai_epi32((x0 + t3).mm256, 17); row1.mm256 = _mm256_srai_epi32((x1 + t2).mm256, 17); row2.mm256 = _mm256_srai_epi32((x2 + t1).mm256, 17); row3.mm256 = _mm256_srai_epi32((x3 + t0).mm256, 17); row4.mm256 = _mm256_srai_epi32((x3 - t0).mm256, 17); row5.mm256 = _mm256_srai_epi32((x2 - t1).mm256, 17); row6.mm256 = _mm256_srai_epi32((x1 - t2).mm256, 17); row7.mm256 = _mm256_srai_epi32((x0 - t3).mm256, 17); } transpose( &mut row0, &mut row1, &mut row2, &mut row3, &mut row4, &mut row5, &mut row6, &mut row7, ); let mut pos = 0; // Pack and write the values back to the array permute_store!((row0.mm256), (row1.mm256), pos, out_vector, stride); permute_store!((row2.mm256), (row3.mm256), pos, out_vector, stride); permute_store!((row4.mm256), (row5.mm256), pos, out_vector, stride); permute_store!((row6.mm256), (row7.mm256), pos, out_vector, stride); } #[inline] #[target_feature(enable = "avx2")] unsafe fn clamp_avx(reg: __m256i) -> __m256i { let min_s = _mm256_set1_epi16(0); let max_s = _mm256_set1_epi16(255); let max_v = _mm256_max_epi16(reg, min_s); //max(a,0) let min_v = _mm256_min_epi16(max_v, max_s); //min(max(a,0),255) return min_v; } /// A copy of `_MM_SHUFFLE()` that doesn't require /// a nightly compiler #[inline] const fn shuffle(z: i32, y: i32, x: i32, w: i32) -> i32 { ((z << 6) | (y << 4) | (x << 2) | w) } zune-jpeg-0.5.11/src/idct/neon.rs000064400000000000000000000217241046102023000146410ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ #![cfg(target_arch = "aarch64")] //! AVX optimised IDCT. //! //! Okay not thaat optimised. //! //! //! # The implementation //! The implementation is neatly broken down into two operations. //! //! 1. Test for zeroes //! > There is a shortcut method for idct where when all AC values are zero, we can get the answer really quickly. //! by scaling the 1/8th of the DCT coefficient of the block to the whole block and level shifting. //! //! 2. If above fails, we proceed to carry out IDCT as a two pass one dimensional algorithm. //! IT does two whole scans where it carries out IDCT on all items //! After each successive scan, data is transposed in register(thank you x86 SIMD powers). and the second //! pass is carried out. //! //! The code is not super optimized, it produces bit identical results with scalar code hence it's //! `mm256_add_epi16` //! and it also has the advantage of making this implementation easy to maintain. #![cfg(feature = "neon")] use core::arch::aarch64::*; use crate::unsafe_utils::{transpose, YmmRegister}; const SCALE_BITS: i32 = 512 + 65536 + (128 << 17); #[inline] #[target_feature(enable = "neon")] unsafe fn pack_16(a: int32x4x2_t) -> int16x8_t { vcombine_s16(vqmovn_s32(a.0), vqmovn_s32(a.1)) } #[inline] #[target_feature(enable = "neon")] unsafe fn condense_bottom_16(a: int32x4x2_t, b: int32x4x2_t) -> int16x8x2_t { int16x8x2_t(pack_16(a), pack_16(b)) } #[target_feature(enable = "neon")] #[allow( clippy::too_many_lines, clippy::cast_possible_truncation, clippy::similar_names, clippy::op_ref, unused_assignments, clippy::zero_prefixed_literal )] pub unsafe fn idct_neon( in_vector: &mut [i32; 64], out_vector: &mut [i16], stride: usize ) { let mut pos = 0; // load into registers // // We sign extend i16's to i32's and calculate them with extended precision and // later reduce them to i16's when we are done carrying out IDCT let mut row0 = YmmRegister::load(in_vector[00..].as_ptr().cast()); let mut row1 = YmmRegister::load(in_vector[08..].as_ptr().cast()); let mut row2 = YmmRegister::load(in_vector[16..].as_ptr().cast()); let mut row3 = YmmRegister::load(in_vector[24..].as_ptr().cast()); let mut row4 = YmmRegister::load(in_vector[32..].as_ptr().cast()); let mut row5 = YmmRegister::load(in_vector[40..].as_ptr().cast()); let mut row6 = YmmRegister::load(in_vector[48..].as_ptr().cast()); let mut row7 = YmmRegister::load(in_vector[56..].as_ptr().cast()); // Forward DCT and quantization may cause all the AC terms to be zero, for such // cases we can try to accelerate it // Basically the poop is that whenever the array has 63 zeroes, its idct is // (arr[0]>>3)or (arr[0]/8) propagated to all the elements. // We first test to see if the array contains zero elements and if it does, we go the // short way. // // This reduces IDCT overhead from about 39% to 18 %, almost half // Do another load for the first row, we don't want to check DC value, because // we only care about AC terms // TODO this should be a shift/shuffle, not a likely unaligned load let row8 = YmmRegister::load(in_vector[1..].as_ptr().cast()); let or_tree = (((row1 | row8) | (row2 | row3)) | ((row4 | row5) | (row6 | row7))); if or_tree.all_zero() { // AC terms all zero, idct of the block is ( coeff[0] * qt[0] )/8 + 128 (bias) // (and clamped to 255) // Round by adding 0.5 * (1 << 3) and offset by adding (128 << 3) before scaling let coeff = ((in_vector[0] + 4 + 1024) >> 3).clamp(0, 255) as i16; let idct_value = vdupq_n_s16(coeff); macro_rules! store { ($pos:tt,$value:tt) => { // store vst1q_s16( out_vector .get_mut($pos..$pos + 8) .unwrap() .as_mut_ptr() .cast(), $value ); $pos += stride; }; } store!(pos, idct_value); store!(pos, idct_value); store!(pos, idct_value); store!(pos, idct_value); store!(pos, idct_value); store!(pos, idct_value); store!(pos, idct_value); store!(pos, idct_value); return; } macro_rules! dct_pass { ($SCALE_BITS:tt,$scale:tt) => { // There are a lot of ways to do this // but to keep it simple(and beautiful), ill make a direct translation of the // scalar code to also make this code fully transparent(this version and the non // avx one should produce identical code.) // Compiler does a pretty good job of optimizing add + mul pairs // into multiply-acumulate pairs // even part let p1 = (row2 + row6) * 2217; let mut t2 = p1 + row6 * -7567; let mut t3 = p1 + row2 * 3135; let mut t0 = (row0 + row4).const_shl::<12>(); let mut t1 = (row0 - row4).const_shl::<12>(); let x0 = t0 + t3 + $SCALE_BITS; let x3 = t0 - t3 + $SCALE_BITS; let x1 = t1 + t2 + $SCALE_BITS; let x2 = t1 - t2 + $SCALE_BITS; let p3 = row7 + row3; let p4 = row5 + row1; let p1 = row7 + row1; let p2 = row5 + row3; let p5 = (p3 + p4) * 4816; t0 = row7 * 1223; t1 = row5 * 8410; t2 = row3 * 12586; t3 = row1 * 6149; let p1 = p5 + p1 * -3685; let p2 = p5 + (p2 * -10497); let p3 = p3 * -8034; let p4 = p4 * -1597; t3 += p1 + p4; t2 += p2 + p3; t1 += p2 + p4; t0 += p1 + p3; row0 = (x0 + t3).const_shra::<$scale>(); row1 = (x1 + t2).const_shra::<$scale>(); row2 = (x2 + t1).const_shra::<$scale>(); row3 = (x3 + t0).const_shra::<$scale>(); row4 = (x3 - t0).const_shra::<$scale>(); row5 = (x2 - t1).const_shra::<$scale>(); row6 = (x1 - t2).const_shra::<$scale>(); row7 = (x0 - t3).const_shra::<$scale>(); }; } // Process rows dct_pass!(512, 10); transpose( &mut row0, &mut row1, &mut row2, &mut row3, &mut row4, &mut row5, &mut row6, &mut row7 ); // process columns dct_pass!(SCALE_BITS, 17); transpose( &mut row0, &mut row1, &mut row2, &mut row3, &mut row4, &mut row5, &mut row6, &mut row7 ); // Pack i32 to i16's, // clamp them to be between 0-255 // Undo shuffling // Store back to array // This could potentially be reorganized to take advantage of the multi-register stores macro_rules! permute_store { ($x:tt,$y:tt,$index:tt,$out:tt) => { let a = condense_bottom_16($x, $y); // Clamp the values after packing, we can clamp more values at once let b = clamp256_neon(a); // store first vector vst1q_s16( ($out) .get_mut($index..$index + 8) .unwrap() .as_mut_ptr() .cast(), b.0 ); $index += stride; // second vector vst1q_s16( ($out) .get_mut($index..$index + 8) .unwrap() .as_mut_ptr() .cast(), b.1 ); $index += stride; }; } // Pack and write the values back to the array permute_store!((row0.mm256), (row1.mm256), pos, out_vector); permute_store!((row2.mm256), (row3.mm256), pos, out_vector); permute_store!((row4.mm256), (row5.mm256), pos, out_vector); permute_store!((row6.mm256), (row7.mm256), pos, out_vector); } #[inline] #[target_feature(enable = "neon")] unsafe fn clamp_neon(reg: int16x8_t) -> int16x8_t { let min_s = vdupq_n_s16(0); let max_s = vdupq_n_s16(255); let max_v = vmaxq_s16(reg, min_s); //max(a,0) let min_v = vminq_s16(max_v, max_s); //min(max(a,0),255) min_v } #[inline] #[target_feature(enable = "neon")] unsafe fn clamp256_neon(reg: int16x8x2_t) -> int16x8x2_t { int16x8x2_t(clamp_neon(reg.0), clamp_neon(reg.1)) } #[cfg(test)] mod test { use super::*; #[test] fn test_neon_clamp_256() { unsafe { let vals: [i16; 16] = [-1, -2, -3, 4, 256, 257, 258, 240, -1, 290, 2, 3, 4, 5, 6, 7]; let loaded = vld1q_s16_x2(vals.as_ptr().cast()); let shuffled = clamp256_neon(loaded); let mut result: [i16; 16] = [0; 16]; vst1q_s16_x2(result.as_mut_ptr().cast(), shuffled); assert_eq!( result, [0, 0, 0, 4, 255, 255, 255, 240, 0, 255, 2, 3, 4, 5, 6, 7] ) } } } zune-jpeg-0.5.11/src/idct/scalar.rs000064400000000000000000000170041046102023000151430ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ //! Platform independent IDCT algorithm //! //! Not as fast as AVX one. const SCALE_BITS: i32 = 512 + 65536 + (128 << 17); #[inline(always)] fn wa(a: i32, b: i32) -> i32 { a.wrapping_add(b) } #[inline(always)] fn ws(a: i32, b: i32) -> i32 { a.wrapping_sub(b) } #[inline(always)] fn wm(a: i32, b: i32) -> i32 { a.wrapping_mul(b) } #[inline] pub fn idct_int_1x1(in_vector: &mut [i32; 64], mut out_vector: &mut [i16], stride: usize) { let coeff = ((wa(wa(in_vector[0], 4), 1024) >> 3).clamp(0, 255)) as i16; out_vector[..8].fill(coeff); for _ in 0..7 { out_vector = &mut out_vector[stride..]; out_vector[..8].fill(coeff); } } #[allow(unused_assignments)] #[allow( clippy::too_many_lines, clippy::op_ref, clippy::cast_possible_truncation )] pub fn idct_int(in_vector: &mut [i32; 64], out_vector: &mut [i16], stride: usize) { let mut pos = 0; let mut i = 0; if &in_vector[1..] == &[0_i32; 63] { return idct_int_1x1(in_vector, out_vector, stride); } // vertical pass for ptr in 0..8 { let p2 = in_vector[ptr + 16]; let p3 = in_vector[ptr + 48]; let p1 = wm(wa(p2, p3), 2217); let t2 = wa(p1, wm(p3, -7567)); let t3 = wa(p1, wm(p2, 3135)); let p2 = in_vector[ptr]; let p3 = in_vector[32 + ptr]; let t0 = fsh(wa(p2, p3)); let t1 = fsh(ws(p2, p3)); let x0 = wa(wa(t0, t3), 512); let x3 = wa(ws(t0, t3), 512); let x1 = wa(wa(t1, t2), 512); let x2 = wa(ws(t1, t2), 512); let mut t0 = in_vector[ptr + 56]; let mut t1 = in_vector[ptr + 40]; let mut t2 = in_vector[ptr + 24]; let mut t3 = in_vector[ptr + 8]; let p3 = wa(t0, t2); let p4 = wa(t1, t3); let p1 = wa(t0, t3); let p2 = wa(t1, t2); let p5 = wm(wa(p3, p4), 4816); t0 = wm(t0, 1223); t1 = wm(t1, 8410); t2 = wm(t2, 12586); t3 = wm(t3, 6149); let p1 = wa(p5, wm(p1, -3685)); let p2 = wa(p5, wm(p2, -10497)); let p3 = wm(p3, -8034); let p4 = wm(p4, -1597); t3 = wa(t3, wa(p1, p4)); t2 = wa(t2, wa(p2, p3)); t1 = wa(t1, wa(p2, p4)); t0 = wa(t0, wa(p1, p3)); in_vector[ptr] = ws(wa(x0, t3), 0) >> 10; in_vector[ptr + 8] = ws(wa(x1, t2), 0) >> 10; in_vector[ptr + 16] = ws(wa(x2, t1), 0) >> 10; in_vector[ptr + 24] = ws(wa(x3, t0), 0) >> 10; in_vector[ptr + 32] = ws(ws(x3, t0), 0) >> 10; in_vector[ptr + 40] = ws(ws(x2, t1), 0) >> 10; in_vector[ptr + 48] = ws(ws(x1, t2), 0) >> 10; in_vector[ptr + 56] = ws(ws(x0, t3), 0) >> 10; } // horizontal pass while i < 64 { let p2 = in_vector[i + 2]; let p3 = in_vector[i + 6]; let p1 = wm(wa(p2, p3), 2217); let t2 = wa(p1, wm(p3, -7567)); let t3 = wa(p1, wm(p2, 3135)); let p2 = in_vector[i]; let p3 = in_vector[i + 4]; let t0 = fsh(wa(p2, p3)); let t1 = fsh(ws(p2, p3)); let x0 = wa(wa(t0, t3), SCALE_BITS); let x3 = wa(ws(t0, t3), SCALE_BITS); let x1 = wa(wa(t1, t2), SCALE_BITS); let x2 = wa(ws(t1, t2), SCALE_BITS); let mut t0 = in_vector[i + 7]; let mut t1 = in_vector[i + 5]; let mut t2 = in_vector[i + 3]; let mut t3 = in_vector[i + 1]; let p3 = wa(t0, t2); let p4 = wa(t1, t3); let p1 = wa(t0, t3); let p2 = wa(t1, t2); let p5 = wm(wa(p3, p4), f2f(1.175875602)); t0 = wm(t0, 1223); t1 = wm(t1, 8410); t2 = wm(t2, 12586); t3 = wm(t3, 6149); let p1 = wa(p5, wm(p1, -3685)); let p2 = wa(p5, wm(p2, -10497)); let p3 = wm(p3, -8034); let p4 = wm(p4, -1597); t3 = wa(t3, wa(p1, p4)); t2 = wa(t2, wa(p2, p3)); t1 = wa(t1, wa(p2, p4)); t0 = wa(t0, wa(p1, p3)); let out: &mut [i16; 8] = out_vector .get_mut(pos..pos + 8) .unwrap() .try_into() .unwrap(); out[0] = clamp(wa(x0, t3) >> 17); out[1] = clamp(wa(x1, t2) >> 17); out[2] = clamp(wa(x2, t1) >> 17); out[3] = clamp(wa(x3, t0) >> 17); out[4] = clamp(ws(x3, t0) >> 17); out[5] = clamp(ws(x2, t1) >> 17); out[6] = clamp(ws(x1, t2) >> 17); out[7] = clamp(ws(x0, t3) >> 17); i += 8; pos += stride; } } #[inline] #[allow(clippy::cast_possible_truncation)] /// Multiply a number by 4096 fn f2f(x: f32) -> i32 { (x * 4096.0 + 0.5) as i32 } #[inline] /// Multiply a number by 4096 fn fsh(x: i32) -> i32 { x << 12 } /// Clamp values between 0 and 255 #[inline] #[allow(clippy::cast_possible_truncation)] fn clamp(a: i32) -> i16 { a.clamp(0, 255) as i16 } /// IDCT assuming only the upper 4x4 is filled. pub fn idct4x4(in_vector: &mut [i32; 64], out_vector: &mut [i16], stride: usize) { let mut pos = 0; // vertical pass for ptr in 0..4 { let i0 = wa(fsh(in_vector[ptr]), 512); let i2 = in_vector[ptr + 16]; let p1 = wm(i2, 2217); let p3 = wm(i2, 5352); let x0 = wa(i0, p3); let x1 = wa(i0, p1); let x2 = ws(i0, p1); let x3 = ws(i0, p3); // odd part let i4 = in_vector[ptr + 24]; let i3 = in_vector[ptr + 8]; let p5 = wm(wa(i4, i3), 4816); let p1 = wa(p5, wm(i3, -3685)); let p2 = wa(p5, wm(i4, -10497)); let t3 = wa(p5, wm(i3, 867)); let t2 = wa(p5, wm(i4, -5945)); let t1 = wa(p2, wm(i3, -1597)); let t0 = wa(p1, wm(i4, -8034)); in_vector[ptr] = wa(x0, t3) >> 10; in_vector[ptr + 8] = wa(x1, t2) >> 10; in_vector[ptr + 16] = wa(x2, t1) >> 10; in_vector[ptr + 24] = wa(x3, t0) >> 10; in_vector[ptr + 32] = ws(x3, t0) >> 10; in_vector[ptr + 40] = ws(x2, t1) >> 10; in_vector[ptr + 48] = ws(x1, t2) >> 10; in_vector[ptr + 56] = ws(x0, t3) >> 10; } // horizontal pass for i in (0..8).map(|i| 8 * i) { let i2 = in_vector[i + 2]; let i0 = in_vector[i]; let t0 = wa(fsh(i0), SCALE_BITS); let t2 = wm(i2, 2217); let t3 = wm(i2, 5352); let x0 = wa(t0, t3); let x3 = ws(t0, t3); let x1 = wa(t0, t2); let x2 = ws(t0, t2); // odd part let i3 = in_vector[i + 3]; let i1 = in_vector[i + 1]; let p5 = wm(wa(i3, i1), f2f(1.175875602)); let p1 = wa(p5, wm(i1, -3685)); let p2 = wa(p5, wm(i3, -10497)); let t3 = wa(p5, wm(i1, 867)); let t2 = wa(p5, wm(i3, -5945)); let t1 = wa(p2, wm(i1, -1597)); let t0 = wa(p1, wm(i3, -8034)); let out: &mut [i16; 8] = out_vector .get_mut(pos..pos + 8) .unwrap() .try_into() .unwrap(); out.copy_from_slice(&[ clamp(wa(x0, t3) >> 17), clamp(wa(x1, t2) >> 17), clamp(wa(x2, t1) >> 17), clamp(wa(x3, t0) >> 17), clamp(ws(x3, t0) >> 17), clamp(ws(x2, t1) >> 17), clamp(ws(x1, t2) >> 17), clamp(ws(x0, t3) >> 17), ]); pos += stride; } in_vector[32..36].fill(0); in_vector[40..44].fill(0); in_vector[48..52].fill(0); in_vector[56..60].fill(0); } zune-jpeg-0.5.11/src/idct.rs000064400000000000000000000153521046102023000137020ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ //! Routines for IDCT //! //! Essentially we provide 2 routines for IDCT, a scalar implementation and a not super optimized //! AVX2 one, i'll talk about them here. //! //! There are 2 reasons why we have the avx one //! 1. No one compiles with -C target-features=avx2 hence binaries won't probably take advantage(even //! if it exists). //! 2. AVX employs zero short circuit in a way the scalar code cannot employ it. //! - AVX does this by checking for MCU's whose 63 AC coefficients are zero and if true, it writes //! values directly, if false, it goes the long way of calculating. //! - Although this can be trivially implemented in the scalar version, it generates code //! I'm not happy width(scalar version that basically loops and that is too many branches for me) //! The avx one does a better job of using bitwise or's with (`_mm256_or_si256`) which is magnitudes of faster //! than anything I could come up with //! //! The AVX code also has some cool transpose_u16 instructions which look so complicated to be cool //! (spoiler alert, i barely understand how it works, that's why I credited the owner). //! #![allow( clippy::excessive_precision, clippy::unreadable_literal, clippy::module_name_repetitions, unused_parens, clippy::wildcard_imports )] use zune_core::log::debug; use zune_core::options::DecoderOptions; use crate::decoder::IDCTPtr; use crate::idct::scalar::{idct_int, idct_int_1x1}; #[cfg(feature = "x86")] pub mod avx2; #[cfg(feature = "neon")] pub mod neon; pub mod scalar; /// Choose an appropriate IDCT function #[allow(unused_variables)] pub fn choose_idct_func(options: &DecoderOptions) -> IDCTPtr { #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] #[cfg(feature = "x86")] { if options.use_avx2() { debug!("Using vector integer IDCT"); return |a: &mut [i32; 64], b: &mut [i16], c: usize| { // SAFETY: `options.use_avx2()` only returns true if avx2 is supported. unsafe { avx2::idct_avx2(a,b,c) } }; } } #[cfg(target_arch = "aarch64")] #[cfg(feature = "neon")] { if options.use_neon() { debug!("Using vector integer IDCT"); return |a: &mut [i32; 64], b: &mut [i16], c: usize| { // SAFETY: `options.use_neon()` only returns true if neon is supported. unsafe { neon::idct_neon(a,b,c) } }; } } debug!("Using scalar integer IDCT"); // use generic one return idct_int; } /// Choose a function to implement 4x4 IDCT. /// /// These functions get the same input but have an extra contract: Only the first 4x4 block of /// coefficients are non-zero. All other entries are zeroed. /// /// **The callee must uphold that contract on return** pub fn choose_idct_4x4_func(_options: &DecoderOptions) -> IDCTPtr { #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] #[cfg(feature = "x86")] { if _options.use_avx2() { debug!("Using vector integer IDCT"); return |a: &mut [i32; 64], b: &mut [i16], c: usize| { // SAFETY: `options.use_avx2()` only returns true if avx2 is supported. unsafe { avx2::idct_avx2_4x4(a,b,c) } }; } } scalar::idct4x4 } pub fn choose_idct_1x1_func(_: &DecoderOptions) -> IDCTPtr { // These are simple stores, no alternative implementation for now idct_int_1x1 } #[cfg(test)] #[allow(unreachable_code)] #[allow(dead_code)] mod tests { use super::*; #[test] fn idct_test0() { let stride = 8; let mut coeff = [10; 64]; let mut coeff2 = [10; 64]; let mut output_scalar = [0; 64]; let mut output_vector = [0; 64]; let idct_func = choose_idct_func(&DecoderOptions::new_fast()); idct_func(&mut coeff, &mut output_vector, stride); idct_int(&mut coeff2, &mut output_scalar, stride); assert_eq!(output_scalar, output_vector, "IDCT and scalar do not match"); } #[test] fn do_idct_test1() { let stride = 8; let mut coeff = [14; 64]; let mut coeff2 = [14; 64]; let mut output_scalar = [0; 64]; let mut output_vector = [0; 64]; let idct_func = choose_idct_func(&DecoderOptions::new_fast()); idct_func(&mut coeff, &mut output_vector, stride); idct_int(&mut coeff2, &mut output_scalar, stride); assert_eq!(output_scalar, output_vector, "IDCT and scalar do not match"); } #[test] fn do_idct_test2() { let stride = 8; let mut coeff = [0; 64]; coeff[0] = 255; coeff[63] = -256; let mut coeff2 = coeff; let mut output_scalar = [0; 64]; let mut output_vector = [0; 64]; let idct_func = choose_idct_func(&DecoderOptions::new_fast()); idct_func(&mut coeff, &mut output_vector, stride); idct_int(&mut coeff2, &mut output_scalar, stride); assert_eq!(output_scalar, output_vector, "IDCT and scalar do not match"); } #[test] fn do_idct_zeros() { let stride = 8; let mut coeff = [0; 64]; let mut coeff2 = [0; 64]; let mut output_scalar = [0; 64]; let mut output_vector = [0; 64]; let idct_func = choose_idct_func(&DecoderOptions::new_fast()); idct_func(&mut coeff, &mut output_vector, stride); idct_int(&mut coeff2, &mut output_scalar, stride); assert_eq!(output_scalar, output_vector, "IDCT and scalar do not match"); } #[test] fn idct_4x4() { #[rustfmt::skip] const A: [i32; 32] = [ -254, -7, 0, 0, 0, 0, 0, 0, 7, 0, -30, 32, 0, 0, 0, 0, 7, 0, -30, 32, 0, 0, 0, 0, 7, 0, -30, 32, 0, 0, 0, 0, ]; let v: Vec = vec![ choose_idct_func(&DecoderOptions::new_safe()), choose_idct_4x4_func(&DecoderOptions::new_safe()), choose_idct_func(&DecoderOptions::new_fast()), choose_idct_4x4_func(&DecoderOptions::new_fast()), ]; let dct_names = vec![ "safe idct", "safe idct 4x4", "fast idct", "fast idct 4x4", ]; let mut color = vec![]; for idct in v { let mut a = [0i32; 64]; a[..32].copy_from_slice(&A); let mut b = [0i16; 64]; idct(&mut a, &mut b, 8); color.push(b); } for (wnd, name) in color.windows(2).zip(&dct_names) { let [a, b] = wnd else { unreachable!() }; assert_eq!(a, b, "{name}"); } } } zune-jpeg-0.5.11/src/lib.rs000064400000000000000000000140521046102023000135210ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ //!This crate provides a library for decoding valid //! ITU-T Rec. T.851 (09/2005) ITU-T T.81 (JPEG-1) or JPEG images. //! //! //! //! # Features //! - SSE and AVX accelerated functions to speed up certain decoding operations //! - FAST and accurate 32 bit IDCT algorithm //! - Fast color convert functions //! - RGBA and RGBX (4-Channel) color conversion functions //! - YCbCr to Luma(Grayscale) conversion. //! //! # Usage //! Add zune-jpeg to the dependencies in the project Cargo.toml //! //! ```toml //! [dependencies] //! zune_jpeg = "0.5" //! ``` //! # Examples //! //! ## Decode a JPEG file with default arguments. //!```no_run //! use std::fs::read; //! use std::io::BufReader; //! use zune_jpeg::JpegDecoder; //! let file_contents = BufReader::new(std::fs::File::open("a_jpeg.file").unwrap()); //! let mut decoder = JpegDecoder::new(file_contents); //! let mut pixels = decoder.decode().unwrap(); //! ``` //! //! ## Migrating from version 0.4-- //! //! ### Motivation //! zune v 0.5 reworks mainly the internal architecture of how we perform I/O //! ,before the decoder accepted byte slices that represent the whole data as contiguous //! but that was not ideal for all use cases, increasing memory e.g on massive files that had //! to be read to memory. //! //! With v 0.5 a new I/O system is introduced, which generally introduces mechanisms to process //! `std::io::Read + std::io::Seek` type of data feeds, (but which works in no-std), which means... //! //! ### What changes //! //! I/O code that looked like this //! //!```ignore //! use zune_core::colorspace::ColorSpace; //! use zune_jpeg::JpegDecoder; //! // Read file into memory //! let image = std::fs::read("image.jpg").unwrap(); //! // Make a decoder from the slice //! let mut decoder = JpegDecoder::new(&image); //! // decode //! decoder.decode().unwrap(); //! ``` //! //! Now can be rewritten in two ways. //! //! 1. File I/O (Using bufreader) //! //!```no_run //! use std::io::BufReader; //! use zune_core::colorspace::ColorSpace; //! use zune_jpeg::JpegDecoder; //! //! let image = BufReader::new(std::fs::File::open("image.jpg").unwrap()); //! let mut decoder = JpegDecoder::new(image); //! // decode //! decoder.decode().unwrap(); //! ``` //! //! 2. Reading to memory (but wrapping it in a Cursor like object) //!```no_run //! use zune_core::bytestream::ZCursor; //! use zune_jpeg::JpegDecoder; //! //! let image_data =std::fs::read("image.jpg").unwrap(); //! // Alternatively, you can use std::io::Cursor, //! // but it is better speed wise to use ZCursor, and it also works in //! // no-std environments //! let mut cursor = ZCursor::new(image_data); //! // use the wrapped item //! let mut decoder = JpegDecoder::new(cursor); //! // decode //! decoder.decode().unwrap(); //! ``` //! //! 3. Anything that implements [ZByteReaderTrait](zune_core::bytestream::traits::ZByteReaderTrait) //! //! ## Decode a JPEG file to RGBA format //! //! - Other (limited) supported formats are and BGR, BGRA //! //!```no_run //! use zune_core::bytestream::ZCursor; //! use zune_core::colorspace::ColorSpace; //! use zune_core::options::DecoderOptions; //! use zune_jpeg::JpegDecoder; //! //! let mut options = DecoderOptions::default().jpeg_set_out_colorspace(ColorSpace::RGBA); //! //! let mut decoder = JpegDecoder::new_with_options(ZCursor::new(&[]),options); //! let pixels = decoder.decode().unwrap(); //! ``` //! //! ## Decode an image and get its width and height. //!```no_run //! use zune_core::bytestream::ZCursor; //! use zune_jpeg::JpegDecoder; //! //! let mut decoder = JpegDecoder::new(ZCursor::new(&[])); //! decoder.decode_headers().unwrap(); //! let image_info = decoder.info().unwrap(); //! println!("{},{}",image_info.width,image_info.height) //! ``` //! # Crate features. //! This crate tries to be as minimal as possible while being extensible //! enough to handle the complexities arising from parsing different types //! of jpeg images. //! //! Safety is a top concern that is why we provide both static ways to disable unsafe code, //! disabling x86 feature, and dynamic ,by using [`DecoderOptions::set_use_unsafe(false)`], //! both of these disable platform specific optimizations, which reduce the speed of decompression. //! //! Please do note that careful consideration has been taken to ensure that the unsafe paths //! are only unsafe because they depend on platform specific intrinsics, hence no need to disable them //! //! The crate tries to decode as many images as possible, as a best effort, even those violating the standard //! , this means a lot of images may get silent warnings and wrong output, but if you are sure you will be handling //! images that follow the spec, set `ZuneJpegOptions::set_strict` to true. //! //![`DecoderOptions::set_use_unsafe(false)`]: https://docs.rs/zune-core/latest/zune_core/options/struct.DecoderOptions.html#method.set_use_unsafe #![warn( clippy::correctness, clippy::perf, clippy::pedantic, clippy::inline_always, clippy::missing_errors_doc, clippy::panic )] #![allow( clippy::needless_return, clippy::similar_names, clippy::inline_always, clippy::similar_names, clippy::doc_markdown, clippy::module_name_repetitions, clippy::missing_panics_doc, clippy::missing_errors_doc )] // no_std compatibility #![deny(clippy::std_instead_of_alloc, clippy::alloc_instead_of_core)] #![cfg_attr(not(any(feature = "x86", feature = "neon")), forbid(unsafe_code))] #![cfg_attr(not(feature = "std"), no_std)] #![cfg_attr(feature = "portable_simd", feature(portable_simd))] #![macro_use] extern crate alloc; extern crate core; pub use zune_core; pub use crate::decoder::{ImageInfo, JpegDecoder}; pub use crate::marker::Marker; mod bitstream; mod color_convert; mod components; mod decoder; pub mod errors; mod headers; mod huffman; #[cfg(not(fuzzing))] mod idct; #[cfg(fuzzing)] pub mod idct; mod marker; mod mcu; mod mcu_prog; mod misc; mod unsafe_utils; mod unsafe_utils_avx2; mod unsafe_utils_neon; mod upsampler; mod worker; zune-jpeg-0.5.11/src/marker.rs000064400000000000000000000053311046102023000142340ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ #![allow(clippy::upper_case_acronyms)] /// JPEG Markers /// /// **NOTE** This doesn't cover all markers, just the ones zune-jpeg supports. #[derive(Clone, Copy, Debug, PartialEq, Eq)] pub enum Marker { /// Start Of Frame markers /// /// - SOF(0): Baseline DCT (Huffman coding) /// - SOF(1): Extended sequential DCT (Huffman coding) /// - SOF(2): Progressive DCT (Huffman coding) /// - SOF(3): Lossless (sequential) (Huffman coding) /// - SOF(5): Differential sequential DCT (Huffman coding) /// - SOF(6): Differential progressive DCT (Huffman coding) /// - SOF(7): Differential lossless (sequential) (Huffman coding) /// - SOF(9): Extended sequential DCT (arithmetic coding) /// - SOF(10): Progressive DCT (arithmetic coding) /// - SOF(11): Lossless (sequential) (arithmetic coding) /// - SOF(13): Differential sequential DCT (arithmetic coding) /// - SOF(14): Differential progressive DCT (arithmetic coding) /// - SOF(15): Differential lossless (sequential) (arithmetic coding) SOF(u8), /// Define Huffman table(s) DHT, /// Define arithmetic coding conditioning(s) DAC, /// Restart with modulo 8 count `m` RST(u8), /// Start of image SOI, /// End of image EOI, /// Start of scan SOS, /// Define quantization table(s) DQT, /// Define number of lines DNL, /// Define restart interval DRI, /// Reserved for application segments APP(u8), /// Comment COM, /// Unknown markers UNKNOWN(u8) } impl Marker { pub fn from_u8(n: u8) -> Option { use self::Marker::{APP, COM, DAC, DHT, DNL, DQT, DRI, EOI, RST, SOF, SOI, SOS, UNKNOWN}; match n { 0xFE => Some(COM), 0xC0 => Some(SOF(0)), 0xC1 => Some(SOF(1)), 0xC2 => Some(SOF(2)), 0xC4 => Some(DHT), 0xCC => Some(DAC), 0xD0 => Some(RST(0)), 0xD1 => Some(RST(1)), 0xD2 => Some(RST(2)), 0xD3 => Some(RST(3)), 0xD4 => Some(RST(4)), 0xD5 => Some(RST(5)), 0xD6 => Some(RST(6)), 0xD7 => Some(RST(7)), 0xD8 => Some(SOI), 0xD9 => Some(EOI), 0xDA => Some(SOS), 0xDB => Some(DQT), 0xDC => Some(DNL), 0xDD => Some(DRI), 0xE0 => Some(APP(0)), 0xE1 => Some(APP(1)), 0xE2 => Some(APP(2)), 0xED => Some(APP(13)), 0xEE => Some(APP(14)), _ => Some(UNKNOWN(n)) } } } zune-jpeg-0.5.11/src/mcu.rs000064400000000000000000001156271046102023000135510ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ use alloc::vec::Vec; use alloc::{format, vec}; use core::cmp::min; use zune_core::bytestream::ZByteReaderTrait; use zune_core::colorspace::ColorSpace; use zune_core::colorspace::ColorSpace::Luma; use zune_core::log::{error, trace, warn}; use crate::bitstream::BitStream; use crate::components::SampleRatios; use crate::decoder::MAX_COMPONENTS; use crate::errors::DecodeErrors; use crate::marker::Marker; use crate::mcu_prog::get_marker; use crate::misc::{calculate_padded_width, setup_component_params}; use crate::worker::{color_convert, upsample}; use crate::JpegDecoder; /// The size of a DC block for a MCU. pub const DCT_BLOCK: usize = 64; impl JpegDecoder { /// Check for existence of DC and AC Huffman Tables pub(crate) fn check_tables(&self) -> Result<(), DecodeErrors> { // check that dc and AC tables exist outside the hot path for component in &self.components { let _ = &self .dc_huffman_tables .get(component.dc_huff_table) .as_ref() .ok_or_else(|| { DecodeErrors::HuffmanDecode(format!( "No Huffman DC table for component {:?} ", component.component_id )) })? .as_ref() .ok_or_else(|| { DecodeErrors::HuffmanDecode(format!( "No DC table for component {:?}", component.component_id )) })?; let _ = &self .ac_huffman_tables .get(component.ac_huff_table) .as_ref() .ok_or_else(|| { DecodeErrors::HuffmanDecode(format!( "No Huffman AC table for component {:?} ", component.component_id )) })? .as_ref() .ok_or_else(|| { DecodeErrors::HuffmanDecode(format!( "No AC table for component {:?}", component.component_id )) })?; } Ok(()) } /// Decode MCUs and carry out post processing. /// /// This is the main decoder loop for the library, the hot path. /// /// Because of this, we pull in some very crazy optimization tricks hence readability is a pinch /// here. #[allow( clippy::similar_names, clippy::too_many_lines, clippy::cast_possible_truncation )] #[inline(never)] pub(crate) fn decode_mcu_ycbcr_baseline( &mut self, pixels: &mut [u8] ) -> Result<(), DecodeErrors> { setup_component_params(self)?; // check dc and AC tables self.check_tables()?; let (mut mcu_width, mut mcu_height); if self.is_interleaved { // set upsampling functions self.set_upsampling()?; mcu_width = self.mcu_x; mcu_height = self.mcu_y; } else { // For non-interleaved images( (1*1) subsampling) // number of MCU's are the widths (+7 to account for paddings) divided bu 8. mcu_width = ((self.info.width + 7) / 8) as usize; mcu_height = ((self.info.height + 7) / 8) as usize; } if self.is_interleaved && self.input_colorspace.num_components() > 1 && self.options.jpeg_get_out_colorspace().num_components() == 1 && (self.sub_sample_ratio == SampleRatios::V || self.sub_sample_ratio == SampleRatios::HV) { // For a specific set of images, e.g interleaved, // when converting from YcbCr to grayscale, we need to // take into account mcu height since the MCU decoding needs to take // it into account for padding purposes and the post processor // parses two rows per mcu width. // // set coeff to be 2 to ensure that we increment two rows // for every mcu processed also mcu_height *= self.v_max; mcu_height /= self.h_max; self.coeff = 2; } if self.input_colorspace == ColorSpace::Luma && self.is_interleaved { warn!("Grayscale image with down-sampled component, resetting component details"); self.reset_params(); mcu_width = ((self.info.width + 7) / 8) as usize; mcu_height = ((self.info.height + 7) / 8) as usize; } let width = usize::from(self.info.width); let padded_width = calculate_padded_width(width, self.sub_sample_ratio); let mut stream = BitStream::new(); let mut tmp = [0_i32; DCT_BLOCK]; let comp_len = self.components.len(); for (pos, comp) in self.components.iter_mut().enumerate() { // Allocate only needed components. // // For special colorspaces i.e YCCK and CMYK, just allocate all of the needed // components. if min( self.options.jpeg_get_out_colorspace().num_components() - 1, pos ) == pos || comp_len == 4 // Special colorspace { // allocate enough space to hold a whole MCU width // this means we should take into account sampling ratios // `*8` is because each MCU spans 8 widths. let len = comp.width_stride * comp.vertical_sample * 8; comp.needed = true; comp.raw_coeff = vec![0; len]; } else { comp.needed = false; } } // If all components are contained in the first scan of MCUs, then we can process into // (upsampled) pixels immediately after each MCU, for convenience we use each row of MCUS. // Otherwise, we must first wait until following SOS provide the remaining components. let all_components_in_first_scan = usize::from(self.num_scans) == self.components.len(); let mut progressive_mcus: [Vec; 4] = core::array::from_fn(|_| vec![]); if !all_components_in_first_scan { for (component, mcu) in self.components.iter().zip(&mut progressive_mcus) { let len = mcu_width * component.vertical_sample * component.horizontal_sample * mcu_height * 64; *mcu = vec![0; len]; } } let mut pixels_written = 0; let is_hv = usize::from(self.is_interleaved); let upsampler_scratch_size = is_hv * self.components[0].width_stride; let mut upsampler_scratch_space = vec![0; upsampler_scratch_size]; 'sos: loop { trace!( "Baseline decoding of components: {:?}", &self.z_order[..usize::from(self.num_scans)] ); trace!("Decoding MCU width: {mcu_width}, height: {mcu_height}"); for i in 0..mcu_height { if stream.overread_by > 0 { pixels.get_mut(pixels_written..).map(|v| v.fill(128)); if self.options.strict_mode() { return Err(DecodeErrors::FormatStatic("Premature end of buffer")); }; error!("Premature end of buffer"); break; } // decode a whole MCU width, // this takes into account interleaved components. let terminate = if all_components_in_first_scan { self.decode_mcu_width::( mcu_width, i, &mut tmp, &mut stream, &mut progressive_mcus )? } else { /* NB: (cae). This code was added due to the issue at https://github.com/etemesi254/zune-image/issues/277 * * There is a particular set of images that interleave the start of scan (SOS) with the MCU, * E.g if it's a three component image, we have SOS->MCU ->SOS->MCU ->SOS->MCU * which presents a problem on decoding, we need to buffer the whole image before continuing since * we won't have a row containing all the component data which will be needed e.g for color conversion. * * The mechanisms is that we decode the whole image upfront, which goes against the normal * routine of decoding MCU width , so this requires more memory upfront than initial routines * but it is a single image out of the many corpuses that exist, so its fine. * (image in test-images/jpeg/sos_news.jpeg) * Code contributed by Aurelia Molzer (https://github.com/197g) * */ self.decode_mcu_width::( mcu_width, i, &mut tmp, &mut stream, &mut progressive_mcus )? }; // process that width up until it's impossible. This is faster than allocation the // full components, which we skipped earlier. if all_components_in_first_scan { self.post_process( pixels, i, mcu_height, width, padded_width, &mut pixels_written, &mut upsampler_scratch_space )?; } match terminate { McuContinuation::Ok => {} McuContinuation::AnotherSos if all_components_in_first_scan => { warn!("More than one SOS despite already having all components"); return Ok(()); } McuContinuation::AnotherSos => continue 'sos, McuContinuation::InterScanMarker(marker) => { // Handle inter-scan markers (DHT/DQT/etc) uniformly here. // This keeps all marker handling in the outer loop. if self.advance_to_next_sos(marker, &mut stream)? { continue 'sos; } else { // Hit EOI break; } } McuContinuation::Terminate => { warn!("Got terminate signal, will not process further"); pixels.get_mut(pixels_written..).map(|v| v.fill(128)); return Ok(()); } } } // Breaks if we get here, looping only if we have restarted, i.e. found another SOS and // continued at `'sos'. break; } if !all_components_in_first_scan { self.finish_baseline_decoding(&progressive_mcus, mcu_width, pixels)?; } // it may happen that some images don't have the whole buffer // so we can't panic in case of that // assert_eq!(pixels_written, pixels.len()); // For UHD usecases that tie two images separating them with EOI and // SOI markers, it may happen that we do not reach this image end of image // So this ensures we reach it // Ensure we read EOI if !stream.seen_eoi { let marker = get_marker(&mut self.stream, &mut stream); match marker { Ok(_m) => { trace!("Found marker {:?}", _m); } Err(_) => { // ignore error } } } trace!("Finished decoding image"); Ok(()) } /// Process all MCUs when baseline decoding has been processing them component-after-component. /// For simplicity this assembles the dequantized blocks in the order that the post processing /// of an interleaved baseline decoding would use. #[allow(clippy::too_many_lines)] #[allow(clippy::cast_sign_loss)] pub(crate) fn finish_baseline_decoding( &mut self, block: &[Vec; MAX_COMPONENTS], _mcu_width: usize, pixels: &mut [u8] ) -> Result<(), DecodeErrors> { let mcu_height = self.mcu_y; // Size of our output image(width*height) let is_hv = usize::from(self.is_interleaved); let upsampler_scratch_size = is_hv * self.components[0].width_stride; let width = usize::from(self.info.width); let padded_width = calculate_padded_width(width, self.sub_sample_ratio); let mut upsampler_scratch_space = vec![0; upsampler_scratch_size]; for (pos, comp) in self.components.iter_mut().enumerate() { // Mark only needed components for computing output colors. if min( self.options.jpeg_get_out_colorspace().num_components() - 1, pos ) == pos || self.input_colorspace == ColorSpace::YCCK || self.input_colorspace == ColorSpace::CMYK { comp.needed = true; } else { comp.needed = false; } } let mut pixels_written = 0; // dequantize and idct have been performed, only color convert. for i in 0..mcu_height { // All the data is already in the right order, we just need to be able to pass it to // the post_process & upsample method. That expects all the data to be stored as one // row of MCUs in each component's `raw_coeff`. 'component: for (position, component) in &mut self.components.iter_mut().enumerate() { if !component.needed { continue 'component; } // step is the number of pixels this iteration wil be handling // Given by the number of mcu's height and the length of the component block // Since the component block contains the whole channel as raw pixels // we this evenly divides the pixels into MCU blocks // // For interleaved images, this gives us the exact pixels comprising a whole MCU // block let step = block[position].len() / mcu_height; // where we will be reading our pixels from. let slice = &block[position][i * step..][..step]; let temp_channel = &mut component.raw_coeff; temp_channel[..step].copy_from_slice(slice); } // process that whole stripe of MCUs self.post_process( pixels, i, mcu_height, width, padded_width, &mut pixels_written, &mut upsampler_scratch_space )?; } return Ok(()); } fn decode_mcu_width( &mut self, mcu_width: usize, mcu_height: usize, tmp: &mut [i32; 64], stream: &mut BitStream, progressive: &mut [Vec; 4] ) -> Result { let is_one_by_one = !self.scan_subsampled; // The definition of MCU depends on the sampling factor of involved scans. When components // have different factors then each Minimal-Coding-Unit is the least common multiple such // that we have an integer number of blocks from each component. But the decoding of these // components differs from it otherwise, we need an inner loop with a dynamic amount of // coefficients per component, whereas otherwise we have exactly one block of coefficients // encoded for each component in the bitstream order. // // We statically specialize on this to improve code generation of the common case a little // bit. We could also special case common sub-sampling cases but be mindful of code bloat. if is_one_by_one { self.inner_decode_mcu_width::( mcu_width, mcu_height, tmp, stream, progressive ) } else { self.inner_decode_mcu_width::( mcu_width, mcu_height, tmp, stream, progressive ) } } // Inline-never ensures we do get this function optimize on its own, into two different // versions, without the optimizer tripping up over the complexity that comes with the // constant folding. And constant folding is quite important for performance here as // when `not SAMPLED` then the inner loop has exactly one iteration per component in // the scan. The difference was ~1% or a bit more. fn inner_decode_mcu_width( &mut self, mcu_width: usize, mcu_height: usize, tmp: &mut [i32; 64], stream: &mut BitStream, progressive: &mut [Vec; 4] ) -> Result { let z_order = self.z_order; let z_scans = &z_order[..usize::from(self.num_scans)]; // How much of the head of `tmp` was written by the last MCU decoding? We only check for // two different cases and not all possible outcomes as this is only used to optimize the // bytes written in `fill`. Since the clobber happens in UNZIGZAG order we'd be straddling // most cache lines anyways even if we did a partial write with the exact length of the // coefficient data which was written into `tmp`. let mut clobber_more_than_4x4 = true; // For non-interleaved scans (PROGRESSIVE=true), each scan contains a single component // and we iterate over that component's actual data unit count, not the interleaved MCU // width multiplied by sampling factor. let scan_du_width = if PROGRESSIVE { let k = z_scans[0]; let comp = &self.components[k]; // Calculate actual data units for this component: ceil(width / (8 * subsampling_ratio)) (self.info.width as usize * comp.horizontal_sample + self.h_max * 8 - 1) / (self.h_max * 8) } else { mcu_width }; for j in 0..scan_du_width { // iterate over components for &k in z_scans { // we made this loop body massive due to several different paths that depend on // static conditions. Note we (potentially) call into other functions so the // compiler will not unroll anything here anyways. The gains from separating // differently optimized loop bodies are much greater than a single additional jump // here. let component = &mut self.components[k]; let dc_table = self.dc_huffman_tables[component.dc_huff_table % MAX_COMPONENTS] .as_ref() .ok_or(DecodeErrors::FormatStatic("DC table not found"))?; let ac_table = self.ac_huffman_tables[component.ac_huff_table % MAX_COMPONENTS] .as_ref() .ok_or(DecodeErrors::FormatStatic("AC table not found"))?; let qt_table = &component.quantization_table; let channel = if PROGRESSIVE { let offset = mcu_height * component.width_stride * 8 * component.vertical_sample; &mut progressive[k][offset..] } else { &mut component.raw_coeff }; let component_samples_needed = component.needed; // If image is interleaved iterate over scan components, // otherwise if it-s non-interleaved, these routines iterate in // trivial scanline order(Y,Cb,Cr) // // Turn the bounds into a compile time constant for a common special case. This // allows the compiler to unroll the loop and then do a bunch of interleaving. // // For PROGRESSIVE (non-interleaved), we iterate data units directly so // h_samp/v_samp loops run exactly once. let v_step = if SAMPLED && !PROGRESSIVE { 0..component.vertical_sample } else { 0..1 }; for v_samp in v_step { let h_step = if SAMPLED && !PROGRESSIVE { 0..component.horizontal_sample } else { 0..1 }; for h_samp in h_step { let result = if component_samples_needed { // Fill the array with zeroes, decode_mcu_block expects // a zero based array. Clobber is in zig-zag order though. // Writing consecutive entries is basically free in terms // of memory throughput so we opt for a larger power of // two which lets the compiler turn this into a repeated // write of a zeroed vector register, which does not have // any branches, instead of a more difficult pattern where // we attempt to overwrite exactly one coefficient. let clobber_len = if !clobber_more_than_4x4 { 32 } else { 64 }; tmp[..clobber_len].fill(0); stream.decode_mcu_block( &mut self.stream, dc_table, ac_table, qt_table, tmp, &mut component.dc_pred ) } else { // We do not touch tmp so there is no need to reset it. stream.discard_mcu_block(&mut self.stream, dc_table, ac_table) }; // If an error occurs we can either propagate it // as an error or print it and call terminate. // // This allows even corrupt images to render something, // even if its bad, matching browsers. // // See example in https://github.com/etemesi254/zune-image/issues/293 let len = if let Ok(len) = result { len } else { // result.is_err() return if self.options.strict_mode() { Err(result.err().unwrap()) } else { error!("{}", result.err().unwrap()); Ok(McuContinuation::Terminate) }; }; if component_samples_needed { // tmp was only written partially, note that len is in ZigZag order. clobber_more_than_4x4 = len > 10; let idct_position = if PROGRESSIVE { // For non-interleaved, j indexes data units directly j * 8 } else { // derived from stb and rewritten for my tastes let c2 = v_samp * 8; let c3 = ((j * component.horizontal_sample) + h_samp) * 8; component.width_stride * c2 + c3 }; let idct_pos = channel.get_mut(idct_position..).unwrap(); if len <= 1 { (self.idct_1x1_func)(tmp, idct_pos, component.width_stride); } else if len <= 10 { (self.idct_4x4_func)(tmp, idct_pos, component.width_stride); } else { // call idct. (self.idct_func)(tmp, idct_pos, component.width_stride); } } } } } self.todo = self.todo.wrapping_sub(1); if self.todo == 0 { self.handle_rst_main(stream)?; continue; } if stream.marker.is_some() && stream.bits_left == 0 { break; } } self.check_stream_marker_after_mcu_width(stream) } fn check_stream_marker_after_mcu_width( &mut self, stream: &mut BitStream ) -> Result { // After all interleaved components, that's an MCU // handle stream markers // // In some corrupt images, it may occur that header markers occur in the stream. // The spec EXPLICITLY FORBIDS this, specifically, in // routine F.2.2.5 it says // `The only valid marker which may occur within the Huffman coded data is the RSTm marker.` // // But libjpeg-turbo allows it because of some weird reason. so I'll also // allow it because of some weird reason. if let Some(m) = stream.marker { if m == Marker::EOI { // acknowledge and ignore EOI marker. stream.marker.take(); trace!("Found EOI marker"); // Google Introduced the Ultra-HD image format which is basically // stitching two images into one container. // They basically separate two images via a EOI and SOI marker // so let's just ensure if we ever see EOI, we never read past that // ever. // https://github.com/google/libultrahdr stream.seen_eoi = true; } else if let Marker::RST(_) = m { //debug_assert_eq!(self.todo, 0); if self.todo == 0 { self.handle_rst(stream)?; } } else if let Marker::SOS = m { self.parse_marker_inner(m)?; stream.marker.take(); stream.reset(); trace!("Found SOS marker"); return Ok(McuContinuation::AnotherSos); } else if matches!(m, Marker::DHT | Marker::DQT | Marker::DRI | Marker::COM) || matches!(m, Marker::APP(_)) { // For non-interleaved images, setup markers can appear between scans. // Signal the caller to handle this marker and find the next SOS. // This keeps all marker parsing in the caller's loop. stream.marker.take(); trace!("Found inter-scan marker {:?}", m); return Ok(McuContinuation::InterScanMarker(m)); } else { if self.options.strict_mode() { return Err(DecodeErrors::Format(format!( "Marker {m:?} found where not expected" ))); } error!( "Marker `{:?}` Found within Huffman Stream, possibly corrupt jpeg", m ); self.parse_marker_inner(m)?; stream.marker.take(); stream.reset(); return Ok(McuContinuation::Terminate); } } Ok(McuContinuation::Ok) } /// Scan for the next SOS marker, parsing setup markers along the way. /// /// This is the unified marker scanning function used after encountering an /// inter-scan marker. It handles DHT, DQT, DRI, COM, and APP markers that /// can appear between scans in non-interleaved images. /// /// # Arguments /// * `first_marker` - The first marker that was already detected (not yet parsed) /// * `stream` - The bitstream state /// /// # Returns /// * `Ok(true)` - Found SOS, ready to continue decoding /// * `Ok(false)` - Found EOI, decoding complete /// * `Err(_)` - Error (too many markers, unexpected marker in strict mode, etc.) fn advance_to_next_sos( &mut self, first_marker: Marker, stream: &mut BitStream ) -> Result { // Limit iterations to prevent DoS from malicious files. const MAX_INTER_SCAN_MARKERS: usize = 64; // Parse the first marker that triggered this call self.parse_marker_inner(first_marker)?; stream.reset(); for _ in 0..MAX_INTER_SCAN_MARKERS { let marker = get_marker(&mut self.stream, stream)?; match marker { Marker::SOS => { self.parse_marker_inner(Marker::SOS)?; stream.reset(); trace!("Found SOS marker, continuing decode"); return Ok(true); } Marker::EOI => { stream.seen_eoi = true; trace!("Found EOI marker"); return Ok(false); } Marker::DHT | Marker::DQT | Marker::DRI | Marker::COM => { trace!("Parsing inter-scan marker {:?}", marker); self.parse_marker_inner(marker)?; } Marker::APP(_) => { trace!("Parsing inter-scan APP marker {:?}", marker); self.parse_marker_inner(marker)?; } other => { if self.options.strict_mode() { return Err(DecodeErrors::Format(format!( "Unexpected marker {:?} while scanning for SOS between scans", other ))); } // Non-strict: skip unknown marker warn!("Skipping unexpected marker {:?} between scans", other); let length = self.stream.get_u16_be_err()?; if length >= 2 { self.stream.skip((length - 2) as usize)?; } } } } Err(DecodeErrors::FormatStatic( "Too many markers between scans (exceeded limit of 64)" )) } // handle RST markers. // No-op if not using restarts // this routine is shared with mcu_prog #[cold] pub(crate) fn handle_rst(&mut self, stream: &mut BitStream) -> Result<(), DecodeErrors> { self.todo = self.restart_interval; if let Some(marker) = stream.marker { // Found a marker // Read stream and see what marker is stored there match marker { Marker::RST(_) => { // reset stream stream.reset(); // Initialize dc predictions to zero for all components self.components.iter_mut().for_each(|x| x.dc_pred = 0); // Start iterating again. from position. } Marker::EOI => { // silent pass } _ => { return Err(DecodeErrors::MCUError(format!( "Marker {marker:?} found in bitstream, possibly corrupt jpeg" ))); } } } Ok(()) } #[allow(clippy::too_many_lines, clippy::too_many_arguments)] pub(crate) fn post_process( &mut self, pixels: &mut [u8], i: usize, mcu_height: usize, width: usize, padded_width: usize, pixels_written: &mut usize, upsampler_scratch_space: &mut [i16] ) -> Result<(), DecodeErrors> { let out_colorspace_components = self.options.jpeg_get_out_colorspace().num_components(); let mut px = *pixels_written; // indicates whether image is vertically up-sampled let is_vertically_sampled = self .components .iter() .any(|c| c.sample_ratio == SampleRatios::HV || c.sample_ratio == SampleRatios::V); let mut comp_len = self.components.len(); // If we are moving from YCbCr -> Luma, we do not allocate storage for other components, so we // will panic when we are trying to read samples, so for that case, // hardcode it so that we don't panic when doing // *samp = &samples[j][pos * padded_width..(pos + 1) * padded_width] if out_colorspace_components < comp_len && self.options.jpeg_get_out_colorspace() == Luma { comp_len = out_colorspace_components; } let mut color_conv_function = |num_iters: usize, samples: [&[i16]; 4]| -> Result<(), DecodeErrors> { for (pos, output) in pixels[px..] .chunks_exact_mut(width * out_colorspace_components) .take(num_iters) .enumerate() { let mut raw_samples: [&[i16]; 4] = [&[], &[], &[], &[]]; // iterate over each line, since color-convert needs only // one line for (j, samp) in raw_samples.iter_mut().enumerate().take(comp_len) { let temp = &samples[j].get(pos * padded_width..(pos + 1) * padded_width); if temp.is_none() { return Err(DecodeErrors::FormatStatic("Missing samples")); } *samp = temp.unwrap(); } color_convert( &raw_samples, self.color_convert_16, self.input_colorspace, self.options.jpeg_get_out_colorspace(), output, width, padded_width )?; px += width * out_colorspace_components; } Ok(()) }; let comps = &mut self.components[..]; if self.is_interleaved && self.options.jpeg_get_out_colorspace() != ColorSpace::Luma { for comp in comps.iter_mut() { upsample( comp, mcu_height, i, upsampler_scratch_space, is_vertically_sampled )?; } if is_vertically_sampled { if i > 0 { // write the last line, it wasn't up-sampled as we didn't have row_down // yet let mut samples: [&[i16]; 4] = [&[], &[], &[], &[]]; for (samp, component) in samples.iter_mut().zip(comps.iter()) { *samp = &component.first_row_upsample_dest; } // ensure length matches for all samples let _first_len = samples[0].len(); // This was a good check, but can be caused to panic, esp on invalid/corrupt images. // See one in issue https://github.com/etemesi254/zune-image/issues/262, so for now // we just ignore and generate invalid images at the end. // // // for samp in samples.iter().take(comp_len) { // assert_eq!(first_len, samp.len()); // } let num_iters = self.coeff * self.v_max; color_conv_function(num_iters, samples)?; } // After up-sampling the last row, save any row that can be used for // a later up-sampling, // // E.g the Y sample is not sampled but we haven't finished upsampling the last row of // the previous mcu, since we don't have the down row, so save it for component in comps.iter_mut() { if component.sample_ratio != SampleRatios::H { // We don't care about H sampling factors, since it's copied in the workers function // copy last row to be used for the next color conversion let size = component.vertical_sample * component.width_stride * component.sample_ratio.sample(); let last_bytes = component.raw_coeff.rchunks_exact_mut(size).next().unwrap(); component .first_row_upsample_dest .copy_from_slice(last_bytes); } } } let mut samples: [&[i16]; 4] = [&[], &[], &[], &[]]; for (samp, component) in samples.iter_mut().zip(comps.iter()) { *samp = if component.sample_ratio == SampleRatios::None { &component.raw_coeff } else { &component.upsample_dest }; } // we either do 7 or 8 MCU's depending on the state, this only applies to // vertically sampled images // // for rows up until the last MCU, we do not upsample the last stride of the MCU // which means that the number of iterations should take that into account is one less the // up-sampled size // // For the last MCU, we upsample the last stride, meaning that if we hit the last MCU, we // should sample full raw coeffs let is_last_considered = is_vertically_sampled && (i != mcu_height.saturating_sub(1)); let num_iters = (8 - usize::from(is_last_considered)) * self.coeff * self.v_max; color_conv_function(num_iters, samples)?; } else { let mut channels_ref: [&[i16]; MAX_COMPONENTS] = [&[]; MAX_COMPONENTS]; self.components .iter() .enumerate() .for_each(|(pos, x)| channels_ref[pos] = &x.raw_coeff); if let SampleRatios::Generic(_, v) = self.sub_sample_ratio { color_conv_function(8 * v * self.coeff, channels_ref)?; } else { color_conv_function(8 * self.coeff, channels_ref)?; } } *pixels_written = px; Ok(()) } } enum McuContinuation { Ok, AnotherSos, /// Found an inter-scan marker (DHT/DQT/DRI/COM/APP) that needs handling. /// The caller should parse it and scan for the next SOS. InterScanMarker(Marker), Terminate } zune-jpeg-0.5.11/src/mcu_prog.rs000064400000000000000000000676571046102023000146110ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ //!Routines for progressive decoding /* This file is needlessly complicated, It is that way to ensure we don't burn memory anyhow Memory is a scarce resource in some environments, I would like this to be viable in such environments Half of the complexity comes from the jpeg spec, because progressive decoding, is one hell of a ride. */ use alloc::string::ToString; use alloc::vec::Vec; use alloc::{format, vec}; use core::cmp::min; use zune_core::bytestream::{ZByteReaderTrait, ZReader}; use zune_core::colorspace::ColorSpace; use zune_core::log::{debug, error, warn}; use crate::bitstream::BitStream; use crate::components::SampleRatios; use crate::decoder::{JpegDecoder, MAX_COMPONENTS}; use crate::errors::DecodeErrors; use crate::headers::parse_sos; use crate::marker::Marker; use crate::mcu::DCT_BLOCK; use crate::misc::{calculate_padded_width, setup_component_params}; impl JpegDecoder { /// Decode a progressive image /// /// This routine decodes a progressive image, stopping if it finds any error. #[allow( clippy::needless_range_loop, clippy::cast_sign_loss, clippy::redundant_else, clippy::too_many_lines )] #[inline(never)] pub(crate) fn decode_mcu_ycbcr_progressive( &mut self, pixels: &mut [u8] ) -> Result<(), DecodeErrors> { setup_component_params(self)?; let mut mcu_height; // memory location for decoded pixels for components let mut block: [Vec; MAX_COMPONENTS] = [vec![], vec![], vec![], vec![]]; let mut mcu_width; let mut seen_scans = 1; if self.input_colorspace == ColorSpace::Luma && self.is_interleaved { warn!("Grayscale image with down-sampled component, resetting component details"); self.reset_params(); } if self.is_interleaved { // this helps us catch component errors. self.set_upsampling()?; } if self.is_interleaved { mcu_width = self.mcu_x; mcu_height = self.mcu_y; } else { mcu_width = (self.info.width as usize + 7) / 8; mcu_height = (self.info.height as usize + 7) / 8; } if self.is_interleaved && self.input_colorspace.num_components() > 1 && self.options.jpeg_get_out_colorspace().num_components() == 1 && (self.sub_sample_ratio == SampleRatios::V || self.sub_sample_ratio == SampleRatios::HV) { // For a specific set of images, e.g interleaved, // when converting from YcbCr to grayscale, we need to // take into account mcu height since the MCU decoding needs to take // it into account for padding purposes and the post processor // parses two rows per mcu width. // // set coeff to be 2 to ensure that we increment two rows // for every mcu processed also mcu_height *= self.v_max; mcu_height /= self.h_max; self.coeff = 2; } mcu_width *= 64; for i in 0..self.input_colorspace.num_components() { let comp = &self.components[i]; let len = mcu_width * comp.vertical_sample * comp.horizontal_sample * mcu_height; block[i] = vec![0; len]; } let mut stream = BitStream::new_progressive(self.succ_low, self.spec_start, self.spec_end); // there are multiple scans in the stream, this should resolve the first scan let result = self.parse_entropy_coded_data(&mut stream, &mut block); if result.is_err() { return if self.options.strict_mode() { Err(result.err().unwrap()) } else { error!("{}", result.err().unwrap()); // Go process it and return as much as we can, exiting here return self.finish_progressive_decoding(&block, pixels); }; } // extract marker let mut marker = stream .marker .take() .ok_or(DecodeErrors::FormatStatic("Marker missing where expected"))?; // if marker is EOI, we are done, otherwise continue scanning. // // In case we have a premature image, we print a warning or return // an error, depending on the strictness of the decoder, so there // is that logic to handle too 'eoi: while marker != Marker::EOI { match marker { Marker::SOS => { parse_sos(self)?; stream.update_progressive_params( self.succ_high, self.succ_low, self.spec_start, self.spec_end ); // after every SOS, marker, parse data for that scan. let result = self.parse_entropy_coded_data(&mut stream, &mut block); // Do not error out too fast, allows the decoder to continue as much as possible // even after errors if result.is_err() { return if self.options.strict_mode() { Err(result.err().unwrap()) } else { error!("{}", result.err().unwrap()); break 'eoi; }; } // extract marker, might either indicate end of image or we continue // scanning(hence the continue statement to determine). match get_marker(&mut self.stream, &mut stream) { Ok(marker_n) => { marker = marker_n; seen_scans += 1; if seen_scans > self.options.jpeg_get_max_scans() { return Err(DecodeErrors::Format(format!( "Too many scans, exceeded limit of {}", self.options.jpeg_get_max_scans() ))); } stream.reset(); continue 'eoi; } Err(msg) => { if self.options.strict_mode() { return Err(msg); } error!("{:?}", msg); break 'eoi; } } } Marker::RST(_n) => { self.handle_rst(&mut stream)?; } _ => { self.parse_marker_inner(marker)?; } } match get_marker(&mut self.stream, &mut stream) { Ok(marker_n) => { marker = marker_n; } Err(e) => { if self.options.strict_mode() { return Err(e); } error!("{}", e); // If we can't get the marker, just break away // allows us to decode some corrupt images // e.g https://github.com/etemesi254/zune-image/issues/294 break 'eoi; } } } self.finish_progressive_decoding(&block, pixels) } /// Reset progressive parameters fn reset_prog_params(&mut self, stream: &mut BitStream) { stream.reset(); self.components.iter_mut().for_each(|x| x.dc_pred = 0); // Also reset JPEG restart intervals self.todo = if self.restart_interval != 0 { self.restart_interval } else { usize::MAX }; } #[allow(clippy::too_many_lines, clippy::cast_sign_loss)] fn parse_entropy_coded_data( &mut self, stream: &mut BitStream, buffer: &mut [Vec; MAX_COMPONENTS] ) -> Result<(), DecodeErrors> { self.reset_prog_params(stream); if usize::from(self.num_scans) > self.input_colorspace.num_components() { return Err(DecodeErrors::Format(format!( "Number of scans {} cannot be greater than number of components, {}", self.num_scans, self.input_colorspace.num_components() ))); } if self.num_scans == 1 { // Safety checks if self.spec_end != 0 && self.spec_start == 0 { return Err(DecodeErrors::FormatStatic( "Can't merge DC and AC corrupt jpeg" )); } // non interleaved data, process one block at a time in trivial scanline order let k = self.z_order[0]; if k >= self.components.len() { return Err(DecodeErrors::Format(format!( "Cannot find component {k}, corrupt image" ))); } let (mcu_width, mcu_height); if self.components[k].vertical_sample != 1 || self.components[k].horizontal_sample != 1 || !self.is_interleaved { // For non interleaved scans // mcu's is the image dimensions divided by 8 mcu_width = self.info.width.div_ceil(8) as usize; mcu_height = self.info.height.div_ceil(8) as usize; } else { // For other channels, in an interleaved mcu, number of MCU's // are determined by some weird maths done in headers.rs->parse_sos() mcu_width = self.mcu_x; mcu_height = self.mcu_y; } for i in 0..mcu_height { for j in 0..mcu_width { if self.spec_start != 0 && self.succ_high == 0 && stream.eob_run > 0 { // handle EOB runs here. stream.eob_run -= 1; } else { let start = 64 * (j + i * (self.components[k].width_stride / 8)); let data: &mut [i16; 64] = buffer .get_mut(k) .unwrap() .get_mut(start..start + 64) .ok_or(DecodeErrors::FormatStatic("Slice to Small"))? .try_into() .unwrap(); if self.spec_start == 0 { let pos = self.components[k].dc_huff_table & (MAX_COMPONENTS - 1); let dc_table = self .dc_huffman_tables .get(pos) .ok_or(DecodeErrors::FormatStatic( "No huffman table for DC component" ))? .as_ref() .ok_or(DecodeErrors::FormatStatic( "Huffman table at index {} not initialized" ))?; let dc_pred = &mut self.components[k].dc_pred; if self.succ_high == 0 { // first scan for this mcu stream.decode_prog_dc_first( &mut self.stream, dc_table, &mut data[0], dc_pred )?; } else { // refining scans for this MCU stream.decode_prog_dc_refine(&mut self.stream, &mut data[0])?; } } else { let pos = self.components[k].ac_huff_table; let ac_table = self .ac_huffman_tables .get(pos) .ok_or_else(|| { DecodeErrors::Format(format!( "No huffman table for component:{pos}" )) })? .as_ref() .ok_or_else(|| { DecodeErrors::Format(format!( "Huffman table at index {pos} not initialized" )) })?; if self.succ_high == 0 { debug_assert!(stream.eob_run == 0, "EOB run is not zero"); stream.decode_mcu_ac_first(&mut self.stream, ac_table, data)?; } else { // refinement scan stream.decode_mcu_ac_refine(&mut self.stream, ac_table, data)?; } // Check for a marker. // It can appear in stream CC https://github.com/etemesi254/zune-image/issues/300 // if let Some(marker) = stream.marker.take() { // self.parse_marker_inner(marker)?; // } } } // + EOB and investigate effect. self.todo -= 1; self.handle_rst_main(stream)?; } } } else { if self.spec_end != 0 { return Err(DecodeErrors::HuffmanDecode( "Can't merge dc and AC corrupt jpeg".to_string() )); } // process scan n elements in order // Do the error checking with allocs here. // Make the one in the inner loop free of allocations. for k in 0..self.num_scans { let n = self.z_order[k as usize]; if n >= self.components.len() { return Err(DecodeErrors::Format(format!( "Cannot find component {n}, corrupt image" ))); } let component = &mut self.components[n]; let _ = self .dc_huffman_tables .get(component.dc_huff_table) .ok_or_else(|| { DecodeErrors::Format(format!( "No huffman table for component:{}", component.dc_huff_table )) })? .as_ref() .ok_or_else(|| { DecodeErrors::Format(format!( "Huffman table at index {} not initialized", component.dc_huff_table )) })?; } // Interleaved scan // Components shall not be interleaved in progressive mode, except for // the DC coefficients in the first scan for each component of a progressive frame. for i in 0..self.mcu_y { for j in 0..self.mcu_x { // process scan n elements in order for k in 0..self.num_scans { let n = self.z_order[k as usize]; let component = &mut self.components[n]; let huff_table = self .dc_huffman_tables .get(component.dc_huff_table) .ok_or(DecodeErrors::FormatStatic("No huffman table for component"))? .as_ref() .ok_or(DecodeErrors::FormatStatic( "Huffman table at index not initialized" ))?; for v_samp in 0..component.vertical_sample { for h_samp in 0..component.horizontal_sample { let x2 = j * component.horizontal_sample + h_samp; let y2 = i * component.vertical_sample + v_samp; let position = 64 * (x2 + y2 * component.width_stride / 8); let buf_n = &mut buffer[n]; let Some(data) = &mut buf_n.get_mut(position) else { // TODO: (CAE), this is another weird sub-sampling bug, so on fix // remove this return Err(DecodeErrors::FormatStatic("Invalid image")); }; if self.succ_high == 0 { stream.decode_prog_dc_first( &mut self.stream, huff_table, data, &mut component.dc_pred )?; } else { stream.decode_prog_dc_refine(&mut self.stream, data)?; } } } } // We want wrapping subtraction here because it means // we get a higher number in the case this underflows self.todo -= 1; // after every scan that's a mcu, count down restart markers. self.handle_rst_main(stream)?; } } } return Ok(()); } pub(crate) fn handle_rst_main(&mut self, stream: &mut BitStream) -> Result<(), DecodeErrors> { if self.todo == 0 { stream.refill(&mut self.stream)?; } if self.todo == 0 && self.restart_interval != 0 && stream.marker.is_none() && !stream.seen_eoi { // if no marker and we are to reset RST, look for the marker, this matches // libjpeg-turbo behaviour and allows us to decode images in // https://github.com/etemesi254/zune-image/issues/261 let _start = self.stream.position()?; // skip bytes until we find marker let marker = get_marker(&mut self.stream, stream); // In some images, the RST marker on the last section may not be available // as it is maybe stopped by an EOI marker, see in the case of https://github.com/etemesi254/zune-image/issues/292 // what happened was that we would go looking for the RST marker exhausting all the data // in the image and this would return an error, so for now // translate it to a warning, but return the image decoded up // until that point if let Ok(marker) = marker { let _end = self.stream.position()?; stream.marker = Some(marker); // NB some warnings may be false positives. warn!( "{} Extraneous bytes before marker {:?}", _end - _start, marker ); } else { warn!("RST marker was not found, where expected, image may be garbled") } } if self.todo == 0 { self.handle_rst(stream)? } Ok(()) } #[allow(clippy::too_many_lines)] #[allow(clippy::needless_range_loop, clippy::cast_sign_loss)] fn finish_progressive_decoding( &mut self, block: &[Vec; MAX_COMPONENTS], pixels: &mut [u8] ) -> Result<(), DecodeErrors> { // This function is complicated because we need to replicate // the function in mcu.rs // // The advantage is that we do very little allocation and very lot // channel reusing. // The trick is to notice that we repeat the same procedure per MCU // width. // // So we can set it up that we only allocate temporary storage large enough // to store a single mcu width, then reuse it per invocation. // // This is advantageous to us. // // Remember we need to have the whole MCU buffer so we store 3 unprocessed // channels in memory, and then we allocate the whole output buffer in memory, both of // which are huge. // // let mcu_height = if self.is_interleaved { self.mcu_y } else { // For non-interleaved images( (1*1) subsampling) // number of MCU's are the widths (+7 to account for paddings) divided by 8. self.info.height.div_ceil(8) as usize }; // Size of our output image(width*height) let is_hv = usize::from(self.is_interleaved); let upsampler_scratch_size = is_hv * self.components[0].width_stride; let width = usize::from(self.info.width); let padded_width = calculate_padded_width(width, self.sub_sample_ratio); let mut upsampler_scratch_space = vec![0; upsampler_scratch_size]; let mut tmp = [0_i32; DCT_BLOCK]; for (pos, comp) in self.components.iter_mut().enumerate() { // Allocate only needed components. // // For special colorspaces i.e YCCK and CMYK, just allocate all of the needed // components. if min( self.options.jpeg_get_out_colorspace().num_components() - 1, pos ) == pos || self.input_colorspace == ColorSpace::YCCK || self.input_colorspace == ColorSpace::CMYK { // allocate enough space to hold a whole MCU width // this means we should take into account sampling ratios // `*8` is because each MCU spans 8 widths. let len = comp.width_stride * comp.vertical_sample * 8; comp.needed = true; comp.raw_coeff = vec![0; len]; } else { comp.needed = false; } } let mut pixels_written = 0; // dequantize, idct and color convert. for i in 0..mcu_height { 'component: for (position, component) in &mut self.components.iter_mut().enumerate() { if !component.needed { continue 'component; } let qt_table = &component.quantization_table; // step is the number of pixels this iteration wil be handling // Given by the number of mcu's height and the length of the component block // Since the component block contains the whole channel as raw pixels // we this evenly divides the pixels into MCU blocks // // For interleaved images, this gives us the exact pixels comprising a whole MCU // block let step = block[position].len() / mcu_height; // where we will be reading our pixels from. let start = i * step; let slice = &block[position][start..start + step]; let temp_channel = &mut component.raw_coeff; // The next logical step is to iterate width wise. // To figure out how many pixels we iterate by we use effective pixels // Given to us by component.x // iterate per effective pixels. let mcu_x = component.width_stride / 8; // iterate per every vertical sample. for k in 0..component.vertical_sample { for j in 0..mcu_x { // after writing a single stride, we need to skip 8 rows. // This does the row calculation let width_stride = k * 8 * component.width_stride; let start = j * 64 + width_stride; // See https://github.com/etemesi254/zune-image/issues/262 sample 3. let Some(qt_slice) = slice.get(start..start + 64) else { return Err(DecodeErrors::FormatStatic( "Invalid slice , would panic, invalid image" )); }; // dequantize for ((x, out), qt_val) in qt_slice.iter().zip(tmp.iter_mut()).zip(qt_table.iter()) { *out = i32::from(*x) * qt_val; } // determine where to write. let sl = &mut temp_channel[component.idct_pos..]; component.idct_pos += 8; // tmp now contains a dequantized block so idct it (self.idct_func)(&mut tmp, sl, component.width_stride); } // after every write of 8, skip 7 since idct write stride wise 8 times. // // Remember each MCU is 8x8 block, so each idct will write 8 strides into // sl // // and component.idct_pos is one stride long component.idct_pos += 7 * component.width_stride; } component.idct_pos = 0; } // process that width up until it's impossible self.post_process( pixels, i, mcu_height, width, padded_width, &mut pixels_written, &mut upsampler_scratch_space )?; } debug!("Finished decoding image"); return Ok(()); } pub(crate) fn reset_params(&mut self) { /* Apparently, grayscale images which can be down sampled exists, which is weird in the sense that it has one component Y, which is not usually down sampled. This means some calculations will be wrong, so for that we explicitly reset params for such occurrences, warn and reset the image info to appear as if it were a non-sampled image to ensure decoding works */ self.h_max = 1; self.v_max = 1; self.sub_sample_ratio = SampleRatios::None; self.is_interleaved = false; self.components[0].vertical_sample = 1; self.components[0].width_stride = (((self.info.width as usize) + 7) / 8) * 8; self.components[0].horizontal_sample = 1; } } ///Get a marker from the bit-stream. /// /// This reads until it gets a marker or end of file is encountered pub fn get_marker( reader: &mut ZReader, stream: &mut BitStream ) -> Result where T: ZByteReaderTrait { if let Some(marker) = stream.marker { stream.marker = None; return Ok(marker); } // read until we get a marker while !reader.eof()? { let marker = reader.read_u8_err()?; if marker == 255 { let mut r = reader.read_u8_err()?; // 0xFF 0XFF(some images may be like that) while r == 0xFF { r = reader.read_u8_err()?; } if r != 0 { return Marker::from_u8(r) .ok_or_else(|| DecodeErrors::Format(format!("Unknown marker 0xFF{r:X}"))); } } } return Err(DecodeErrors::ExhaustedData); } // #[cfg(test)] // mod tests{ // use zune_core::bytestream::ZCursor; // use crate::JpegDecoder; // // #[test] // fn make_test(){ // let img = "/Users/etemesi/Downloads/wrong_sampling.jpeg"; // let data = ZCursor::new(std::fs::read(img).unwrap()); // let mut decoder = JpegDecoder::new(data); // decoder.decode().unwrap(); // // } // } zune-jpeg-0.5.11/src/misc.rs000064400000000000000000000445101046102023000137100ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ //!Miscellaneous stuff #![allow(dead_code)] use alloc::format; use core::cmp::max; use core::fmt; use core::num::NonZeroU32; use zune_core::bytestream::ZByteReaderTrait; use zune_core::colorspace::ColorSpace; use zune_core::log::{trace, warn}; use crate::components::{ComponentID, SampleRatios}; use crate::errors::DecodeErrors; use crate::huffman::HuffmanTable; use crate::JpegDecoder; /// Start of baseline DCT Huffman coding pub const START_OF_FRAME_BASE: u16 = 0xffc0; /// Start of another frame pub const START_OF_FRAME_EXT_SEQ: u16 = 0xffc1; /// Start of progressive DCT encoding pub const START_OF_FRAME_PROG_DCT: u16 = 0xffc2; /// Start of Lossless sequential Huffman coding pub const START_OF_FRAME_LOS_SEQ: u16 = 0xffc3; /// Start of extended sequential DCT arithmetic coding pub const START_OF_FRAME_EXT_AR: u16 = 0xffc9; /// Start of Progressive DCT arithmetic coding pub const START_OF_FRAME_PROG_DCT_AR: u16 = 0xffca; /// Start of Lossless sequential Arithmetic coding pub const START_OF_FRAME_LOS_SEQ_AR: u16 = 0xffcb; /// Undo run length encoding of coefficients by placing them in natural order /// /// This is an index from position-in-bitstream to position-in-row-major-order. #[rustfmt::skip] pub const UN_ZIGZAG: [usize; 64 + 16] = [ 0, 1, 8, 16, 9, 2, 3, 10, 17, 24, 32, 25, 18, 11, 4, 5, 12, 19, 26, 33, 40, 48, 41, 34, 27, 20, 13, 6, 7, 14, 21, 28, 35, 42, 49, 56, 57, 50, 43, 36, 29, 22, 15, 23, 30, 37, 44, 51, 58, 59, 52, 45, 38, 31, 39, 46, 53, 60, 61, 54, 47, 55, 62, 63, // Prevent overflowing 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63 ]; /// Align data to a 16 byte boundary #[repr(align(16))] #[derive(Clone)] pub struct Aligned16(pub T); impl Default for Aligned16 where T: Default { fn default() -> Self { Aligned16(T::default()) } } /// Align data to a 32 byte boundary #[repr(align(32))] #[derive(Clone)] pub struct Aligned32(pub T); impl Default for Aligned32 where T: Default { fn default() -> Self { Aligned32(T::default()) } } /// Markers that identify different Start of Image markers /// They identify the type of encoding and whether the file use lossy(DCT) or /// lossless compression and whether we use Huffman or arithmetic coding schemes #[derive(Eq, PartialEq, Copy, Clone)] #[allow(clippy::upper_case_acronyms)] pub enum SOFMarkers { /// Baseline DCT markers BaselineDct, /// SOF_1 Extended sequential DCT,Huffman coding ExtendedSequentialHuffman, /// Progressive DCT, Huffman coding ProgressiveDctHuffman, /// Lossless (sequential), huffman coding, LosslessHuffman, /// Extended sequential DEC, arithmetic coding ExtendedSequentialDctArithmetic, /// Progressive DCT, arithmetic coding, ProgressiveDctArithmetic, /// Lossless ( sequential), arithmetic coding LosslessArithmetic } impl Default for SOFMarkers { fn default() -> Self { Self::BaselineDct } } impl SOFMarkers { /// Check if a certain marker is sequential DCT or not pub fn is_sequential_dct(self) -> bool { matches!( self, Self::BaselineDct | Self::ExtendedSequentialHuffman | Self::ExtendedSequentialDctArithmetic ) } /// Check if a marker is a Lossles type or not pub fn is_lossless(self) -> bool { matches!(self, Self::LosslessHuffman | Self::LosslessArithmetic) } /// Check whether a marker is a progressive marker or not pub fn is_progressive(self) -> bool { matches!( self, Self::ProgressiveDctHuffman | Self::ProgressiveDctArithmetic ) } /// Create a marker from an integer pub fn from_int(int: u16) -> Option { match int { START_OF_FRAME_BASE => Some(Self::BaselineDct), START_OF_FRAME_PROG_DCT => Some(Self::ProgressiveDctHuffman), START_OF_FRAME_PROG_DCT_AR => Some(Self::ProgressiveDctArithmetic), START_OF_FRAME_LOS_SEQ => Some(Self::LosslessHuffman), START_OF_FRAME_LOS_SEQ_AR => Some(Self::LosslessArithmetic), START_OF_FRAME_EXT_SEQ => Some(Self::ExtendedSequentialHuffman), START_OF_FRAME_EXT_AR => Some(Self::ExtendedSequentialDctArithmetic), _ => None } } } impl fmt::Debug for SOFMarkers { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { match &self { Self::BaselineDct => write!(f, "Baseline DCT"), Self::ExtendedSequentialHuffman => { write!(f, "Extended sequential DCT, Huffman Coding") } Self::ProgressiveDctHuffman => write!(f, "Progressive DCT,Huffman Encoding"), Self::LosslessHuffman => write!(f, "Lossless (sequential) Huffman encoding"), Self::ExtendedSequentialDctArithmetic => { write!(f, "Extended sequential DCT, arithmetic coding") } Self::ProgressiveDctArithmetic => write!(f, "Progressive DCT, arithmetic coding"), Self::LosslessArithmetic => write!(f, "Lossless (sequential) arithmetic coding") } } } /// Set up component parameters. /// /// This modifies the components in place setting up details needed by other /// parts fo the decoder. pub(crate) fn setup_component_params( img: &mut JpegDecoder ) -> Result<(), DecodeErrors> { let img_width = img.width(); let img_height = img.height(); // in case of adobe app14 being present, zero may indicate // either CMYK if components are 4 or RGB if components are 3, // see https://docs.oracle.com/javase/6/docs/api/javax/imageio/metadata/doc-files/jpeg_metadata.html // so since we may not know how many number of components // we have when decoding app14, we have to defer that check // until now. // // We know adobe app14 was present since it's the only one that can modify // input colorspace to be CMYK if img.components.len() == 3 && img.input_colorspace == ColorSpace::CMYK { img.input_colorspace = ColorSpace::RGB; } for component in &mut img.components { // compute interleaved image info // h_max contains the maximum horizontal component img.h_max = max(img.h_max, component.horizontal_sample); // v_max contains the maximum vertical component img.v_max = max(img.v_max, component.vertical_sample); img.mcu_width = img.h_max * 8; img.mcu_height = img.v_max * 8; // Number of MCU's per width img.mcu_x = usize::from(img.info.width).div_ceil(img.mcu_width); // Number of MCU's per height img.mcu_y = usize::from(img.info.height).div_ceil(img.mcu_height); if img.h_max != 1 || img.v_max != 1 { // interleaved images have horizontal and vertical sampling factors // not equal to 1. img.is_interleaved = true; } // Extract quantization tables from the arrays into components let qt_table = *img.qt_tables[component.quantization_table_number as usize] .as_ref() .ok_or_else(|| { DecodeErrors::DqtError(format!( "No quantization table for component {:?}", component.component_id )) })?; let x = (usize::from(img_width) * component.horizontal_sample + img.h_max - 1) / img.h_max; let y = (usize::from(img_height) * component.horizontal_sample + img.h_max - 1) / img.v_max; component.x = x; component.w2 = img.mcu_x * component.horizontal_sample * 8; // probably not needed. :) component.y = y; component.quantization_table = qt_table; // initially stride contains its horizontal sub-sampling component.width_stride *= img.mcu_x * 8; } { // Sampling factors are one thing that suck // this fixes a specific problem with images like // // (2 2) None // (2 1) H // (2 1) H // // The images exist in the wild, the images are not meant to exist // but they do, it's just an annoying horizontal sub-sampling that // I don't know why it exists. // But it does // So we try to cope with that. // I am not sure of how to explain how to fix it, but it involved a debugger // and to much coke(the legal one) // // If this wasn't present, self.upsample_dest would have the wrong length let mut handle_that_annoying_bug = false; if let Some(y_component) = img .components .iter() .find(|c| c.component_id == ComponentID::Y) { if y_component.horizontal_sample == 2 || y_component.vertical_sample == 2 { handle_that_annoying_bug = true; } } if handle_that_annoying_bug { for comp in &mut img.components { if (comp.component_id != ComponentID::Y) && (comp.horizontal_sample != 1 || comp.vertical_sample != 1) { comp.fix_an_annoying_bug = 2; } } } } if img.is_mjpeg { fill_default_mjpeg_tables( img.is_progressive, &mut img.dc_huffman_tables, &mut img.ac_huffman_tables ); } // check colorspace matches if img.input_colorspace.num_components() > img.components.len() { if img.input_colorspace == ColorSpace::YCCK { // Some images may have YCCK format (from adobe app14 segment) which is supposed to be 4 components // but only 3 components, see issue https://github.com/etemesi254/zune-image/issues/275 // So this is the behaviour of other decoders // - stb_image: Treats it as YCbCr image // - libjpeg_turbo: Does not know how to parse YCCK images (transform 2 app14) so treats // it as YCbCr // So I will match that to match existing ones warn!("Treating YCCK colorspace as YCbCr as component length does not match"); img.input_colorspace = ColorSpace::YCbCr } else { // Note, translated this to a warning to handle valid images of the sort // See https://github.com/etemesi254/zune-image/issues/288 where there // was a CMYK image with two components which would be decoded to 4 components // by the decoder. // So with a warning that becomes supported. // // djpeg fails to render an image from that also probably because it does not // understand the expected format. if !img.options.strict_mode() { warn!( "Expected {} number of components but found {}", img.input_colorspace.num_components(), img.components.len() ); warn!("Defaulting to multisample to decode"); // N/B: We do not post process the color of such, treating it as multiband // is the best option since I am not aware of grayscale+alpha which is the most common // two band format in jpeg. if img.components.len() > 0 { img.input_colorspace = ColorSpace::MultiBand( NonZeroU32::new(img.components.len() as u32).unwrap() ); } } else { let msg = format!( "Expected {} number of components but found {}", img.input_colorspace.num_components(), img.components.len() ); return Err(DecodeErrors::Format(msg)); } } } Ok(()) } ///Calculate number of fill bytes added to the end of a JPEG image /// to fill the image /// /// JPEG usually inserts padding bytes if the image width cannot be evenly divided into /// 8 , 16 or 32 chunks depending on the sub sampling ratio. So given a sub-sampling ratio, /// and the actual width, this calculates the padded bytes that were added to the image /// /// # Params /// -actual_width: Actual width of the image /// -sub_sample: Sub sampling factor of the image /// /// # Returns /// The padded width, this is how long the width is for a particular image pub fn calculate_padded_width(actual_width: usize, sub_sample: SampleRatios) -> usize { match sub_sample { SampleRatios::None | SampleRatios::V => { // None+V sends one MCU row, so that's a simple calculation ((actual_width + 7) / 8) * 8 } SampleRatios::H | SampleRatios::HV => { // sends two rows, width can be expanded by up to 15 more bytes ((actual_width + 15) / 16) * 16 } SampleRatios::Generic(h, _) => { ((actual_width + ((h * 8).saturating_sub(1))) / (h * 8)) * (h * 8) } } } // https://www.loc.gov/preservation/digital/formats/fdd/fdd000063.shtml // "Avery Lee, writing in the rec.video.desktop newsgroup in 2001, commented that "MJPEG, or at // least the MJPEG in AVIs having the MJPG fourcc, is restricted JPEG with a fixed -- and // *omitted* -- Huffman table. The JPEG must be YCbCr colorspace, it must be 4:2:2, and it must // use basic Huffman encoding, not arithmetic or progressive.... You can indeed extract the // MJPEG frames and decode them with a regular JPEG decoder, but you have to prepend the DHT // segment to them, or else the decoder won't have any idea how to decompress the data. // The exact table necessary is given in the OpenDML spec."" pub fn fill_default_mjpeg_tables( is_progressive: bool, dc_huffman_tables: &mut [Option], ac_huffman_tables: &mut [Option] ) { // Section K.3.3 trace!("Filling with default mjpeg tables"); if dc_huffman_tables[0].is_none() { // Table K.3 dc_huffman_tables[0] = Some( HuffmanTable::new_unfilled( &[ 0x00, 0x00, 0x01, 0x05, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 ], &[ 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B ], true, is_progressive ) .unwrap() ); } if dc_huffman_tables[1].is_none() { // Table K.4 dc_huffman_tables[1] = Some( HuffmanTable::new_unfilled( &[ 0x00, 0x00, 0x03, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00 ], &[ 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B ], true, is_progressive ) .unwrap() ); } if ac_huffman_tables[0].is_none() { // Table K.5 ac_huffman_tables[0] = Some( HuffmanTable::new_unfilled( &[ 0x00, 0x00, 0x02, 0x01, 0x03, 0x03, 0x02, 0x04, 0x03, 0x05, 0x05, 0x04, 0x04, 0x00, 0x00, 0x01, 0x7D ], &[ 0x01, 0x02, 0x03, 0x00, 0x04, 0x11, 0x05, 0x12, 0x21, 0x31, 0x41, 0x06, 0x13, 0x51, 0x61, 0x07, 0x22, 0x71, 0x14, 0x32, 0x81, 0x91, 0xA1, 0x08, 0x23, 0x42, 0xB1, 0xC1, 0x15, 0x52, 0xD1, 0xF0, 0x24, 0x33, 0x62, 0x72, 0x82, 0x09, 0x0A, 0x16, 0x17, 0x18, 0x19, 0x1A, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3A, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x4A, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5A, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6A, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7A, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8A, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9A, 0xA2, 0xA3, 0xA4, 0xA5, 0xA6, 0xA7, 0xA8, 0xA9, 0xAA, 0xB2, 0xB3, 0xB4, 0xB5, 0xB6, 0xB7, 0xB8, 0xB9, 0xBA, 0xC2, 0xC3, 0xC4, 0xC5, 0xC6, 0xC7, 0xC8, 0xC9, 0xCA, 0xD2, 0xD3, 0xD4, 0xD5, 0xD6, 0xD7, 0xD8, 0xD9, 0xDA, 0xE1, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA ], false, is_progressive ) .unwrap() ); } if ac_huffman_tables[1].is_none() { // Table K.6 ac_huffman_tables[1] = Some( HuffmanTable::new_unfilled( &[ 0x00, 0x00, 0x02, 0x01, 0x02, 0x04, 0x04, 0x03, 0x04, 0x07, 0x05, 0x04, 0x04, 0x00, 0x01, 0x02, 0x77 ], &[ 0x00, 0x01, 0x02, 0x03, 0x11, 0x04, 0x05, 0x21, 0x31, 0x06, 0x12, 0x41, 0x51, 0x07, 0x61, 0x71, 0x13, 0x22, 0x32, 0x81, 0x08, 0x14, 0x42, 0x91, 0xA1, 0xB1, 0xC1, 0x09, 0x23, 0x33, 0x52, 0xF0, 0x15, 0x62, 0x72, 0xD1, 0x0A, 0x16, 0x24, 0x34, 0xE1, 0x25, 0xF1, 0x17, 0x18, 0x19, 0x1A, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3A, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x4A, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5A, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6A, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7A, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8A, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9A, 0xA2, 0xA3, 0xA4, 0xA5, 0xA6, 0xA7, 0xA8, 0xA9, 0xAA, 0xB2, 0xB3, 0xB4, 0xB5, 0xB6, 0xB7, 0xB8, 0xB9, 0xBA, 0xC2, 0xC3, 0xC4, 0xC5, 0xC6, 0xC7, 0xC8, 0xC9, 0xCA, 0xD2, 0xD3, 0xD4, 0xD5, 0xD6, 0xD7, 0xD8, 0xD9, 0xDA, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA ], false, is_progressive ) .unwrap() ); } } zune-jpeg-0.5.11/src/unsafe_utils.rs000064400000000000000000000003201046102023000154450ustar 00000000000000#[cfg(all(feature = "x86", any(target_arch = "x86", target_arch = "x86_64")))] pub use crate::unsafe_utils_avx2::*; #[cfg(all(feature = "neon", target_arch = "aarch64"))] pub use crate::unsafe_utils_neon::*; zune-jpeg-0.5.11/src/unsafe_utils_avx2.rs000064400000000000000000000123241046102023000164140ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ #![cfg(all(feature = "x86", any(target_arch = "x86", target_arch = "x86_64")))] //! This module provides unsafe ways to do some things #![allow(clippy::wildcard_imports)] #[cfg(target_arch = "x86")] use core::arch::x86::*; #[cfg(target_arch = "x86_64")] use core::arch::x86_64::*; use core::ops::{Add, AddAssign, Mul, MulAssign, Sub}; /// A copy of `_MM_SHUFFLE()` that doesn't require /// a nightly compiler #[inline] const fn shuffle(z: i32, y: i32, x: i32, w: i32) -> i32 { (z << 6) | (y << 4) | (x << 2) | w } /// An abstraction of an AVX ymm register that ///allows some things to not look ugly #[derive(Clone, Copy)] pub struct YmmRegister { /// An AVX register pub(crate) mm256: __m256i } impl Add for YmmRegister { type Output = YmmRegister; #[inline] fn add(self, rhs: Self) -> Self::Output { unsafe { return YmmRegister { mm256: _mm256_add_epi32(self.mm256, rhs.mm256) }; } } } impl Add for YmmRegister { type Output = YmmRegister; #[inline] fn add(self, rhs: i32) -> Self::Output { unsafe { let tmp = _mm256_set1_epi32(rhs); return YmmRegister { mm256: _mm256_add_epi32(self.mm256, tmp) }; } } } impl Sub for YmmRegister { type Output = YmmRegister; #[inline] fn sub(self, rhs: Self) -> Self::Output { unsafe { return YmmRegister { mm256: _mm256_sub_epi32(self.mm256, rhs.mm256) }; } } } impl AddAssign for YmmRegister { #[inline] fn add_assign(&mut self, rhs: Self) { unsafe { self.mm256 = _mm256_add_epi32(self.mm256, rhs.mm256); } } } impl AddAssign for YmmRegister { #[inline] fn add_assign(&mut self, rhs: i32) { unsafe { let tmp = _mm256_set1_epi32(rhs); self.mm256 = _mm256_add_epi32(self.mm256, tmp); } } } impl Mul for YmmRegister { type Output = YmmRegister; #[inline] fn mul(self, rhs: Self) -> Self::Output { unsafe { YmmRegister { mm256: _mm256_mullo_epi32(self.mm256, rhs.mm256) } } } } impl Mul for YmmRegister { type Output = YmmRegister; #[inline] fn mul(self, rhs: i32) -> Self::Output { unsafe { let tmp = _mm256_set1_epi32(rhs); YmmRegister { mm256: _mm256_mullo_epi32(self.mm256, tmp) } } } } impl MulAssign for YmmRegister { #[inline] fn mul_assign(&mut self, rhs: Self) { unsafe { self.mm256 = _mm256_mullo_epi32(self.mm256, rhs.mm256); } } } impl MulAssign for YmmRegister { #[inline] fn mul_assign(&mut self, rhs: i32) { unsafe { let tmp = _mm256_set1_epi32(rhs); self.mm256 = _mm256_mullo_epi32(self.mm256, tmp); } } } impl MulAssign<__m256i> for YmmRegister { #[inline] fn mul_assign(&mut self, rhs: __m256i) { unsafe { self.mm256 = _mm256_mullo_epi32(self.mm256, rhs); } } } type Reg = YmmRegister; /// Transpose an array of 8 by 8 i32's using avx intrinsics /// /// This was translated from [here](https://newbedev.com/transpose-an-8x8-float-using-avx-avx2) #[allow(unused_parens, clippy::too_many_arguments)] #[target_feature(enable = "avx2")] #[inline] pub unsafe fn transpose( v0: &mut Reg, v1: &mut Reg, v2: &mut Reg, v3: &mut Reg, v4: &mut Reg, v5: &mut Reg, v6: &mut Reg, v7: &mut Reg ) { macro_rules! merge_epi32 { ($v0:tt,$v1:tt,$v2:tt,$v3:tt) => { let va = _mm256_permute4x64_epi64($v0, shuffle(3, 1, 2, 0)); let vb = _mm256_permute4x64_epi64($v1, shuffle(3, 1, 2, 0)); $v2 = _mm256_unpacklo_epi32(va, vb); $v3 = _mm256_unpackhi_epi32(va, vb); }; } macro_rules! merge_epi64 { ($v0:tt,$v1:tt,$v2:tt,$v3:tt) => { let va = _mm256_permute4x64_epi64($v0, shuffle(3, 1, 2, 0)); let vb = _mm256_permute4x64_epi64($v1, shuffle(3, 1, 2, 0)); $v2 = _mm256_unpacklo_epi64(va, vb); $v3 = _mm256_unpackhi_epi64(va, vb); }; } macro_rules! merge_si128 { ($v0:tt,$v1:tt,$v2:tt,$v3:tt) => { $v2 = _mm256_permute2x128_si256($v0, $v1, shuffle(0, 2, 0, 0)); $v3 = _mm256_permute2x128_si256($v0, $v1, shuffle(0, 3, 0, 1)); }; } let (w0, w1, w2, w3, w4, w5, w6, w7); merge_epi32!((v0.mm256), (v1.mm256), w0, w1); merge_epi32!((v2.mm256), (v3.mm256), w2, w3); merge_epi32!((v4.mm256), (v5.mm256), w4, w5); merge_epi32!((v6.mm256), (v7.mm256), w6, w7); let (x0, x1, x2, x3, x4, x5, x6, x7); merge_epi64!(w0, w2, x0, x1); merge_epi64!(w1, w3, x2, x3); merge_epi64!(w4, w6, x4, x5); merge_epi64!(w5, w7, x6, x7); merge_si128!(x0, x4, (v0.mm256), (v1.mm256)); merge_si128!(x1, x5, (v2.mm256), (v3.mm256)); merge_si128!(x2, x6, (v4.mm256), (v5.mm256)); merge_si128!(x3, x7, (v6.mm256), (v7.mm256)); } zune-jpeg-0.5.11/src/unsafe_utils_neon.rs000064400000000000000000000174331046102023000165010ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ #![cfg(all(feature = "neon", target_arch = "aarch64"))] // TODO can this be extended to armv7 //! This module provides unsafe ways to do some things #![allow(clippy::wildcard_imports)] use core::arch::aarch64::*; use core::ops::{Add, AddAssign, BitOr, BitOrAssign, Mul, MulAssign, Sub}; pub type VecType = int32x4x2_t; pub unsafe fn loadu(src: *const i32) -> VecType { vld1q_s32_x2(src as *const _) } /// An abstraction of an AVX ymm register that ///allows some things to not look ugly #[derive(Clone, Copy)] pub struct YmmRegister { /// An AVX register pub(crate) mm256: VecType } impl YmmRegister { #[inline] pub unsafe fn load(src: *const i32) -> Self { loadu(src).into() } #[inline] pub fn map2(self, other: Self, f: impl Fn(int32x4_t, int32x4_t) -> int32x4_t) -> Self { let m0 = f(self.mm256.0, other.mm256.0); let m1 = f(self.mm256.1, other.mm256.1); YmmRegister { mm256: int32x4x2_t(m0, m1) } } #[inline] pub fn all_zero(self) -> bool { unsafe { let both = vorrq_s32(self.mm256.0, self.mm256.1); let both_unsigned = vreinterpretq_u32_s32(both); 0 == vmaxvq_u32(both_unsigned) } } #[inline] pub fn const_shl(self) -> Self { // Ensure that we logically shift left unsafe { let m0 = vreinterpretq_s32_u32(vshlq_n_u32::(vreinterpretq_u32_s32(self.mm256.0))); let m1 = vreinterpretq_s32_u32(vshlq_n_u32::(vreinterpretq_u32_s32(self.mm256.1))); YmmRegister { mm256: int32x4x2_t(m0, m1) } } } #[inline] pub fn const_shra(self) -> Self { unsafe { let i0 = vshrq_n_s32::(self.mm256.0); let i1 = vshrq_n_s32::(self.mm256.1); YmmRegister { mm256: int32x4x2_t(i0, i1) } } } } impl Add for YmmRegister where T: Into { type Output = YmmRegister; #[inline] fn add(self, rhs: T) -> Self::Output { let rhs = rhs.into(); unsafe { self.map2(rhs, |a, b| vaddq_s32(a, b)) } } } impl Sub for YmmRegister where T: Into { type Output = YmmRegister; #[inline] fn sub(self, rhs: T) -> Self::Output { let rhs = rhs.into(); unsafe { self.map2(rhs, |a, b| vsubq_s32(a, b)) } } } impl AddAssign for YmmRegister where T: Into { #[inline] fn add_assign(&mut self, rhs: T) { let rhs: Self = rhs.into(); *self = *self + rhs; } } impl Mul for YmmRegister where T: Into { type Output = YmmRegister; #[inline] fn mul(self, rhs: T) -> Self::Output { let rhs = rhs.into(); unsafe { self.map2(rhs, |a, b| vmulq_s32(a, b)) } } } impl MulAssign for YmmRegister where T: Into { #[inline] fn mul_assign(&mut self, rhs: T) { let rhs: Self = rhs.into(); *self = *self * rhs; } } impl BitOr for YmmRegister where T: Into { type Output = YmmRegister; #[inline] fn bitor(self, rhs: T) -> Self::Output { let rhs = rhs.into(); unsafe { self.map2(rhs, |a, b| vorrq_s32(a, b)) } } } impl BitOrAssign for YmmRegister where T: Into { #[inline] fn bitor_assign(&mut self, rhs: T) { let rhs: Self = rhs.into(); *self = *self | rhs; } } impl From for YmmRegister { #[inline] fn from(val: i32) -> Self { unsafe { let dup = vdupq_n_s32(val); YmmRegister { mm256: int32x4x2_t(dup, dup) } } } } impl From for YmmRegister { #[inline] fn from(mm256: VecType) -> Self { YmmRegister { mm256 } } } #[allow(clippy::too_many_arguments)] #[inline] unsafe fn transpose4( v0: &mut int32x4_t, v1: &mut int32x4_t, v2: &mut int32x4_t, v3: &mut int32x4_t ) { let w0 = vtrnq_s32( vreinterpretq_s32_s64(vtrn1q_s64( vreinterpretq_s64_s32(*v0), vreinterpretq_s64_s32(*v2) )), vreinterpretq_s32_s64(vtrn1q_s64( vreinterpretq_s64_s32(*v1), vreinterpretq_s64_s32(*v3) )) ); let w1 = vtrnq_s32( vreinterpretq_s32_s64(vtrn2q_s64( vreinterpretq_s64_s32(*v0), vreinterpretq_s64_s32(*v2) )), vreinterpretq_s32_s64(vtrn2q_s64( vreinterpretq_s64_s32(*v1), vreinterpretq_s64_s32(*v3) )) ); *v0 = w0.0; *v1 = w0.1; *v2 = w1.0; *v3 = w1.1; } /// Transpose an array of 8 by 8 i32 /// Arm has dedicated interleave/transpose instructions /// we: /// 1. Transpose the upper left and lower right quadrants /// 2. Swap and transpose the upper right and lower left quadrants #[allow(clippy::too_many_arguments)] #[inline] pub unsafe fn transpose( v0: &mut YmmRegister, v1: &mut YmmRegister, v2: &mut YmmRegister, v3: &mut YmmRegister, v4: &mut YmmRegister, v5: &mut YmmRegister, v6: &mut YmmRegister, v7: &mut YmmRegister ) { use core::mem::swap; let ul0 = &mut v0.mm256.0; let ul1 = &mut v1.mm256.0; let ul2 = &mut v2.mm256.0; let ul3 = &mut v3.mm256.0; let ur0 = &mut v0.mm256.1; let ur1 = &mut v1.mm256.1; let ur2 = &mut v2.mm256.1; let ur3 = &mut v3.mm256.1; let ll0 = &mut v4.mm256.0; let ll1 = &mut v5.mm256.0; let ll2 = &mut v6.mm256.0; let ll3 = &mut v7.mm256.0; let lr0 = &mut v4.mm256.1; let lr1 = &mut v5.mm256.1; let lr2 = &mut v6.mm256.1; let lr3 = &mut v7.mm256.1; swap(ur0, ll0); swap(ur1, ll1); swap(ur2, ll2); swap(ur3, ll3); transpose4(ul0, ul1, ul2, ul3); transpose4(ur0, ur1, ur2, ur3); transpose4(ll0, ll1, ll2, ll3); transpose4(lr0, lr1, lr2, lr3); } #[cfg(test)] mod tests { use super::*; #[test] fn test_transpose() { fn get_val(i: usize, j: usize) -> i32 { ((i * 8) / (j + 1)) as i32 } unsafe { let mut vals: [i32; 8 * 8] = [0; 8 * 8]; for i in 0..8 { for j in 0..8 { // some order-dependent value of i and j let value = get_val(i, j); vals[i * 8 + j] = value; } } let mut regs: [YmmRegister; 8] = core::mem::transmute(vals); let mut reg0 = regs[0]; let mut reg1 = regs[1]; let mut reg2 = regs[2]; let mut reg3 = regs[3]; let mut reg4 = regs[4]; let mut reg5 = regs[5]; let mut reg6 = regs[6]; let mut reg7 = regs[7]; transpose( &mut reg0, &mut reg1, &mut reg2, &mut reg3, &mut reg4, &mut reg5, &mut reg6, &mut reg7 ); regs[0] = reg0; regs[1] = reg1; regs[2] = reg2; regs[3] = reg3; regs[4] = reg4; regs[5] = reg5; regs[6] = reg6; regs[7] = reg7; let vals_from_reg: [i32; 8 * 8] = core::mem::transmute(regs); for i in 0..8 { for j in 0..i { let orig = vals[i * 8 + j]; vals[i * 8 + j] = vals[j * 8 + i]; vals[j * 8 + i] = orig; } } for i in 0..8 { for j in 0..8 { assert_eq!(vals[j * 8 + i], get_val(i, j)); assert_eq!(vals_from_reg[j * 8 + i], get_val(i, j)); } } assert_eq!(vals, vals_from_reg); } } } zune-jpeg-0.5.11/src/upsampler/avx2.rs000064400000000000000000000150721046102023000156460ustar 00000000000000/* * Copyright (c) 2025. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ #[cfg(target_arch = "x86")] use core::arch::x86::*; #[cfg(target_arch = "x86_64")] use core::arch::x86_64::*; #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] #[target_feature(enable = "avx2")] pub unsafe fn upsample_horizontal_avx2( input: &[i16], in_near: &[i16], in_far: &[i16], scratch: &mut [i16], output: &mut [i16], ) { assert_eq!(input.len() * 2, output.len()); assert!(input.len() > 2); let len = input.len(); if len < 18 { return super::scalar::upsample_horizontal(input, in_near, in_far, scratch, output); } // First two pixels output[0] = input[0]; output[1] = (input[0] * 3 + input[1] + 2) >> 2; let v_three = _mm256_set1_epi16(3); let v_two = _mm256_set1_epi16(2); let upsample16 = |input: &[i16; 18], output: &mut [i16; 32]| { let in_ptr = input.as_ptr(); let out_ptr = output.as_mut_ptr(); // SAFETY: The input is 18 * 16 bit long, so the loads are safe. let (v_prev, v_curr, v_next) = unsafe { ( _mm256_loadu_si256(in_ptr.add(0) as *const __m256i), _mm256_loadu_si256(in_ptr.add(1) as *const __m256i), _mm256_loadu_si256(in_ptr.add(2) as *const __m256i), ) }; let v_common = _mm256_add_epi16(_mm256_mullo_epi16(v_curr, v_three), v_two); let v_even = _mm256_srai_epi16(_mm256_add_epi16(v_common, v_prev), 2); let v_odd = _mm256_srai_epi16(_mm256_add_epi16(v_common, v_next), 2); let v_res_1 = _mm256_unpacklo_epi16(v_even, v_odd); let v_res_2 = _mm256_unpackhi_epi16(v_even, v_odd); let v_final_1 = _mm256_permute2x128_si256(v_res_1, v_res_2, 0x20); let v_final_2 = _mm256_permute2x128_si256(v_res_1, v_res_2, 0x31); // SAFETY: The output is 32 * 16 bit long, so the stores are safe. unsafe { _mm256_storeu_si256(out_ptr as *mut __m256i, v_final_1); _mm256_storeu_si256(out_ptr.add(16) as *mut __m256i, v_final_2); } }; for (input, output) in input .windows(18) .step_by(16) .zip(output[2..].chunks_exact_mut(32)) { upsample16(input.try_into().unwrap(), output.try_into().unwrap()); } // Upsample the remainder. This may have some overlap, but that's fine. if let Some(rest_input) = input.last_chunk::<18>() { let end = output.len() - 2; if let Some(rest_output) = output[..end].last_chunk_mut::<32>() { upsample16(rest_input, rest_output); } } // Last two pixels. output[output.len() - 2] = (3 * input[len - 1] + input[len - 2] + 2) >> 2; output[output.len() - 1] = input[len - 1]; } #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] #[target_feature(enable = "avx2")] pub unsafe fn upsample_vertical_avx2( input: &[i16], in_near: &[i16], in_far: &[i16], scratch: &mut [i16], output: &mut [i16], ) { assert_eq!(input.len() * 2, output.len()); assert_eq!(in_near.len(), input.len()); assert_eq!(in_far.len(), input.len()); let len = input.len(); if len < 16 { return super::scalar::upsample_vertical(input, in_near, in_far, scratch, output); } let middle = output.len() / 2; let (out_top, out_bottom) = output.split_at_mut(middle); let v_three = _mm256_set1_epi16(3); let v_two = _mm256_set1_epi16(2); let upsample16 = |input: &[i16; 16], in_near: &[i16; 16], in_far: &[i16; 16], out_top: &mut [i16; 16], out_bottom: &mut [i16; 16]| { // SAFETY: Inputs are all 16 * 16 bit long, so the loads are safe. let (v_in, v_near, v_far) = unsafe { ( _mm256_loadu_si256(input.as_ptr() as *const __m256i), _mm256_loadu_si256(in_near.as_ptr() as *const __m256i), _mm256_loadu_si256(in_far.as_ptr() as *const __m256i), ) }; let v_common = _mm256_add_epi16(_mm256_mullo_epi16(v_in, v_three), v_two); let v_out_top = _mm256_srai_epi16(_mm256_add_epi16(v_common, v_near), 2); let v_out_bottom = _mm256_srai_epi16(_mm256_add_epi16(v_common, v_far), 2); // SAFETY: Outputs are 16 * 16 bit long, so the stores are safe. unsafe { _mm256_storeu_si256(out_top.as_mut_ptr() as *mut __m256i, v_out_top); _mm256_storeu_si256(out_bottom.as_mut_ptr() as *mut __m256i, v_out_bottom); } }; let chunks = input .chunks_exact(16) .zip(in_near.chunks_exact(16)) .zip(in_far.chunks_exact(16)) .zip(out_top.chunks_exact_mut(16)) .zip(out_bottom.chunks_exact_mut(16)); for ((((input, in_near), in_far), out_top), out_bottom) in chunks { upsample16( input.try_into().unwrap(), in_near.try_into().unwrap(), in_far.try_into().unwrap(), out_top.try_into().unwrap(), out_bottom.try_into().unwrap(), ); } // Upsample the remainder. This may have some overlap, but that's fine. // Edition upgrade will fix this nested awfulness. if let Some(rest) = input.last_chunk::<16>() { if let Some(rest_near) = in_near.last_chunk::<16>() { if let Some(rest_far) = in_far.last_chunk::<16>() { if let Some(mut rest_top) = out_top.last_chunk_mut::<16>() { if let Some(mut rest_bottom) = out_bottom.last_chunk_mut::<16>() { upsample16(rest, rest_near, rest_far, &mut rest_top, &mut rest_bottom); } } } } } } #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] #[target_feature(enable = "avx2")] pub unsafe fn upsample_hv_avx2( input: &[i16], in_near: &[i16], in_far: &[i16], scratch_space: &mut [i16], output: &mut [i16], ) { assert_eq!(input.len() * 4, output.len()); assert_eq!(input.len() * 2, scratch_space.len()); upsample_vertical_avx2(input, in_near, in_far, &mut [], scratch_space); let scratch_half = scratch_space.len() / 2; let output_half = output.len() / 2; let (scratch_top, scratch_bottom) = scratch_space.split_at_mut(scratch_half); let (out_top, out_bottom) = output.split_at_mut(output_half); let mut t = [0]; upsample_horizontal_avx2(scratch_top, &[], &[], &mut t, out_top); upsample_horizontal_avx2(scratch_bottom, &[], &[], &mut t, out_bottom); } zune-jpeg-0.5.11/src/upsampler/neon.rs000064400000000000000000000134031046102023000157210ustar 00000000000000/* * Copyright (c) 2025. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ #[cfg(target_arch = "aarch64")] use core::arch::aarch64::*; #[cfg(target_arch = "aarch64")] #[target_feature(enable = "neon")] pub fn upsample_horizontal_neon( input: &[i16], in_near: &[i16], in_far: &[i16], scratch: &mut [i16], output: &mut [i16], ) { assert_eq!(input.len() * 2, output.len()); assert!(input.len() > 2); let len = input.len(); if len < 10 { return super::scalar::upsample_horizontal(input, in_near, in_far, scratch, output); } // First two pixels output[0] = input[0]; output[1] = (input[0] * 3 + input[1] + 2) >> 2; let v_three = vdupq_n_s16(3); let v_two = vdupq_n_s16(2); let upsample8 = |input: &[i16; 10], output: &mut [i16; 16]| { let in_ptr = input.as_ptr(); let out_ptr = output.as_mut_ptr(); // SAFETY: The input is 10 * 16 bit long, so the loads are safe. let (v_prev, v_curr, v_next) = unsafe { ( vld1q_s16(in_ptr), vld1q_s16(in_ptr.add(1)), vld1q_s16(in_ptr.add(2)), ) }; let v_common = vaddq_s16(vmulq_s16(v_curr, v_three), v_two); let v_even = vshrq_n_s16::<2>(vaddq_s16(v_common, v_prev)); let v_odd = vshrq_n_s16::<2>(vaddq_s16(v_common, v_next)); let v_res_1 = vzip1q_s16(v_even, v_odd); let v_res_2 = vzip2q_s16(v_even, v_odd); // SAFETY: The output is 16 * 16 bit long, so the stores are safe. unsafe { vst1q_s16(out_ptr, v_res_1); vst1q_s16(out_ptr.add(8), v_res_2); } }; for (input, output) in input .windows(10) .step_by(8) .zip(output[2..].chunks_exact_mut(16)) { upsample8(input.try_into().unwrap(), output.try_into().unwrap()); } // Upsample the remainder. This may have some overlap, but that's fine. if let Some(rest_input) = input.last_chunk::<10>() { let end = output.len() - 2; if let Some(rest_output) = output[..end].last_chunk_mut::<16>() { upsample8(rest_input, rest_output); } } // Last two pixels. output[output.len() - 2] = (3 * input[len - 1] + input[len - 2] + 2) >> 2; output[output.len() - 1] = input[len - 1]; } #[cfg(target_arch = "aarch64")] #[target_feature(enable = "neon")] pub fn upsample_vertical_neon( input: &[i16], in_near: &[i16], in_far: &[i16], scratch: &mut [i16], output: &mut [i16], ) { assert_eq!(input.len() * 2, output.len()); assert_eq!(in_near.len(), input.len()); assert_eq!(in_far.len(), input.len()); let len = input.len(); if len < 16 { return super::scalar::upsample_vertical(input, in_near, in_far, scratch, output); } let middle = output.len() / 2; let (out_top, out_bottom) = output.split_at_mut(middle); let v_three = vdupq_n_s16(3); let v_two = vdupq_n_s16(2); let upsample8 = |input: &[i16; 8], in_near: &[i16; 8], in_far: &[i16; 8], out_top: &mut [i16; 8], out_bottom: &mut [i16; 8]| { // SAFETY: Inputs are all 8 * 16 bit long, so the loads are safe. let (v_in, v_near, v_far) = unsafe { ( vld1q_s16(input.as_ptr()), vld1q_s16(in_near.as_ptr()), vld1q_s16(in_far.as_ptr()), ) }; let v_common = vaddq_s16(vmulq_s16(v_in, v_three), v_two); let v_out_top = vshrq_n_s16::<2>(vaddq_s16(v_common, v_near)); let v_out_bottom = vshrq_n_s16::<2>(vaddq_s16(v_common, v_far)); // SAFETY: Outputs are 8 * 16 bit long, so the stores are safe. unsafe { vst1q_s16(out_top.as_mut_ptr(), v_out_top); vst1q_s16(out_bottom.as_mut_ptr(), v_out_bottom); } }; let chunks = input .chunks_exact(8) .zip(in_near.chunks_exact(8)) .zip(in_far.chunks_exact(8)) .zip(out_top.chunks_exact_mut(8)) .zip(out_bottom.chunks_exact_mut(8)); for ((((input, in_near), in_far), out_top), out_bottom) in chunks { upsample8( input.try_into().unwrap(), in_near.try_into().unwrap(), in_far.try_into().unwrap(), out_top.try_into().unwrap(), out_bottom.try_into().unwrap(), ); } // Upsample the remainder. if let Some(rest) = input.last_chunk::<8>() { if let Some(rest_near) = in_near.last_chunk::<8>() { if let Some(rest_far) = in_far.last_chunk::<8>() { if let Some(mut rest_top) = out_top.last_chunk_mut::<8>() { if let Some(mut rest_bottom) = out_bottom.last_chunk_mut::<8>() { upsample8(rest, rest_near, rest_far, &mut rest_top, &mut rest_bottom); } } } } } } #[cfg(target_arch = "aarch64")] #[target_feature(enable = "neon")] pub fn upsample_hv_neon( input: &[i16], in_near: &[i16], in_far: &[i16], scratch_space: &mut [i16], output: &mut [i16], ) { assert_eq!(input.len() * 4, output.len()); assert_eq!(input.len() * 2, scratch_space.len()); upsample_vertical_neon(input, in_near, in_far, &mut [], scratch_space); let scratch_half = scratch_space.len() / 2; let output_half = output.len() / 2; let (scratch_top, scratch_bottom) = scratch_space.split_at_mut(scratch_half); let (out_top, out_bottom) = output.split_at_mut(output_half); let mut t = [0]; upsample_horizontal_neon(scratch_top, &[], &[], &mut t, out_top); upsample_horizontal_neon(scratch_bottom, &[], &[], &mut t, out_bottom); } zune-jpeg-0.5.11/src/upsampler/portable_simd.rs000064400000000000000000000120021046102023000176000ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ use std::simd::prelude::*; const LANES: usize = 16; type V = Simd; pub fn upsample_horizontal_simd( input: &[i16], in_near: &[i16], in_far: &[i16], scratch: &mut [i16], output: &mut [i16], ) { assert_eq!(input.len() * 2, output.len()); assert!(input.len() > 2); let len = input.len(); if len < 18 { return super::scalar::upsample_horizontal(input, in_near, in_far, scratch, output); } // First two pixels output[0] = input[0]; output[1] = (input[0] * 3 + input[1] + 2) >> 2; let v_three = V::splat(3); let v_two = V::splat(2); let upsample16 = |input: &[i16; 18], output: &mut [i16; 32]| { let v_prev = V::from_slice(&input[0..LANES]); let v_curr = V::from_slice(&input[1..LANES + 1]); let v_next = V::from_slice(&input[2..LANES + 2]); let v_common = v_curr * v_three + v_two; let v_even = (v_common + v_prev) >> 2; let v_odd = (v_common + v_next) >> 2; let (v_res_1, v_res_2) = v_even.interleave(v_odd); v_res_1.copy_to_slice(&mut output[0..LANES]); v_res_2.copy_to_slice(&mut output[LANES..2 * LANES]); }; for (input, output) in input .windows(18) .step_by(16) .zip(output[2..].chunks_exact_mut(32)) { upsample16(input.try_into().unwrap(), output.try_into().unwrap()); } // Upsample the remainder. This may have some overlap, but that's fine. if let Some(rest_input) = input.last_chunk::<18>() { let end = output.len() - 2; if let Some(rest_output) = output[..end].last_chunk_mut::<32>() { upsample16(rest_input, rest_output); } } // Last two pixels. output[output.len() - 2] = (3 * input[len - 1] + input[len - 2] + 2) >> 2; output[output.len() - 1] = input[len - 1]; } pub fn upsample_vertical_simd( input: &[i16], in_near: &[i16], in_far: &[i16], _scratch_space: &mut [i16], output: &mut [i16], ) { assert_eq!(input.len() * 2, output.len()); assert_eq!(in_near.len(), input.len()); assert_eq!(in_far.len(), input.len()); let len = input.len(); if len < 16 { return super::scalar::upsample_vertical(input, in_near, in_far, _scratch_space, output); } let middle = output.len() / 2; let (out_top, out_bottom) = output.split_at_mut(middle); let v_three = V::splat(3); let v_two = V::splat(2); let upsample16 = |input: &[i16; 16], in_near: &[i16; 16], in_far: &[i16; 16], out_top: &mut [i16; 16], out_bottom: &mut [i16; 16]| { let v_in = V::from(*input); let v_near = V::from(*in_near); let v_far = V::from(*in_far); let v_common = v_in * v_three + v_two; let v_out_top = (v_common + v_near) >> 2; let v_out_bottom = (v_common + v_far) >> 2; v_out_top.copy_to_slice(out_top.as_mut_slice()); v_out_bottom.copy_to_slice(out_bottom.as_mut_slice()); }; let chunks = input .chunks_exact(16) .zip(in_near.chunks_exact(16)) .zip(in_far.chunks_exact(16)) .zip(out_top.chunks_exact_mut(16)) .zip(out_bottom.chunks_exact_mut(16)); for ((((input, in_near), in_far), out_top), out_bottom) in chunks { upsample16( input.try_into().unwrap(), in_near.try_into().unwrap(), in_far.try_into().unwrap(), out_top.try_into().unwrap(), out_bottom.try_into().unwrap(), ); } // Upsample the remainder. This may have some overlap, but that's fine. // Edition upgrade will fix this nested awfulness. if let Some(rest) = input.last_chunk::<16>() { if let Some( rest_near) = in_near.last_chunk::<16>() { if let Some( rest_far) = in_far.last_chunk::<16>() { if let Some( rest_top) = out_top.last_chunk_mut::<16>() { if let Some( rest_bottom) = out_bottom.last_chunk_mut::<16>() { upsample16(rest, rest_near, rest_far, rest_top, rest_bottom); } } } } } } pub fn upsample_hv_simd( input: &[i16], in_near: &[i16], in_far: &[i16], scratch_space: &mut [i16], output: &mut [i16], ) { assert_eq!(input.len() * 4, output.len()); assert_eq!(input.len() * 2, scratch_space.len()); upsample_vertical_simd(input, in_near, in_far, &mut [], scratch_space); let scratch_half = scratch_space.len() / 2; let output_half = output.len() / 2; let (scratch_top, scratch_bottom) = scratch_space.split_at_mut(scratch_half); let (out_top, out_bottom) = output.split_at_mut(output_half); let mut t = [0]; upsample_horizontal_simd(scratch_top, &[], &[], &mut t, out_top); upsample_horizontal_simd(scratch_bottom, &[], &[], &mut t, out_bottom); } zune-jpeg-0.5.11/src/upsampler/scalar.rs000064400000000000000000000075311046102023000162340ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ pub fn upsample_horizontal( input: &[i16], _ref: &[i16], _in_near: &[i16], _scratch: &mut [i16], output: &mut [i16] ) { assert_eq!( input.len() * 2, output.len(), "Input length is not half the size of the output length" ); assert!( output.len() > 4 && input.len() > 2, "Too Short of a vector, cannot upsample" ); output[0] = input[0]; output[1] = (input[0] * 3 + input[1] + 2) >> 2; // This code is written for speed and not readability // // The readable code is // // for i in 1..input.len() - 1{ // let sample = 3 * input[i] + 2; // out[i * 2] = (sample + input[i - 1]) >> 2; // out[i * 2 + 1] = (sample + input[i + 1]) >> 2; // } // // The output of a pixel is determined by it's surrounding neighbours but we attach more weight to it's nearest // neighbour (input[i]) than to the next nearest neighbour. for (output_window, input_window) in output[2..].chunks_exact_mut(2).zip(input.windows(3)) { let sample = 3 * input_window[1] + 2; output_window[0] = (sample + input_window[0]) >> 2; output_window[1] = (sample + input_window[2]) >> 2; } // Get lengths let out_len = output.len() - 2; let input_len = input.len() - 2; // slice the output vector let f_out = &mut output[out_len..]; let i_last = &input[input_len..]; // write out manually.. f_out[0] = (3 * i_last[1] + i_last[0] + 2) >> 2; f_out[1] = i_last[1]; } pub fn upsample_vertical( input: &[i16], in_near: &[i16], in_far: &[i16], _scratch_space: &mut [i16], output: &mut [i16] ) { assert_eq!(input.len() * 2, output.len()); assert_eq!(in_near.len(), input.len()); assert_eq!(in_far.len(), input.len()); let middle = output.len() / 2; let (out_top, out_bottom) = output.split_at_mut(middle); // for the first row, closest row is in_near for ((near, far), x) in input.iter().zip(in_near.iter()).zip(out_top) { *x = (((3 * near) + 2) + far) >> 2; } // for the second row, the closest row to input is in_far for ((near, far), x) in input.iter().zip(in_far.iter()).zip(out_bottom) { *x = (((3 * near) + 2) + far) >> 2; } } pub fn upsample_hv( input: &[i16], in_near: &[i16], in_far: &[i16], scratch_space: &mut [i16], output: &mut [i16] ) { assert_eq!(input.len() * 4, output.len()); assert_eq!(input.len() * 2, scratch_space.len()); let mut t = [0]; upsample_vertical(input, in_near, in_far, &mut t, scratch_space); // horizontal upsampling must be done separate for every line // Otherwise it introduces artifacts that may cause the edge colors // to appear on the other line. // Since this is called for two scanlines/widths currently // splitting the inputs and outputs into half ensures we only handle // one scanline per iteration let scratch_half = scratch_space.len() / 2; let output_half = output.len() / 2; upsample_horizontal( &scratch_space[..scratch_half], &[], &[], &mut t, &mut output[..output_half] ); upsample_horizontal( &scratch_space[scratch_half..], &[], &[], &mut t, &mut output[output_half..] ); } pub fn upsample_generic( input: &[i16], _in_near: &[i16], _in_far: &[i16], _scratch_space: &mut [i16], output: &mut [i16] ) { // use nearest sample let difference = output.len() / input.len(); if difference > 0 { // nearest neighbour for (input, chunk_output) in input.iter().zip(output.chunks_exact_mut(difference)) { chunk_output.iter_mut().for_each(|x| *x = *input); } } } zune-jpeg-0.5.11/src/upsampler.rs000064400000000000000000000254471046102023000147750ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ //! Up-sampling routines //! //! The main upsampling method is a bi-linear interpolation or a "triangle //! filter " or libjpeg turbo `fancy_upsampling` which is a good compromise //! between speed and visual quality //! //! # The filter //! Each output pixel is made from `(3*A+B)/4` where A is the original //! pixel closer to the output and B is the one further. //! //! ```text //!+---+---+ //! | A | B | //! +---+---+ //! +-+-+-+-+ //! | |P| | | //! +-+-+-+-+ //! ``` //! //! # Horizontal Bi-linear filter //! ```text //! |---+-----------+---+ //! | | | | //! | A | |p1 | p2| | B | //! | | | | //! |---+-----------+---+ //! //! ``` //! For a horizontal bi-linear it's trivial to implement, //! //! `A` becomes the input closest to the output. //! //! `B` varies depending on output. //! - For odd positions, input is the `next` pixel after A //! - For even positions, input is the `previous` value before A. //! //! We iterate in a classic 1-D sliding window with a window of 3. //! For our sliding window approach, `A` is the 1st and `B` is either the 0th term or 2nd term //! depending on position we are writing.(see scalar code). //! //! For vector code see module sse for explanation. //! //! # Vertical bi-linear. //! Vertical up-sampling is a bit trickier. //! //! ```text //! +----+----+ //! | A1 | A2 | //! +----+----+ //! +----+----+ //! | p1 | p2 | //! +----+-+--+ //! +----+-+--+ //! | p3 | p4 | //! +----+-+--+ //! +----+----+ //! | B1 | B2 | //! +----+----+ //! ``` //! //! For `p1` //! - `A1` is given a weight of `3` and `B1` is given a weight of 1. //! //! For `p3` //! - `B1` is given a weight of `3` and `A1` is given a weight of 1 //! //! # Horizontal vertical downsampling/chroma quartering. //! //! Carry out a vertical filter in the first pass, then a horizontal filter in the second pass. #![allow(unreachable_code)] use zune_core::options::DecoderOptions; use crate::components::UpSampler; #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] #[cfg(feature = "x86")] mod avx2; #[cfg(target_arch = "aarch64")] #[cfg(feature = "neon")] mod neon; #[cfg(feature = "portable_simd")] mod portable_simd; mod scalar; // choose the best possible implementation for this platform #[allow(unused_variables)] pub fn choose_horizontal_samp_function(options: &DecoderOptions) -> UpSampler { #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] #[cfg(feature = "x86")] if options.use_avx2() { return |a: &[i16], b: &[i16], c: &[i16], d: &mut [i16], e: &mut [i16]| { // SAFETY: `options.use_avx2()` only returns true if avx2 is supported. unsafe { avx2::upsample_horizontal_avx2(a, b, c, d, e) } }; } #[cfg(target_arch = "aarch64")] #[cfg(feature = "neon")] if options.use_neon() { return |a: &[i16], b: &[i16], c: &[i16], d: &mut [i16], e: &mut [i16]| { // SAFETY: `options.use_neon()` only returns true if neon is supported. unsafe { neon::upsample_horizontal_neon(a, b, c, d, e) } }; } #[cfg(feature = "portable_simd")] return portable_simd::upsample_horizontal_simd; return scalar::upsample_horizontal; } #[allow(unused_variables)] pub fn choose_hv_samp_function(options: &DecoderOptions) -> UpSampler { #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] #[cfg(feature = "x86")] if options.use_avx2() { return |a: &[i16], b: &[i16], c: &[i16], d: &mut [i16], e: &mut [i16]| { // SAFETY: `options.use_avx2()` only returns true if avx2 is supported. unsafe { avx2::upsample_hv_avx2(a, b, c, d, e) } }; } #[cfg(target_arch = "aarch64")] #[cfg(feature = "neon")] if options.use_neon() { return |a: &[i16], b: &[i16], c: &[i16], d: &mut [i16], e: &mut [i16]| { // SAFETY: `options.use_neon()` only returns true if neon is supported. unsafe { neon::upsample_hv_neon(a, b, c, d, e) } }; } #[cfg(feature = "portable_simd")] return portable_simd::upsample_hv_simd; return scalar::upsample_hv; } #[allow(unused_variables)] pub fn choose_v_samp_function(options: &DecoderOptions) -> UpSampler { #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] #[cfg(feature = "x86")] if options.use_avx2() { return |a: &[i16], b: &[i16], c: &[i16], d: &mut [i16], e: &mut [i16]| { // SAFETY: `options.use_avx2()` only returns true if avx2 is supported. unsafe { avx2::upsample_vertical_avx2(a, b, c, d, e) } }; } #[cfg(target_arch = "aarch64")] #[cfg(feature = "neon")] if options.use_neon() { return |a: &[i16], b: &[i16], c: &[i16], d: &mut [i16], e: &mut [i16]| { // SAFETY: `options.use_neon()` only returns true if neon is supported. unsafe { neon::upsample_vertical_neon(a, b, c, d, e) } }; } #[cfg(feature = "portable_simd")] return portable_simd::upsample_vertical_simd; return scalar::upsample_vertical; } /// Upsample nothing pub fn upsample_no_op( _input: &[i16], _in_ref: &[i16], _in_near: &[i16], _scratch_space: &mut [i16], _output: &mut [i16], ) { } pub fn generic_sampler() -> UpSampler { scalar::upsample_generic } #[cfg(test)] mod tests { use super::*; #[cfg(feature = "portable_simd")] mod portable_simd_impl { use super::*; #[test] fn portable_simd_vertical() { _test_vertical(portable_simd::upsample_vertical_simd) } #[test] fn portable_simd_horizontal() { _test_horizontal(portable_simd::upsample_horizontal_simd) } #[test] fn portable_simd_hv() { _test_hv(portable_simd::upsample_hv_simd) } } #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] #[cfg(feature = "x86")] #[cfg(target_feature = "avx2")] mod avx2_impl { use super::*; #[test] fn avx2_vertical() { _test_vertical(|a: &[i16], b: &[i16], c: &[i16], d: &mut [i16], e: &mut [i16]| { // SAFETY: Test guarded behind `target_feature` unsafe { avx2::upsample_vertical_avx2(a, b, c, d, e) } }) } #[test] fn avx2_horizontal() { _test_horizontal(|a: &[i16], b: &[i16], c: &[i16], d: &mut [i16], e: &mut [i16]| { // SAFETY: Test guarded behind `target_feature` unsafe { avx2::upsample_horizontal_avx2(a, b, c, d, e) } }) } #[test] fn avx2_hv() { _test_hv(|a: &[i16], b: &[i16], c: &[i16], d: &mut [i16], e: &mut [i16]| { // SAFETY: Test guarded behind `target_feature` unsafe { avx2::upsample_hv_avx2(a, b, c, d, e) } }) } } #[cfg(target_arch = "aarch64")] #[cfg(feature = "neon")] #[cfg(target_feature = "neon")] mod neon_impl { use super::*; #[test] fn neon_vertical() { _test_vertical(|a: &[i16], b: &[i16], c: &[i16], d: &mut [i16], e: &mut [i16]| { // SAFETY: Test guarded behind `target_feature` unsafe { neon::upsample_vertical_neon(a, b, c, d, e) } }) } #[test] fn neon_horizontal() { _test_horizontal(|a: &[i16], b: &[i16], c: &[i16], d: &mut [i16], e: &mut [i16]| { // SAFETY: Test guarded behind `target_feature` unsafe { neon::upsample_horizontal_neon(a, b, c, d, e) } }) } #[test] fn neon_hv() { _test_hv(|a: &[i16], b: &[i16], c: &[i16], d: &mut [i16], e: &mut [i16]| { // SAFETY: Test guarded behind `target_feature` unsafe { neon::upsample_hv_neon(a, b, c, d, e) } }) } } fn _test_vertical(upsampler: UpSampler) { let width = 1024; let input: Vec = (0..width).map(|x| ((x + 10) % 256) as i16).collect(); let in_near: Vec = (0..width).map(|x| ((x + 20) % 256) as i16).collect(); let in_far: Vec = (0..width).map(|x| ((x + 30) % 256) as i16).collect(); let mut scratch = vec![0i16; width]; let mut output_scalar = vec![0i16; width * 2]; let mut output_fast = vec![0i16; width * 2]; scalar::upsample_vertical(&input, &in_near, &in_far, &mut scratch, &mut output_scalar); upsampler(&input, &in_near, &in_far, &mut scratch, &mut output_fast); assert_eq!(output_scalar, output_fast); } fn _test_horizontal(upsampler: UpSampler) { _test_horizontal_even_width(upsampler); _test_horizontal_odd_width(upsampler); } fn _test_horizontal_even_width(upsampler: UpSampler) { let width = 1024; let input: Vec = (0..width).map(|x| ((x + 10) % 256) as i16).collect(); let mut scratch = vec![0i16; width]; let mut output_scalar = vec![0i16; width * 2]; let mut output_fast = vec![0i16; width * 2]; scalar::upsample_horizontal(&input, &[], &[], &mut scratch, &mut output_scalar); upsampler(&input, &[], &[], &mut scratch, &mut output_fast); assert_eq!(output_scalar, output_fast); } fn _test_horizontal_odd_width(upsampler: UpSampler) { let width = 33; let input: Vec = (0..width).map(|x| ((x + 10) % 256) as i16).collect(); let mut scratch = vec![0i16; width]; let mut output_scalar = vec![0i16; width * 2]; let mut output_fast = vec![0i16; width * 2]; scalar::upsample_horizontal(&input, &[], &[], &mut scratch, &mut output_scalar); upsampler(&input, &[], &[], &mut scratch, &mut output_fast); assert_eq!(output_scalar, output_fast); } fn _test_hv(upsampler: UpSampler) { let width = 512; let input: Vec = (0..width).map(|x| ((x + 10) % 256) as i16).collect(); let in_near: Vec = (0..width).map(|x| ((x + 20) % 256) as i16).collect(); let in_far: Vec = (0..width).map(|x| ((x + 30) % 256) as i16).collect(); // Output len is width * 4 for HV (vertical * 2, then horizontal * 2 for each row) // scratch is width * 2 let mut scratch_scalar = vec![0i16; width * 2]; let mut scratch_fast = vec![0i16; width * 2]; let mut output_scalar = vec![0i16; width * 4]; let mut output_fast = vec![0i16; width * 4]; scalar::upsample_hv( &input, &in_near, &in_far, &mut scratch_scalar, &mut output_scalar, ); upsampler( &input, &in_near, &in_far, &mut scratch_fast, &mut output_fast, ); assert_eq!(output_scalar, output_fast); } } zune-jpeg-0.5.11/src/worker.rs000064400000000000000000000543071046102023000142730ustar 00000000000000/* * Copyright (c) 2023. * * This software is free software; * * You can redistribute it or modify it under terms of the MIT, Apache License or Zlib license */ use alloc::format; use core::convert::TryInto; use core::cmp::min; use zune_core::colorspace::ColorSpace; use crate::color_convert::ycbcr_to_grayscale; use crate::components::{Components, SampleRatios}; use crate::decoder::{ColorConvert16Ptr, MAX_COMPONENTS}; use crate::errors::DecodeErrors; /// fast 0..255 * 0..255 => 0..255 rounded multiplication /// /// Borrowed from stb #[allow(clippy::cast_sign_loss, clippy::cast_possible_truncation)] #[inline] fn blinn_8x8(in_val: u8, y: u8) -> u8 { let t = i32::from(in_val) * i32::from(y) + 128; return ((t + (t >> 8)) >> 8) as u8; } #[allow(clippy::cast_sign_loss, clippy::cast_possible_truncation)] pub(crate) fn color_convert( unprocessed: &[&[i16]; MAX_COMPONENTS], color_convert_16: ColorConvert16Ptr, input_colorspace: ColorSpace, output_colorspace: ColorSpace, output: &mut [u8], width: usize, padded_width: usize ) -> Result<(), DecodeErrors> { if input_colorspace.num_components() == 3 && input_colorspace == output_colorspace { // sort things like RGB to RGB conversion copy_removing_padding(unprocessed, width, padded_width, output); return Ok(()); } if input_colorspace.num_components() == 4 && input_colorspace == output_colorspace { copy_removing_padding_4x(unprocessed, width, padded_width, output); return Ok(()); } // color convert match (input_colorspace, output_colorspace) { (ColorSpace::YCbCr | ColorSpace::Luma, ColorSpace::Luma) => { ycbcr_to_grayscale(unprocessed[0], width, padded_width, output); } ( ColorSpace::YCbCr, ColorSpace::RGB | ColorSpace::RGBA | ColorSpace::BGR | ColorSpace::BGRA ) => { color_convert_ycbcr( unprocessed, width, padded_width, output_colorspace, color_convert_16, output ); } (ColorSpace::YCCK, ColorSpace::RGB) => { color_convert_ycck_to_rgb::<3>( unprocessed, width, padded_width, output_colorspace, color_convert_16, output ); } (ColorSpace::YCCK, ColorSpace::RGBA) => { color_convert_ycck_to_rgb::<4>( unprocessed, width, padded_width, output_colorspace, color_convert_16, output ); } (ColorSpace::CMYK, ColorSpace::RGB) => { color_convert_cymk_to_rgb::<3>(unprocessed, width, padded_width, output); } (ColorSpace::CMYK, ColorSpace::RGBA) => { color_convert_cymk_to_rgb::<4>(unprocessed, width, padded_width, output); } (ColorSpace::MultiBand(n), _) => { if n.get() != 2 { return Err(DecodeErrors::Format(format!( "Unknown multiband sample ({n}), please share sample" ))); } copy_removing_padding_generic( unprocessed, width, padded_width, output, n.get() as usize ); } (ColorSpace::Luma, ColorSpace::RGB) => { // duplicate the luma channel three times to form RGB // Note, this may assume the direct conversion // from luma to RGB is by duplicating // // There may be a bit more complex ways // of doing it but won't get onto it convert_luma_to_rgb(unprocessed, width, padded_width, output) } (ColorSpace::Luma, ColorSpace::RGBA) => { // duplicate the luma channel three times to form RGB // add 255 as alpha // Note, this may assume the direct conversion // from luma to RGB is by duplicating // // There may be a bit more complex ways // of doing it but won't get onto it convert_luma_to_rgba(unprocessed, width, padded_width, output) } // For the other components we do nothing(currently) _ => { let msg = format!( "Unimplemented colorspace mapping from {input_colorspace:?} to {output_colorspace:?}"); return Err(DecodeErrors::Format(msg)); } } Ok(()) } fn convert_luma_to_rgb( mcu_block: &[&[i16]; MAX_COMPONENTS], width: usize, padded_width: usize, output: &mut [u8] ) { for (pix_w, y_w) in output .chunks_exact_mut(width * 3) .zip(mcu_block[0].chunks_exact(padded_width)) { for (pix, c) in pix_w.chunks_exact_mut(3).zip(y_w) { pix[0] = *c as u8; pix[1] = *c as u8; pix[2] = *c as u8; } } } fn convert_luma_to_rgba( mcu_block: &[&[i16]; MAX_COMPONENTS], width: usize, padded_width: usize, output: &mut [u8] ) { for (pix_w, y_w) in output .chunks_exact_mut(width * 4) .zip(mcu_block[0].chunks_exact(padded_width)) { for (pix, c) in pix_w.chunks_exact_mut(4).zip(y_w) { pix[0] = *c as u8; pix[1] = *c as u8; pix[2] = *c as u8; pix[3] = 255; } } } /// Copy a block to output removing padding bytes from input /// if necessary #[allow(clippy::cast_sign_loss, clippy::cast_possible_truncation)] fn copy_removing_padding( mcu_block: &[&[i16]; MAX_COMPONENTS], width: usize, padded_width: usize, output: &mut [u8] ) { for (((pix_w, c_w), m_w), y_w) in output .chunks_exact_mut(width * 3) .zip(mcu_block[0].chunks_exact(padded_width)) .zip(mcu_block[1].chunks_exact(padded_width)) .zip(mcu_block[2].chunks_exact(padded_width)) { for (((pix, c), y), m) in pix_w.chunks_exact_mut(3).zip(c_w).zip(m_w).zip(y_w) { pix[0] = *c as u8; pix[1] = *y as u8; pix[2] = *m as u8; } } } #[allow(clippy::cast_possible_truncation, clippy::cast_sign_loss)] fn copy_removing_padding_4x( mcu_block: &[&[i16]; MAX_COMPONENTS], width: usize, padded_width: usize, output: &mut [u8] ) { for ((((pix_w, c_w), m_w), y_w), k_w) in output .chunks_exact_mut(width * 4) .zip(mcu_block[0].chunks_exact(padded_width)) .zip(mcu_block[1].chunks_exact(padded_width)) .zip(mcu_block[2].chunks_exact(padded_width)) .zip(mcu_block[3].chunks_exact(padded_width)) { for ((((pix, c), y), m), k) in pix_w .chunks_exact_mut(4) .zip(c_w) .zip(m_w) .zip(y_w) .zip(k_w) { pix[0] = *c as u8; pix[1] = *y as u8; pix[2] = *m as u8; pix[3] = *k as u8; } } } #[allow(clippy::cast_possible_truncation, clippy::cast_sign_loss)] fn copy_removing_padding_generic( mcu_block: &[&[i16]; MAX_COMPONENTS], width: usize, padded_width: usize, output: &mut [u8], channels: usize ) { match channels { // just do 2 for now 2 => { for ((pix_w, y_w), k_w) in output .chunks_exact_mut(width * channels) .zip(mcu_block[0].chunks_exact(padded_width)) .zip(mcu_block[1].chunks_exact(padded_width)) { for ((pix, c), k) in pix_w.chunks_exact_mut(2).zip(y_w).zip(k_w) { pix[0] = *c as u8; pix[1] = *k as u8; } } } _ => unreachable!() } } /// Convert YCCK image to rgb #[allow(clippy::cast_possible_truncation, clippy::cast_sign_loss)] fn color_convert_ycck_to_rgb( mcu_block: &[&[i16]; MAX_COMPONENTS], width: usize, padded_width: usize, output_colorspace: ColorSpace, color_convert_16: ColorConvert16Ptr, output: &mut [u8] ) { color_convert_ycbcr( mcu_block, width, padded_width, output_colorspace, color_convert_16, output ); for (pix_w, m_w) in output .chunks_exact_mut(width * 3) .zip(mcu_block[3].chunks_exact(padded_width)) { for (pix, m) in pix_w.chunks_exact_mut(NUM_COMPONENTS).zip(m_w) { let m = (*m) as u8; pix[0] = blinn_8x8(255 - pix[0], m); pix[1] = blinn_8x8(255 - pix[1], m); pix[2] = blinn_8x8(255 - pix[2], m); } } } #[allow(clippy::cast_sign_loss, clippy::cast_possible_truncation)] fn color_convert_cymk_to_rgb( mcu_block: &[&[i16]; MAX_COMPONENTS], width: usize, padded_width: usize, output: &mut [u8] ) { for ((((pix_w, c_w), m_w), y_w), k_w) in output .chunks_exact_mut(width * NUM_COMPONENTS) .zip(mcu_block[0].chunks_exact(padded_width)) .zip(mcu_block[1].chunks_exact(padded_width)) .zip(mcu_block[2].chunks_exact(padded_width)) .zip(mcu_block[3].chunks_exact(padded_width)) { for ((((pix, c), m), y), k) in pix_w .chunks_exact_mut(3) .zip(c_w) .zip(m_w) .zip(y_w) .zip(k_w) { let c = *c as u8; let m = *m as u8; let y = *y as u8; let k = *k as u8; pix[0] = blinn_8x8(c, k); pix[1] = blinn_8x8(m, k); pix[2] = blinn_8x8(y, k); } } } /// Do color-conversion for interleaved MCU #[allow( clippy::similar_names, clippy::too_many_arguments, clippy::needless_pass_by_value, clippy::unwrap_used )] fn color_convert_ycbcr( mcu_block: &[&[i16]; MAX_COMPONENTS], width: usize, padded_width: usize, output_colorspace: ColorSpace, color_convert_16: ColorConvert16Ptr, output: &mut [u8] ) { let num_components = output_colorspace.num_components(); let stride = width * num_components; // Allocate temporary buffer for small widths less than 16. let mut temp = [0; 64]; // We need to chunk per width to ensure we can discard extra values at the end of the width. // Since the encoder may pad bits to ensure the width is a multiple of 8. for (((y_width, cb_width), cr_width), out) in mcu_block[0] .chunks_exact(padded_width) .zip(mcu_block[1].chunks_exact(padded_width)) .zip(mcu_block[2].chunks_exact(padded_width)) .zip(output.chunks_exact_mut(stride)) { if width < 16 { // allocate temporary buffers for the values received from idct let mut y_out = [0; 16]; let mut cb_out = [0; 16]; let mut cr_out = [0; 16]; // copy those small widths to that buffer // Use a min with 16 to prevent some panics, see https://github.com/etemesi254/zune-image/issues/331 y_out[0..min(y_width.len(), 16)].copy_from_slice(&y_width[0..min(y_width.len(), 16)]); cb_out[0..min(cb_width.len(), 16)] .copy_from_slice(&cb_width[0..min(cb_width.len(), 16)]); cr_out[0..min(cr_width.len(), 16)] .copy_from_slice(&cr_width[0..min(cr_width.len(), 16)]); // we handle widths less than 16 a bit differently, allocating a temporary // buffer and writing to that and then flushing to the out buffer // because of the optimizations applied below, (color_convert_16)(&y_out, &cb_out, &cr_out, &mut temp, &mut 0); // copy to stride out[0..width * num_components].copy_from_slice(&temp[0..width * num_components]); // next continue; } // Chunk in outputs of 16 to pass to color_convert as an array of 16 i16's. for (((y, cb), cr), out_c) in y_width .chunks_exact(16) .zip(cb_width.chunks_exact(16)) .zip(cr_width.chunks_exact(16)) .zip(out.chunks_exact_mut(16 * num_components)) { (color_convert_16)( y.try_into().unwrap(), cb.try_into().unwrap(), cr.try_into().unwrap(), out_c, &mut 0 ); } //we have more pixels in the end that can't be handled by the main loop. //move pointer back a little bit to get last 16 bytes, //color convert, and overwrite //This means some values will be color converted twice. for ((y, cb), cr) in y_width[width - 16..] .chunks_exact(16) .zip(cb_width[width - 16..].chunks_exact(16)) .zip(cr_width[width - 16..].chunks_exact(16)) .take(1) { (color_convert_16)( y.try_into().unwrap(), cb.try_into().unwrap(), cr.try_into().unwrap(), &mut temp, &mut 0 ); } let rem = out[(width - 16) * num_components..] .chunks_exact_mut(16 * num_components) .next() .unwrap(); rem.copy_from_slice(&temp[0..rem.len()]); } } pub(crate) fn upsample( component: &mut Components, mcu_height: usize, i: usize, upsampler_scratch_space: &mut [i16], has_vertical_sample: bool ) -> Result<(), DecodeErrors> { match component.sample_ratio { SampleRatios::V | SampleRatios::HV => { /* When upsampling vertically sampled images, we have a certain problem which is that we do not have all MCU's decoded, this usually sucks at boundaries e.g we can't upsample the last mcu row, since the row_down currently doesn't exist To solve this we need to do two things 1. Carry over coefficients when we lack enough data to upsample 2. Upsample when we have enough data To achieve (1), we store a previous row, and the current row in components themselves which will later be used to make (2) To achieve (2), we take the stored previous row(second last MCU row), current row(last mcu row) and row down(first row of newly decoded MCU) and upsample that and store it in first_row_upsample_dest, this contains up-sampled coefficients for the last for the previous decoded mcu row. The caller is then expected to process first_row_upsample_dest before processing data in component.upsample_dest which stores the up-sampled components excluding the last row */ let mut dest_start = 0; let stride_bytes_written = component.width_stride * component.sample_ratio.sample(); if i > 0 { // Handle the last MCU of the previous row // This wasn't up-sampled as we didn't have the row_down // so we do it now let stride = component.width_stride; let dest = &mut component.first_row_upsample_dest[0..stride_bytes_written]; // get current row let row = &component.row[..]; let row_up = &component.row_up[..]; let row_down = &component.raw_coeff[0..stride]; (component.up_sampler)(row, row_up, row_down, upsampler_scratch_space, dest); } // we have the Y component width stride. // this may be higher than the actual width,(2x because vertical sampling) // // This will not upsample the last row // if false, do not upsample. // set to false on the last row of an mcu let mut upsample = true; let stride = component.width_stride * component.vertical_sample; let stop_offset = component.raw_coeff.len() / component.width_stride; if component.raw_coeff.len() != stop_offset * stride { // slice would panic below return Err(DecodeErrors::FormatStatic( "Invalid component dimensions, would panic" )); } for (pos, curr_row) in component .raw_coeff .chunks_exact(component.width_stride) .enumerate() { let mut dest: &mut [i16] = &mut []; let mut row_up: &[i16] = &[]; // row below current sample let mut row_down: &[i16] = &[]; // Order of ifs matters if i == 0 && pos == 0 { // first IMAGE row, row_up is the same as current row // row_down is the row below. row_up = &component.raw_coeff[pos * stride..(pos + 1) * stride]; row_down = &component.raw_coeff[(pos + 1) * stride..(pos + 2) * stride]; } else if i > 0 && pos == 0 { // first row of a new mcu, previous row was copied so use that row_up = &component.row[..]; row_down = &component.raw_coeff[(pos + 1) * stride..(pos + 2) * stride]; } else if i == mcu_height.saturating_sub(1) && pos == stop_offset - 1 { // last IMAGE row, adjust pointer to use previous row and current row row_up = &component.raw_coeff[(pos - 1) * stride..pos * stride]; row_down = &component.raw_coeff[pos * stride..(pos + 1) * stride]; } else if pos > 0 && pos < stop_offset - 1 { // other rows, get row up and row down relative to our current row // ignore last row of each mcu row_up = &component.raw_coeff[(pos - 1) * stride..pos * stride]; row_down = &component.raw_coeff[(pos + 1) * stride..(pos + 2) * stride]; } else if pos == stop_offset - 1 { // last MCU in a row // // we need a row at the next MCU but we haven't decoded that MCU yet // so we should save this and when we have the next MCU, // do the upsampling // store the current row and previous row in a buffer let prev_row = &component.raw_coeff[(pos - 1) * stride..pos * stride]; component.row_up.copy_from_slice(prev_row); component.row.copy_from_slice(curr_row); upsample = false; } else { unreachable!("Uh oh!"); } if upsample { dest = &mut component.upsample_dest[dest_start..dest_start + stride_bytes_written]; dest_start += stride_bytes_written; } if upsample { // upsample (component.up_sampler)( curr_row, row_up, row_down, upsampler_scratch_space, dest ); } } } SampleRatios::H => { //assert_eq!(component.raw_coeff.len() * 2, component.upsample_dest.len()); // Before it was an assert, but numerous and numerous and numerous // bug fixes and ad hoc solutions later, I have now just decided to keep it as a resize component .upsample_dest .resize(component.raw_coeff.len() * 2, 0); let raw_coeff = &component.raw_coeff; let dest_coeff = &mut component.upsample_dest; if has_vertical_sample { /* There have been images that have the following configurations. Component ID:Y HS:2 VS:2 QT:0 Component ID:Cb HS:1 VS:1 QT:1 Component ID:Cr HS:1 VS:2 QT:1 This brings out a nasty case of misaligned sampling factors. Cr will need to save a row because of the way we process boundaries but Cb won't since Cr is horizontally sampled while Cb is HV sampled with respect to the image sampling factors. So during decoding of one MCU, we could only do 7 and not 8 rows, but the SampleRatio::H never had to save a single line, since it doesn't suffer from boundary issues. Now this takes care of that, saving the last MCU row in case it will be needed. We save the previous row before up-sampling this row because the boundary issue is in the last MCU row of the previous MCU. PS(cae): I can't add the image to the repo as it is nsfw, but can send if required */ let length = component.first_row_upsample_dest.len(); component .first_row_upsample_dest .copy_from_slice(&dest_coeff.rchunks_exact(length).next().unwrap()); } // up-sample each row for (single_row, output_stride) in raw_coeff .chunks_exact(component.width_stride) .zip(dest_coeff.chunks_exact_mut(component.width_stride * 2)) { // upsample using the fn pointer, should only be H, so no need for // row up and row down (component.up_sampler)(single_row, &[], &[], &mut [], output_stride); } } SampleRatios::Generic(h, v) => { let raw_coeff = &component.raw_coeff; let dest_coeff = &mut component.upsample_dest; //let size = component.width_stride.div_ceil(v); // for (single_row, output_stride) in raw_coeff // .chunks_exact(size) // .zip(dest_coeff.chunks_exact_mut(component.width_stride * h)) // { // (component.up_sampler)(single_row, &[], &[], &mut [], output_stride); // // } for (single_row, output_stride) in raw_coeff .chunks_exact(component.width_stride) .zip(dest_coeff.chunks_exact_mut(component.width_stride * h * v)) { for row in output_stride.chunks_exact_mut(component.width_stride * h) { (component.up_sampler)(single_row, &[], &[], &mut [], row); } } } SampleRatios::None => {} }; Ok(()) }